4.3 SciPy.stats & LinearModels
"Without data, you're just another person with an opinion.""没有数据,你只是另一个有观点的人。"— W. Edwards Deming, Statistician (统计学家)
统计检验与计量经济学专业工具
Part A: SciPy.stats —— 统计推断瑞士军刀
核心功能
定位:快速统计检验,不需要复杂输出
优势:
- 函数式API(简单直接)
- 覆盖所有经典检验
- 概率分布操作
- 速度快
假设检验
1. t 检验
python
from scipy import stats
import numpy as np
# 数据
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
# (1) 单样本 t 检验
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# (2) 独立样本 t 检验
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# (3) 配对 t 检验
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"配对 t = {t_stat:.3f}, p = {p_value:.4f}")
# (4) Welch's t 检验(不假设方差齐性)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)2. 卡方检验
python
# 列联表
contingency_table = [[10, 20, 30],
[15, 25, 35]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")3. 方差分析(ANOVA)
python
# 单因素 ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")4. 正态性检验
python
data = np.random.normal(0, 1, 100)
# Shapiro-Wilk(小样本)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")
# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")5. 相关性检验
python
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")
# Spearman(非参数)
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")
# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")概率分布
python
from scipy.stats import norm, t, chi2, f
# 正态分布
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}") # 0.9750
print(f"95% 分位数 = {norm.ppf(0.95):.4f}") # 1.6449
# t 分布
print(f"t(df=10) 95% 分位数 = {t.ppf(0.95, df=10):.4f}")
# 生成随机数
random_sample = norm.rvs(loc=0, scale=1, size=1000)Part B: LinearModels —— 计量经济学专业工具
为什么需要 LinearModels?
Statsmodels 的局限:
- 面板数据支持有限
- 工具变量不够完善
- 聚类标准误计算复杂
LinearModels 的优势:
- 专为面板数据设计
- 完善的 IV 估计
- 聚类稳健标准误
- 系统 GMM
安装:
bash
pip install linearmodels面板数据回归
1. 固定效应(FE)
python
from linearmodels.panel import PanelOLS
import pandas as pd
# 面板数据(MultiIndex)
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])
# 固定效应回归
model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True # 个体固定效应
).fit(cov_type='clustered', cluster_entity=True)
print(model)输出解读:
PanelOLS Estimation Summary
================================================================================
Dep. Variable: log_wage R-squared: 0.8234
Estimator: PanelOLS R-squared (Between): 0.7123
No. Observations: 5000 R-squared (Within): 0.6543
Entity Effects: Yes F-statistic: 1234.56
Time Effects: No P-value (F-stat): 0.0000
Cov. Estimator: Clustered
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
education 0.0850 0.0123 6.911 0.0000 0.0609 0.1091
experience 0.0234 0.0045 5.200 0.0000 0.0146 0.0322
==============================================================================2. 随机效应(RE)
python
from linearmodels.panel import RandomEffects
model = RandomEffects(
dependent=df['log_wage'],
exog=df[['education', 'experience']]
).fit()
print(model)3. Hausman 检验(FE vs RE)
python
# 分别拟合 FE 和 RE
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']],
entity_effects=True).fit()
re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()
# Hausman 检验(手动)
from scipy.stats import chi2
# 计算检验统计量
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))
print(f"Hausman 统计量: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# 解释:
# p < 0.05 → 拒绝 RE,使用 FE
# p > 0.05 → 接受 RE4. 双向固定效应
python
model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True, # 个体固定效应
time_effects=True # 时间固定效应
).fit(cov_type='clustered', cluster_entity=True)工具变量(IV)
1. 2SLS 回归
python
from linearmodels.iv import IV2SLS
# 示例:教育的回报(能力是内生变量)
# 模型:log(wage) = β₀ + β₁*education + β₂*ability + ε
# 问题:ability 不可观测,与 education 相关
# 工具变量:父亲教育(father_education)
model = IV2SLS(
dependent=df['log_wage'],
exog=df[['experience']], # 外生变量
endog=df[['education']], # 内生变量
instruments=df[['father_education']] # 工具变量
).fit(cov_type='robust')
print(model)输出关键部分:
IV-2SLS Estimation Summary
================================================================================
Endogenous: education Instruments: father_education
Exogenous: experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν
F-statistic: 156.34 # ️ 需要 > 10(强工具变量)
...2. 弱工具变量检验
python
# 第一阶段 F 统计量
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']
print(f"第一阶段 F 统计量: {first_stage_fstat:.2f}")
# 判断标准:
# F > 10:强工具变量
# F < 10:弱工具变量 ️(结果不可信)3. 过度识别检验
python
# 如果工具变量多于内生变量
model = IV2SLS(
dependent=df['log_wage'],
exog=df[['experience']],
endog=df[['education']],
instruments=df[['father_education', 'mother_education']] # 2 个 IV
).fit()
# Sargan-Hansen J 统计量(自动计算)
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")
# 解释:
# p > 0.05 → 工具变量有效(无法拒绝外生性)完整面板数据示例
python
import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS
# 1. 构造面板数据
np.random.seed(42)
n_individuals = 500
n_years = 5
data = []
for i in range(n_individuals):
for t in range(n_years):
data.append({
'person_id': i,
'year': 2016 + t,
'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
'education': 12 + np.random.randint(0, 9),
'experience': t + np.random.randint(0, 10)
})
df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])
# 2. 固定效应回归
fe_model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True,
time_effects=True
).fit(cov_type='clustered', cluster_entity=True)
print("=" * 70)
print("固定效应回归结果")
print("=" * 70)
print(fe_model)
# 3. 检查 F 统计量
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")
# 4. 提取系数
print(f"\n教育回报率: {fe_model.params['education']:.4f}")
print(f"标准误: {fe_model.std_errors['education']:.4f}")小结
SciPy.stats 核心函数
| 任务 | 函数 | 用途 |
|---|---|---|
| t 检验 | ttest_ind(), ttest_1samp() | 均值比较 |
| 卡方 | chi2_contingency() | 独立性检验 |
| ANOVA | f_oneway() | 多组比较 |
| 相关 | pearsonr(), spearmanr() | 相关性 |
| 正态性 | shapiro(), kstest() | 分布检验 |
LinearModels 核心功能
| 任务 | 类 | 用途 |
|---|---|---|
| 面板 OLS | PanelOLS | FE/RE 回归 |
| 随机效应 | RandomEffects | RE 模型 |
| 2SLS | IV2SLS | 工具变量 |
| GMM | IVGMM | 系统 GMM |
何时使用?
| 场景 | 工具 |
|---|---|
| 快速假设检验 | scipy.stats |
| 面板数据 | linearmodels.PanelOLS |
| 工具变量 | linearmodels.IV2SLS |
| 详细诊断 | statsmodels |
下一步
下一节:工具包对比与综合工作流
参考:
- SciPy 文档:https://docs.scipy.org/doc/scipy/reference/stats.html
- LinearModels 文档:https://bashtage.github.io/linearmodels/