Skip to content

4.3 SciPy.stats & LinearModels

"Without data, you're just another person with an opinion.""没有数据,你只是另一个有观点的人。"— W. Edwards Deming, Statistician (统计学家)

统计检验与计量经济学专业工具


Part A: SciPy.stats —— 统计推断瑞士军刀

核心功能

定位:快速统计检验,不需要复杂输出

优势

  • 函数式API(简单直接)
  • 覆盖所有经典检验
  • 概率分布操作
  • 速度快

假设检验

1. t 检验

python
from scipy import stats
import numpy as np

# 数据
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]

# (1) 单样本 t 检验
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (2) 独立样本 t 检验
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (3) 配对 t 检验
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"配对 t = {t_stat:.3f}, p = {p_value:.4f}")

# (4) Welch's t 检验(不假设方差齐性)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

2. 卡方检验

python
# 列联表
contingency_table = [[10, 20, 30],
                     [15, 25, 35]]

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")

3. 方差分析(ANOVA)

python
# 单因素 ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")

4. 正态性检验

python
data = np.random.normal(0, 1, 100)

# Shapiro-Wilk(小样本)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")

# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")

5. 相关性检验

python
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")

# Spearman(非参数)
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")

# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")

概率分布

python
from scipy.stats import norm, t, chi2, f

# 正态分布
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}")  # 0.9750
print(f"95% 分位数 = {norm.ppf(0.95):.4f}")   # 1.6449

# t 分布
print(f"t(df=10) 95% 分位数 = {t.ppf(0.95, df=10):.4f}")

# 生成随机数
random_sample = norm.rvs(loc=0, scale=1, size=1000)

Part B: LinearModels —— 计量经济学专业工具

为什么需要 LinearModels?

Statsmodels 的局限

  • 面板数据支持有限
  • 工具变量不够完善
  • 聚类标准误计算复杂

LinearModels 的优势

  • 专为面板数据设计
  • 完善的 IV 估计
  • 聚类稳健标准误
  • 系统 GMM

安装

bash
pip install linearmodels

面板数据回归

1. 固定效应(FE)

python
from linearmodels.panel import PanelOLS
import pandas as pd

# 面板数据(MultiIndex)
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])

# 固定效应回归
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True  # 个体固定效应
).fit(cov_type='clustered', cluster_entity=True)

print(model)

输出解读

                          PanelOLS Estimation Summary                          
================================================================================
Dep. Variable:               log_wage   R-squared:                        0.8234
Estimator:                   PanelOLS   R-squared (Between):              0.7123
No. Observations:                5000   R-squared (Within):               0.6543
Entity Effects:                   Yes   F-statistic:                    1234.56
Time Effects:                      No   P-value (F-stat):                0.0000
Cov. Estimator:             Clustered
                                        
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
education      0.0850     0.0123      6.911     0.0000      0.0609      0.1091
experience     0.0234     0.0045      5.200     0.0000      0.0146      0.0322
==============================================================================

2. 随机效应(RE)

python
from linearmodels.panel import RandomEffects

model = RandomEffects(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']]
).fit()

print(model)

3. Hausman 检验(FE vs RE)

python
# 分别拟合 FE 和 RE
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']], 
                    entity_effects=True).fit()

re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()

# Hausman 检验(手动)
from scipy.stats import chi2

# 计算检验统计量
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))

print(f"Hausman 统计量: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# 解释:
# p < 0.05 → 拒绝 RE,使用 FE
# p > 0.05 → 接受 RE

4. 双向固定效应

python
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,  # 个体固定效应
    time_effects=True     # 时间固定效应
).fit(cov_type='clustered', cluster_entity=True)

工具变量(IV)

1. 2SLS 回归

python
from linearmodels.iv import IV2SLS

# 示例:教育的回报(能力是内生变量)
# 模型:log(wage) = β₀ + β₁*education + β₂*ability + ε
# 问题:ability 不可观测,与 education 相关
# 工具变量:父亲教育(father_education)

model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],  # 外生变量
    endog=df[['education']],  # 内生变量
    instruments=df[['father_education']]  # 工具变量
).fit(cov_type='robust')

print(model)

输出关键部分

                          IV-2SLS Estimation Summary                          
================================================================================
Endogenous:                  education   Instruments:            father_education
Exogenous:                  experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν

F-statistic:                     156.34   # ️ 需要 > 10(强工具变量)
...

2. 弱工具变量检验

python
# 第一阶段 F 统计量
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']

print(f"第一阶段 F 统计量: {first_stage_fstat:.2f}")

# 判断标准:
# F > 10:强工具变量 
# F < 10:弱工具变量 ️(结果不可信)

3. 过度识别检验

python
# 如果工具变量多于内生变量
model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],
    endog=df[['education']],
    instruments=df[['father_education', 'mother_education']]  # 2 个 IV
).fit()

# Sargan-Hansen J 统计量(自动计算)
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")

# 解释:
# p > 0.05 → 工具变量有效(无法拒绝外生性)

完整面板数据示例

python
import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

# 1. 构造面板数据
np.random.seed(42)
n_individuals = 500
n_years = 5

data = []
for i in range(n_individuals):
    for t in range(n_years):
        data.append({
            'person_id': i,
            'year': 2016 + t,
            'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
            'education': 12 + np.random.randint(0, 9),
            'experience': t + np.random.randint(0, 10)
        })

df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])

# 2. 固定效应回归
fe_model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("固定效应回归结果")
print("=" * 70)
print(fe_model)

# 3. 检查 F 统计量
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")

# 4. 提取系数
print(f"\n教育回报率: {fe_model.params['education']:.4f}")
print(f"标准误: {fe_model.std_errors['education']:.4f}")

小结

SciPy.stats 核心函数

任务函数用途
t 检验ttest_ind(), ttest_1samp()均值比较
卡方chi2_contingency()独立性检验
ANOVAf_oneway()多组比较
相关pearsonr(), spearmanr()相关性
正态性shapiro(), kstest()分布检验

LinearModels 核心功能

任务用途
面板 OLSPanelOLSFE/RE 回归
随机效应RandomEffectsRE 模型
2SLSIV2SLS工具变量
GMMIVGMM系统 GMM

何时使用?

场景工具
快速假设检验scipy.stats
面板数据linearmodels.PanelOLS
工具变量linearmodels.IV2SLS
详细诊断statsmodels

下一步

下一节:工具包对比与综合工作流


参考

基于 MIT 许可证发布。内容版权归作者所有。