4.3 SciPy.stats & LinearModels

"Without data, you're just another person with an opinion.""没有数据，你只是另一个有观点的人。"— W. Edwards Deming, Statistician (统计学家)

统计检验与计量经济学专业工具

Part A: SciPy.stats —— 统计推断瑞士军刀

核心功能

定位：快速统计检验，不需要复杂输出

优势：

函数式API（简单直接）
覆盖所有经典检验
概率分布操作
速度快

假设检验

1. t 检验

python

from scipy import stats
import numpy as np

# 数据
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]

# (1) 单样本 t 检验
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (2) 独立样本 t 检验
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (3) 配对 t 检验
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"配对 t = {t_stat:.3f}, p = {p_value:.4f}")

# (4) Welch's t 检验（不假设方差齐性）
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

2. 卡方检验

python

# 列联表
contingency_table = [[10, 20, 30],
                     [15, 25, 35]]

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")

3. 方差分析（ANOVA）

python

# 单因素 ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")

4. 正态性检验

python

data = np.random.normal(0, 1, 100)

# Shapiro-Wilk（小样本）
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")

# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")

5. 相关性检验

python

x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")

# Spearman（非参数）
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")

# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")

概率分布

python

from scipy.stats import norm, t, chi2, f

# 正态分布
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}")  # 0.9750
print(f"95% 分位数 = {norm.ppf(0.95):.4f}")   # 1.6449

# t 分布
print(f"t(df=10) 95% 分位数 = {t.ppf(0.95, df=10):.4f}")

# 生成随机数
random_sample = norm.rvs(loc=0, scale=1, size=1000)

Part B: LinearModels —— 计量经济学专业工具

为什么需要 LinearModels？

Statsmodels 的局限：

面板数据支持有限
工具变量不够完善
聚类标准误计算复杂

LinearModels 的优势：

专为面板数据设计
完善的 IV 估计
聚类稳健标准误
系统 GMM

安装：

bash

pip install linearmodels

面板数据回归

1. 固定效应（FE）

python

from linearmodels.panel import PanelOLS
import pandas as pd

# 面板数据（MultiIndex）
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])

# 固定效应回归
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True  # 个体固定效应
).fit(cov_type='clustered', cluster_entity=True)

print(model)

输出解读：

                          PanelOLS Estimation Summary                          
================================================================================
Dep. Variable:               log_wage   R-squared:                        0.8234
Estimator:                   PanelOLS   R-squared (Between):              0.7123
No. Observations:                5000   R-squared (Within):               0.6543
Entity Effects:                   Yes   F-statistic:                    1234.56
Time Effects:                      No   P-value (F-stat):                0.0000
Cov. Estimator:             Clustered
                                        
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
education      0.0850     0.0123      6.911     0.0000      0.0609      0.1091
experience     0.0234     0.0045      5.200     0.0000      0.0146      0.0322
==============================================================================

2. 随机效应（RE）

python

from linearmodels.panel import RandomEffects

model = RandomEffects(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']]
).fit()

print(model)

3. Hausman 检验（FE vs RE）

python

# 分别拟合 FE 和 RE
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']], 
                    entity_effects=True).fit()

re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()

# Hausman 检验（手动）
from scipy.stats import chi2

# 计算检验统计量
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))

print(f"Hausman 统计量: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# 解释：
# p < 0.05 → 拒绝 RE，使用 FE
# p > 0.05 → 接受 RE

4. 双向固定效应

python

model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,  # 个体固定效应
    time_effects=True     # 时间固定效应
).fit(cov_type='clustered', cluster_entity=True)

工具变量（IV）

1. 2SLS 回归

python

from linearmodels.iv import IV2SLS

# 示例：教育的回报（能力是内生变量）
# 模型：log(wage) = β₀ + β₁*education + β₂*ability + ε
# 问题：ability 不可观测，与 education 相关
# 工具变量：父亲教育（father_education）

model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],  # 外生变量
    endog=df[['education']],  # 内生变量
    instruments=df[['father_education']]  # 工具变量
).fit(cov_type='robust')

print(model)

输出关键部分：

                          IV-2SLS Estimation Summary                          
================================================================================
Endogenous:                  education   Instruments:            father_education
Exogenous:                  experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν

F-statistic:                     156.34   # ️ 需要 > 10（强工具变量）
...

2. 弱工具变量检验

python

# 第一阶段 F 统计量
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']

print(f"第一阶段 F 统计量: {first_stage_fstat:.2f}")

# 判断标准：
# F > 10：强工具变量 
# F < 10：弱工具变量 ️（结果不可信）

3. 过度识别检验

python

# 如果工具变量多于内生变量
model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],
    endog=df[['education']],
    instruments=df[['father_education', 'mother_education']]  # 2 个 IV
).fit()

# Sargan-Hansen J 统计量（自动计算）
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")

# 解释：
# p > 0.05 → 工具变量有效（无法拒绝外生性）

完整面板数据示例

python

import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

# 1. 构造面板数据
np.random.seed(42)
n_individuals = 500
n_years = 5

data = []
for i in range(n_individuals):
    for t in range(n_years):
        data.append({
            'person_id': i,
            'year': 2016 + t,
            'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
            'education': 12 + np.random.randint(0, 9),
            'experience': t + np.random.randint(0, 10)
        })

df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])

# 2. 固定效应回归
fe_model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("固定效应回归结果")
print("=" * 70)
print(fe_model)

# 3. 检查 F 统计量
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")

# 4. 提取系数
print(f"\n教育回报率: {fe_model.params['education']:.4f}")
print(f"标准误: {fe_model.std_errors['education']:.4f}")

小结

SciPy.stats 核心函数

任务	函数	用途
t 检验	`ttest_ind()`, `ttest_1samp()`	均值比较
卡方	`chi2_contingency()`	独立性检验
ANOVA	`f_oneway()`	多组比较
相关	`pearsonr()`, `spearmanr()`	相关性
正态性	`shapiro()`, `kstest()`	分布检验

LinearModels 核心功能

任务	类	用途
面板 OLS	`PanelOLS`	FE/RE 回归
随机效应	`RandomEffects`	RE 模型
2SLS	`IV2SLS`	工具变量
GMM	`IVGMM`	系统 GMM

何时使用？

场景	工具
快速假设检验	scipy.stats
面板数据	linearmodels.PanelOLS
工具变量	linearmodels.IV2SLS
详细诊断	statsmodels

下一步

下一节：工具包对比与综合工作流

参考：

SciPy 文档：https://docs.scipy.org/doc/scipy/reference/stats.html
LinearModels 文档：https://bashtage.github.io/linearmodels/

4.3 SciPy.stats & LinearModels ​

Part A: SciPy.stats —— 统计推断瑞士军刀 ​

核心功能 ​

假设检验 ​

1. t 检验 ​

2. 卡方检验 ​

3. 方差分析（ANOVA） ​

4. 正态性检验 ​

5. 相关性检验 ​

概率分布 ​

Part B: LinearModels —— 计量经济学专业工具 ​

为什么需要 LinearModels？ ​

面板数据回归 ​

1. 固定效应（FE） ​

2. 随机效应（RE） ​

3. Hausman 检验（FE vs RE） ​

4. 双向固定效应 ​

工具变量（IV） ​

1. 2SLS 回归 ​

2. 弱工具变量检验 ​

3. 过度识别检验 ​

完整面板数据示例 ​

小结 ​

SciPy.stats 核心函数 ​

LinearModels 核心功能 ​

何时使用？ ​

下一步 ​

4.3 SciPy.stats & LinearModels

Part A: SciPy.stats —— 统计推断瑞士军刀

核心功能

假设检验

1. t 检验

2. 卡方检验

3. 方差分析（ANOVA）

4. 正态性检验

5. 相关性检验

概率分布

Part B: LinearModels —— 计量经济学专业工具

为什么需要 LinearModels？

面板数据回归

1. 固定效应（FE）

2. 随机效应（RE）

3. Hausman 检验（FE vs RE）

4. 双向固定效应

工具变量（IV）

1. 2SLS 回归

2. 弱工具变量检验

3. 过度识别检验

完整面板数据示例

小结

SciPy.stats 核心函数

LinearModels 核心功能

何时使用？

下一步