4.1 本章介绍（Python 统计计量工具包全景）

从描述统计到因果推断：掌握 Python 统计生态系统

为什么需要掌握多个统计工具包？

Stata 用户的困惑

Stata 用户：

stata

* Stata 中一切都很简单
regress wage education experience

转到 Python 后的疑问：

为什么有这么多包？（statsmodels、scipy、linearmodels...）
该用哪个包？什么时候用哪个？
为什么同一个功能有多种实现？

答案：Python 是生态系统，而非单一软件

Python 统计生态全景

核心统计包对比

工具包	定位	核心功能	适用场景
statsmodels	统计建模	OLS、GLM、时间序列、诊断	经典统计分析、论文级输出
scipy.stats	科学计算	概率分布、假设检验、描述统计	快速统计检验、单变量分析
linearmodels	计量经济学	面板数据、工具变量、GMM	面板回归、内生性处理
pingouin	友好统计	t检验、ANOVA、相关性、功效分析	快速统计、可读输出
scikit-learn	机器学习	预测模型、特征工程、验证	预测任务、机器学习
PyMC	贝叶斯推断	MCMC、贝叶斯模型	贝叶斯统计、不确定性量化

Stata vs Python：范式差异

维度	Stata	Python
哲学	一体化软件	模块化生态系统
回归	`regress y x1 x2`	`sm.OLS(y, X).fit()`
输出	自动显示	需调用 `.summary()`
扩展	有限（ado文件）	无限（开源包）
学习曲线	平缓	陡峭但更灵活
成本	商业软件（昂贵）	完全免费

️ 本章学习路线

第 1 节：Statsmodels —— Python 统计分析的基石

核心地位：相当于 Python 中的 Stata

主要功能：

python

import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. OLS 回归
model = sm.OLS(y, X).fit()
print(model.summary())  # Stata 风格输出

# 2. 公式接口（R 风格）
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()

# 3. 广义线性模型（GLM）
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

# 4. 时间序列
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df['sales'], order=(1, 1, 1)).fit()

# 5. 模型诊断
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)

输出特点：

论文级表格（类似 Stata）
详细诊断统计量
AIC、BIC、R²等多种拟合度指标
异方差稳健标准误

第 2 节：SciPy.stats —— 快速统计检验

核心地位：统计推断的瑞士军刀

主要功能：

python

from scipy import stats

# 1. t 检验
t_stat, p_value = stats.ttest_ind(group1, group2)

# 2. 卡方检验
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# 3. 正态性检验
statistic, p_value = stats.shapiro(data)

# 4. 相关性
corr, p_value = stats.pearsonr(x, y)

# 5. 分布拟合
dist = stats.norm.fit(data)

适用场景：

快速假设检验
单变量分析
概率分布操作
不需要复杂输出表格

第 3 节：LinearModels —— 计量经济学专业工具

核心地位：面板数据和工具变量的首选

主要功能：

python

from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.iv import IV2SLS

# 1. 面板数据（固定效应）
model = PanelOLS(
    dependent=df['wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

# 2. 工具变量（2SLS）
model = IV2SLS(
    dependent=df['wage'],
    exog=df[['education']],
    endog=df[['ability']],
    instruments=df[['father_education']]
).fit(cov_type='robust')

# 3. GMM
from linearmodels.system import SUR
model = SUR(...).fit()

优势：

专为面板数据设计
聚类标准误（Cluster-Robust SE）
工具变量诊断（弱工具变量检验）
支持系统 GMM

第 4 节：专业化工具包

Pingouin —— 用户友好的统计包

python

import pingouin as pg

# 1. t 检验（输出更清晰）
pg.ttest(group1, group2, correction=True)

# 2. ANOVA
pg.anova(data=df, dv='score', between='group')

# 3. 功效分析
pg.power_ttest(d=0.5, n=50, alpha=0.05)

# 4. 事后检验
pg.pairwise_ttests(data=df, dv='score', between='group')

特点：

输出为 DataFrame（易于处理）
自动计算效应量（Cohen's d、η²）
内置可视化功能

Statsmodels.formula.api —— R 风格公式

python

import statsmodels.formula.api as smf

# 公式接口（更直观）
model = smf.ols('log_wage ~ education + experience + I(experience**2) + C(region)', 
                data=df).fit()

# 优势：
# - 自动添加常数项
# - 自动处理分类变量（C()）
# - 支持变换（I()、np.log()）
# - 支持交互项（education:experience）

工具包选择决策树

开始
│
├─ 简单假设检验？（t检验、卡方检验）
│  └─ 是 → scipy.stats 或 pingouin
│
├─ OLS 回归？
│  ├─ 需要详细诊断 → statsmodels.OLS
│  ├─ 快速原型 → statsmodels.formula.api
│  └─ 预测为主 → scikit-learn
│
├─ 面板数据？
│  ├─ 固定效应/随机效应 → linearmodels.PanelOLS
│  └─ 动态面板 → linearmodels（或 Stata）
│
├─ 工具变量？
│  └─ linearmodels.IV2SLS
│
├─ 时间序列？
│  ├─ ARIMA/SARIMA → statsmodels.tsa
│  └─ 复杂预测 → prophet、neuralprophet
│
├─ GLM（二元、计数）？
│  └─ statsmodels.GLM
│
└─ 贝叶斯推断？
   └─ PyMC、ArviZ

安装指南

基础安装

bash

# 核心统计包
pip install statsmodels scipy pandas

# 计量经济学
pip install linearmodels

# 友好统计
pip install pingouin

# 可视化
pip install matplotlib seaborn

# 完整数据科学栈（推荐）
conda install -c conda-forge statsmodels scipy pandas linearmodels pingouin

版本要求

python

import statsmodels
import scipy
import linearmodels

print(f"statsmodels: {statsmodels.__version__}")  # 推荐 >= 0.14
print(f"scipy: {scipy.__version__}")              # 推荐 >= 1.10
print(f"linearmodels: {linearmodels.__version__}")  # 推荐 >= 5.0

学习目标

完成本章后，你将能够：

能力维度	具体目标
工具认知	理解 Python 统计生态的整体架构
	知道何时使用哪个工具包
Statsmodels	掌握 OLS、GLM、时间序列建模
	理解模型诊断与稳健标准误
	使用公式接口快速建模
SciPy.stats	快速进行各类假设检验
	处理概率分布
LinearModels	进行面板数据回归（固定效应、随机效应）
	实施工具变量估计（2SLS、GMM）
	计算聚类稳健标准误
综合应用	从数据到论文的完整工作流
	输出论文级回归表格

与 Stata/R 的对比

Stata → Python 映射

Stata 命令	Python 等价	包
`regress y x1 x2`	`sm.OLS(y, X).fit()`	statsmodels
`logit y x1 x2`	`sm.Logit(y, X).fit()`	statsmodels
`xtreg y x, fe`	`PanelOLS(..., entity_effects=True).fit()`	linearmodels
`ivregress 2sls y (x1=z) x2`	`IV2SLS(...).fit()`	linearmodels
`arima y, ar(1) ma(1)`	`ARIMA(y, order=(1,0,1)).fit()`	statsmodels
`ttest x == 0`	`stats.ttest_1samp(x, 0)`	scipy.stats

R → Python 映射

R 命令	Python 等价	包
`lm(y ~ x1 + x2)`	`smf.ols('y ~ x1 + x2', df).fit()`	statsmodels.formula
`glm(y ~ x, family=binomial)`	`sm.GLM(y, X, family=sm.families.Binomial()).fit()`	statsmodels
`t.test(x, y)`	`stats.ttest_ind(x, y)`	scipy.stats
`cor.test(x, y)`	`stats.pearsonr(x, y)`	scipy.stats
`plm(y ~ x, effect='individual')`	`PanelOLS(..., entity_effects=True).fit()`	linearmodels

学习建议

DO（推荐做法）

先学 statsmodels：它是基础，类似于 Stata
理解包的定位：每个包都有特定用途
查看官方文档：Python 包文档都很详细
对比 Stata/R：找到熟悉的映射关系
实践为主：每个包都跑一遍示例代码

DON'T（避免误区）

不要只用一个包：灵活选择最适合的工具
不要死记函数：理解包的设计理念更重要
不要忽略版本：统计包更新频繁，注意版本兼容性
不要盲目相信默认值：检查标准误、自由度等设置
不要忘记引用：学术论文要注明使用的包和版本

包	文档链接
Statsmodels	https://www.statsmodels.org/
SciPy	https://docs.scipy.org/doc/scipy/reference/stats.html
LinearModels	https://bashtage.github.io/linearmodels/
Pingouin	https://pingouin-stats.org/

本章数据集

数据集	描述	来源	用途
wage_panel.csv	面板工资数据	模拟	linearmodels 示例
treatment_iv.csv	工具变量数据	模拟	IV2SLS 示例
time_series.csv	宏观时间序列	FRED	ARIMA 示例
survey_data.csv	横截面调查	模拟	statsmodels 示例

准备好了吗？

Python 统计生态强大而灵活，掌握它将让你：

拥有比 Stata 更强的扩展性
完全免费（Stata 售价 $1,000+）
融入全球最大的数据科学社区
为机器学习和因果推断做好准备

注意：本章不是"入门"级别，需要：

熟悉 Python 基础语法
理解回归分析基本概念
完成 Module 1-3 的学习

让我们开始探索 Python 统计宇宙！

本章文件清单

module-4_Python Statistical Packages/
├── 00-本章介绍.md                    # 本文件
├── 01-statsmodels-essentials.md      # Statsmodels 核心功能
├── 02-scipy-stats.md                 # SciPy 统计推断
├── 03-linearmodels.md                # LinearModels 面板与IV
├── 04-specialized-packages.md        # Pingouin 等专业包
├── 05-package-ecosystem.md           # 工具包对比与选择
└── 06-integrated-workflow.md         # 从数据到论文的工作流

预计学习时间：20-24 小时 难度系数：⭐⭐⭐⭐（需要统计学基础） 实用性：⭐⭐⭐⭐⭐（核心技能）

下一节：01 - Statsmodels 核心功能

开启 Python 统计之旅！

4.1 本章介绍（Python 统计计量工具包全景）

为什么需要掌握多个统计工具包？

Stata 用户的困惑

Python 统计生态全景

核心统计包对比

Stata vs Python：范式差异

️ 本章学习路线

第 1 节：Statsmodels —— Python 统计分析的基石

第 2 节：SciPy.stats —— 快速统计检验

第 3 节：LinearModels —— 计量经济学专业工具

第 4 节：专业化工具包

Pingouin —— 用户友好的统计包

Statsmodels.formula.api —— R 风格公式

工具包选择决策树

安装指南

基础安装

版本要求

学习目标

与 Stata/R 的对比

Stata → Python 映射

R → Python 映射

学习建议

DO（推荐做法）

DON'T（避免误区）

推荐资源

官方文档

书籍

在线教程

本章数据集

准备好了吗？

本章文件清单

4.1 本章介绍（Python 统计计量工具包全景） ​

为什么需要掌握多个统计工具包？ ​

Stata 用户的困惑 ​

Python 统计生态全景 ​

核心统计包对比 ​

Stata vs Python：范式差异 ​

️ 本章学习路线 ​

第 1 节：Statsmodels —— Python 统计分析的基石 ​

第 2 节：SciPy.stats —— 快速统计检验 ​

第 3 节：LinearModels —— 计量经济学专业工具 ​

第 4 节：专业化工具包 ​

Pingouin —— 用户友好的统计包 ​

Statsmodels.formula.api —— R 风格公式 ​

工具包选择决策树 ​

安装指南 ​

基础安装 ​

版本要求 ​

学习目标 ​

与 Stata/R 的对比 ​

Stata → Python 映射 ​

R → Python 映射 ​

学习建议 ​

DO（推荐做法） ​

DON'T（避免误区） ​

推荐资源 ​

官方文档 ​

书籍 ​

在线教程 ​

本章数据集 ​

准备好了吗？ ​

本章文件清单 ​

4.1 本章介绍（Python 统计计量工具包全景）

为什么需要掌握多个统计工具包？

Stata 用户的困惑

Python 统计生态全景

核心统计包对比

Stata vs Python：范式差异

️ 本章学习路线

第 1 节：Statsmodels —— Python 统计分析的基石

第 2 节：SciPy.stats —— 快速统计检验

第 3 节：LinearModels —— 计量经济学专业工具

第 4 节：专业化工具包

Pingouin —— 用户友好的统计包

Statsmodels.formula.api —— R 风格公式

工具包选择决策树

安装指南

基础安装

版本要求

学习目标

与 Stata/R 的对比

Stata → Python 映射

R → Python 映射

学习建议

DO（推荐做法）

DON'T（避免误区）

推荐资源

官方文档

书籍

在线教程

本章数据集

准备好了吗？

本章文件清单