5.4 回归诊断（Regression Diagnostics）

"All models are wrong, some are useful, and a few are insightful.""所有模型都是错的，有些是有用的，少数是有洞察力的。"— George Box, Statistician (统计学家)

检验假设，诊断问题，确保推断有效

本节目标

完成本节后，你将能够：

理解 Gauss-Markov 假设的含义
检验异方差性（Heteroskedasticity）
检验自相关（Autocorrelation）
进行残差分析（Residual Analysis）
识别影响点和异常值（Outliers & Influential Points）
使用稳健标准误（Robust Standard Errors）

Gauss-Markov 假设（Classical Linear Model Assumptions）

完整假设列表

假设	数学表达	含义	违反后果
MLR.1 线性性		总体模型是线性的	系数有偏
MLR.2 随机抽样	是 i.i.d.	样本独立同分布	推断失效
MLR.3 无完全共线性	矩阵满秩	自变量不存在完全线性关系	无法估计
MLR.4 零条件均值	$E[\varepsilon	X] = 0$	误差项与不相关（外生性）
MLR.5 同方差性	$Var(\varepsilon	X) = \sigma^2$	误差方差为常数
MLR.6 正态性		误差服从正态分布	影响小样本推断

Gauss-Markov 定理

定理：在假设 MLR.1-MLR.5 下，OLS 估计量是 BLUE（Best Linear Unbiased Estimator）

实践意义：

假设 1-4 保证无偏性：
假设 5 保证标准误正确
假设 6 不是必需的（但对小样本推断有用）

诊断工具箱

1. 残差分析（Residual Analysis）

残差类型：

类型	公式	用途
原始残差		基本诊断
标准化残差		识别大残差
学生化残差		识别异常值

其中是杠杆值（Leverage）

Python 实现

python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# 生成示例数据
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)

df = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'experience_sq': experience**2
})

# 回归
X = sm.add_constant(df[['education', 'experience', 'experience_sq']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

# 获取残差和诊断统计量
influence = model.get_influence()
standardized_resid = influence.resid_studentized_internal
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]

df['resid'] = model.resid
df['fitted'] = model.fittedvalues
df['std_resid'] = standardized_resid
df['leverage'] = leverage
df['cooks_d'] = cooks_d

print("前 10 个观测的诊断统计量:")
print(df[['resid', 'std_resid', 'leverage', 'cooks_d']].head(10))

2. 残差可视化（四合一诊断图）

python

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 图 1：残差 vs 拟合值（检验线性性和同方差性）
axes[0, 0].scatter(df['fitted'], df['resid'], alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('拟合值')
axes[0, 0].set_ylabel('残差')
axes[0, 0].set_title('残差 vs 拟合值')
axes[0, 0].grid(True, alpha=0.3)

# 添加局部回归曲线（LOWESS）
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(df['resid'], df['fitted'], frac=0.3)
axes[0, 0].plot(lowess_result[:, 0], lowess_result[:, 1], 'b-', linewidth=2, label='LOWESS')
axes[0, 0].legend()

# 图 2：Q-Q 图（检验正态性）
stats.probplot(df['resid'], dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('正态 Q-Q 图')
axes[0, 1].grid(True, alpha=0.3)

# 图 3：Scale-Location 图（检验同方差性）
axes[1, 0].scatter(df['fitted'], np.sqrt(np.abs(df['std_resid'])), alpha=0.5)
axes[1, 0].set_xlabel('拟合值')
axes[1, 0].set_ylabel('√|标准化残差|')
axes[1, 0].set_title('Scale-Location 图')
axes[1, 0].grid(True, alpha=0.3)

# 添加 LOWESS 曲线
lowess_result2 = lowess(np.sqrt(np.abs(df['std_resid'])), df['fitted'], frac=0.3)
axes[1, 0].plot(lowess_result2[:, 0], lowess_result2[:, 1], 'r-', linewidth=2)

# 图 4：残差 vs 杠杆值（识别影响点）
axes[1, 1].scatter(df['leverage'], df['std_resid'], alpha=0.5)
axes[1, 1].set_xlabel('杠杆值')
axes[1, 1].set_ylabel('标准化残差')
axes[1, 1].set_title('残差 vs 杠杆值')
axes[1, 1].grid(True, alpha=0.3)

# 标记高影响力点（Cook's D > 0.5）
high_influence = df['cooks_d'] > 0.5
if high_influence.any():
    axes[1, 1].scatter(df.loc[high_influence, 'leverage'], 
                      df.loc[high_influence, 'std_resid'], 
                      color='red', s=100, label="高影响力点")
    axes[1, 1].legend()

plt.tight_layout()
plt.show()

解读指南：

图	好的模式	问题模式
残差 vs 拟合值	随机分布，水平带状	漏斗形（异方差）、曲线（非线性）
Q-Q 图	点在直线上	系统性偏离（非正态）
Scale-Location	水平线	向上或向下趋势（异方差）
残差 vs 杠杆值	无极端点	右上/右下角有点（高影响力）

异方差性（Heteroskedasticity）

定义

同方差性（Homoskedasticity）：（常数）

异方差性（Heteroskedasticity）：（随变化）

后果

好消息：OLS 估计量仍然无偏
坏消息：标准误有偏，统计量和值不可靠

检验方法

1. Breusch-Pagan 检验

原假设：（同方差）

python

from statsmodels.stats.diagnostic import het_breuschpagan

# BP 检验
bp_test = het_breuschpagan(model.resid, model.model.exog)
bp_stat, bp_pvalue, bp_f, bp_f_pvalue = bp_test

print("Breusch-Pagan 检验:")
print(f"  LM 统计量 = {bp_stat:.3f}")
print(f"  p 值 = {bp_pvalue:.4f}")

if bp_pvalue < 0.05:
    print("  结论：拒绝同方差假设，存在异方差性")
else:
    print("  结论：不能拒绝同方差假设")

2. White 检验

优势：不假设异方差的具体形式

python

from statsmodels.stats.diagnostic import het_white

# White 检验
white_test = het_white(model.resid, model.model.exog)
white_stat, white_pvalue, white_f, white_f_pvalue = white_test

print("\nWhite 检验:")
print(f"  LM 统计量 = {white_stat:.3f}")
print(f"  p 值 = {white_pvalue:.4f}")

3. 图形诊断

python

# 绘制残差平方与拟合值
plt.figure(figsize=(10, 6))
plt.scatter(df['fitted'], df['resid']**2, alpha=0.5)
plt.xlabel('拟合值')
plt.ylabel('残差平方')
plt.title('残差平方 vs 拟合值（异方差诊断）')
plt.grid(True, alpha=0.3)

# 添加趋势线
z = np.polyfit(df['fitted'], df['resid']**2, 2)
p = np.poly1d(z)
x_line = np.linspace(df['fitted'].min(), df['fitted'].max(), 100)
plt.plot(x_line, p(x_line), "r-", linewidth=2, label='二次拟合')
plt.legend()
plt.show()

处理方法

方法 1：稳健标准误（Heteroskedasticity-Robust Standard Errors）

python

# 使用 HC3 稳健标准误（推荐）
model_robust = sm.OLS(y, X).fit(cov_type='HC3')

print("普通 OLS:")
print(model.summary().tables[1])

print("\n稳健标准误（HC3）:")
print(model_robust.summary().tables[1])

# 对比标准误
comparison = pd.DataFrame({
    'Variable': model.params.index,
    'Coefficient': model.params.values,
    'SE (OLS)': model.bse.values,
    'SE (Robust)': model_robust.bse.values,
    'Ratio': model_robust.bse.values / model.bse.values
})
print("\n标准误对比:")
print(comparison)

稳健标准误类型：

类型	适用场景	Python 参数
HC0	基本版	`cov_type='HC0'`
HC1	小样本调整	`cov_type='HC1'`
HC2	杠杆值调整	`cov_type='HC2'`
HC3	推荐（最保守）	`cov_type='HC3'`

方法 2：加权最小二乘法（WLS）

python

# 如果知道方差结构：Var(ε) ∝ X²
# 使用 WLS
weights = 1 / df['education']**2
model_wls = sm.WLS(y, X, weights=weights).fit()

print("WLS 回归结果:")
print(model_wls.summary())

方法 3：变量变换

python

# 对因变量取对数（常用于工资数据）
# 可以稳定方差

自相关（Autocorrelation）

定义

无自相关假设： for

自相关：误差项之间存在相关性（常见于时间序列数据）

Durbin-Watson 检验

python

from statsmodels.stats.stattools import durbin_watson

# DW 统计量
dw_stat = durbin_watson(model.resid)
print(f"Durbin-Watson 统计量: {dw_stat:.3f}")

# 判断标准
if dw_stat < 1.5:
    print("可能存在正自相关")
elif dw_stat > 2.5:
    print("可能存在负自相关")
else:
    print("无明显自相关")

DW 统计量解释：

DW ≈ 2：无自相关
DW < 2：正自相关
DW > 2：负自相关
范围：[0, 4]

处理方法

python

# 方法 1：Newey-West 标准误（HAC）
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 4})
print("Newey-West 标准误:")
print(model_hac.bse)

# 方法 2：加入滞后因变量（仅时间序列）
# y_t = β₀ + β₁ x_t + ρ y_{t-1} + ε_t

影响力分析（Influence Analysis）

关键指标

1. 杠杆值（Leverage）

含义：观测值的值与平均值的距离

判断标准：为高杠杆值

python

# 计算杠杆值阈值
k = X.shape[1] - 1  # 自变量个数
n = len(df)
leverage_threshold = 2 * (k + 1) / n

print(f"杠杆值阈值: {leverage_threshold:.4f}")
print(f"高杠杆值点数量: {(df['leverage'] > leverage_threshold).sum()}")

# 可视化
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['leverage'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=leverage_threshold, color='r', linestyle='--', linewidth=2, label=f'阈值 = {leverage_threshold:.3f}')
plt.xlabel('观测序号')
plt.ylabel('杠杆值')
plt.title('杠杆值诊断')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Cook's Distance

含义：删除观测值后，所有拟合值的变化程度

判断标准：为高影响力点

python

# Cook's D 阈值
cooks_threshold = 4 / n

print(f"Cook's D 阈值: {cooks_threshold:.4f}")
high_influence = df['cooks_d'] > cooks_threshold
print(f"高影响力点数量: {high_influence.sum()}")

if high_influence.any():
    print("\n高影响力点:")
    print(df.loc[high_influence, ['education', 'experience', 'log_wage', 'cooks_d']])

# 可视化
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['cooks_d'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=cooks_threshold, color='r', linestyle='--', linewidth=2, label=f'阈值 = {cooks_threshold:.3f}')
plt.xlabel('观测序号')
plt.ylabel("Cook's Distance")
plt.title('影响力诊断')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

3. DFBETAS

含义：删除观测值后，每个系数的变化

python

# DFBETAS
dfbetas = influence.dfbetas
dfbetas_df = pd.DataFrame(dfbetas, columns=X.columns)

print("DFBETAS 描述统计:")
print(dfbetas_df.describe())

# 可视化（针对教育系数）
plt.figure(figsize=(10, 6))
plt.stem(df.index, dfbetas_df['education'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=2/np.sqrt(n), color='r', linestyle='--', linewidth=2, label='阈值')
plt.axhline(y=-2/np.sqrt(n), color='r', linestyle='--', linewidth=2)
plt.xlabel('观测序号')
plt.ylabel('DFBETAS (education)')
plt.title('教育系数的影响力分析')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

处理影响点

python

# 方法 1：删除高影响力点后重新估计
df_no_outliers = df[~high_influence].copy()
X_no = sm.add_constant(df_no_outliers[['education', 'experience', 'experience_sq']])
y_no = df_no_outliers['log_wage']
model_no_outliers = sm.OLS(y_no, X_no).fit()

print("原模型（包含异常值）:")
print(model.params)

print("\n删除异常值后:")
print(model_no_outliers.params)

# 方法 2：稳健回归（Robust Regression）
from statsmodels.robust.robust_linear_model import RLM

model_rlm = RLM(y, X, M=sm.robust.norms.HuberT()).fit()
print("\n稳健回归:")
print(model_rlm.params)

正态性检验（Normality Tests）

1. Shapiro-Wilk 检验

python

from scipy.stats import shapiro

# Shapiro-Wilk 检验
stat, p_value = shapiro(model.resid)
print(f"Shapiro-Wilk 统计量: {stat:.4f}")
print(f"p 值: {p_value:.4f}")

if p_value < 0.05:
    print("结论：拒绝正态性假设")
else:
    print("结论：不能拒绝正态性假设")

2. Jarque-Bera 检验

python

from statsmodels.stats.stattools import jarque_bera

# Jarque-Bera 检验
jb_stat, jb_pvalue, jb_skew, jb_kurtosis = jarque_bera(model.resid)
print(f"\nJarque-Bera 统计量: {jb_stat:.3f}")
print(f"p 值: {jb_pvalue:.4f}")
print(f"偏度: {jb_skew:.3f}")
print(f"峰度: {jb_kurtosis:.3f}")

3. 可视化

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 直方图
axes[0].hist(model.resid, bins=30, density=True, alpha=0.7, edgecolor='black')
# 叠加正态分布
mu, sigma = model.resid.mean(), model.resid.std()
x = np.linspace(model.resid.min(), model.resid.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='理论正态分布')
axes[0].set_xlabel('残差')
axes[0].set_ylabel('密度')
axes[0].set_title('残差分布')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Q-Q 图
stats.probplot(model.resid, dist="norm", plot=axes[1])
axes[1].set_title('正态 Q-Q 图')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

注意：

正态性假设不是必需的（Gauss-Markov 定理不需要）
大样本下，CLT 保证渐近正态性
小样本下，非正态性影响检验和置信区间

完整诊断流程

诊断检查清单

python

def comprehensive_diagnostics(model, X, y):
    """
    完整的回归诊断
    """
    print("="*70)
    print("回归诊断报告")
    print("="*70)
    
    # 1. 基本信息
    print(f"\n1. 样本量: {len(y)}")
    print(f"   自变量数: {X.shape[1] - 1}")
    print(f"   R²: {model.rsquared:.4f}")
    print(f"   Adjusted R²: {model.rsquared_adj:.4f}")
    
    # 2. 异方差检验
    print("\n2. 异方差检验:")
    bp_test = het_breuschpagan(model.resid, model.model.exog)
    print(f"   Breusch-Pagan: LM = {bp_test[0]:.3f}, p = {bp_test[1]:.4f}")
    
    white_test = het_white(model.resid, model.model.exog)
    print(f"   White: LM = {white_test[0]:.3f}, p = {white_test[1]:.4f}")
    
    if bp_test[1] < 0.05 or white_test[1] < 0.05:
        print("   ️  建议：使用稳健标准误（HC3）")
    else:
        print("    无明显异方差")
    
    # 3. 自相关检验
    print("\n3. 自相关检验:")
    dw_stat = durbin_watson(model.resid)
    print(f"   Durbin-Watson: {dw_stat:.3f}")
    if 1.5 < dw_stat < 2.5:
        print("    无明显自相关")
    else:
        print("   ️  可能存在自相关")
    
    # 4. 正态性检验
    print("\n4. 正态性检验:")
    jb_stat, jb_pvalue, _, _ = jarque_bera(model.resid)
    print(f"   Jarque-Bera: JB = {jb_stat:.3f}, p = {jb_pvalue:.4f}")
    if jb_pvalue < 0.05:
        print("   ️  残差可能不服从正态分布（大样本下通常不影响推断）")
    else:
        print("    残差近似正态分布")
    
    # 5. 影响力分析
    print("\n5. 影响力分析:")
    influence = model.get_influence()
    cooks_d = influence.cooks_distance[0]
    high_influence = cooks_d > 4/len(y)
    print(f"   高影响力点数量: {high_influence.sum()} ({high_influence.sum()/len(y)*100:.1f}%)")
    if high_influence.sum() > 0:
        print("   ️  建议：检查高影响力点，考虑稳健回归")
    else:
        print("    无明显高影响力点")
    
    # 6. VIF
    print("\n6. 多重共线性（VIF）:")
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif_data = vif_data[vif_data['Variable'] != 'const']
    print(vif_data.to_string(index=False))
    if (vif_data['VIF'] > 10).any():
        print("   ️  部分变量 VIF > 10，存在严重多重共线性")
    else:
        print("    VIF 在可接受范围")
    
    print("\n" + "="*70)
    print("诊断完成")
    print("="*70)

# 运行诊断
comprehensive_diagnostics(model, X, y)

练习题

练习 1：识别异方差

给定以下残差图，判断是否存在异方差：

python

# 模拟异方差数据
np.random.seed(123)
X_het = np.random.uniform(0, 10, 200)
y_het = 2 + 3*X_het + X_het * np.random.normal(0, 1, 200)  # 方差随 X 增加

df_het = pd.DataFrame({'X': X_het, 'y': y_het})
X_het_model = sm.add_constant(df_het['X'])
model_het = sm.OLS(df_het['y'], X_het_model).fit()

plt.scatter(model_het.fittedvalues, model_het.resid)
plt.xlabel('拟合值')
plt.ylabel('残差')
plt.title('残差 vs 拟合值')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

任务：

进行 BP 检验
使用稳健标准误重新估计
对比标准误的变化

练习 2：影响力分析

在上面的工资数据中：

人为添加一个异常值
计算 Cook's D
对比删除前后的回归系数

本节小结

诊断流程

数据准备
   ↓
OLS 回归
   ↓
检验假设 ——→ 异方差？ ——→ 是 → 使用稳健 SE
   ↓          ↓ 否
   ↓          ↓
   ↓      自相关？ ——→ 是 → 使用 HAC SE
   ↓          ↓ 否
   ↓          ↓
   ↓      非正态？ ——→ 是 → 大样本 OK，小样本谨慎
   ↓          ↓ 否
   ↓          ↓
   ↓      影响点？ ——→ 是 → 稳健回归或删除
   ↓          ↓ 否
   ↓          ↓
   ↓        通过所有检验
   ↓
 最终模型

关键要点

问题	检验	后果	解决方案
异方差	BP, White	SE 有偏	HC3 稳健 SE
自相关	DW	SE 有偏	HAC SE
非正态	JB, Shapiro	小样本推断失效	大样本下无妨
影响点	Cook's D	系数不稳定	稳健回归

下节预告

在下一节中，我们将学习：

分类变量的处理（Dummy Variables）
虚拟变量陷阱（Dummy Variable Trap）
交互效应（Interaction Effects）
非线性关系建模

扩展回归模型的表达能力！

5.4 回归诊断（Regression Diagnostics） ​

本节目标 ​

Gauss-Markov 假设（Classical Linear Model Assumptions） ​

完整假设列表 ​

Gauss-Markov 定理 ​

诊断工具箱 ​

1. 残差分析（Residual Analysis） ​

Python 实现 ​

2. 残差可视化（四合一诊断图） ​

异方差性（Heteroskedasticity） ​

定义 ​

后果 ​

检验方法 ​

1. Breusch-Pagan 检验 ​

2. White 检验 ​

3. 图形诊断 ​

处理方法 ​

方法 1：稳健标准误（Heteroskedasticity-Robust Standard Errors） ​

方法 2：加权最小二乘法（WLS） ​

方法 3：变量变换 ​

自相关（Autocorrelation） ​

定义 ​

Durbin-Watson 检验 ​

处理方法 ​

影响力分析（Influence Analysis） ​

关键指标 ​

1. 杠杆值（Leverage） ​

2. Cook's Distance ​

3. DFBETAS ​

处理影响点 ​

正态性检验（Normality Tests） ​

1. Shapiro-Wilk 检验 ​

2. Jarque-Bera 检验 ​

3. 可视化 ​

完整诊断流程 ​

诊断检查清单 ​

练习题 ​

练习 1：识别异方差 ​

练习 2：影响力分析 ​

本节小结 ​

诊断流程 ​

关键要点 ​

下节预告 ​

延伸阅读 ​

5.4 回归诊断（Regression Diagnostics）

本节目标

Gauss-Markov 假设（Classical Linear Model Assumptions）

完整假设列表

Gauss-Markov 定理

诊断工具箱

1. 残差分析（Residual Analysis）

Python 实现

2. 残差可视化（四合一诊断图）

异方差性（Heteroskedasticity）

定义

后果

检验方法

1. Breusch-Pagan 检验

2. White 检验

3. 图形诊断

处理方法

方法 1：稳健标准误（Heteroskedasticity-Robust Standard Errors）

方法 2：加权最小二乘法（WLS）

方法 3：变量变换

自相关（Autocorrelation）

定义

Durbin-Watson 检验

处理方法

影响力分析（Influence Analysis）

关键指标

1. 杠杆值（Leverage）

2. Cook's Distance

3. DFBETAS

处理影响点

正态性检验（Normality Tests）

1. Shapiro-Wilk 检验

2. Jarque-Bera 检验

3. 可视化

完整诊断流程

诊断检查清单

练习题

练习 1：识别异方差

练习 2：影响力分析

本节小结

诊断流程

关键要点

下节预告

延伸阅读