6.4 回归分析可视化（Regression Visualization）

"Visualization is the language of discovery.""可视化是发现的语言。"— John W. Tukey, Statistician (统计学家)

诊断模型，展示结果

本节目标

完成本节后，你将能够：

创建回归拟合图
绘制残差诊断图（四合一）
可视化回归系数和置信区间
展示预测结果

回归诊断的四合一图

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

# 生成数据并拟合模型
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)

df = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'experience_sq': experience**2
})

# 回归
X = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model = sm.OLS(df['log_wage'], X).fit()

# 获取诊断统计量
influence = model.get_influence()
standardized_resid = influence.resid_studentized_internal
leverage = influence.hat_matrix_diag

# 四合一诊断图
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. 残差 vs 拟合值
axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('拟合值')
axes[0, 0].set_ylabel('残差')
axes[0, 0].set_title('残差 vs 拟合值')
axes[0, 0].grid(True, alpha=0.3)

# 添加 LOWESS 曲线
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(model.resid, model.fittedvalues, frac=0.3)
axes[0, 0].plot(lowess_result[:, 0], lowess_result[:, 1], 'b-', linewidth=2)

# 2. Q-Q 图
stats.probplot(model.resid, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('正态 Q-Q 图')
axes[0, 1].grid(True, alpha=0.3)

# 3. Scale-Location
axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5)
axes[1, 0].set_xlabel('拟合值')
axes[1, 0].set_ylabel('√|标准化残差|')
axes[1, 0].set_title('Scale-Location 图')
axes[1, 0].grid(True, alpha=0.3)

# 4. 残差 vs 杠杆值
axes[1, 1].scatter(leverage, standardized_resid, alpha=0.5)
axes[1, 1].set_xlabel('杠杆值')
axes[1, 1].set_ylabel('标准化残差')
axes[1, 1].set_title('残差 vs 杠杆值')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('回归诊断图', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

系数图（Coefficient Plot）

python

# 提取系数和置信区间
coefs = model.params.drop('const')
ci = model.conf_int().drop('const')
ci_lower = ci[0]
ci_upper = ci[1]

# 绘制系数图
fig, ax = plt.subplots(figsize=(10, 6))
y_pos = np.arange(len(coefs))

ax.errorbar(coefs, y_pos, xerr=[coefs - ci_lower, ci_upper - coefs],
           fmt='o', markersize=8, capsize=5, capthick=2, linewidth=2)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('系数估计值')
ax.set_title('回归系数与 95% 置信区间', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

预测可视化

python

# 预测不同教育水平的工资
edu_range = np.linspace(6, 20, 100)
pred_data = pd.DataFrame({
    'const': 1,
    'education': edu_range,
    'experience': 10,  # 固定经验=10年
    'experience_sq': 100
})

# 预测
predictions = model.get_prediction(pred_data)
pred_summary = predictions.summary_frame(alpha=0.05)

# 绘图
plt.figure(figsize=(10, 6))

# 实际数据（experience ≈ 10年）
mask = (df['experience'] >= 8) & (df['experience'] <= 12)
plt.scatter(df.loc[mask, 'education'], df.loc[mask, 'log_wage'],
           alpha=0.5, s=50, label='实际数据（经验≈10年）')

# 预测线
plt.plot(edu_range, pred_summary['mean'], 'r-', linewidth=2, label='预测均值')

# 置信区间
plt.fill_between(edu_range, pred_summary['mean_ci_lower'], pred_summary['mean_ci_upper'],
                alpha=0.2, color='red', label='95% 置信区间')

plt.xlabel('教育年限（年）', fontsize=12)
plt.ylabel('log(工资)', fontsize=12)
plt.title('教育对工资的预测（控制经验=10年）', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

本节小结

关键诊断图

残差 vs 拟合值：检验线性性和同方差性
Q-Q 图：检验正态性
Scale-Location：检验同方差性
残差 vs 杠杆值：识别影响点

下节预告

在下一节中，我们将学习如何比较多组分布。

继续深化可视化技能！

6.4 回归分析可视化（Regression Visualization） ​

本节目标 ​

回归诊断的四合一图 ​

系数图（Coefficient Plot） ​

预测可视化 ​

本节小结 ​

关键诊断图 ​

下节预告 ​

6.4 回归分析可视化（Regression Visualization）

本节目标

回归诊断的四合一图

系数图（Coefficient Plot）

预测可视化

本节小结

关键诊断图

下节预告