5.5 分类变量与交互效应（Categorical Variables & Interaction Effects）

"Interaction effects tell us how the world really works.""交互效应告诉我们世界是如何真正运作的。"— Andrew Gelman, Statistician & Political Scientist (统计学家、政治学家)

扩展回归模型：处理质性变量，捕捉交互关系

本节目标

完成本节后，你将能够：

理解虚拟变量（Dummy Variables）的原理
避免虚拟变量陷阱（Dummy Variable Trap）
处理多分类变量
建模和解释交互效应（Interaction Effects）
可视化交互关系
进行分组回归分析

虚拟变量（Dummy Variables）

为什么需要虚拟变量？

问题：如何在回归中包含分类变量（如性别、地区、教育水平）？

解决方案：将分类变量转换为虚拟变量（0/1 二元变量）

二元分类变量

案例：性别工资差距

其中：

如果是女性
如果是男性（参照组）

解释：

男性：
女性：
：性别工资差距（女性相对于男性）

Python 实现

python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
n = 500

education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
female = np.random.binomial(1, 0.5, n)

# 真实 DGP：女性工资比男性低 15%
log_wage = 1.5 + 0.08*education - 0.15*female + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

df = pd.DataFrame({
    'wage': wage,
    'log_wage': log_wage,
    'education': education,
    'female': female
})

# 回归
X = sm.add_constant(df[['education', 'female']])
y = df['log_wage']
model = sm.OLS(y, X).fit(cov_type='HC3')

print(model.summary())

输出（关键部分）：

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5012      0.052     28.870      0.000       1.399       1.603
education      0.0798      0.004     19.950      0.000       0.072       0.088
female        -0.1485      0.027     -5.500      0.000      -0.202      -0.095
==============================================================================

解释：

python

# 系数解释
gender_gap = (np.exp(model.params['female']) - 1) * 100
print(f"性别工资差距: 女性工资比男性低 {-gender_gap:.1f}%")
print(f"具体来说，控制教育后，女性工资是男性的 {np.exp(model.params['female'])*100:.1f}%")

输出：

性别工资差距: 女性工资比男性低 13.8%
具体来说，控制教育后，女性工资是男性的 86.2%

可视化

python

# 绘制不同性别的回归线
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 左图：原始工资
for gender, label, color in [(0, '男性', 'blue'), (1, '女性', 'red')]:
    mask = df['female'] == gender
    axes[0].scatter(df.loc[mask, 'education'], df.loc[mask, 'wage'], 
                   alpha=0.3, label=label, color=color)

axes[0].set_xlabel('教育年限')
axes[0].set_ylabel('工资（千元/月）')
axes[0].set_title('Level-Level 模型')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 右图：对数工资
for gender, label, color in [(0, '男性', 'blue'), (1, '女性', 'red')]:
    mask = df['female'] == gender
    axes[1].scatter(df.loc[mask, 'education'], df.loc[mask, 'log_wage'], 
                   alpha=0.3, label=label, color=color)
    
    # 绘制回归线
    edu_range = np.linspace(df['education'].min(), df['education'].max(), 100)
    log_wage_pred = (model.params['const'] + 
                     model.params['education'] * edu_range + 
                     model.params['female'] * gender)
    axes[1].plot(edu_range, log_wage_pred, color=color, linewidth=2)

axes[1].set_xlabel('教育年限')
axes[1].set_ylabel('log(工资)')
axes[1].set_title('Log-Level 模型（平行回归线）')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

关键观察：两条回归线平行（斜率相同），只是截距不同

虚拟变量陷阱（Dummy Variable Trap）

完全共线性问题

错误做法：为每个类别都创建虚拟变量

python

#  错误示例
df['male'] = 1 - df['female']
X_wrong = sm.add_constant(df[['education', 'female', 'male']])

try:
    model_wrong = sm.OLS(y, X_wrong).fit()
except Exception as e:
    print(f"错误：{e}")

输出：

错误：Singular matrix

原因：

存在完全共线性！

正确做法：删除一个参照组（Reference Category）

原则：

个类别 → 创建个虚拟变量
被删除的类别成为参照组（Baseline / Reference Group）
所有系数解释为"相对于参照组的差异"

多分类变量（Multi-Category Variables）

案例：地区工资差异

假设有 4 个地区：东部、中部、西部、东北

python

# 生成数据
np.random.seed(123)
n = 800

education = np.random.normal(13, 3, n)
region = np.random.choice(['东部', '中部', '西部', '东北'], n)

# 不同地区的工资水平
region_effect = {
    '东部': 0.20,   # 基准组
    '中部': 0.10,
    '西部': 0.05,
    '东北': 0.00
}

log_wage = 1.5 + 0.08*education + np.array([region_effect[r] for r in region]) + np.random.normal(0, 0.3, n)

df_region = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'region': region
})

print("各地区平均工资:")
print(df_region.groupby('region')['log_wage'].mean().sort_values(ascending=False))

方法 1：使用 pandas.get_dummies()

python

# 创建虚拟变量（自动删除一个参照组）
region_dummies = pd.get_dummies(df_region['region'], prefix='region', drop_first=True)
print("虚拟变量:")
print(region_dummies.head())

# 合并到数据框
df_region_model = pd.concat([df_region[['log_wage', 'education']], region_dummies], axis=1)

# 回归
X = sm.add_constant(df_region_model.drop('log_wage', axis=1))
y = df_region_model['log_wage']
model_region = sm.OLS(y, X).fit()

print("\n回归结果:")
print(model_region.summary())

输出：

==============================================================================
                    coef    std err          t      P>|t|      [0.025    0.975]
------------------------------------------------------------------------------
const              1.678      0.053     31.660      0.000       1.574     1.782
education          0.080      0.004     20.000      0.000       0.072     0.088
region_中部       -0.098      0.030     -3.267      0.001      -0.157    -0.039
region_西部       -0.145      0.030     -4.833      0.000      -0.204    -0.086
region_东北       -0.195      0.030     -6.500      0.000      -0.254    -0.136
==============================================================================

解释：

参照组：东部（最富裕地区）
中部工资比东部低 %
西部工资比东部低 %
东北工资比东部低 %

方法 2：使用 patsy 公式（推荐）

python

import statsmodels.formula.api as smf

# 使用公式接口（自动处理虚拟变量）
model_formula = smf.ols('log_wage ~ education + C(region)', data=df_region).fit()
print(model_formula.summary())

优势：

自动创建虚拟变量
自动选择参照组（字母顺序第一个）
代码更简洁

更改参照组

python

# 使用 Treatment 编码，指定参照组
from patsy import Treatment

model_ref_east = smf.ols(
    'log_wage ~ education + C(region, Treatment(reference="东部"))',
    data=df_region
).fit()

print("参照组 = 东部:")
print(model_ref_east.params)

# 对比：参照组 = 东北（最穷）
model_ref_northeast = smf.ols(
    'log_wage ~ education + C(region, Treatment(reference="东北"))',
    data=df_region
).fit()

print("\n参照组 = 东北:")
print(model_ref_northeast.params)

交互效应（Interaction Effects）

什么是交互效应？

定义：一个变量的效应取决于另一个变量的值

数学表达：

边际效应：

的效应随变化！

案例 1：教育回报率的性别差异

研究问题：教育对工资的回报率是否因性别而异？

解释：

男性教育回报率：
女性教育回报率：
：性别差异（女性 vs 男性）

python

# 生成数据（教育回报率：男性 8%，女性 6%）
np.random.seed(456)
n = 600

education = np.random.normal(13, 3, n)
female = np.random.binomial(1, 0.5, n)

# 交互效应：女性教育回报率更低
log_wage = (1.5 + 
            0.08 * education + 
            0.10 * female - 
            0.02 * education * female + 
            np.random.normal(0, 0.3, n))

df_interact = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'female': female
})

# 创建交互项
df_interact['edu_x_female'] = df_interact['education'] * df_interact['female']

# 回归
X = sm.add_constant(df_interact[['education', 'female', 'edu_x_female']])
y = df_interact['log_wage']
model_interact = sm.OLS(y, X).fit(cov_type='HC3')

print(model_interact.summary())

输出：

==============================================================================
                    coef    std err          t      P>|t|      [0.025    0.975]
------------------------------------------------------------------------------
const              1.498      0.078     19.205      0.000       1.345     1.651
education          0.080      0.006     13.333      0.000       0.068     0.092
female             0.112      0.110      1.018      0.309      -0.104     0.328
edu_x_female      -0.020      0.008     -2.500      0.013      -0.036    -0.004
==============================================================================

解释：

python

# 边际效应
beta_1 = model_interact.params['education']
beta_3 = model_interact.params['edu_x_female']

print(f"男性教育回报率: {beta_1*100:.2f}% per year")
print(f"女性教育回报率: {(beta_1 + beta_3)*100:.2f}% per year")
print(f"性别差异: {beta_3*100:.2f} percentage points")

# 检验交互项显著性
p_value = model_interact.pvalues['edu_x_female']
print(f"\n交互项 p 值: {p_value:.4f}")
if p_value < 0.05:
    print("结论：教育回报率存在显著的性别差异")

可视化交互效应

python

# 绘制不同性别的回归线（非平行）
plt.figure(figsize=(10, 6))

for gender, label, color in [(0, '男性', 'blue'), (1, '女性', 'red')]:
    mask = df_interact['female'] == gender
    plt.scatter(df_interact.loc[mask, 'education'], 
               df_interact.loc[mask, 'log_wage'],
               alpha=0.3, label=label, color=color)
    
    # 回归线
    edu_range = np.linspace(df_interact['education'].min(), 
                           df_interact['education'].max(), 100)
    log_wage_pred = (model_interact.params['const'] + 
                     model_interact.params['education'] * edu_range +
                     model_interact.params['female'] * gender +
                     model_interact.params['edu_x_female'] * edu_range * gender)
    plt.plot(edu_range, log_wage_pred, color=color, linewidth=2, 
            label=f'{label}回归线')

plt.xlabel('教育年限')
plt.ylabel('log(工资)')
plt.title('教育-工资关系的性别差异（非平行回归线）')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

关键观察：回归线不平行（斜率不同）

案例 2：经验与教育的交互效应

研究问题：工作经验的价值是否取决于教育水平？

python

# 生成数据
np.random.seed(789)
n = 500

education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)

# 交互效应：教育越高，经验的价值越大
log_wage = (1.0 + 
            0.06 * education + 
            0.01 * experience + 
            0.002 * education * experience + 
            np.random.normal(0, 0.3, n))

df_exp_edu = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience
})

# 使用公式接口（自动创建交互项）
model_exp_edu = smf.ols('log_wage ~ education * experience', data=df_exp_edu).fit()
print(model_exp_edu.summary())

可视化：

python

# 绘制不同教育水平下的经验-工资曲线
fig = plt.figure(figsize=(10, 6))

edu_levels = [10, 13, 16]  # 高中、大学、研究生
colors = ['red', 'blue', 'green']

for edu, color, label in zip(edu_levels, colors, ['高中', '本科', '研究生']):
    exp_range = np.linspace(0, 30, 100)
    log_wage_pred = model_exp_edu.predict(pd.DataFrame({
        'education': [edu] * 100,
        'experience': exp_range
    }))
    plt.plot(exp_range, log_wage_pred, color=color, linewidth=2, label=label)

plt.xlabel('工作经验（年）')
plt.ylabel('log(工资)')
plt.title('经验对工资的影响：教育水平的调节效应')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

边际效应分析

python

# 计算不同教育水平下，经验的边际效应
def marginal_effect_experience(edu, model):
    beta_exp = model.params['experience']
    beta_interact = model.params['education:experience']
    return beta_exp + beta_interact * edu

for edu, label in [(10, '高中'), (13, '本科'), (16, '研究生')]:
    me = marginal_effect_experience(edu, model_exp_edu)
    print(f"{label}（{edu}年教育）：经验的边际回报率 = {me*100:.2f}% per year")

输出：

高中（10年教育）：经验的边际回报率 = 3.00% per year
本科（13年教育）：经验的边际回报率 = 3.60% per year
研究生（16年教育）：经验的边际回报率 = 4.20% per year

分组回归 vs 交互项

方法对比

方法 1：分组回归（Separate Regressions）

python

# 分别对男性和女性回归
model_male = smf.ols('log_wage ~ education + experience', 
                      data=df_interact[df_interact['female'] == 0]).fit()
model_female = smf.ols('log_wage ~ education + experience', 
                        data=df_interact[df_interact['female'] == 1]).fit()

print("男性回归:")
print(model_male.params)
print("\n女性回归:")
print(model_female.params)

方法 2：交互项（Interaction Terms）

python

# 完全交互模型（允许所有系数都不同）
model_full_interact = smf.ols('log_wage ~ education * female + experience * female',
                               data=df_interact).fit()
print("完全交互模型:")
print(model_full_interact.params)

检验系数是否相等（Chow Test）

原假设：两组的回归系数相等

python

# F 检验
# SSR_pooled: 合并回归的 SSR
# SSR_separate: 分组回归的 SSR 之和
# k: 每组的参数个数
# n1, n2: 两组样本量

model_pooled = smf.ols('log_wage ~ education + experience', data=df_interact).fit()
SSR_pooled = model_pooled.ssr
SSR_separate = model_male.ssr + model_female.ssr

k = 3  # const + education + experience
n1 = (df_interact['female'] == 0).sum()
n2 = (df_interact['female'] == 1).sum()

F_stat = ((SSR_pooled - SSR_separate) / k) / (SSR_separate / (n1 + n2 - 2*k))

from scipy.stats import f
p_value = 1 - f.cdf(F_stat, k, n1 + n2 - 2*k)

print(f"\nChow Test:")
print(f"F 统计量: {F_stat:.3f}")
print(f"p 值: {p_value:.4f}")

if p_value < 0.05:
    print("结论：拒绝系数相等，应使用分组回归或交互项")
else:
    print("结论：不能拒绝系数相等，可使用合并回归")

实战案例：完整的工资决定方程

python

# 综合案例：包含所有类型的变量
np.random.seed(2024)
n = 1000

# 生成变量
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)
region = np.random.choice(['东部', '中部', '西部'], n, p=[0.4, 0.3, 0.3])
married = np.random.binomial(1, 0.6, n)

# DGP
region_effect = {'东部': 0.15, '中部': 0.05, '西部': 0.00}
log_wage = (1.2 + 
            0.07 * education + 
            0.03 * experience - 
            0.0005 * experience**2 -
            0.12 * female +
            0.08 * married -
            0.015 * education * female +  # 教育回报率性别差异
            np.array([region_effect[r] for r in region]) +
            np.random.normal(0, 0.3, n))

df_full = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'female': female,
    'region': region,
    'married': married
})

# 完整模型
formula = '''
log_wage ~ education + experience + I(experience**2) + 
           female + C(region) + married + 
           education:female
'''
model_full = smf.ols(formula, data=df_full).fit(cov_type='HC3')

print("完整工资决定方程:")
print(model_full.summary())

预测不同人群的工资

python

# 预测示例
scenarios = pd.DataFrame({
    'education': [12, 16, 16, 16],
    'experience': [5, 10, 10, 10],
    'female': [0, 0, 1, 1],
    'region': ['东部', '东部', '东部', '中部'],
    'married': [0, 1, 1, 1],
    'label': ['高中男性，东部，5年经验', 
              '本科男性，东部，已婚，10年经验',
              '本科女性，东部，已婚，10年经验',
              '本科女性，中部，已婚，10年经验']
})

scenarios['log_wage_pred'] = model_full.predict(scenarios)
scenarios['wage_pred'] = np.exp(scenarios['log_wage_pred'])

print("\n不同人群的预测工资:")
print(scenarios[['label', 'wage_pred']])

本节小结

核心要点

概念	要点
虚拟变量	个类别 → 个虚拟变量
参照组	被删除的类别，所有系数相对于它
交互效应	一个变量的效应取决于另一个变量
边际效应

Python 工具

任务	工具
创建虚拟变量	`pd.get_dummies()`
公式接口	`smf.ols('y ~ C(x)')`
交互项	`smf.ols('y ~ x1 * x2')`
边际效应	手动计算或 `statsmodels.graphics`

下节预告

在下一节中，我们将学习：

系数解释的艺术（Level-Level, Log-Level, Log-Log）
学术论文级的回归表格
结果报告的规范
可视化回归结果

从模型到论文：专业化呈现！

5.5 分类变量与交互效应（Categorical Variables & Interaction Effects） ​

本节目标 ​

虚拟变量（Dummy Variables） ​

为什么需要虚拟变量？ ​

二元分类变量 ​

案例：性别工资差距 ​

Python 实现 ​

可视化 ​

虚拟变量陷阱（Dummy Variable Trap） ​

完全共线性问题 ​

正确做法：删除一个参照组（Reference Category） ​

多分类变量（Multi-Category Variables） ​

案例：地区工资差异 ​

方法 1：使用 pandas.get_dummies() ​

方法 2：使用 patsy 公式（推荐） ​

更改参照组 ​

交互效应（Interaction Effects） ​

什么是交互效应？ ​

案例 1：教育回报率的性别差异 ​

可视化交互效应 ​

案例 2：经验与教育的交互效应 ​

边际效应分析 ​

分组回归 vs 交互项 ​

方法对比 ​

方法 1：分组回归（Separate Regressions） ​

方法 2：交互项（Interaction Terms） ​

检验系数是否相等（Chow Test） ​

实战案例：完整的工资决定方程 ​

预测不同人群的工资 ​

本节小结 ​

核心要点 ​

Python 工具 ​

下节预告 ​

延伸阅读 ​

5.5 分类变量与交互效应（Categorical Variables & Interaction Effects）

本节目标

虚拟变量（Dummy Variables）

为什么需要虚拟变量？

二元分类变量

案例：性别工资差距

Python 实现

可视化

虚拟变量陷阱（Dummy Variable Trap）

完全共线性问题

正确做法：删除一个参照组（Reference Category）

多分类变量（Multi-Category Variables）

案例：地区工资差异

方法 1：使用 pandas.get_dummies()

方法 2：使用 patsy 公式（推荐）

更改参照组

交互效应（Interaction Effects）

什么是交互效应？

案例 1：教育回报率的性别差异

可视化交互效应

案例 2：经验与教育的交互效应

边际效应分析

分组回归 vs 交互项

方法对比

方法 1：分组回归（Separate Regressions）

方法 2：交互项（Interaction Terms）

检验系数是否相等（Chow Test）

实战案例：完整的工资决定方程

预测不同人群的工资

本节小结

核心要点

Python 工具

下节预告

延伸阅读