11.5 经典案例和Python实现

"A picture is worth a thousand regressions.""一张图胜过一千次回归。"— David Lee, On RDD Graphical Evidence (关于RDD图形证据)

从经典论文学习RDD的最佳实践

本节概要

本节将深入复现三个经典RDD研究：

Angrist & Lavy (1999)：班级规模与学生成绩（Fuzzy RDD）
Lee (2008)：选举优势与连任（Sharp RDD）
Carpenter & Dobkin (2009)：最低饮酒年龄与死亡率（Sharp RDD）

每个案例都包含完整的Python实现、数据模拟和可视化。

案例 1：Angrist & Lavy (1999) - 班级规模与学生成绩

研究背景

研究问题：减少班级规模是否提高学生成绩？

挑战：

班级规模是内生的（好学校可能有更小的班级）
简单OLS回归会有偏差

自然实验：Maimonides' Rule（迈蒙尼德规则）

以色列法律规定：班级人数不得超过40人
如果学校有41名学生 → 必须分成2个班（每班≈20人）
如果学校有40名学生 → 1个班（40人）

研究设计（Fuzzy RDD）

驱动变量（Running Variable）：

断点（Cutoffs）：

处理变量（Treatment）：

工具变量（Instrument）：

公式：

为什么是 Fuzzy RDD？

规则不是完美执行（有些学校不遵守）
班级规模在断点处跳跃，但不是从0到1

Python 实现（模拟数据）

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from scipy import stats

# 设置
np.random.seed(42)
sns.set_style("whitegrid")

# 生成数据
n_schools = 500

# 学校年级学生数（聚集在40的倍数附近）
enrollment = []
for _ in range(n_schools):
    # 随机选择一个"目标"人数（40的倍数附近）
    target = np.random.choice([40, 80, 120]) + np.random.randint(-15, 16)
    enrollment.append(max(20, target))

enrollment = np.array(enrollment)

# Maimonides' Rule：预测的班级规模
def maimonides_rule(n_students):
    """根据Maimonides' Rule计算预测的班级规模"""
    n_classes = int((n_students - 1) / 40) + 1
    return n_students / n_classes

predicted_class_size = np.array([maimonides_rule(n) for n in enrollment])

# 实际班级规模（有噪音，不完美遵守规则）
# 真实班级规模 = 预测值 + 噪音
actual_class_size = predicted_class_size + np.random.normal(0, 3, n_schools)
actual_class_size = np.clip(actual_class_size, 15, 45)  # 限制在合理范围

# 生成成绩（班级规模的因果效应 = -0.3）
# DGP: test_score = 70 - 0.3 * class_size + noise
true_effect = -0.3  # 班级规模增加1人，成绩下降0.3分
test_score = 70 - 0.3 * actual_class_size + np.random.normal(0, 5, n_schools)

# 创建数据框
df_angrist = pd.DataFrame({
    'enrollment': enrollment,
    'predicted_class_size': predicted_class_size,
    'actual_class_size': actual_class_size,
    'test_score': test_score
})

print("=" * 70)
print("Angrist & Lavy (1999) 数据模拟")
print("=" * 70)
print(df_angrist.head(10))
print(f"\n真实因果效应: {true_effect:.2f}")

可视化：Maimonides' Rule

python

# 绘制 Maimonides' Rule
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 子图1：预测的班级规模
ax1 = axes[0]
enrollment_range = np.arange(20, 160)
predicted_range = [maimonides_rule(n) for n in enrollment_range]

ax1.plot(enrollment_range, predicted_range, color='red', linewidth=2.5,
         label='Maimonides\' Rule (Predicted)')
ax1.scatter(df_angrist['enrollment'], df_angrist['predicted_class_size'],
            alpha=0.3, s=30, color='blue', label='Schools')

# 断点
for cutoff in [40, 80, 120]:
    ax1.axvline(x=cutoff, color='green', linestyle='--', linewidth=1.5, alpha=0.6)

ax1.set_xlabel('Enrollment (学生总数)', fontsize=13, fontweight='bold')
ax1.set_ylabel('Predicted Class Size', fontsize=13, fontweight='bold')
ax1.set_title('Maimonides\' Rule', fontsize=15, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# 子图2：实际 vs 预测班级规模
ax2 = axes[1]
ax2.scatter(df_angrist['predicted_class_size'], df_angrist['actual_class_size'],
            alpha=0.5, s=40, edgecolors='black', linewidth=0.5)
ax2.plot([15, 45], [15, 45], 'r--', linewidth=2, label='45° Line (Perfect Compliance)')

ax2.set_xlabel('Predicted Class Size', fontsize=13, fontweight='bold')
ax2.set_ylabel('Actual Class Size', fontsize=13, fontweight='bold')
ax2.set_title('First Stage: Compliance with Maimonides\' Rule',
              fontsize=15, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Fuzzy RDD 估计

python

from rdrobust import rdrobust

# 重新编码：以40为中心
df_angrist['enrollment_centered'] = df_angrist['enrollment'] - 40

# Fuzzy RDD（使用predicted_class_size作为工具变量）
# 第一阶段：预测的班级规模 vs 实际班级规模
print("\n" + "=" * 70)
print("第一阶段（First Stage）")
print("=" * 70)

first_stage = rdrobust(
    y=df_angrist['actual_class_size'],
    x=df_angrist['enrollment'],
    c=40,
    fuzzy=None  # Sharp RDD for first stage
)
print(f"预测班级规模在断点处的跳跃: {first_stage.coef[0]:.3f}")
print(f"F-统计量: {first_stage.z[0]**2:.2f}")

# Fuzzy RDD：使用工具变量
print("\n" + "=" * 70)
print("Fuzzy RDD 估计")
print("=" * 70)

# 注意：rdrobust 的 fuzzy 参数接受实际处理变量
fuzzy_result = rdrobust(
    y=df_angrist['test_score'],
    x=df_angrist['enrollment'],
    c=40,
    fuzzy=df_angrist['actual_class_size']
)

print(f"班级规模对成绩的因果效应: {fuzzy_result.coef[0]:.3f}")
print(f"p-value: {fuzzy_result.pval[0]:.4f}")
print(f"95% CI: [{fuzzy_result.ci[0][0]:.3f}, {fuzzy_result.ci[0][1]:.3f}]")
print(f"\n真实效应: {true_effect:.3f}")

# 手动实现 Fuzzy RDD（2SLS）
print("\n" + "=" * 70)
print("手动实现：2SLS")
print("=" * 70)

# 限制样本在断点附近
df_local = df_angrist[np.abs(df_angrist['enrollment_centered']) <= 20].copy()
df_local['D'] = (df_local['enrollment'] >= 40).astype(int)

# 第一阶段
fs_model = smf.ols('actual_class_size ~ D + enrollment_centered + D:enrollment_centered',
                   data=df_local).fit()
df_local['class_size_hat'] = fs_model.predict(df_local)

print("第一阶段回归:")
print(f"  断点效应（F-stat）: {fs_model.fvalue:.2f}")

# 第二阶段
ss_model = smf.ols('test_score ~ class_size_hat + enrollment_centered',
                   data=df_local).fit()

print("\n第二阶段回归:")
print(f"  班级规模效应: {ss_model.params['class_size_hat']:.3f}")
print(f"  SE: {ss_model.bse['class_size_hat']:.3f}")

Angrist & Lavy 的发现：

班级规模减少1人 → 成绩提高约0.2-0.3个标准差
小班教学确实有益（尤其对弱势学生）

️ 案例 2：Lee (2008) - 选举优势与连任

研究背景

研究问题：现任议员的身份是否带来连任优势？

挑战：

现任身份是内生的（能力强的候选人更可能当选）
观察性比较会有选择性偏差

自然实验：选举得票率

得票率 ≥ 50% → 当选（成为现任）
得票率 < 50% → 落选

研究设计（Sharp RDD）

驱动变量：

断点：

处理：

结果：

关键直觉：

得票率49.9%的候选人 vs 50.1%的候选人
这两人几乎一样（能力、资金、选民支持）
唯一差别：一个当选（成为现任），一个落选

Python 实现（模拟数据）

python

# 生成数据
n_elections = 1000

# 当前选举的得票率（集中在50%附近）
vote_share_t = 50 + np.random.normal(0, 10, n_elections)

# 是否当选
won_election = (vote_share_t >= 50).astype(int)

# 下次选举的得票率
# 现任优势：当选者在下次选举中得票率提高8个百分点
incumbency_advantage = 8
vote_share_t1 = (vote_share_t + incumbency_advantage * won_election +
                 np.random.normal(0, 8, n_elections))

# 创建数据框
df_lee = pd.DataFrame({
    'vote_share_t': vote_share_t,
    'vote_margin_t': vote_share_t - 50,  # 中心化
    'won_election': won_election,
    'vote_share_t1': vote_share_t1
})

# 限制在合理范围
df_lee = df_lee[(df_lee['vote_share_t'] >= 30) & (df_lee['vote_share_t'] <= 70)]

print("=" * 70)
print("Lee (2008) 数据模拟")
print("=" * 70)
print(df_lee.head(10))
print(f"\n真实现任优势: {incumbency_advantage:.1f} 个百分点")

可视化：选举RDD

python

from rdrobust import rdplot

# RDD 图
fig, ax = plt.subplots(figsize=(14, 8))

# 使用 rdplot（手动实现以便控制）
# Binning
vote_bins = pd.cut(df_lee['vote_margin_t'], bins=30)
df_binned = df_lee.groupby([vote_bins, 'won_election']).agg({
    'vote_share_t1': 'mean',
    'vote_margin_t': 'mean'
}).reset_index()

df_binned_left = df_binned[df_binned['won_election'] == 0]
df_binned_right = df_binned[df_binned['won_election'] == 1]

# 散点图
ax.scatter(df_binned_left['vote_margin_t'], df_binned_left['vote_share_t1'],
           s=100, alpha=0.7, color='blue', edgecolors='black', linewidths=1.5,
           label='Lost election (t)')
ax.scatter(df_binned_right['vote_margin_t'], df_binned_right['vote_share_t1'],
           s=100, alpha=0.7, color='red', edgecolors='black', linewidths=1.5,
           label='Won election (t)')

# 拟合线
df_local = df_lee[np.abs(df_lee['vote_margin_t']) <= 15]
df_left = df_local[df_local['won_election'] == 0]
df_right = df_local[df_local['won_election'] == 1]

from sklearn.linear_model import LinearRegression
if len(df_left) > 0:
    lr_left = LinearRegression().fit(
        df_left[['vote_margin_t']], df_left['vote_share_t1']
    )
    X_range_left = np.linspace(df_left['vote_margin_t'].min(), 0, 100).reshape(-1, 1)
    ax.plot(X_range_left, lr_left.predict(X_range_left),
            color='blue', linewidth=3)

if len(df_right) > 0:
    lr_right = LinearRegression().fit(
        df_right[['vote_margin_t']], df_right['vote_share_t1']
    )
    X_range_right = np.linspace(0, df_right['vote_margin_t'].max(), 100).reshape(-1, 1)
    ax.plot(X_range_right, lr_right.predict(X_range_right),
            color='red', linewidth=3)

# 断点
ax.axvline(x=0, color='green', linestyle='--', linewidth=2.5)

# 标注
y_left_at_cutoff = lr_left.predict([[0]])[0]
y_right_at_cutoff = lr_right.predict([[0]])[0]
jump = y_right_at_cutoff - y_left_at_cutoff

ax.annotate('', xy=(0.5, y_right_at_cutoff), xytext=(0.5, y_left_at_cutoff),
            arrowprops=dict(arrowstyle='<->', color='purple', lw=3))
ax.text(1, (y_left_at_cutoff + y_right_at_cutoff) / 2,
        f'现任优势\n≈ {jump:.1f} pp',
        fontsize=12, color='purple', fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

ax.set_xlabel('Vote Margin at t (%)', fontsize=13, fontweight='bold')
ax.set_ylabel('Vote Share at t+1 (%)', fontsize=13, fontweight='bold')
ax.set_title('Lee (2008): Incumbency Advantage in U.S. House Elections',
             fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

RDD 估计

python

# Sharp RDD
print("\n" + "=" * 70)
print("Sharp RDD 估计")
print("=" * 70)

lee_result = rdrobust(
    y=df_lee['vote_share_t1'],
    x=df_lee['vote_margin_t'],
    c=0
)

print(f"现任优势: {lee_result.coef[0]:.3f} 个百分点")
print(f"稳健 p-value: {lee_result.pval[0]:.4f}")
print(f"95% CI: [{lee_result.ci[0][0]:.3f}, {lee_result.ci[0][1]:.3f}]")
print(f"\n真实优势: {incumbency_advantage:.1f} 个百分点")

# 有效性检验：密度检验
print("\n" + "=" * 70)
print("McCrary 密度检验（是否有操控？）")
print("=" * 70)

try:
    from rddensity import rddensity
    density_test = rddensity(X=df_lee['vote_margin_t'], c=0)
    print(f"p-value: {density_test.pval[0]:.4f}")
    print(f"结论: {' 无操控证据' if density_test.pval[0] > 0.05 else ' 可能存在操控'}")
except ImportError:
    print("未安装 rddensity 包")

Lee 的发现：

现任优势巨大：约40个百分点！
刚好赢得选举的候选人，在下次选举中得票率比刚好输掉的候选人高约40%

案例 3：Carpenter & Dobkin (2009) - 最低饮酒年龄

研究背景

研究问题：最低饮酒年龄（21岁）是否降低年轻人的死亡率？

自然实验：

美国法律：21岁生日后可以合法饮酒
21岁生日前后，个体几乎完全一样（除了合法饮酒权）

研究设计（Sharp RDD）

驱动变量：

断点：（21岁生日）

处理：（合法饮酒）

结果：

（每10万人）
分类：交通事故死亡、酒精中毒死亡、总死亡率

Python 实现（模拟数据）

python

# 生成数据
n_days = 365 * 4  # 19-23岁，共4年
ages = np.linspace(19, 23, n_days)

# 合法饮酒
legal_drinking = (ages >= 21).astype(int)

# 死亡率（每10万人）
# 基础死亡率随年龄轻微上升
base_mortality = 80 + 2 * (ages - 19)

# 合法饮酒导致死亡率跳升（效应 = 10）
alcohol_effect = 10
mortality_rate = base_mortality + alcohol_effect * legal_drinking + np.random.normal(0, 5, n_days)

# 创建数据框
df_alcohol = pd.DataFrame({
    'age': ages,
    'age_centered': ages - 21,
    'legal_drinking': legal_drinking,
    'mortality_rate': mortality_rate
})

print("=" * 70)
print("Carpenter & Dobkin (2009) 数据模拟")
print("=" * 70)
print(df_alcohol.head(10))
print(f"\n真实酒精效应: {alcohol_effect:.1f} 死亡/10万人")

可视化

python

fig, ax = plt.subplots(figsize=(14, 8))

# 散点图（每周平均）
df_alcohol['week'] = (df_alcohol['age'] * 52).astype(int)
df_weekly = df_alcohol.groupby('week').agg({
    'age_centered': 'mean',
    'mortality_rate': 'mean'
}).reset_index()

ax.scatter(df_weekly['age_centered'], df_weekly['mortality_rate'],
           s=30, alpha=0.6, color='gray', edgecolors='black', linewidths=0.5)

# 拟合线
df_left = df_alcohol[df_alcohol['age'] < 21]
df_right = df_alcohol[df_alcohol['age'] >= 21]

lr_left = LinearRegression().fit(df_left[['age_centered']], df_left['mortality_rate'])
lr_right = LinearRegression().fit(df_right[['age_centered']], df_right['mortality_rate'])

X_range_left = np.linspace(-2, 0, 100).reshape(-1, 1)
X_range_right = np.linspace(0, 2, 100).reshape(-1, 1)

ax.plot(X_range_left, lr_left.predict(X_range_left),
        color='blue', linewidth=3, label='Age < 21 (Illegal)')
ax.plot(X_range_right, lr_right.predict(X_range_right),
        color='red', linewidth=3, label='Age ≥ 21 (Legal)')

# 断点
ax.axvline(x=0, color='green', linestyle='--', linewidth=2.5)
ax.text(0.05, 75, '21岁生日', fontsize=12, color='green', fontweight='bold')

ax.set_xlabel('Age - 21 (years)', fontsize=13, fontweight='bold')
ax.set_ylabel('Mortality Rate (per 100,000)', fontsize=13, fontweight='bold')
ax.set_title('Carpenter & Dobkin (2009): Alcohol and Mortality',
             fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

RDD 估计

python

# Sharp RDD
print("\n" + "=" * 70)
print("Sharp RDD 估计")
print("=" * 70)

alcohol_result = rdrobust(
    y=df_alcohol['mortality_rate'],
    x=df_alcohol['age_centered'],
    c=0
)

print(f"合法饮酒对死亡率的影响: {alcohol_result.coef[0]:.3f} 死亡/10万人")
print(f"p-value: {alcohol_result.pval[0]:.4f}")
print(f"95% CI: [{alcohol_result.ci[0][0]:.3f}, {alcohol_result.ci[0][1]:.3f}]")
print(f"\n真实效应: {alcohol_effect:.1f}")

Carpenter & Dobkin 的发现：

21岁生日后，死亡率跳升约9%
主要由交通事故和酒精中毒导致
最低饮酒年龄确实降低了年轻人死亡率

关键要点

从经典案例中学到的

Angrist & Lavy (1999)：
- Fuzzy RDD 的典范
- 制度规则（Maimonides' Rule）作为IV
- 第一阶段必须强（F > 10）
Lee (2008)：
- Sharp RDD 在政治经济学中的应用
- 检验操控非常重要（选举舞弊）
- 现任优势的惊人发现
Carpenter & Dobkin (2009)：
- 公共卫生政策的RDD评估
- 断点非常清晰（法定年龄）
- 高频数据的优势

最佳实践总结

清晰的断点规则：寻找外生的、不可操控的断点
有效性检验：协变量平衡、密度检验
稳健性检验：多个带宽、多项式阶数
可视化：RDD图是论文的核心
因果解释：明确LATE的含义

本节总结

在本节中，我们复现了：

Angrist & Lavy (1999) 的 Fuzzy RDD
Lee (2008) 的选举RDD
Carpenter & Dobkin (2009) 的年龄断点RDD
完整的Python实现和可视化

关键教训：

"经典研究之所以经典，不仅因为问题重要，更因为设计巧妙、执行严谨。学习这些案例是掌握RDD的最佳途径。"

下一步：在第6节中，我们将总结RDD方法，提供练习题和文献推荐。

从经典案例学习，让你的研究也成为经典！

11.5 经典案例和Python实现 ​

本节概要 ​

案例 1：Angrist & Lavy (1999) - 班级规模与学生成绩 ​

研究背景 ​

研究设计（Fuzzy RDD） ​

Python 实现（模拟数据） ​

可视化：Maimonides' Rule ​

Fuzzy RDD 估计 ​

️ 案例 2：Lee (2008) - 选举优势与连任 ​

研究背景 ​

研究设计（Sharp RDD） ​

Python 实现（模拟数据） ​

可视化：选举RDD ​

RDD 估计 ​

案例 3：Carpenter & Dobkin (2009) - 最低饮酒年龄 ​

研究背景 ​

研究设计（Sharp RDD） ​

Python 实现（模拟数据） ​

可视化 ​

RDD 估计 ​

关键要点 ​

从经典案例中学到的 ​

最佳实践总结 ​

本节总结 ​

11.5 经典案例和Python实现

本节概要

案例 1：Angrist & Lavy (1999) - 班级规模与学生成绩

研究背景

研究设计（Fuzzy RDD）

Python 实现（模拟数据）

可视化：Maimonides' Rule

Fuzzy RDD 估计

️ 案例 2：Lee (2008) - 选举优势与连任

研究背景

研究设计（Sharp RDD）

Python 实现（模拟数据）

可视化：选举RDD

RDD 估计

案例 3：Carpenter & Dobkin (2009) - 最低饮酒年龄

研究背景

研究设计（Sharp RDD）

Python 实现（模拟数据）

可视化

RDD 估计

关键要点

从经典案例中学到的

最佳实践总结

本节总结