11.4 Bandwidth Selection and Robustness Tests
"Bandwidth choice involves a fundamental bias-variance tradeoff."— Guido Imbens & Karthik Kalyanaraman, Optimal Bandwidth Authors
Make your RDD estimates withstand reviewers' scrutiny
Section Overview
Bandwidth selection is one of the most critical decisions in RDD empirical analysis. This section introduces:
- Theory and methods of optimal bandwidth selection (IK, CCT)
- Sensitivity analysis
- Polynomial order selection
- Donut-hole RDD (excluding observations very close to cutoff)
- Complete robustness testing workflow
️ Core Tradeoffs in Bandwidth Selection
Bias-Variance Tradeoff
Mean Squared Error (MSE):
Small bandwidth :
- Low bias: Local linear approximation more accurate
- High variance: Small sample, unstable estimate
Large bandwidth :
- Low variance: Large sample, stable estimate
- High bias: May violate local linearity assumption
Optimal bandwidth :
Asymptotic Theory
Fan & Gijbels (1996) derived asymptotic form of optimal bandwidth:
Intuition:
- Larger sample allows smaller optimal bandwidth (can use more local data)
- Bandwidth shrinks at rate (slower than parametric )
Automatic Bandwidth Selection Methods
Method 1: Imbens-Kalyanaraman (IK) Bandwidth (2012)
Approach: Based on asymptotic expansion of MSE, derive data-driven optimal bandwidth.
Simplified formula:
where:
- : Constant (depends on kernel function)
- : Residual variance left and right of cutoff
- : Density of running variable at cutoff
- : Second derivative of potential outcome function
Python implementation:
from rdrobust import rdbwselect
# IK bandwidth selection
bw_ik = rdbwselect(y=df['Y'], x=df['X'], c=0, bwselect='mserd')
print(f"IK Bandwidth: {bw_ik.bws[0]:.2f}")Method 2: Calonico-Cattaneo-Titiunik (CCT) Bandwidth (2014)
Improvements:
- Bias correction: Considers finite-sample bias
- Two bandwidths:
- Main bandwidth : For point estimation
- Bias bandwidth : For estimating and correcting bias
Coverage Error Optimal (CEO):
Python implementation (rdrobust default):
from rdrobust import rdrobust
# CCT method (default)
result_cct = rdrobust(y=df['Y'], x=df['X'], c=0)
print(f"CCT Main Bandwidth: {result_cct.bws[0]:.2f}")
print(f"CCT Bias Bandwidth: {result_cct.bws[1]:.2f}")Method 3: Cross-Validation (CV)
Leave-one-out cross-validation:
- For each observation , remove it
- Fit model with remaining data (bandwidth )
- Predict
- Calculate prediction error:
- Choose minimizing
Note: Use only data on one side of cutoff (avoid using jump itself).
def cross_validation_bandwidth(df, X_col, Y_col, cutoff, h_candidates):
"""
Cross-validation bandwidth selection (simplified)
Parameters:
- df: dataframe
- X_col, Y_col: variable names
- cutoff: cutoff
- h_candidates: candidate bandwidth list
"""
cv_scores = []
for h in h_candidates:
# Restrict sample
df_local = df[np.abs(df[X_col] - cutoff) <= h].copy()
df_local['X_c'] = df_local[X_col] - cutoff
df_local['D'] = (df_local[X_col] >= cutoff).astype(int)
# Leave-one-out CV (simplified: use K-fold instead)
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_mse = []
for train_idx, test_idx in kf.split(df_local):
df_train = df_local.iloc[train_idx]
df_test = df_local.iloc[test_idx]
# Fit
model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_train).fit()
# Predict
y_pred = model.predict(df_test)
mse = np.mean((df_test[Y_col] - y_pred) ** 2)
fold_mse.append(mse)
cv_scores.append(np.mean(fold_mse))
optimal_h = h_candidates[np.argmin(cv_scores)]
return optimal_h, cv_scores
# Example
h_candidates = np.arange(5, 30, 2)
optimal_h, cv_scores = cross_validation_bandwidth(df, 'X', 'Y', 0, h_candidates)
print(f"CV Optimal Bandwidth: {optimal_h:.2f}")Robustness Tests: Sensitivity Analysis
1. Comparison Across Multiple Bandwidths
Best practice: Report estimates under multiple bandwidths.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
# Try a range of bandwidths
bandwidths = [5, 10, 15, 20, 25, 30]
results = []
for h in bandwidths:
df_local = df[np.abs(df['X']) <= h].copy()
df_local['X_c'] = df_local['X']
df_local['D'] = (df_local['X'] >= 0).astype(int)
model = smf.ols('Y ~ D + X_c + D:X_c', data=df_local).fit()
results.append({
'bandwidth': h,
'effect': model.params['D'],
'se': model.bse['D'],
'ci_lower': model.conf_int().loc['D', 0],
'ci_upper': model.conf_int().loc['D', 1],
'n_obs': len(df_local)
})
results_df = pd.DataFrame(results)
# Visualization
fig, ax = plt.subplots(figsize=(12, 7))
ax.plot(results_df['bandwidth'], results_df['effect'],
'o-', linewidth=2, markersize=8, label='Point Estimate')
ax.fill_between(results_df['bandwidth'],
results_df['ci_lower'],
results_df['ci_upper'],
alpha=0.3, label='95% CI')
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Bandwidth', fontsize=13, fontweight='bold')
ax.set_ylabel('RDD Effect Estimate', fontsize=13, fontweight='bold')
ax.set_title('Sensitivity to Bandwidth Choice', fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print table
print("=" * 80)
print("Bandwidth Sensitivity Analysis")
print("=" * 80)
print(results_df.to_string(index=False))Interpretation:
- Robust: If estimate roughly stable across different bandwidths → Results credible
- Not robust: If estimate varies wildly with bandwidth → Further investigation needed
2. Sensitivity to Polynomial Order
Warning (Gelman & Imbens 2019): Do not use high-order polynomials ()!
Reasons:
- High-order polynomials prone to overfitting
- Data far from cutoff excessively influences estimates at cutoff
- Confidence intervals may be overly optimistic
Recommendation:
- Use local linear (, recommended)
- At most use local quadratic ()
- Avoid
# Compare different polynomial orders
poly_orders = [1, 2, 3]
poly_results = []
for p in poly_orders:
df_local = df[np.abs(df['X']) <= 20].copy()
df_local['X_c'] = df_local['X']
df_local['D'] = (df_local['X'] >= 0).astype(int)
# Build formula
formula_terms = ['D', 'X_c']
for k in range(2, p + 1):
formula_terms.append(f'I(X_c**{k})')
# Add interaction terms
for term in formula_terms[1:]:
formula_terms.append(f'D:{term}')
formula = 'Y ~ ' + ' + '.join(formula_terms)
model = smf.ols(formula, data=df_local).fit()
poly_results.append({
'polynomial_order': p,
'effect': model.params['D'],
'se': model.bse['D'],
'ci_lower': model.conf_int().loc['D', 0],
'ci_upper': model.conf_int().loc['D', 1]
})
poly_df = pd.DataFrame(poly_results)
print("\n" + "=" * 80)
print("Polynomial Order Sensitivity Analysis")
print("=" * 80)
print(poly_df.to_string(index=False))
print("\nRecommendation: Use linear (p=1) or quadratic (p=2), avoid high-order")3. Donut-hole RDD
Motivation: Exclude observations very close to cutoff to test robustness.
Reasons:
- If precise manipulation exists, most likely occurs near cutoff
- If results remain robust after excluding these observations, increases credibility
Implementation:
- Define "donut" size (e.g., )
- Exclude observations with
- Re-estimate RDD
def donut_rdd(df, X_col, Y_col, cutoff, donut_size, bandwidth):
"""
Donut-hole RDD
Parameters:
- df: dataframe
- X_col, Y_col: variable names
- cutoff: cutoff
- donut_size: donut size (exclude |X - c| < donut_size observations)
- bandwidth: bandwidth
"""
df_donut = df[
(np.abs(df[X_col] - cutoff) >= donut_size) &
(np.abs(df[X_col] - cutoff) <= bandwidth)
].copy()
df_donut['X_c'] = df_donut[X_col] - cutoff
df_donut['D'] = (df_donut[X_col] >= cutoff).astype(int)
model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_donut).fit()
return {
'donut_size': donut_size,
'effect': model.params['D'],
'se': model.bse['D'],
'n_dropped': len(df[(np.abs(df[X_col] - cutoff) < donut_size)]),
'n_used': len(df_donut)
}
# Try different donut sizes
donut_sizes = [0, 1, 2, 3, 5]
donut_results = []
for ds in donut_sizes:
result = donut_rdd(df, 'X', 'Y', 0, ds, 20)
donut_results.append(result)
donut_df = pd.DataFrame(donut_results)
print("\n" + "=" * 80)
print("Donut-hole RDD Test")
print("=" * 80)
print(donut_df.to_string(index=False))
print("\nIf estimate is robust, results are not driven by observations very close to cutoff")Complete Robustness Testing Workflow
Comprehensive Report
def rdd_robustness_report(df, X_col, Y_col, cutoff):
"""
Generate complete RDD robustness test report
Parameters:
- df: dataframe
- X_col: running variable
- Y_col: outcome variable
- cutoff: cutoff
"""
from rdrobust import rdrobust
print("=" * 80)
print(" " * 25 + "RDD Robustness Test Report")
print("=" * 80)
df = df.copy()
df['X_c'] = df[X_col] - cutoff
df['D'] = (df[X_col] >= cutoff).astype(int)
# 1. Baseline estimate (CCT optimal bandwidth)
print("\n[1] Baseline Estimate (CCT Optimal Bandwidth)")
print("-" * 80)
baseline = rdrobust(y=df[Y_col], x=df[X_col], c=cutoff)
print(f"Optimal bandwidth: {baseline.bws[0]:.2f}")
print(f"RDD effect: {baseline.coef[0]:.4f}")
print(f"Robust SE: {baseline.se[0]:.4f}")
print(f"Robust p-value: {baseline.pval[0]:.4f}")
print(f"95% CI: [{baseline.ci[0][0]:.4f}, {baseline.ci[0][1]:.4f}]")
# 2. Bandwidth sensitivity
print("\n[2] Bandwidth Sensitivity Analysis")
print("-" * 80)
h_baseline = baseline.bws[0]
bandwidths = [0.5 * h_baseline, 0.75 * h_baseline, h_baseline,
1.25 * h_baseline, 1.5 * h_baseline]
bw_results = []
for h in bandwidths:
df_local = df[np.abs(df['X_c']) <= h]
if len(df_local) > 50:
model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_local).fit()
bw_results.append({
'Bandwidth': f'{h:.2f}',
'Effect': f"{model.params['D']:.4f}",
'SE': f"{model.bse['D']:.4f}",
'N': len(df_local)
})
bw_df = pd.DataFrame(bw_results)
print(bw_df.to_string(index=False))
# 3. Polynomial order
print("\n[3] Polynomial Order Sensitivity")
print("-" * 80)
poly_results = []
for p in [1, 2]:
df_local = df[np.abs(df['X_c']) <= h_baseline]
formula_parts = ['D', 'X_c']
for k in range(2, p + 1):
formula_parts.append(f'I(X_c**{k})')
for part in formula_parts[1:]:
formula_parts.append(f'D:{part}')
formula = f'{Y_col} ~ ' + ' + '.join(formula_parts)
model = smf.ols(formula, data=df_local).fit()
poly_results.append({
'Polynomial': f'p={p}',
'Effect': f"{model.params['D']:.4f}",
'SE': f"{model.bse['D']:.4f}"
})
poly_df = pd.DataFrame(poly_results)
print(poly_df.to_string(index=False))
# 4. Donut-hole
print("\n[4] Donut-hole RDD")
print("-" * 80)
donut_results = []
for ds in [0, 1, 2, 5]:
df_donut = df[
(np.abs(df['X_c']) >= ds) &
(np.abs(df['X_c']) <= h_baseline)
]
if len(df_donut) > 50:
model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_donut).fit()
donut_results.append({
'Donut': ds,
'Effect': f"{model.params['D']:.4f}",
'N_dropped': len(df[np.abs(df['X_c']) < ds]),
'N_used': len(df_donut)
})
donut_df = pd.DataFrame(donut_results)
print(donut_df.to_string(index=False))
print("\n" + "=" * 80)
print(" " * 30 + "Report Complete")
print("=" * 80)
# Generate report
rdd_robustness_report(df, 'X', 'Y', 0)Best Practices in Practice
Bandwidth Selection Recommendations
- Default to CCT optimal bandwidth (
rdrobustdefault) - Report multiple bandwidths (e.g., 0.5h, 0.75h, h, 1.25h, 1.5h)
- Check sensitivity: If results stable across reasonable bandwidth range → Credible
Polynomial Order Recommendations
- Main results: Use local linear ()
- Robustness: Check local quadratic ()
- Avoid: Don't use
Reporting Checklist
What to report in papers:
- Main effect estimate (CCT optimal bandwidth + robust SE)
- Bandwidth sensitivity (table or figure)
- Validity tests:
- Covariate balance
- McCrary density test
- Placebo tests
- Polynomial order sensitivity (optional)
- Donut-hole test (if concerned about manipulation)
Key Takeaways
Bandwidth Selection
- MSE optimal: IK and CCT methods (CCT recommended)
- Automation: Use
rdrobustpackage defaults - Robustness: Report results under multiple bandwidths
Robustness Tests
- Bandwidth sensitivity: Estimates should be stable within reasonable range
- Polynomial order: Avoid high-order ()
- Donut-hole: Exclude observations very close to cutoff, test robustness
Reporting Standards
- Main results: CCT + robust confidence intervals
- Robustness tables: Bandwidth, polynomial, donut-hole
- Validity tests: Covariates, density, placebo
Section Summary
In this section, we learned:
- Theory of bandwidth selection (bias-variance tradeoff)
- Automatic bandwidth selection methods (IK, CCT)
- Sensitivity analysis (bandwidth, polynomial order)
- Donut-hole RDD
- Complete robustness testing workflow
Key lesson:
"RDD's credibility depends not only on point estimates, but also on result robustness across various specifications. Reviewers will scrutinize these carefully!"
Next step: In Section 5, we will replicate classic RDD studies, including Angrist & Lavy (1999) and Lee (2008).
Rigorous robustness testing makes your research stand the test of time!