Skip to content

11.4 Bandwidth Selection and Robustness Tests

"Bandwidth choice involves a fundamental bias-variance tradeoff."— Guido Imbens & Karthik Kalyanaraman, Optimal Bandwidth Authors

Make your RDD estimates withstand reviewers' scrutiny


Section Overview

Bandwidth selection is one of the most critical decisions in RDD empirical analysis. This section introduces:

  • Theory and methods of optimal bandwidth selection (IK, CCT)
  • Sensitivity analysis
  • Polynomial order selection
  • Donut-hole RDD (excluding observations very close to cutoff)
  • Complete robustness testing workflow

️ Core Tradeoffs in Bandwidth Selection

Bias-Variance Tradeoff

Mean Squared Error (MSE):

Small bandwidth :

  • Low bias: Local linear approximation more accurate
  • High variance: Small sample, unstable estimate

Large bandwidth :

  • Low variance: Large sample, stable estimate
  • High bias: May violate local linearity assumption

Optimal bandwidth :

Asymptotic Theory

Fan & Gijbels (1996) derived asymptotic form of optimal bandwidth:

Intuition:

  • Larger sample allows smaller optimal bandwidth (can use more local data)
  • Bandwidth shrinks at rate (slower than parametric )

Automatic Bandwidth Selection Methods

Method 1: Imbens-Kalyanaraman (IK) Bandwidth (2012)

Approach: Based on asymptotic expansion of MSE, derive data-driven optimal bandwidth.

Simplified formula:

where:

  • : Constant (depends on kernel function)
  • : Residual variance left and right of cutoff
  • : Density of running variable at cutoff
  • : Second derivative of potential outcome function

Python implementation:

python
from rdrobust import rdbwselect

# IK bandwidth selection
bw_ik = rdbwselect(y=df['Y'], x=df['X'], c=0, bwselect='mserd')
print(f"IK Bandwidth: {bw_ik.bws[0]:.2f}")

Method 2: Calonico-Cattaneo-Titiunik (CCT) Bandwidth (2014)

Improvements:

  1. Bias correction: Considers finite-sample bias
  2. Two bandwidths:
    • Main bandwidth : For point estimation
    • Bias bandwidth : For estimating and correcting bias

Coverage Error Optimal (CEO):

Python implementation (rdrobust default):

python
from rdrobust import rdrobust

# CCT method (default)
result_cct = rdrobust(y=df['Y'], x=df['X'], c=0)
print(f"CCT Main Bandwidth: {result_cct.bws[0]:.2f}")
print(f"CCT Bias Bandwidth: {result_cct.bws[1]:.2f}")

Method 3: Cross-Validation (CV)

Leave-one-out cross-validation:

  1. For each observation , remove it
  2. Fit model with remaining data (bandwidth )
  3. Predict
  4. Calculate prediction error:
  5. Choose minimizing

Note: Use only data on one side of cutoff (avoid using jump itself).

python
def cross_validation_bandwidth(df, X_col, Y_col, cutoff, h_candidates):
    """
    Cross-validation bandwidth selection (simplified)

    Parameters:
    - df: dataframe
    - X_col, Y_col: variable names
    - cutoff: cutoff
    - h_candidates: candidate bandwidth list
    """
    cv_scores = []

    for h in h_candidates:
        # Restrict sample
        df_local = df[np.abs(df[X_col] - cutoff) <= h].copy()
        df_local['X_c'] = df_local[X_col] - cutoff
        df_local['D'] = (df_local[X_col] >= cutoff).astype(int)

        # Leave-one-out CV (simplified: use K-fold instead)
        from sklearn.model_selection import KFold
        kf = KFold(n_splits=5, shuffle=True, random_state=42)

        fold_mse = []
        for train_idx, test_idx in kf.split(df_local):
            df_train = df_local.iloc[train_idx]
            df_test = df_local.iloc[test_idx]

            # Fit
            model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_train).fit()

            # Predict
            y_pred = model.predict(df_test)
            mse = np.mean((df_test[Y_col] - y_pred) ** 2)
            fold_mse.append(mse)

        cv_scores.append(np.mean(fold_mse))

    optimal_h = h_candidates[np.argmin(cv_scores)]
    return optimal_h, cv_scores

# Example
h_candidates = np.arange(5, 30, 2)
optimal_h, cv_scores = cross_validation_bandwidth(df, 'X', 'Y', 0, h_candidates)
print(f"CV Optimal Bandwidth: {optimal_h:.2f}")

Robustness Tests: Sensitivity Analysis

1. Comparison Across Multiple Bandwidths

Best practice: Report estimates under multiple bandwidths.

python
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# Try a range of bandwidths
bandwidths = [5, 10, 15, 20, 25, 30]
results = []

for h in bandwidths:
    df_local = df[np.abs(df['X']) <= h].copy()
    df_local['X_c'] = df_local['X']
    df_local['D'] = (df_local['X'] >= 0).astype(int)

    model = smf.ols('Y ~ D + X_c + D:X_c', data=df_local).fit()

    results.append({
        'bandwidth': h,
        'effect': model.params['D'],
        'se': model.bse['D'],
        'ci_lower': model.conf_int().loc['D', 0],
        'ci_upper': model.conf_int().loc['D', 1],
        'n_obs': len(df_local)
    })

results_df = pd.DataFrame(results)

# Visualization
fig, ax = plt.subplots(figsize=(12, 7))

ax.plot(results_df['bandwidth'], results_df['effect'],
        'o-', linewidth=2, markersize=8, label='Point Estimate')

ax.fill_between(results_df['bandwidth'],
                results_df['ci_lower'],
                results_df['ci_upper'],
                alpha=0.3, label='95% CI')

ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Bandwidth', fontsize=13, fontweight='bold')
ax.set_ylabel('RDD Effect Estimate', fontsize=13, fontweight='bold')
ax.set_title('Sensitivity to Bandwidth Choice', fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print table
print("=" * 80)
print("Bandwidth Sensitivity Analysis")
print("=" * 80)
print(results_df.to_string(index=False))

Interpretation:

  • Robust: If estimate roughly stable across different bandwidths → Results credible
  • Not robust: If estimate varies wildly with bandwidth → Further investigation needed

2. Sensitivity to Polynomial Order

Warning (Gelman & Imbens 2019): Do not use high-order polynomials ()!

Reasons:

  • High-order polynomials prone to overfitting
  • Data far from cutoff excessively influences estimates at cutoff
  • Confidence intervals may be overly optimistic

Recommendation:

  • Use local linear (, recommended)
  • At most use local quadratic ()
  • Avoid
python
# Compare different polynomial orders
poly_orders = [1, 2, 3]
poly_results = []

for p in poly_orders:
    df_local = df[np.abs(df['X']) <= 20].copy()
    df_local['X_c'] = df_local['X']
    df_local['D'] = (df_local['X'] >= 0).astype(int)

    # Build formula
    formula_terms = ['D', 'X_c']
    for k in range(2, p + 1):
        formula_terms.append(f'I(X_c**{k})')

    # Add interaction terms
    for term in formula_terms[1:]:
        formula_terms.append(f'D:{term}')

    formula = 'Y ~ ' + ' + '.join(formula_terms)

    model = smf.ols(formula, data=df_local).fit()

    poly_results.append({
        'polynomial_order': p,
        'effect': model.params['D'],
        'se': model.bse['D'],
        'ci_lower': model.conf_int().loc['D', 0],
        'ci_upper': model.conf_int().loc['D', 1]
    })

poly_df = pd.DataFrame(poly_results)

print("\n" + "=" * 80)
print("Polynomial Order Sensitivity Analysis")
print("=" * 80)
print(poly_df.to_string(index=False))
print("\nRecommendation: Use linear (p=1) or quadratic (p=2), avoid high-order")

3. Donut-hole RDD

Motivation: Exclude observations very close to cutoff to test robustness.

Reasons:

  • If precise manipulation exists, most likely occurs near cutoff
  • If results remain robust after excluding these observations, increases credibility

Implementation:

  • Define "donut" size (e.g., )
  • Exclude observations with
  • Re-estimate RDD
python
def donut_rdd(df, X_col, Y_col, cutoff, donut_size, bandwidth):
    """
    Donut-hole RDD

    Parameters:
    - df: dataframe
    - X_col, Y_col: variable names
    - cutoff: cutoff
    - donut_size: donut size (exclude |X - c| < donut_size observations)
    - bandwidth: bandwidth
    """
    df_donut = df[
        (np.abs(df[X_col] - cutoff) >= donut_size) &
        (np.abs(df[X_col] - cutoff) <= bandwidth)
    ].copy()

    df_donut['X_c'] = df_donut[X_col] - cutoff
    df_donut['D'] = (df_donut[X_col] >= cutoff).astype(int)

    model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_donut).fit()

    return {
        'donut_size': donut_size,
        'effect': model.params['D'],
        'se': model.bse['D'],
        'n_dropped': len(df[(np.abs(df[X_col] - cutoff) < donut_size)]),
        'n_used': len(df_donut)
    }

# Try different donut sizes
donut_sizes = [0, 1, 2, 3, 5]
donut_results = []

for ds in donut_sizes:
    result = donut_rdd(df, 'X', 'Y', 0, ds, 20)
    donut_results.append(result)

donut_df = pd.DataFrame(donut_results)

print("\n" + "=" * 80)
print("Donut-hole RDD Test")
print("=" * 80)
print(donut_df.to_string(index=False))
print("\nIf estimate is robust, results are not driven by observations very close to cutoff")

Complete Robustness Testing Workflow

Comprehensive Report

python
def rdd_robustness_report(df, X_col, Y_col, cutoff):
    """
    Generate complete RDD robustness test report

    Parameters:
    - df: dataframe
    - X_col: running variable
    - Y_col: outcome variable
    - cutoff: cutoff
    """
    from rdrobust import rdrobust

    print("=" * 80)
    print(" " * 25 + "RDD Robustness Test Report")
    print("=" * 80)

    df = df.copy()
    df['X_c'] = df[X_col] - cutoff
    df['D'] = (df[X_col] >= cutoff).astype(int)

    # 1. Baseline estimate (CCT optimal bandwidth)
    print("\n[1] Baseline Estimate (CCT Optimal Bandwidth)")
    print("-" * 80)
    baseline = rdrobust(y=df[Y_col], x=df[X_col], c=cutoff)
    print(f"Optimal bandwidth: {baseline.bws[0]:.2f}")
    print(f"RDD effect: {baseline.coef[0]:.4f}")
    print(f"Robust SE: {baseline.se[0]:.4f}")
    print(f"Robust p-value: {baseline.pval[0]:.4f}")
    print(f"95% CI: [{baseline.ci[0][0]:.4f}, {baseline.ci[0][1]:.4f}]")

    # 2. Bandwidth sensitivity
    print("\n[2] Bandwidth Sensitivity Analysis")
    print("-" * 80)
    h_baseline = baseline.bws[0]
    bandwidths = [0.5 * h_baseline, 0.75 * h_baseline, h_baseline,
                  1.25 * h_baseline, 1.5 * h_baseline]

    bw_results = []
    for h in bandwidths:
        df_local = df[np.abs(df['X_c']) <= h]
        if len(df_local) > 50:
            model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_local).fit()
            bw_results.append({
                'Bandwidth': f'{h:.2f}',
                'Effect': f"{model.params['D']:.4f}",
                'SE': f"{model.bse['D']:.4f}",
                'N': len(df_local)
            })

    bw_df = pd.DataFrame(bw_results)
    print(bw_df.to_string(index=False))

    # 3. Polynomial order
    print("\n[3] Polynomial Order Sensitivity")
    print("-" * 80)
    poly_results = []

    for p in [1, 2]:
        df_local = df[np.abs(df['X_c']) <= h_baseline]
        formula_parts = ['D', 'X_c']
        for k in range(2, p + 1):
            formula_parts.append(f'I(X_c**{k})')
        for part in formula_parts[1:]:
            formula_parts.append(f'D:{part}')

        formula = f'{Y_col} ~ ' + ' + '.join(formula_parts)
        model = smf.ols(formula, data=df_local).fit()

        poly_results.append({
            'Polynomial': f'p={p}',
            'Effect': f"{model.params['D']:.4f}",
            'SE': f"{model.bse['D']:.4f}"
        })

    poly_df = pd.DataFrame(poly_results)
    print(poly_df.to_string(index=False))

    # 4. Donut-hole
    print("\n[4] Donut-hole RDD")
    print("-" * 80)
    donut_results = []

    for ds in [0, 1, 2, 5]:
        df_donut = df[
            (np.abs(df['X_c']) >= ds) &
            (np.abs(df['X_c']) <= h_baseline)
        ]
        if len(df_donut) > 50:
            model = smf.ols(f'{Y_col} ~ D + X_c + D:X_c', data=df_donut).fit()
            donut_results.append({
                'Donut': ds,
                'Effect': f"{model.params['D']:.4f}",
                'N_dropped': len(df[np.abs(df['X_c']) < ds]),
                'N_used': len(df_donut)
            })

    donut_df = pd.DataFrame(donut_results)
    print(donut_df.to_string(index=False))

    print("\n" + "=" * 80)
    print(" " * 30 + "Report Complete")
    print("=" * 80)

# Generate report
rdd_robustness_report(df, 'X', 'Y', 0)

Best Practices in Practice

Bandwidth Selection Recommendations

  1. Default to CCT optimal bandwidth (rdrobust default)
  2. Report multiple bandwidths (e.g., 0.5h, 0.75h, h, 1.25h, 1.5h)
  3. Check sensitivity: If results stable across reasonable bandwidth range → Credible

Polynomial Order Recommendations

  1. Main results: Use local linear ()
  2. Robustness: Check local quadratic ()
  3. Avoid: Don't use

Reporting Checklist

What to report in papers:

  1. Main effect estimate (CCT optimal bandwidth + robust SE)
  2. Bandwidth sensitivity (table or figure)
  3. Validity tests:
    • Covariate balance
    • McCrary density test
    • Placebo tests
  4. Polynomial order sensitivity (optional)
  5. Donut-hole test (if concerned about manipulation)

Key Takeaways

Bandwidth Selection

  1. MSE optimal: IK and CCT methods (CCT recommended)
  2. Automation: Use rdrobust package defaults
  3. Robustness: Report results under multiple bandwidths

Robustness Tests

  1. Bandwidth sensitivity: Estimates should be stable within reasonable range
  2. Polynomial order: Avoid high-order ()
  3. Donut-hole: Exclude observations very close to cutoff, test robustness

Reporting Standards

  1. Main results: CCT + robust confidence intervals
  2. Robustness tables: Bandwidth, polynomial, donut-hole
  3. Validity tests: Covariates, density, placebo

Section Summary

In this section, we learned:

  • Theory of bandwidth selection (bias-variance tradeoff)
  • Automatic bandwidth selection methods (IK, CCT)
  • Sensitivity analysis (bandwidth, polynomial order)
  • Donut-hole RDD
  • Complete robustness testing workflow

Key lesson:

"RDD's credibility depends not only on point estimates, but also on result robustness across various specifications. Reviewers will scrutinize these carefully!"

Next step: In Section 5, we will replicate classic RDD studies, including Angrist & Lavy (1999) and Lee (2008).


Rigorous robustness testing makes your research stand the test of time!

Released under the MIT License. Content © Author.