9.2 DID Fundamentals

Deep Dive into the Causal Inference Logic of DID

Learning Objectives

Understand DID's causal inference logic (potential outcomes framework)
Master panel data DID models (TWFE fixed effects)
Learn to use fixed effects to control for unobserved variables
Cluster adjustment of standard errors
Extensions of the DID model (control variables)
Use linearmodels to implement professional panel regressions

I. Causal Inference Framework of DID

Potential Outcomes Framework (Rubin Causal Model)

Basic Notation

: Individual 's potential outcome under treatment
: Individual 's potential outcome without treatment
: Whether receives treatment (1 = receives treatment)

Observed Outcome

Individual Treatment Effect (ITE)

Fundamental Problem: We can never observe and simultaneously—this is the fundamental problem of causal inference.

Average Treatment Effect (ATE)

Definition

Average Treatment Effect on the Treated (ATT)

Why can't we simply compare?

Conclusion: Simple comparison conflates the causal effect with selection bias, leading to biased estimates.

II. How DID Eliminates Bias

Panel Data Setup for DID

Time Dimension: (pre-policy), (post-policy)

Potential Outcomes

: Potential outcome for individual at time without policy
: Potential outcome for individual at time with policy

Observed Outcome

where

Key Assumption of DID (Parallel Trends Assumption):

Meaning: In the absence of policy intervention, the change trend of treatment and control groups is parallel.

DID Estimator

Conclusion: Under the parallel trends assumption, the DID estimator unbiasedly identifies the average treatment effect on the treated (ATT).

III. Panel Data DID Models

Two-Way Fixed Effects Model (TWFE)

Standard DID Regression Equation

Components

: Unit Fixed Effects (Entity Fixed Effects)
- Controls for time-invariant individual characteristics
- Example: geographic location, cultural traditions, etc.
: Time Fixed Effects
- Controls for common time shocks faced by all individuals
- Example: nationwide economic cycles, technological progress, etc.
: Treatment dummy variable ()
: DID Estimator (ATT)

How Fixed Effects Eliminate Bias

Unit Fixed Effects eliminate cross-sectional heterogeneity
- Achieved through within transformation (demeaning within groups)
Time Fixed Effects eliminate common time trends
- Achieved through time demeaning

Deep Understanding of DID's Double Differencing

Using dummy variable representation (more intuitive):

Equivalent to:

IV. Python Implementation of Panel DID

Method 1: statsmodels (OLS with dummies)

python

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# Setup
np.random.seed(42)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

# Generate simulated panel data
n_units = 50  # Number of units
n_periods = 10  # Number of periods
treatment_time = 5  # Policy intervention time
treatment_effect = 20  # True policy effect

data = []
for unit in range(n_units):
    treated = 1 if unit >= n_units // 2 else 0  # Half of units receive treatment
    unit_effect = np.random.normal(10 * treated, 5)  # Unit fixed effect

    for period in range(n_periods):
        time_effect = 2 * period  # Time trend
        post = 1 if period >= treatment_time else 0

        # Outcome variable
        y = (50 + unit_effect + time_effect +
             treatment_effect * treated * post +
             np.random.normal(0, 3))

        data.append({
            'unit': unit,
            'period': period,
            'treated': treated,
            'post': post,
            'did': treated * post,
            'y': y
        })

df = pd.DataFrame(data)

print("=" * 70)
print("Simulated Data Descriptive Statistics")
print("=" * 70)
print(df.groupby(['treated', 'period'])['y'].mean().unstack())
print("\n")

# Method 1: OLS with fixed effects dummies
# Note: Including all dummies causes collinearity; need to set baseline group
model_fe = smf.ols('y ~ C(unit) + C(period) + did', data=df).fit(cov_type='HC1')

print("=" * 70)
print("Method 1: OLS with FE dummies")
print("=" * 70)
print(f"DID Coefficient: {model_fe.params['did']:.3f}")
print(f"Standard Error: {model_fe.bse['did']:.3f}")
print(f"95% CI: [{model_fe.conf_int().loc['did', 0]:.3f}, {model_fe.conf_int().loc['did', 1]:.3f}]")
print(f"True Effect: {treatment_effect}")
print("\n")

Method 2: linearmodels (Professional Panel Data Tool)

python

from linearmodels.panel import PanelOLS

# Set data to panel format (multi-index)
df_panel = df.set_index(['unit', 'period'])

# Use PanelOLS with entity and time effects
model_panel = PanelOLS(
    dependent=df_panel['y'],
    exog=df_panel[['did']],
    entity_effects=True,  # Unit fixed effects
    time_effects=True,    # Time fixed effects
).fit(cov_type='clustered', cluster_entity=True)  # Clustered standard errors

print("=" * 70)
print("Method 2: linearmodels PanelOLS (Recommended)")
print("=" * 70)
print(model_panel.summary)
print("\n")

# Extract results
did_coef = model_panel.params['did']
did_se = model_panel.std_errors['did']
did_ci = model_panel.conf_int().loc['did']

print("=" * 70)
print("Estimation Results Summary")
print("=" * 70)
print(f"DID Coefficient: {did_coef:.3f} (SE = {did_se:.3f})")
print(f"95% CI: [{did_ci[0]:.3f}, {did_ci[1]:.3f}]")
print(f"True Effect: {treatment_effect}")
print(f"Estimation Bias: {did_coef - treatment_effect:.3f}")

Important Parameters

entity_effects=True: Add unit fixed effects
time_effects=True: Add time fixed effects
cov_type='clustered', cluster_entity=True: Use entity-level clustered standard errors (recommended by Bertrand et al. 2004)

V. Visualizing Parallel Trends

Event Study Plot

python

# Construct relative time variable
df['rel_period'] = df['period'] - treatment_time
df['rel_period'] = df['rel_period'] * df['treated']  # Only for treatment group

# Create leads and lags dummies (t=-1 as baseline)
for t in range(-treatment_time, n_periods - treatment_time):
    if t != -1:  # Baseline group
        df[f'lead_lag_{t}'] = ((df['rel_period'] == t) & (df['treated'] == 1)).astype(int)

# Construct regression formula
lead_lag_vars = [f'lead_lag_{t}' for t in range(-treatment_time, n_periods - treatment_time) if t != -1]

# Event study regression
formula = 'y ~ C(unit) + C(period) + ' + ' + '.join(lead_lag_vars)
model_es = smf.ols(formula, data=df).fit(cov_type='HC1')

# Extract coefficients and construct plotting data
event_study_results = []
for t in range(-treatment_time, n_periods - treatment_time):
    if t == -1:
        # Baseline period coefficient is 0
        event_study_results.append({'period': t, 'coef': 0, 'ci_lower': 0, 'ci_upper': 0})
    else:
        var_name = f'lead_lag_{t}'
        coef = model_es.params[var_name]
        ci = model_es.conf_int().loc[var_name]
        event_study_results.append({
            'period': t,
            'coef': coef,
            'ci_lower': ci[0],
            'ci_upper': ci[1]
        })

es_df = pd.DataFrame(event_study_results)

# Plot event study
fig, ax = plt.subplots(figsize=(14, 8))

ax.plot(es_df['period'], es_df['coef'], 'o-', linewidth=2, markersize=8, color='navy', label='DID Coefficient')
ax.fill_between(es_df['period'], es_df['ci_lower'], es_df['ci_upper'], alpha=0.2, color='navy', label='95% CI')
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax.axvline(x=-0.5, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Policy Implementation Time')

ax.set_xlabel('Time Relative to Policy Implementation', fontsize=14, fontweight='bold')
ax.set_ylabel('Estimated Coefficient', fontsize=14, fontweight='bold')
ax.set_title('Event Study: DID Dynamic Effects', fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=" * 70)
print("Parallel Trends Test")
print("=" * 70)
print("Examine pre-policy coefficients (t < 0):")
pre_treatment = es_df[es_df['period'] < 0]
print(pre_treatment[['period', 'coef', 'ci_lower', 'ci_upper']])
print("\nIf parallel trends holds, pre-policy coefficients should be close to 0 and insignificant")

How to Interpret

Pre-policy (): Coefficients should be close to 0 (parallel trends test)
Post-policy (): Coefficients significantly positive indicate policy effectiveness
Dynamic Effects: Observe whether effects strengthen/weaken/remain stable over time

VI. Clustered Standard Errors Adjustment

Why Clustered Standard Errors Are Needed

Important Finding by Bertrand et al. (2004)

Common statistical problems in DID research:

Serial Correlation: Error terms for an individual across periods are correlated
Standard Error Underestimation: OLS standard errors severely underestimate true standard errors
Significance Exaggeration: Leads to excessive rejection of null hypothesis

Solution: Use entity-level clustered standard errors

Different Options in Python

python

from linearmodels.panel import PanelOLS
import statsmodels.formula.api as smf

df_panel = df.set_index(['unit', 'period'])

# 1. Naive OLS (not recommended)
model_ols = PanelOLS(
    df_panel['y'],
    df_panel[['did']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='unadjusted')

# 2. Heteroskedasticity-robust (not recommended)
model_robust = PanelOLS(
    df_panel['y'],
    df_panel[['did']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='robust')

# 3. Clustered standard errors - entity level (recommended)
model_cluster_entity = PanelOLS(
    df_panel['y'],
    df_panel[['did']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

# 4. Two-way clustering (entity + time)
model_cluster_two_way = PanelOLS(
    df_panel['y'],
    df_panel[['did']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', clusters=df_panel.reset_index()[['unit', 'period']])

# Compare results
print("=" * 70)
print("Standard Error Comparison")
print("=" * 70)
from scipy import stats

results_comparison = pd.DataFrame({
    'Coefficient': [
        model_ols.params['did'],
        model_robust.params['did'],
        model_cluster_entity.params['did'],
        model_cluster_two_way.params['did']
    ],
    'Std Error': [
        model_ols.std_errors['did'],
        model_robust.std_errors['did'],
        model_cluster_entity.std_errors['did'],
        model_cluster_two_way.std_errors['did']
    ]
}, index=['OLS', 'Robust', 'Cluster(Entity)', 'Cluster(Two-way)'])

results_comparison['t-stat'] = results_comparison['Coefficient'] / results_comparison['Std Error']
results_comparison['p-value'] = 2 * (1 - stats.t.cdf(np.abs(results_comparison['t-stat']), df=n_units-1))

print(results_comparison)
print("\n")
print("Notes:")
print("  • OLS standard errors typically underestimate true standard errors")
print("  • Recommended to use clustered standard errors (entity-level clustering)")

Best Practice Recommendations

Minimum Requirement: Use entity-level clustering (cluster_entity=True)
More Robust: Two-way clustering (entity + time)
Small Samples: Consider wild bootstrap and other resampling methods

VII. Adding Control Variables

DID with Control Variables

Extended Regression Equation

where are time-varying covariates

When to Add Control Variables

To improve estimation efficiency (precision), reduce standard errors
To test robustness of parallel trends (control variables shouldn't change parallel trends)
But beware of "bad controls"

Python Implementation

python

# Generate time-varying control variables
np.random.seed(42)
df['x1'] = np.random.normal(10, 2, len(df))  # Continuous variable
df['x2'] = np.random.binomial(1, 0.5, len(df))  # Binary variable

# Regression (with control variables)
df_panel = df.set_index(['unit', 'period'])

model_with_controls = PanelOLS(
    df_panel['y'],
    df_panel[['did', 'x1', 'x2']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("DID with Control Variables")
print("=" * 70)
print(model_with_controls.summary)

Considerations

Only control exogenous variables: Control variables must be unrelated to treatment
Don't control "bad controls": Don't control mediating variables of the policy
Use Lasso etc. to select controls: Too many controls reduce estimation efficiency

VIII. Extensions of DID

Staggered DID

Scenario: Individuals receive treatment at different times

Problem with Traditional TWFE (Goodman-Bacon 2021)

When treatment timing is staggered, is no longer a simple DID, but a weighted average of multiple heterogeneous DIDs (weights can be negative!)

"Bad Control Group" Problem

Already-treated individuals become control group for later-treated individuals
Can lead to biased estimates

Solutions

Callaway & Sant'Anna (2021): Estimate group-time specific ATT
Sun & Abraham (2021): Construct clean interaction terms
De Chaisemartin & D'Haultfoeuille (2020): Robust DID for varying treatment effects

How to Use CS Estimator

python

# Example code (requires installation: pip install csdid (currently only R package is mature))
# Can call R via rpy2, or wait for Python version to mature

# Pseudocode example:
"""
from csdid import ATT

# Estimate group-time specific ATT
att_results = ATT(
    data=df,
    yname='y',
    gname='first_treat',  # First time unit receives treatment
    tname='period',
    idname='unit',
    control_group='notyettreated'  # Use not-yet-treated units as control
)

att_results.summary()
att_results.plot()
"""

print("=" * 70)
print("Handling Staggered DID")
print("=" * 70)
print("If your data has staggered treatment timing, recommended to use:")
print("1. Callaway & Sant'Anna (2021) method")
print("2. Sun & Abraham (2021) method")
print("3. Or manually correct potential TWFE problems")

IX. Section Summary

Key Takeaways

DID's Causal Inference Logic
- Parallel trends assumption is the core of identification
- DID estimator identifies ATT (average treatment effect on the treated)
Panel DID Models
- Use two-way fixed effects (TWFE) models
- Unit fixed effects control for time-invariant characteristics
- Time fixed effects control for common time trends
Standard Errors
- Must use clustered standard errors (entity-level clustering)
- Consider bootstrap and other resampling methods for small samples
Extensions and Considerations
- Staggered DID requires special methods
- Recommended to use robust estimators

Python Toolbox

Task	Recommended Package
Basic DID	`statsmodels.formula.api.ols()`
Panel DID	`linearmodels.panel.PanelOLS()`
Clustered Standard Errors	`cov_type='clustered'`
Staggered DID	`csdid`, `did` (R packages or Python interfaces)

Next Steps

Proceed to Section 3: Parallel Trends Assumption to learn more:

How to test parallel trends assumption
Drawing event study plots
Handling violations of parallel trends

Mastering panel DID is the cornerstone of policy evaluation!

9.2 DID Fundamentals ​

Learning Objectives ​

I. Causal Inference Framework of DID ​

Potential Outcomes Framework (Rubin Causal Model) ​

Average Treatment Effect (ATE) ​

II. How DID Eliminates Bias ​

Panel Data Setup for DID ​

III. Panel Data DID Models ​

Two-Way Fixed Effects Model (TWFE) ​

IV. Python Implementation of Panel DID ​

Method 1: statsmodels (OLS with dummies) ​

Method 2: linearmodels (Professional Panel Data Tool) ​

V. Visualizing Parallel Trends ​

Event Study Plot ​

VI. Clustered Standard Errors Adjustment ​

Why Clustered Standard Errors Are Needed ​

Different Options in Python ​

VII. Adding Control Variables ​

DID with Control Variables ​

Python Implementation ​

VIII. Extensions of DID ​

Staggered DID ​

How to Use CS Estimator ​

IX. Section Summary ​

Key Takeaways ​

Python Toolbox ​

Next Steps ​

9.2 DID Fundamentals

Learning Objectives

I. Causal Inference Framework of DID

Potential Outcomes Framework (Rubin Causal Model)

Average Treatment Effect (ATE)

II. How DID Eliminates Bias

Panel Data Setup for DID

III. Panel Data DID Models

Two-Way Fixed Effects Model (TWFE)

IV. Python Implementation of Panel DID

Method 1: statsmodels (OLS with dummies)

Method 2: linearmodels (Professional Panel Data Tool)

V. Visualizing Parallel Trends

Event Study Plot

VI. Clustered Standard Errors Adjustment

Why Clustered Standard Errors Are Needed

Different Options in Python

VII. Adding Control Variables

DID with Control Variables

Python Implementation

VIII. Extensions of DID

Staggered DID

How to Use CS Estimator

IX. Section Summary

Key Takeaways

Python Toolbox

Next Steps