8.5 Advanced Panel Data Topics
Mastering Frontier Techniques: Two-Way Fixed Effects, Clustered Standard Errors, Dynamic Panels, and DID
Section Objectives
- Deeply understand the identification logic of two-way fixed effects (Two-Way FE)
- Correctly use clustered standard errors (Clustered SE)
- Preliminary understanding of dynamic panel models (Arellano-Bond)
- Apply panel methods to DID research
- Techniques for handling unbalanced panels
- Navigate common pitfalls in panel data
Two-Way Fixed Effects
Model Definition
One-way fixed effects:
- Controls: Individual heterogeneity
Two-way fixed effects:
- Controls: Individual heterogeneity + Time effects
Meaning of Time Effects :
- Time-specific shocks affecting all individuals
- Examples: Macroeconomic cycles, policy changes, technological progress, natural disasters
Why Do We Need Two-Way FE?
Scenario 1: Common Time Trends Exist
Problem: If and both grow over time, it might be due to common time factors
Example: Studying the effect of advertising expenditure on sales
- Advertising expenditure increases annually (technological progress, media cost reduction)
- Sales also increase annually (economic growth, rising consumer income)
- Without controlling for time trends, might incorrectly attribute to advertising effect
Solution: Two-way FE
model_twoway = PanelOLS(sales, ads,
entity_effects=True, # Control for company fixed effects
time_effects=True).fit() # Control for year fixed effectsScenario 2: Standard Practice in DID Research
DID Model:
- : Controls for fixed differences between treatment and control groups
- : Controls for common time trends (embodiment of parallel trends assumption)
Python Implementation:
# Standard DID implementation is two-way FE + interaction term
model_did = PanelOLS(y, treated_post,
entity_effects=True,
time_effects=True).fit()Identification Logic of Two-Way FE
Demeaning Transformation (twice):
- Individual demeaning:
- Time demeaning:
Final Estimation:
Intuition:
- First step: Eliminate individual heterogeneity (between-group differences)
- Second step: Eliminate time trends (common macro shocks)
- Remaining variation: Individual-specific time variation
Python Complete Example
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")
# Simulate data: add time trend
np.random.seed(42)
N = 100 # 100 companies
T = 10 # 10 years
data = []
for i in range(N):
alpha_i = np.random.normal(0, 1) # Company fixed effect
for t in range(T):
year = 2010 + t
lambda_t = 0.05 * t # Time fixed effect (common growth trend)
x = 10 + 0.3 * t + np.random.normal(0, 2)
y = 5 + 2 * x + alpha_i + lambda_t + np.random.normal(0, 1)
data.append({'company': i, 'year': year, 'y': y, 'x': x})
df = pd.DataFrame(data)
df_panel = df.set_index(['company', 'year'])
# Model 1: Pooled OLS (biased)
import statsmodels.api as sm
X_pooled = sm.add_constant(df[['x']])
model_pooled = sm.OLS(df['y'], X_pooled).fit()
# Model 2: One-way FE (control for companies)
model_oneway = PanelOLS(df_panel['y'], df_panel[['x']],
entity_effects=True).fit(cov_type='clustered',
cluster_entity=True)
# Model 3: Two-way FE (control for companies + year)
model_twoway = PanelOLS(df_panel['y'], df_panel[['x']],
entity_effects=True,
time_effects=True).fit(cov_type='clustered',
cluster_entity=True)
print("=" * 70)
print("One-Way FE vs Two-Way FE")
print("=" * 70)
print(f"True parameter: 2.0000")
print(f"Pooled OLS: {model_pooled.params['x']:.4f}")
print(f"One-way FE: {model_oneway.params['x']:.4f}")
print(f"Two-way FE: {model_twoway.params['x']:.4f} (closest to true value)")
# Visualize time effects
# Extract estimated time fixed effects
time_effects = model_twoway.estimated_effects.time_effects
print("\n" + "=" * 70)
print("Estimated Time Fixed Effects")
print("=" * 70)
print(time_effects.head(10))
# Plot time effects
plt.figure(figsize=(12, 6))
plt.plot(time_effects.index, time_effects.values, 'o-',
linewidth=2, markersize=8, color='darkblue')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Time Fixed Effects', fontweight='bold', fontsize=12)
plt.title('Estimated Time Fixed Effects (Common Time Trends)', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Output Interpretation:
- One-way FE: If time trends exist but not controlled, estimates may be biased
- Two-way FE: Controls for both individual and time effects, more accurate estimation
Clustered Standard Errors
Why Do We Need Clustered Standard Errors?
Problem: In panel data, errors at different times for the same individual are typically correlated (serial correlation)
Consequences:
- Classical OLS standard errors assume error independence
- If errors are correlated, OLS standard errors underestimate true uncertainty
- Leads to inflated -statistics, increased false positives (Type I Error)
Example:
- Individual experiences a positive shock in 2015 ()
- This shock may persist into 2016 ()
- Therefore
Principle of Clustered Standard Errors
Core Idea: Allow observations within the same cluster to be correlated, assuming independence between different clusters
Standard Practice for Panel Data: Cluster at individual level
- Allow all time observations for individual to be correlated
- Assume independence between different individuals
Python Implementation:
model = PanelOLS(y, X, entity_effects=True).fit(
cov_type='clustered',
cluster_entity=True # Cluster at entity (individual) level
)Clustering Choices
| Clustering Level | When to Use | Python Implementation |
|---|---|---|
| Entity | Standard practice for panel data | cluster_entity=True |
| Time | Different individuals at same time may be correlated (rare) | cluster_time=True |
| Two-way clustering | Allow both entity and time clustering | cluster_entity=True, cluster_time=True |
| Custom clustering | E.g., cluster by state, industry | clusters=df['state'] |
Recommendations:
- Panel data →
cluster_entity=True(most common) - DID research → Cluster at treatment unit level (e.g., state, city)
Clustered SE vs Robust SE
| Type | Allowed Error Patterns | When to Use |
|---|---|---|
| Classical OLS SE | Homoskedasticity + Independence | Almost never (assumptions too strong) |
| Robust SE | Heteroskedasticity + Independence | Cross-sectional data |
| Clustered SE | Heteroskedasticity + Within-cluster correlation | Panel data ⭐ |
Important Rule:
- Panel data must use clustered SE
- Not using clustered SE leads to severely underestimated standard errors (possibly 50% underestimate)
Python Comparison Example
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
# Simulate data: strong serial correlation
np.random.seed(123)
data = []
for i in range(100):
shock = np.random.normal(0, 2) # Individual-specific persistent shock
for t in range(10):
x = 10 + np.random.normal(0, 1)
# Error term has persistent component (serial correlation)
epsilon = shock + np.random.normal(0, 0.5)
y = 5 + 2 * x + epsilon
data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})
df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])
# Three types of standard errors
model_unadjusted = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
cov_type='unadjusted' # Classical OLS SE
)
model_robust = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
cov_type='robust' # Robust SE (heteroskedasticity only)
)
model_clustered = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
cov_type='clustered',
cluster_entity=True # Clustered SE (heteroskedasticity + serial correlation)
)
print("=" * 70)
print("Standard Error Comparison")
print("=" * 70)
print(f"Coefficient estimate: {model_clustered.params['x']:.4f} (same for all three methods)")
print(f"Classical SE: {model_unadjusted.std_errors['x']:.4f} (underestimate!)")
print(f"Robust SE: {model_robust.std_errors['x']:.4f} (still underestimate)")
print(f"Clustered SE: {model_clustered.std_errors['x']:.4f} (correct)")
print(f"\nClustered SE / Classical SE: {model_clustered.std_errors['x'] / model_unadjusted.std_errors['x']:.2f}x")Key Finding:
- Clustered SE is typically 1.5-3 times classical SE
- Without clustered SE, -statistics are inflated, leading to incorrect rejection of null hypothesis
Dynamic Panel Models
What is a Dynamic Panel?
Model:
Characteristic: Lagged dependent variable as independent variable
Application Scenarios:
- Persistence: Income, GDP, health status
- Adjustment Costs: Corporate investment, employment
- Habit Formation: Consumption, savings
Why Doesn't Regular FE Work?
Problem: is endogenous with
Reason:
- depends on
- After within transformation, depends on (includes )
- Leads to
Consequence: FE estimation is biased and inconsistent (even as )
Arellano-Bond Estimator
Core Idea: Use instrumental variables (IV) + first difference
Step 1: First Difference to Eliminate Fixed Effects
Step 2: Use Earlier as Instrumental Variables
Instrumental variables:
- Correlated with (relevance condition)
- Uncorrelated with (exogeneity condition)
Estimation Method: GMM (Generalized Method of Moments)
Python Implementation (Simplified Version)
from linearmodels.panel import PanelOLS
import pandas as pd
import numpy as np
# Simulate dynamic panel data
np.random.seed(42)
data = []
for i in range(100):
alpha_i = np.random.normal(0, 1)
y_lag = 5 # Initial value
for t in range(10):
x = 10 + np.random.normal(0, 2)
epsilon = np.random.normal(0, 1)
y = 0.5 * y_lag + 1.5 * x + alpha_i + epsilon # True parameters: beta1=0.5, beta2=1.5
data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})
y_lag = y # Update lagged value
df = pd.DataFrame(data)
# Create lagged variable
df = df.sort_values(['id', 'year'])
df['y_lag'] = df.groupby('id')['y'].shift(1)
df = df.dropna()
df_panel = df.set_index(['id', 'year'])
# Wrong method: Regular FE (biased!)
model_fe_wrong = PanelOLS(df_panel['y'],
df_panel[['y_lag', 'x']],
entity_effects=True).fit()
print("=" * 70)
print("Dynamic Panel Model")
print("=" * 70)
print(f"True parameters: y_lag=0.5, x=1.5")
print(f"\nFE estimate (biased):")
print(f" y_lag: {model_fe_wrong.params['y_lag']:.4f}")
print(f" x: {model_fe_wrong.params['x']:.4f}")
print("\nNote: FE estimation is biased! Should use Arellano-Bond GMM")Note:
- Python's
linearmodelscurrently doesn't support Arellano-Bond - Need to use Stata's
xtabondor R'splmpackage - This is an advanced topic in dynamic panels, beyond this course's scope
Panel Data and DID
DID is Two-Way Fixed Effects + Interaction Term
Standard DID Model:
Equivalent to:
model_did = PanelOLS(y, treated_post,
entity_effects=True, # Control for α_i
time_effects=True).fit() # Control for λ_tPython Complete DID Example
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
# Simulate DID data
np.random.seed(2024)
data = []
# Treatment group: ID 1-50, receive treatment in 2018
# Control group: ID 51-100, don't receive treatment
for i in range(1, 101):
treated = 1 if i <= 50 else 0
alpha_i = np.random.normal(0, 1)
for t in range(2015, 2021):
year = t
post = 1 if year >= 2018 else 0
treated_post = treated * post
# DID effect = 10
y = 50 + 10 * treated_post + alpha_i + 0.5 * year + np.random.normal(0, 2)
data.append({
'id': i,
'year': year,
'y': y,
'treated': treated,
'post': post,
'treated_post': treated_post
})
df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])
# DID regression
model_did = PanelOLS(df_panel['y'],
df_panel[['treated_post']],
entity_effects=True,
time_effects=True).fit(cov_type='clustered',
cluster_entity=True)
print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)
print(f"\nDID effect: {model_did.params['treated_post']:.2f} (true value: 10.00)")
# Event study plot
# Create year dummy variables
for year in range(2015, 2021):
df[f'treated_x_{year}'] = df['treated'] * (df['year'] == year)
# Use 2017 as baseline (last year before treatment)
event_vars = [f'treated_x_{y}' for y in [2015, 2016, 2018, 2019, 2020]]
df_panel_event = df.set_index(['id', 'year'])
model_event = PanelOLS(df_panel_event['y'],
df_panel_event[event_vars],
entity_effects=True,
time_effects=True).fit(cov_type='clustered',
cluster_entity=True)
# Extract coefficients
years = [2015, 2016, 2017, 2018, 2019, 2020]
coefs = [model_event.params[f'treated_x_{y}'] if y != 2017 else 0 for y in years]
se = [model_event.std_errors[f'treated_x_{y}'] if y != 2017 else 0 for y in years]
# Plot event study graph
plt.figure(figsize=(12, 6))
plt.errorbar(years, coefs, yerr=1.96*np.array(se), marker='o',
markersize=8, linewidth=2, capsize=5, color='darkblue')
plt.axhline(0, color='red', linestyle='--', linewidth=1)
plt.axvline(2017.5, color='green', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.5, max(coefs) * 0.8, 'Policy Implementation', fontsize=12, color='green')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Treatment Effect', fontweight='bold', fontsize=12)
plt.title('Event Study Plot', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Interpretation:
- 2015-2017: Coefficients close to 0 (parallel trends hold)
- 2018-2020: Coefficients significantly positive (treatment effect)
Handling Unbalanced Panels
Types of Unbalanced Panels
Attrition: Individuals exit sample
- Example: Company bankruptcy, individual exits survey
Entry and Exit: New individuals join sample
- Example: New company IPO, new hospital established
Random Missing: Data missing at certain time points
- Example: Survey incomplete, data entry error
Problems with Unbalanced Panels
Problem 1: Selection Bias
- If exit is related to outcome variable, estimates are biased
- Example: Companies with poor performance more likely to delist
Problem 2: Efficiency Loss
- Missing data reduces sample size
Handling Methods
Method 1: Keep Unbalanced (Recommended) ⭐
linearmodels automatically handles unbalanced panels:
# No special operations needed, linearmodels will handle automatically
model = PanelOLS(y, X, entity_effects=True).fit()Advantages:
- Retain all available information
- Avoid arbitrarily deleting data
Prerequisites:
- Missing is random (Missing at Random, MAR)
- Or missing correlated with independent variables, but not with error term
Method 2: Use Balanced Subsample
Construct balanced panel:
# Only keep individuals observed in all time periods
complete_ids = df.groupby('id')['year'].count()
complete_ids = complete_ids[complete_ids == T].index
df_balanced = df[df['id'].isin(complete_ids)]Advantages:
- Avoid selection bias (if concerned about non-random attrition)
Disadvantages:
- Loss of substantial data
- Low efficiency
Method 3: Sample Selection Models (Heckman)
Applicable to: Non-random attrition (e.g., company bankruptcy)
Method:
- First stage: Estimate attrition probability (Probit)
- Second stage: Add Inverse Mills Ratio as control variable
Beyond this course's scope, refer to Wooldridge (2010) Chapter 19
Section Summary
Key Points
Two-way fixed effects:
- Control for individual + time effects
- Standard practice for DID
- Eliminate common time trends
Clustered standard errors:
- Essential tool for panel data
- Cluster at individual level (standard practice)
- Avoid underestimating standard errors
Dynamic panels:
- Include lagged dependent variable
- Regular FE is biased
- Need Arellano-Bond GMM
Panel data + DID:
- DID = Two-way FE + interaction term
- Event study plot tests parallel trends
- Cluster at treatment unit level
Unbalanced panels:
- linearmodels handles automatically
- Prioritize keeping unbalanced (if MAR)
- Use balanced subsample when concerned about selection bias
Practical Recommendations
Standard Panel Regression Checklist:
- ✓ Use two-way FE (if time trends exist)
- ✓ Use clustered standard errors (
cluster_entity=True) - ✓ Check if within variation is sufficient
- ✓ Conduct Hausman test (FE vs RE)
- ✓ Report , , total observations
- ✓ Check for bad controls (mediators)
DID Research Checklist:
- ✓ Use two-way FE
- ✓ Cluster at treatment unit level
- ✓ Plot event study graph
- ✓ Test parallel trends
- ✓ Conduct placebo tests
Next Steps
In Section 6: Summary and Review, we will:
- Summarize panel data methods decision tree
- Provide 10 practice problems
- Recommend classic literature
Master advanced techniques, become a panel data expert!