8.5 Advanced Panel Data Topics

Mastering Frontier Techniques: Two-Way Fixed Effects, Clustered Standard Errors, Dynamic Panels, and DID

Section Objectives

Deeply understand the identification logic of two-way fixed effects (Two-Way FE)
Correctly use clustered standard errors (Clustered SE)
Preliminary understanding of dynamic panel models (Arellano-Bond)
Apply panel methods to DID research
Techniques for handling unbalanced panels
Navigate common pitfalls in panel data

Two-Way Fixed Effects

Model Definition

One-way fixed effects:

Controls: Individual heterogeneity

Two-way fixed effects:

Controls: Individual heterogeneity + Time effects

Meaning of Time Effects :

Time-specific shocks affecting all individuals
Examples: Macroeconomic cycles, policy changes, technological progress, natural disasters

Why Do We Need Two-Way FE?

Scenario 1: Common Time Trends Exist

Problem: If and both grow over time, it might be due to common time factors

Example: Studying the effect of advertising expenditure on sales

Advertising expenditure increases annually (technological progress, media cost reduction)
Sales also increase annually (economic growth, rising consumer income)
Without controlling for time trends, might incorrectly attribute to advertising effect

Solution: Two-way FE

python

model_twoway = PanelOLS(sales, ads,
                        entity_effects=True,  # Control for company fixed effects
                        time_effects=True).fit()  # Control for year fixed effects

Scenario 2: Standard Practice in DID Research

DID Model:

: Controls for fixed differences between treatment and control groups
: Controls for common time trends (embodiment of parallel trends assumption)

Python Implementation:

python

# Standard DID implementation is two-way FE + interaction term
model_did = PanelOLS(y, treated_post,
                     entity_effects=True,
                     time_effects=True).fit()

Identification Logic of Two-Way FE

Demeaning Transformation (twice):

Individual demeaning:
Time demeaning:

Final Estimation:

Intuition:

First step: Eliminate individual heterogeneity (between-group differences)
Second step: Eliminate time trends (common macro shocks)
Remaining variation: Individual-specific time variation

Python Complete Example

python

import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")

# Simulate data: add time trend
np.random.seed(42)

N = 100  # 100 companies
T = 10   # 10 years

data = []
for i in range(N):
    alpha_i = np.random.normal(0, 1)  # Company fixed effect

    for t in range(T):
        year = 2010 + t
        lambda_t = 0.05 * t  # Time fixed effect (common growth trend)

        x = 10 + 0.3 * t + np.random.normal(0, 2)
        y = 5 + 2 * x + alpha_i + lambda_t + np.random.normal(0, 1)

        data.append({'company': i, 'year': year, 'y': y, 'x': x})

df = pd.DataFrame(data)
df_panel = df.set_index(['company', 'year'])

# Model 1: Pooled OLS (biased)
import statsmodels.api as sm
X_pooled = sm.add_constant(df[['x']])
model_pooled = sm.OLS(df['y'], X_pooled).fit()

# Model 2: One-way FE (control for companies)
model_oneway = PanelOLS(df_panel['y'], df_panel[['x']],
                        entity_effects=True).fit(cov_type='clustered',
                                                 cluster_entity=True)

# Model 3: Two-way FE (control for companies + year)
model_twoway = PanelOLS(df_panel['y'], df_panel[['x']],
                        entity_effects=True,
                        time_effects=True).fit(cov_type='clustered',
                                               cluster_entity=True)

print("=" * 70)
print("One-Way FE vs Two-Way FE")
print("=" * 70)
print(f"True parameter:    2.0000")
print(f"Pooled OLS:        {model_pooled.params['x']:.4f}")
print(f"One-way FE:        {model_oneway.params['x']:.4f}")
print(f"Two-way FE:        {model_twoway.params['x']:.4f} (closest to true value)")

# Visualize time effects
# Extract estimated time fixed effects
time_effects = model_twoway.estimated_effects.time_effects
print("\n" + "=" * 70)
print("Estimated Time Fixed Effects")
print("=" * 70)
print(time_effects.head(10))

# Plot time effects
plt.figure(figsize=(12, 6))
plt.plot(time_effects.index, time_effects.values, 'o-',
         linewidth=2, markersize=8, color='darkblue')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Time Fixed Effects', fontweight='bold', fontsize=12)
plt.title('Estimated Time Fixed Effects (Common Time Trends)', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Output Interpretation:

One-way FE: If time trends exist but not controlled, estimates may be biased
Two-way FE: Controls for both individual and time effects, more accurate estimation

Clustered Standard Errors

Why Do We Need Clustered Standard Errors?

Problem: In panel data, errors at different times for the same individual are typically correlated (serial correlation)

Consequences:

Classical OLS standard errors assume error independence
If errors are correlated, OLS standard errors underestimate true uncertainty
Leads to inflated -statistics, increased false positives (Type I Error)

Example:

Individual experiences a positive shock in 2015 ()
This shock may persist into 2016 ()
Therefore

Principle of Clustered Standard Errors

Core Idea: Allow observations within the same cluster to be correlated, assuming independence between different clusters

Standard Practice for Panel Data: Cluster at individual level

Allow all time observations for individual to be correlated
Assume independence between different individuals

Python Implementation:

python

model = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Cluster at entity (individual) level
)

Clustering Choices

Clustering Level	When to Use	Python Implementation
Entity	Standard practice for panel data	`cluster_entity=True`
Time	Different individuals at same time may be correlated (rare)	`cluster_time=True`
Two-way clustering	Allow both entity and time clustering	`cluster_entity=True, cluster_time=True`
Custom clustering	E.g., cluster by state, industry	`clusters=df['state']`

Recommendations:

Panel data → cluster_entity=True (most common)
DID research → Cluster at treatment unit level (e.g., state, city)

Clustered SE vs Robust SE

Type	Allowed Error Patterns	When to Use
Classical OLS SE	Homoskedasticity + Independence	Almost never (assumptions too strong)
Robust SE	Heteroskedasticity + Independence	Cross-sectional data
Clustered SE	Heteroskedasticity + Within-cluster correlation	Panel data ⭐

Important Rule:

Panel data must use clustered SE
Not using clustered SE leads to severely underestimated standard errors (possibly 50% underestimate)

Python Comparison Example

python

import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS

# Simulate data: strong serial correlation
np.random.seed(123)
data = []
for i in range(100):
    shock = np.random.normal(0, 2)  # Individual-specific persistent shock
    for t in range(10):
        x = 10 + np.random.normal(0, 1)
        # Error term has persistent component (serial correlation)
        epsilon = shock + np.random.normal(0, 0.5)
        y = 5 + 2 * x + epsilon
        data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# Three types of standard errors
model_unadjusted = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='unadjusted'  # Classical OLS SE
)

model_robust = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='robust'  # Robust SE (heteroskedasticity only)
)

model_clustered = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='clustered',
    cluster_entity=True  # Clustered SE (heteroskedasticity + serial correlation)
)

print("=" * 70)
print("Standard Error Comparison")
print("=" * 70)
print(f"Coefficient estimate:      {model_clustered.params['x']:.4f} (same for all three methods)")
print(f"Classical SE:              {model_unadjusted.std_errors['x']:.4f} (underestimate!)")
print(f"Robust SE:                 {model_robust.std_errors['x']:.4f} (still underestimate)")
print(f"Clustered SE:              {model_clustered.std_errors['x']:.4f} (correct)")
print(f"\nClustered SE / Classical SE: {model_clustered.std_errors['x'] / model_unadjusted.std_errors['x']:.2f}x")

Key Finding:

Clustered SE is typically 1.5-3 times classical SE
Without clustered SE, -statistics are inflated, leading to incorrect rejection of null hypothesis

Dynamic Panel Models

What is a Dynamic Panel?

Model:

Characteristic: Lagged dependent variable as independent variable

Application Scenarios:

Persistence: Income, GDP, health status
Adjustment Costs: Corporate investment, employment
Habit Formation: Consumption, savings

Why Doesn't Regular FE Work?

Problem: is endogenous with

Reason:

depends on
After within transformation, depends on (includes )
Leads to

Consequence: FE estimation is biased and inconsistent (even as )

Arellano-Bond Estimator

Core Idea: Use instrumental variables (IV) + first difference

Step 1: First Difference to Eliminate Fixed Effects

Step 2: Use Earlier as Instrumental Variables

Instrumental variables:

Correlated with (relevance condition)
Uncorrelated with (exogeneity condition)

Estimation Method: GMM (Generalized Method of Moments)

Python Implementation (Simplified Version)

python

from linearmodels.panel import PanelOLS
import pandas as pd
import numpy as np

# Simulate dynamic panel data
np.random.seed(42)
data = []
for i in range(100):
    alpha_i = np.random.normal(0, 1)
    y_lag = 5  # Initial value

    for t in range(10):
        x = 10 + np.random.normal(0, 2)
        epsilon = np.random.normal(0, 1)
        y = 0.5 * y_lag + 1.5 * x + alpha_i + epsilon  # True parameters: beta1=0.5, beta2=1.5

        data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})
        y_lag = y  # Update lagged value

df = pd.DataFrame(data)

# Create lagged variable
df = df.sort_values(['id', 'year'])
df['y_lag'] = df.groupby('id')['y'].shift(1)
df = df.dropna()

df_panel = df.set_index(['id', 'year'])

# Wrong method: Regular FE (biased!)
model_fe_wrong = PanelOLS(df_panel['y'],
                          df_panel[['y_lag', 'x']],
                          entity_effects=True).fit()

print("=" * 70)
print("Dynamic Panel Model")
print("=" * 70)
print(f"True parameters: y_lag=0.5, x=1.5")
print(f"\nFE estimate (biased):")
print(f"  y_lag: {model_fe_wrong.params['y_lag']:.4f}")
print(f"  x:     {model_fe_wrong.params['x']:.4f}")
print("\nNote: FE estimation is biased! Should use Arellano-Bond GMM")

Note:

Python's linearmodels currently doesn't support Arellano-Bond
Need to use Stata's xtabond or R's plm package
This is an advanced topic in dynamic panels, beyond this course's scope

Panel Data and DID

DID is Two-Way Fixed Effects + Interaction Term

Standard DID Model:

Equivalent to:

python

model_did = PanelOLS(y, treated_post,
                     entity_effects=True,   # Control for α_i
                     time_effects=True).fit()  # Control for λ_t

Python Complete DID Example

python

import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Simulate DID data
np.random.seed(2024)

data = []
# Treatment group: ID 1-50, receive treatment in 2018
# Control group: ID 51-100, don't receive treatment

for i in range(1, 101):
    treated = 1 if i <= 50 else 0
    alpha_i = np.random.normal(0, 1)

    for t in range(2015, 2021):
        year = t
        post = 1 if year >= 2018 else 0
        treated_post = treated * post

        # DID effect = 10
        y = 50 + 10 * treated_post + alpha_i + 0.5 * year + np.random.normal(0, 2)

        data.append({
            'id': i,
            'year': year,
            'y': y,
            'treated': treated,
            'post': post,
            'treated_post': treated_post
        })

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# DID regression
model_did = PanelOLS(df_panel['y'],
                     df_panel[['treated_post']],
                     entity_effects=True,
                     time_effects=True).fit(cov_type='clustered',
                                            cluster_entity=True)

print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)
print(f"\nDID effect: {model_did.params['treated_post']:.2f} (true value: 10.00)")

# Event study plot
# Create year dummy variables
for year in range(2015, 2021):
    df[f'treated_x_{year}'] = df['treated'] * (df['year'] == year)

# Use 2017 as baseline (last year before treatment)
event_vars = [f'treated_x_{y}' for y in [2015, 2016, 2018, 2019, 2020]]
df_panel_event = df.set_index(['id', 'year'])

model_event = PanelOLS(df_panel_event['y'],
                       df_panel_event[event_vars],
                       entity_effects=True,
                       time_effects=True).fit(cov_type='clustered',
                                              cluster_entity=True)

# Extract coefficients
years = [2015, 2016, 2017, 2018, 2019, 2020]
coefs = [model_event.params[f'treated_x_{y}'] if y != 2017 else 0 for y in years]
se = [model_event.std_errors[f'treated_x_{y}'] if y != 2017 else 0 for y in years]

# Plot event study graph
plt.figure(figsize=(12, 6))
plt.errorbar(years, coefs, yerr=1.96*np.array(se), marker='o',
             markersize=8, linewidth=2, capsize=5, color='darkblue')
plt.axhline(0, color='red', linestyle='--', linewidth=1)
plt.axvline(2017.5, color='green', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.5, max(coefs) * 0.8, 'Policy Implementation', fontsize=12, color='green')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Treatment Effect', fontweight='bold', fontsize=12)
plt.title('Event Study Plot', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Interpretation:

2015-2017: Coefficients close to 0 (parallel trends hold)
2018-2020: Coefficients significantly positive (treatment effect)

Handling Unbalanced Panels

Types of Unbalanced Panels

Attrition: Individuals exit sample
- Example: Company bankruptcy, individual exits survey
Entry and Exit: New individuals join sample
- Example: New company IPO, new hospital established
Random Missing: Data missing at certain time points
- Example: Survey incomplete, data entry error

Problems with Unbalanced Panels

Problem 1: Selection Bias

If exit is related to outcome variable, estimates are biased
Example: Companies with poor performance more likely to delist

Problem 2: Efficiency Loss

Missing data reduces sample size

Handling Methods

Method 1: Keep Unbalanced (Recommended) ⭐

linearmodels automatically handles unbalanced panels:

python

# No special operations needed, linearmodels will handle automatically
model = PanelOLS(y, X, entity_effects=True).fit()

Advantages:

Retain all available information
Avoid arbitrarily deleting data

Prerequisites:

Missing is random (Missing at Random, MAR)
Or missing correlated with independent variables, but not with error term

Method 2: Use Balanced Subsample

Construct balanced panel:

python

# Only keep individuals observed in all time periods
complete_ids = df.groupby('id')['year'].count()
complete_ids = complete_ids[complete_ids == T].index
df_balanced = df[df['id'].isin(complete_ids)]

Advantages:

Avoid selection bias (if concerned about non-random attrition)

Disadvantages:

Loss of substantial data
Low efficiency

Method 3: Sample Selection Models (Heckman)

Applicable to: Non-random attrition (e.g., company bankruptcy)

Method:

First stage: Estimate attrition probability (Probit)
Second stage: Add Inverse Mills Ratio as control variable

Beyond this course's scope, refer to Wooldridge (2010) Chapter 19

Section Summary

Key Points

Two-way fixed effects:
- Control for individual + time effects
- Standard practice for DID
- Eliminate common time trends
Clustered standard errors:
- Essential tool for panel data
- Cluster at individual level (standard practice)
- Avoid underestimating standard errors
Dynamic panels:
- Include lagged dependent variable
- Regular FE is biased
- Need Arellano-Bond GMM
Panel data + DID:
- DID = Two-way FE + interaction term
- Event study plot tests parallel trends
- Cluster at treatment unit level
Unbalanced panels:
- linearmodels handles automatically
- Prioritize keeping unbalanced (if MAR)
- Use balanced subsample when concerned about selection bias

Practical Recommendations

Standard Panel Regression Checklist:

✓ Use two-way FE (if time trends exist)
✓ Use clustered standard errors (cluster_entity=True)
✓ Check if within variation is sufficient
✓ Conduct Hausman test (FE vs RE)
✓ Report , , total observations
✓ Check for bad controls (mediators)

DID Research Checklist:

✓ Use two-way FE
✓ Cluster at treatment unit level
✓ Plot event study graph
✓ Test parallel trends
✓ Conduct placebo tests

Next Steps

In Section 6: Summary and Review, we will:

Summarize panel data methods decision tree
Provide 10 practice problems
Recommend classic literature

Master advanced techniques, become a panel data expert!

8.5 Advanced Panel Data Topics ​

Section Objectives ​

Two-Way Fixed Effects ​

Model Definition ​

Why Do We Need Two-Way FE? ​

Scenario 1: Common Time Trends Exist ​

Scenario 2: Standard Practice in DID Research ​

Identification Logic of Two-Way FE ​

Python Complete Example ​

Clustered Standard Errors ​

Why Do We Need Clustered Standard Errors? ​

Principle of Clustered Standard Errors ​

Clustering Choices ​

Clustered SE vs Robust SE ​

Python Comparison Example ​

Dynamic Panel Models ​

What is a Dynamic Panel? ​

Why Doesn't Regular FE Work? ​

Arellano-Bond Estimator ​

Step 1: First Difference to Eliminate Fixed Effects ​

Step 2: Use Earlier as Instrumental Variables ​

Python Implementation (Simplified Version) ​

Panel Data and DID ​

DID is Two-Way Fixed Effects + Interaction Term ​

Python Complete DID Example ​

Handling Unbalanced Panels ​

Types of Unbalanced Panels ​

Problems with Unbalanced Panels ​

Handling Methods ​

Method 1: Keep Unbalanced (Recommended) ⭐ ​

Method 2: Use Balanced Subsample ​

Method 3: Sample Selection Models (Heckman) ​

Section Summary ​

Key Points ​

Practical Recommendations ​

Next Steps ​

8.5 Advanced Panel Data Topics

Section Objectives

Two-Way Fixed Effects

Model Definition

Why Do We Need Two-Way FE?

Scenario 1: Common Time Trends Exist

Scenario 2: Standard Practice in DID Research

Identification Logic of Two-Way FE

Python Complete Example

Clustered Standard Errors

Why Do We Need Clustered Standard Errors?

Principle of Clustered Standard Errors

Clustering Choices

Clustered SE vs Robust SE

Python Comparison Example

Dynamic Panel Models

What is a Dynamic Panel?

Why Doesn't Regular FE Work?

Arellano-Bond Estimator

Step 1: First Difference to Eliminate Fixed Effects

Step 2: Use Earlier as Instrumental Variables

Python Implementation (Simplified Version)

Panel Data and DID

DID is Two-Way Fixed Effects + Interaction Term

Python Complete DID Example

Handling Unbalanced Panels

Types of Unbalanced Panels

Problems with Unbalanced Panels

Handling Methods

Method 1: Keep Unbalanced (Recommended) ⭐

Method 2: Use Balanced Subsample

Method 3: Sample Selection Models (Heckman)

Section Summary

Key Points

Practical Recommendations

Next Steps