Skip to content

8.5 Advanced Panel Data Topics

Mastering Frontier Techniques: Two-Way Fixed Effects, Clustered Standard Errors, Dynamic Panels, and DID

DifficultyFrontier


Section Objectives

  • Deeply understand the identification logic of two-way fixed effects (Two-Way FE)
  • Correctly use clustered standard errors (Clustered SE)
  • Preliminary understanding of dynamic panel models (Arellano-Bond)
  • Apply panel methods to DID research
  • Techniques for handling unbalanced panels
  • Navigate common pitfalls in panel data

Two-Way Fixed Effects

Model Definition

One-way fixed effects:

  • Controls: Individual heterogeneity

Two-way fixed effects:

  • Controls: Individual heterogeneity + Time effects

Meaning of Time Effects :

  • Time-specific shocks affecting all individuals
  • Examples: Macroeconomic cycles, policy changes, technological progress, natural disasters

Why Do We Need Two-Way FE?

Problem: If and both grow over time, it might be due to common time factors

Example: Studying the effect of advertising expenditure on sales

  • Advertising expenditure increases annually (technological progress, media cost reduction)
  • Sales also increase annually (economic growth, rising consumer income)
  • Without controlling for time trends, might incorrectly attribute to advertising effect

Solution: Two-way FE

python
model_twoway = PanelOLS(sales, ads,
                        entity_effects=True,  # Control for company fixed effects
                        time_effects=True).fit()  # Control for year fixed effects

Scenario 2: Standard Practice in DID Research

DID Model:

  • : Controls for fixed differences between treatment and control groups
  • : Controls for common time trends (embodiment of parallel trends assumption)

Python Implementation:

python
# Standard DID implementation is two-way FE + interaction term
model_did = PanelOLS(y, treated_post,
                     entity_effects=True,
                     time_effects=True).fit()

Identification Logic of Two-Way FE

Demeaning Transformation (twice):

  1. Individual demeaning:
  2. Time demeaning:

Final Estimation:

Intuition:

  • First step: Eliminate individual heterogeneity (between-group differences)
  • Second step: Eliminate time trends (common macro shocks)
  • Remaining variation: Individual-specific time variation

Python Complete Example

python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")

# Simulate data: add time trend
np.random.seed(42)

N = 100  # 100 companies
T = 10   # 10 years

data = []
for i in range(N):
    alpha_i = np.random.normal(0, 1)  # Company fixed effect

    for t in range(T):
        year = 2010 + t
        lambda_t = 0.05 * t  # Time fixed effect (common growth trend)

        x = 10 + 0.3 * t + np.random.normal(0, 2)
        y = 5 + 2 * x + alpha_i + lambda_t + np.random.normal(0, 1)

        data.append({'company': i, 'year': year, 'y': y, 'x': x})

df = pd.DataFrame(data)
df_panel = df.set_index(['company', 'year'])

# Model 1: Pooled OLS (biased)
import statsmodels.api as sm
X_pooled = sm.add_constant(df[['x']])
model_pooled = sm.OLS(df['y'], X_pooled).fit()

# Model 2: One-way FE (control for companies)
model_oneway = PanelOLS(df_panel['y'], df_panel[['x']],
                        entity_effects=True).fit(cov_type='clustered',
                                                 cluster_entity=True)

# Model 3: Two-way FE (control for companies + year)
model_twoway = PanelOLS(df_panel['y'], df_panel[['x']],
                        entity_effects=True,
                        time_effects=True).fit(cov_type='clustered',
                                               cluster_entity=True)

print("=" * 70)
print("One-Way FE vs Two-Way FE")
print("=" * 70)
print(f"True parameter:    2.0000")
print(f"Pooled OLS:        {model_pooled.params['x']:.4f}")
print(f"One-way FE:        {model_oneway.params['x']:.4f}")
print(f"Two-way FE:        {model_twoway.params['x']:.4f} (closest to true value)")

# Visualize time effects
# Extract estimated time fixed effects
time_effects = model_twoway.estimated_effects.time_effects
print("\n" + "=" * 70)
print("Estimated Time Fixed Effects")
print("=" * 70)
print(time_effects.head(10))

# Plot time effects
plt.figure(figsize=(12, 6))
plt.plot(time_effects.index, time_effects.values, 'o-',
         linewidth=2, markersize=8, color='darkblue')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Time Fixed Effects', fontweight='bold', fontsize=12)
plt.title('Estimated Time Fixed Effects (Common Time Trends)', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Output Interpretation:

  • One-way FE: If time trends exist but not controlled, estimates may be biased
  • Two-way FE: Controls for both individual and time effects, more accurate estimation

Clustered Standard Errors

Why Do We Need Clustered Standard Errors?

Problem: In panel data, errors at different times for the same individual are typically correlated (serial correlation)

Consequences:

  • Classical OLS standard errors assume error independence
  • If errors are correlated, OLS standard errors underestimate true uncertainty
  • Leads to inflated -statistics, increased false positives (Type I Error)

Example:

  • Individual experiences a positive shock in 2015 ()
  • This shock may persist into 2016 ()
  • Therefore

Principle of Clustered Standard Errors

Core Idea: Allow observations within the same cluster to be correlated, assuming independence between different clusters

Standard Practice for Panel Data: Cluster at individual level

  • Allow all time observations for individual to be correlated
  • Assume independence between different individuals

Python Implementation:

python
model = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Cluster at entity (individual) level
)

Clustering Choices

Clustering LevelWhen to UsePython Implementation
EntityStandard practice for panel datacluster_entity=True
TimeDifferent individuals at same time may be correlated (rare)cluster_time=True
Two-way clusteringAllow both entity and time clusteringcluster_entity=True, cluster_time=True
Custom clusteringE.g., cluster by state, industryclusters=df['state']

Recommendations:

  • Panel data → cluster_entity=True (most common)
  • DID research → Cluster at treatment unit level (e.g., state, city)

Clustered SE vs Robust SE

TypeAllowed Error PatternsWhen to Use
Classical OLS SEHomoskedasticity + IndependenceAlmost never (assumptions too strong)
Robust SEHeteroskedasticity + IndependenceCross-sectional data
Clustered SEHeteroskedasticity + Within-cluster correlationPanel data ⭐

Important Rule:

  • Panel data must use clustered SE
  • Not using clustered SE leads to severely underestimated standard errors (possibly 50% underestimate)

Python Comparison Example

python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS

# Simulate data: strong serial correlation
np.random.seed(123)
data = []
for i in range(100):
    shock = np.random.normal(0, 2)  # Individual-specific persistent shock
    for t in range(10):
        x = 10 + np.random.normal(0, 1)
        # Error term has persistent component (serial correlation)
        epsilon = shock + np.random.normal(0, 0.5)
        y = 5 + 2 * x + epsilon
        data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# Three types of standard errors
model_unadjusted = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='unadjusted'  # Classical OLS SE
)

model_robust = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='robust'  # Robust SE (heteroskedasticity only)
)

model_clustered = PanelOLS(df_panel['y'], df_panel[['x']]).fit(
    cov_type='clustered',
    cluster_entity=True  # Clustered SE (heteroskedasticity + serial correlation)
)

print("=" * 70)
print("Standard Error Comparison")
print("=" * 70)
print(f"Coefficient estimate:      {model_clustered.params['x']:.4f} (same for all three methods)")
print(f"Classical SE:              {model_unadjusted.std_errors['x']:.4f} (underestimate!)")
print(f"Robust SE:                 {model_robust.std_errors['x']:.4f} (still underestimate)")
print(f"Clustered SE:              {model_clustered.std_errors['x']:.4f} (correct)")
print(f"\nClustered SE / Classical SE: {model_clustered.std_errors['x'] / model_unadjusted.std_errors['x']:.2f}x")

Key Finding:

  • Clustered SE is typically 1.5-3 times classical SE
  • Without clustered SE, -statistics are inflated, leading to incorrect rejection of null hypothesis

Dynamic Panel Models

What is a Dynamic Panel?

Model:

Characteristic: Lagged dependent variable as independent variable

Application Scenarios:

  • Persistence: Income, GDP, health status
  • Adjustment Costs: Corporate investment, employment
  • Habit Formation: Consumption, savings

Why Doesn't Regular FE Work?

Problem: is endogenous with

Reason:

  • depends on
  • After within transformation, depends on (includes )
  • Leads to

Consequence: FE estimation is biased and inconsistent (even as )


Arellano-Bond Estimator

Core Idea: Use instrumental variables (IV) + first difference

Step 1: First Difference to Eliminate Fixed Effects

Step 2: Use Earlier as Instrumental Variables

Instrumental variables:

  • Correlated with (relevance condition)
  • Uncorrelated with (exogeneity condition)

Estimation Method: GMM (Generalized Method of Moments)


Python Implementation (Simplified Version)

python
from linearmodels.panel import PanelOLS
import pandas as pd
import numpy as np

# Simulate dynamic panel data
np.random.seed(42)
data = []
for i in range(100):
    alpha_i = np.random.normal(0, 1)
    y_lag = 5  # Initial value

    for t in range(10):
        x = 10 + np.random.normal(0, 2)
        epsilon = np.random.normal(0, 1)
        y = 0.5 * y_lag + 1.5 * x + alpha_i + epsilon  # True parameters: beta1=0.5, beta2=1.5

        data.append({'id': i, 'year': 2010 + t, 'y': y, 'x': x})
        y_lag = y  # Update lagged value

df = pd.DataFrame(data)

# Create lagged variable
df = df.sort_values(['id', 'year'])
df['y_lag'] = df.groupby('id')['y'].shift(1)
df = df.dropna()

df_panel = df.set_index(['id', 'year'])

# Wrong method: Regular FE (biased!)
model_fe_wrong = PanelOLS(df_panel['y'],
                          df_panel[['y_lag', 'x']],
                          entity_effects=True).fit()

print("=" * 70)
print("Dynamic Panel Model")
print("=" * 70)
print(f"True parameters: y_lag=0.5, x=1.5")
print(f"\nFE estimate (biased):")
print(f"  y_lag: {model_fe_wrong.params['y_lag']:.4f}")
print(f"  x:     {model_fe_wrong.params['x']:.4f}")
print("\nNote: FE estimation is biased! Should use Arellano-Bond GMM")

Note:

  • Python's linearmodels currently doesn't support Arellano-Bond
  • Need to use Stata's xtabond or R's plm package
  • This is an advanced topic in dynamic panels, beyond this course's scope

Panel Data and DID

DID is Two-Way Fixed Effects + Interaction Term

Standard DID Model:

Equivalent to:

python
model_did = PanelOLS(y, treated_post,
                     entity_effects=True,   # Control for α_i
                     time_effects=True).fit()  # Control for λ_t

Python Complete DID Example

python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Simulate DID data
np.random.seed(2024)

data = []
# Treatment group: ID 1-50, receive treatment in 2018
# Control group: ID 51-100, don't receive treatment

for i in range(1, 101):
    treated = 1 if i <= 50 else 0
    alpha_i = np.random.normal(0, 1)

    for t in range(2015, 2021):
        year = t
        post = 1 if year >= 2018 else 0
        treated_post = treated * post

        # DID effect = 10
        y = 50 + 10 * treated_post + alpha_i + 0.5 * year + np.random.normal(0, 2)

        data.append({
            'id': i,
            'year': year,
            'y': y,
            'treated': treated,
            'post': post,
            'treated_post': treated_post
        })

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# DID regression
model_did = PanelOLS(df_panel['y'],
                     df_panel[['treated_post']],
                     entity_effects=True,
                     time_effects=True).fit(cov_type='clustered',
                                            cluster_entity=True)

print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)
print(f"\nDID effect: {model_did.params['treated_post']:.2f} (true value: 10.00)")

# Event study plot
# Create year dummy variables
for year in range(2015, 2021):
    df[f'treated_x_{year}'] = df['treated'] * (df['year'] == year)

# Use 2017 as baseline (last year before treatment)
event_vars = [f'treated_x_{y}' for y in [2015, 2016, 2018, 2019, 2020]]
df_panel_event = df.set_index(['id', 'year'])

model_event = PanelOLS(df_panel_event['y'],
                       df_panel_event[event_vars],
                       entity_effects=True,
                       time_effects=True).fit(cov_type='clustered',
                                              cluster_entity=True)

# Extract coefficients
years = [2015, 2016, 2017, 2018, 2019, 2020]
coefs = [model_event.params[f'treated_x_{y}'] if y != 2017 else 0 for y in years]
se = [model_event.std_errors[f'treated_x_{y}'] if y != 2017 else 0 for y in years]

# Plot event study graph
plt.figure(figsize=(12, 6))
plt.errorbar(years, coefs, yerr=1.96*np.array(se), marker='o',
             markersize=8, linewidth=2, capsize=5, color='darkblue')
plt.axhline(0, color='red', linestyle='--', linewidth=1)
plt.axvline(2017.5, color='green', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.5, max(coefs) * 0.8, 'Policy Implementation', fontsize=12, color='green')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Treatment Effect', fontweight='bold', fontsize=12)
plt.title('Event Study Plot', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Interpretation:

  • 2015-2017: Coefficients close to 0 (parallel trends hold)
  • 2018-2020: Coefficients significantly positive (treatment effect)

Handling Unbalanced Panels

Types of Unbalanced Panels

  1. Attrition: Individuals exit sample

    • Example: Company bankruptcy, individual exits survey
  2. Entry and Exit: New individuals join sample

    • Example: New company IPO, new hospital established
  3. Random Missing: Data missing at certain time points

    • Example: Survey incomplete, data entry error

Problems with Unbalanced Panels

Problem 1: Selection Bias

  • If exit is related to outcome variable, estimates are biased
  • Example: Companies with poor performance more likely to delist

Problem 2: Efficiency Loss

  • Missing data reduces sample size

Handling Methods

linearmodels automatically handles unbalanced panels:

python
# No special operations needed, linearmodels will handle automatically
model = PanelOLS(y, X, entity_effects=True).fit()

Advantages:

  • Retain all available information
  • Avoid arbitrarily deleting data

Prerequisites:

  • Missing is random (Missing at Random, MAR)
  • Or missing correlated with independent variables, but not with error term

Method 2: Use Balanced Subsample

Construct balanced panel:

python
# Only keep individuals observed in all time periods
complete_ids = df.groupby('id')['year'].count()
complete_ids = complete_ids[complete_ids == T].index
df_balanced = df[df['id'].isin(complete_ids)]

Advantages:

  • Avoid selection bias (if concerned about non-random attrition)

Disadvantages:

  • Loss of substantial data
  • Low efficiency

Method 3: Sample Selection Models (Heckman)

Applicable to: Non-random attrition (e.g., company bankruptcy)

Method:

  1. First stage: Estimate attrition probability (Probit)
  2. Second stage: Add Inverse Mills Ratio as control variable

Beyond this course's scope, refer to Wooldridge (2010) Chapter 19


Section Summary

Key Points

  1. Two-way fixed effects:

    • Control for individual + time effects
    • Standard practice for DID
    • Eliminate common time trends
  2. Clustered standard errors:

    • Essential tool for panel data
    • Cluster at individual level (standard practice)
    • Avoid underestimating standard errors
  3. Dynamic panels:

    • Include lagged dependent variable
    • Regular FE is biased
    • Need Arellano-Bond GMM
  4. Panel data + DID:

    • DID = Two-way FE + interaction term
    • Event study plot tests parallel trends
    • Cluster at treatment unit level
  5. Unbalanced panels:

    • linearmodels handles automatically
    • Prioritize keeping unbalanced (if MAR)
    • Use balanced subsample when concerned about selection bias

Practical Recommendations

Standard Panel Regression Checklist:

  • ✓ Use two-way FE (if time trends exist)
  • ✓ Use clustered standard errors (cluster_entity=True)
  • ✓ Check if within variation is sufficient
  • ✓ Conduct Hausman test (FE vs RE)
  • ✓ Report , , total observations
  • ✓ Check for bad controls (mediators)

DID Research Checklist:

  • ✓ Use two-way FE
  • ✓ Cluster at treatment unit level
  • ✓ Plot event study graph
  • ✓ Test parallel trends
  • ✓ Conduct placebo tests

Next Steps

In Section 6: Summary and Review, we will:

  • Summarize panel data methods decision tree
  • Provide 10 practice problems
  • Recommend classic literature

Master advanced techniques, become a panel data expert!

Released under the MIT License. Content © Author.