Skip to content

8.6 Summary and Review

Comprehensive Integration: Systematic Summary and Practical Training in Panel Data Methods

ReviewPractical


Core Content Review

Three Major Advantages of Panel Data

  1. Control for Unobserved Heterogeneity ⭐⭐⭐

    • Differencing eliminates time-invariant individual characteristics (ability, personality, family background, etc.)
    • Solves omitted variable bias (OVB)
  2. More Variation and Observations

    • Sample size:
    • Utilize both between and within variation
  3. Dynamic Analysis and Causal Identification

    • Track individual changes over time
    • Foundation for DID, event study, and other methods

Panel Regression Methods Decision Tree

Start: Have panel data (N × T)

Need to estimate time-invariant variables?
  ↓ Yes
  Consider Random Effects (RE)

  Conduct Hausman test
  ↓ p < 0.05
  Use Fixed Effects (FE)

  ↓ No (don't need to estimate time-invariant variables)

Prioritize Fixed Effects (FE)

Common time trends exist?
  ↓ Yes
  Two-way Fixed Effects (Two-Way FE)
  ↓ No
  One-way Fixed Effects (One-Way FE)

Use clustered standard errors (cluster_entity=True)

Check if within variation is sufficient
  ↓ Small within variation
  Consider RE or add control variables
  ↓ Sufficient within variation
  Complete!

Comparison of Three Panel Methods

MethodModelAssumptionsAdvantagesDisadvantagesPython Implementation
Pooled OLSNo individual heterogeneitySimpleOmitted variable biassm.OLS(y, X)
Fixed Effects (FE)Allow correlated with Consistent (even with endogeneity)Cannot estimate time-invariant variablesPanelOLS(..., entity_effects=True)
Random Effects (RE)Require Efficient, can estimate time-invariant variablesInconsistent if correlated with RandomEffects(...)

Core Estimation Methods

Three FE Estimations

  1. Within Transformation: Demean variables

  2. LSDV: Add dummy variables

  3. First Difference: Difference adjacent periods

RE's GLS Estimation

Quasi-demeaning transformation:


Practical Checklist

Standard Panel Regression Workflow

  • [ ] Step 1: Check data structure

    • Confirm balanced or unbalanced panel
    • Calculate , , total observations
  • [ ] Step 2: Exploratory analysis

    • Plot spaghetti plot (time trends)
    • Calculate within/between variation proportions
  • [ ] Step 3: Choose model

    • Default use FE (economics standard)
    • If need to estimate time-invariant variables, consider RE
    • Conduct Hausman test
  • [ ] Step 4: Decide whether to use two-way FE

    • Check for common time trends
    • DID research must use two-way FE
  • [ ] Step 5: Choose correct standard errors

    • Panel data must use clustered standard errors
    • Cluster at individual level (cluster_entity=True)
  • [ ] Step 6: Diagnostics and robustness checks

    • Check if within variation is sufficient
    • Compare FE and RE results
    • Check for bad control problems

Common Pitfalls and Solutions

Pitfall 1: Forgetting to Use Clustered Standard Errors

Problem: Standard errors underestimated, -statistics inflated

Solution:

python
model = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Must use!
)

Pitfall 2: Attempting to Estimate Time-Invariant Variables

Problem: FE cannot estimate gender, race, and other time-invariant variables

Wrong Example:

python
# gender coefficient cannot be estimated (eliminated by differencing)
model = PanelOLS(y, X + gender, entity_effects=True).fit()

Solutions:

  • Use RE (if Hausman test passes)
  • Study interaction effects between time-invariant and time-varying variables

Pitfall 3: Bad Control Problem

Problem: Controlling for variables affected by treatment (mediators)

Wrong Example:

python
# Occupation is a result of education (mediator)
# Controlling for occupation blocks part of education's effect
model = PanelOLS(log_wage, education + occupation, entity_effects=True).fit()

Decision Rule:

  • Control: Confounders (simultaneously affect and )
  • Don't control: Mediators (), outcome variables

Pitfall 4: Insufficient Within Variation

Problem: If independent variables barely change within groups, FE estimates are imprecise

Diagnostics:

python
# Calculate within variation proportion
total_var = df['education'].var()
within_var = df.groupby('id')['education'].apply(lambda x: (x - x.mean()).var()).mean()
print(f"Within variation proportion: {within_var / total_var * 100:.1f}%")

Solutions:

  • If < 10%: Consider RE (if Hausman test passes)
  • Increase time span or find variables with more variation

Problem: If both and have upward trends, might be due to common time factors

Solution:

python
# Use two-way FE
model = PanelOLS(y, X,
                 entity_effects=True,
                 time_effects=True).fit()  # Control for time trends

Practice Problems

Problem 1: Conceptual Question

Question: Explain why fixed effects can eliminate omitted variable bias? What conditions must be satisfied?

Click to view answer

Answer:

  • FE eliminates time-invariant individual characteristics () through differencing
  • Condition: Omitted variables must be time-invariant
  • Examples: Ability, family background, personality, etc.
  • If omitted variables change over time (e.g., health status), FE cannot eliminate bias

Problem 2: Hausman Test

Question: You estimated FE and RE, Hausman test -value is 0.001. Which model should you use? Why?

Click to view answer

Answer:

  • Should use FE
  • Reason: , reject null hypothesis ()
  • Indicates individual effects correlated with independent variables, RE is inconsistent
  • Although FE is less efficient, it is consistent

Problem 3: Programming Exercise

Data: Simulate 100 companies, 10 years, study effect of R&D expenditure on profit

Tasks:

  1. Generate data, ensure company effects correlated with R&D (endogeneity)
  2. Estimate pooled OLS, FE, RE
  3. Conduct Hausman test
  4. Compare bias of three methods
Click to view answer
python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects
import statsmodels.api as sm

# Generate data
np.random.seed(42)
data = []
for i in range(100):
    company_effect = np.random.normal(0, 1)
    for t in range(10):
        # R&D correlated with company effect (endogeneity)
        rd = 5 + 0.5 * company_effect + np.random.normal(0, 1)
        # Profit
        profit = 10 + 2 * rd + company_effect + np.random.normal(0, 0.5)
        data.append({'id': i, 'year': 2010 + t, 'profit': profit, 'rd': rd})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# 1. Pooled OLS
X_pooled = sm.add_constant(df[['rd']])
model_pooled = sm.OLS(df['profit'], X_pooled).fit()

# 2. FE
model_fe = PanelOLS(df_panel['profit'], df_panel[['rd']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

# 3. RE
model_re = RandomEffects(df_panel['profit'], df_panel[['rd']]).fit()

# 4. Hausman test
beta_diff = model_fe.params['rd'] - model_re.params['rd']
var_diff = model_fe.cov['rd']['rd'] - model_re.cov['rd']['rd']
hausman_stat = (beta_diff ** 2) / var_diff
from scipy.stats import chi2
p_value = 1 - chi2.cdf(hausman_stat, df=1)

print("=" * 70)
print("Estimation Results Comparison")
print("=" * 70)
print(f"True parameter:  2.00")
print(f"Pooled OLS:      {model_pooled.params['rd']:.4f}  (bias: {model_pooled.params['rd'] - 2:.4f})")
print(f"FE:              {model_fe.params['rd']:.4f}  (bias: {model_fe.params['rd'] - 2:.4f})")
print(f"RE:              {model_re.params['rd']:.4f}  (bias: {model_re.params['rd'] - 2:.4f})")
print(f"\nHausman test p-value: {p_value:.4f}")
print(f"Conclusion: {'Use FE' if p_value < 0.05 else 'Use RE'}")

Problem 4: DID Application

Scenario: Province A raised minimum wage in 2018, Province B did not. Data includes employment rates for both provinces 2015-2020.

Tasks:

  1. Set up DID model
  2. Estimate policy effect
  3. Plot event study graph
Click to view answer
python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Generate DID data
np.random.seed(123)
data = []

for province in ['A', 'B']:
    treated = 1 if province == 'A' else 0
    province_effect = 0.5 if province == 'A' else 0

    for year in range(2015, 2021):
        post = 1 if year >= 2018 else 0
        treated_post = treated * post

        # Employment rate (%)
        # DID effect = 2 (increase by 2 percentage points)
        employment = (60 + 2 * treated_post + province_effect +
                      0.3 * year + np.random.normal(0, 1))

        data.append({
            'province': province,
            'year': year,
            'employment': employment,
            'treated': treated,
            'post': post,
            'treated_post': treated_post
        })

df = pd.DataFrame(data)

# Create province ID
df['province_id'] = df['province'].map({'A': 1, 'B': 2})
df_panel = df.set_index(['province_id', 'year'])

# DID regression
model_did = PanelOLS(df_panel['employment'],
                     df_panel[['treated_post']],
                     entity_effects=True,
                     time_effects=True).fit(cov_type='clustered',
                                            cluster_entity=True)

print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)

# Event study plot
df_avg = df.groupby(['treated', 'year'])['employment'].mean().reset_index()
df_treated = df_avg[df_avg['treated'] == 1]
df_control = df_avg[df_avg['treated'] == 0]

plt.figure(figsize=(12, 6))
plt.plot(df_treated['year'], df_treated['employment'], 'o-',
         label='Treatment Group (Province A)', linewidth=2, markersize=8)
plt.plot(df_control['year'], df_control['employment'], 's-',
         label='Control Group (Province B)', linewidth=2, markersize=8)
plt.axvline(2017.5, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.6, 63, 'Policy Implementation', fontsize=12, color='red')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Employment Rate (%)', fontweight='bold', fontsize=12)
plt.title('Difference-in-Differences: Effect of Minimum Wage on Employment', fontweight='bold', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Problem 5: Interpretation Question

Scenario: You study the effect of education on wages using 5-year panel data. Within variation accounts for only 5% of total variation.

Questions:

  1. What is the impact on FE estimation?
  2. What would you do?
Click to view answer

Answer:

  1. Impact:

    • FE only uses within variation (5%)
    • Standard errors will be large (imprecise estimates)
    • Although consistent, very inefficient
  2. Solutions:

    • Option 1: Use RE (if Hausman test passes)
      • RE uses all variation (more efficient)
    • Option 2: Increase panel time span
      • Longer time → more within variation
    • Option 3: Accept FE's low efficiency
      • If consistency is more important than efficiency (conservative strategy)

Classic Literature Recommendations

Foundational Papers

  1. Mundlak, Y. (1978). "On the Pooling of Time Series and Cross Section Data." Econometrica, 46(1), 69-85.

    • Theoretical foundation of FE vs RE
    • Proposes Mundlak method (add means of time-varying variables in RE)
  2. Hausman, J. A. (1978). "Specification Tests in Econometrics." Econometrica, 46(6), 1251-1271.

    • Proposes Hausman test
    • Scientific tool for choosing between FE vs RE
  3. Arellano, M., & Bond, S. (1991). "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations." Review of Economic Studies, 58(2), 277-297.

    • Pioneering work on dynamic panel models
    • GMM estimation method
  4. Bertrand, M., Duflo, E., & Mullainathan, S. (2004). "How Much Should We Trust Differences-In-Differences Estimates?" Quarterly Journal of Economics, 119(1), 249-275.

    • Importance of clustered standard errors
    • Statistical inference issues in DID

Authoritative Textbooks

  1. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed., MIT Press

    • Authoritative textbook on panel data
    • Chapters 10-11 detail FE and RE
  2. Baltagi, B. H. (2021). Econometric Analysis of Panel Data, 6th ed., Springer

    • Comprehensive coverage of panel methods
    • Includes latest developments (e.g., Synthetic Control)
  3. Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications, Cambridge University Press

    • Chapters 21-23: Panel data
    • Empirical application oriented, rich in cases
  4. Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics, Princeton University Press

    • Chapter 5: Fixed effects, DID, RDD
    • Strong intuition, suitable for beginners

Classic Empirical Applications

  1. Card, D. (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling." Aspects of Labour Market Behaviour, 201-222.

    • Uses panel data + IV to estimate returns to education
  2. Jacobson, L. S., LaLonde, R. J., & Sullivan, D. G. (1993). "Earnings Losses of Displaced Workers." American Economic Review, 83(4), 685-709.

    • Classic case of event study design

Further Learning

Python Resources

  1. linearmodels Official Documentation

  2. Python Panel Data Tutorials

    • Kevin Sheppard's Panel Data lecture notes
    • Paired with linearmodels library

Online Courses

  1. MIT OpenCourseWare - 14.32 Econometrics

    • Joshua Angrist's panel data lectures
    • Emphasis on causal inference
  2. Coursera - Econometrics: Methods and Applications

    • University of Amsterdam
    • Includes panel data module

Panel Data Quick Reference

TaskPython CodeNotes
Set panel indexdf.set_index(['id', 'year'])Must be MultiIndex
One-way FEPanelOLS(y, X, entity_effects=True)Control for individual heterogeneity
Two-way FEPanelOLS(y, X, entity_effects=True, time_effects=True)Control for individual + time
Random effectsRandomEffects(y, X)Assume
Clustered SE.fit(cov_type='clustered', cluster_entity=True)Must use for panel data!
Hausman testcompare({'FE': fe, 'RE': re})Choose FE vs RE
Extract fixed effectsmodel.estimated_effectsOnly available for FE

Conclusion

Congratulations on completing your study of Panel Data and Fixed Effects Models!

You now master:

  • ✓ Panel data structure and advantages
  • ✓ Fixed effects and random effects principles
  • ✓ Hausman test and model selection
  • ✓ Two-way fixed effects and clustered standard errors
  • ✓ Panel data applications in DID
  • ✓ Professional use of linearmodels library

Next Steps:

  • Read classic papers to understand best practices in empirical research
  • Apply panel methods to your own research topics
  • Learn more advanced causal inference methods (DID, RDD, IV)

Remember the core idea:

"Panel data + Fixed effects = The workhorse of modern applied econometrics!"

Continue forward, explore Module 9: Difference-in-Differences (DID)!


Panel data, opening a new chapter in causal inference!

Released under the MIT License. Content © Author.