8.6 Summary and Review

Comprehensive Integration: Systematic Summary and Practical Training in Panel Data Methods

Core Content Review

Three Major Advantages of Panel Data

Control for Unobserved Heterogeneity ⭐⭐⭐
- Differencing eliminates time-invariant individual characteristics (ability, personality, family background, etc.)
- Solves omitted variable bias (OVB)
More Variation and Observations
- Sample size:
- Utilize both between and within variation
Dynamic Analysis and Causal Identification
- Track individual changes over time
- Foundation for DID, event study, and other methods

Panel Regression Methods Decision Tree

Start: Have panel data (N × T)
  ↓
Need to estimate time-invariant variables?
  ↓ Yes
  Consider Random Effects (RE)
  ↓
  Conduct Hausman test
  ↓ p < 0.05
  Use Fixed Effects (FE)
  ↓
  ↓ No (don't need to estimate time-invariant variables)
  ↓
Prioritize Fixed Effects (FE)
  ↓
Common time trends exist?
  ↓ Yes
  Two-way Fixed Effects (Two-Way FE)
  ↓ No
  One-way Fixed Effects (One-Way FE)
  ↓
Use clustered standard errors (cluster_entity=True)
  ↓
Check if within variation is sufficient
  ↓ Small within variation
  Consider RE or add control variables
  ↓ Sufficient within variation
  Complete!

Comparison of Three Panel Methods

Method	Assumptions	Advantages	Disadvantages	Python Implementation
Pooled OLS	No individual heterogeneity	Simple	Omitted variable bias	`sm.OLS(y, X)`
Fixed Effects (FE)	Allow correlated with	Consistent (even with endogeneity)	Cannot estimate time-invariant variables	`PanelOLS(..., entity_effects=True)`
Random Effects (RE)	Require	Efficient, can estimate time-invariant variables	Inconsistent if correlated with	`RandomEffects(...)`

Core Estimation Methods

Three FE Estimations

Within Transformation: Demean variables
LSDV: Add dummy variables
First Difference: Difference adjacent periods

RE's GLS Estimation

Quasi-demeaning transformation:

Practical Checklist

Standard Panel Regression Workflow

[ ] Step 1: Check data structure
- Confirm balanced or unbalanced panel
- Calculate , , total observations
[ ] Step 2: Exploratory analysis
- Plot spaghetti plot (time trends)
- Calculate within/between variation proportions
[ ] Step 3: Choose model
- Default use FE (economics standard)
- If need to estimate time-invariant variables, consider RE
- Conduct Hausman test
[ ] Step 4: Decide whether to use two-way FE
- Check for common time trends
- DID research must use two-way FE
[ ] Step 5: Choose correct standard errors
- Panel data must use clustered standard errors
- Cluster at individual level (cluster_entity=True)
[ ] Step 6: Diagnostics and robustness checks
- Check if within variation is sufficient
- Compare FE and RE results
- Check for bad control problems

Common Pitfalls and Solutions

Pitfall 1: Forgetting to Use Clustered Standard Errors

Problem: Standard errors underestimated, -statistics inflated

Solution:

python

model = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Must use!
)

Pitfall 2: Attempting to Estimate Time-Invariant Variables

Problem: FE cannot estimate gender, race, and other time-invariant variables

Wrong Example:

python

# gender coefficient cannot be estimated (eliminated by differencing)
model = PanelOLS(y, X + gender, entity_effects=True).fit()

Solutions:

Use RE (if Hausman test passes)
Study interaction effects between time-invariant and time-varying variables

Pitfall 3: Bad Control Problem

Problem: Controlling for variables affected by treatment (mediators)

Wrong Example:

python

# Occupation is a result of education (mediator)
# Controlling for occupation blocks part of education's effect
model = PanelOLS(log_wage, education + occupation, entity_effects=True).fit()

Decision Rule:

Control: Confounders (simultaneously affect and )
Don't control: Mediators (), outcome variables

Pitfall 4: Insufficient Within Variation

Problem: If independent variables barely change within groups, FE estimates are imprecise

Diagnostics:

python

# Calculate within variation proportion
total_var = df['education'].var()
within_var = df.groupby('id')['education'].apply(lambda x: (x - x.mean()).var()).mean()
print(f"Within variation proportion: {within_var / total_var * 100:.1f}%")

Solutions:

If < 10%: Consider RE (if Hausman test passes)
Increase time span or find variables with more variation

Pitfall 5: Ignoring Time Trends

Problem: If both and have upward trends, might be due to common time factors

Solution:

python

# Use two-way FE
model = PanelOLS(y, X,
                 entity_effects=True,
                 time_effects=True).fit()  # Control for time trends

Practice Problems

Problem 1: Conceptual Question

Question: Explain why fixed effects can eliminate omitted variable bias? What conditions must be satisfied?

Click to view answer

Answer:

FE eliminates time-invariant individual characteristics () through differencing
Condition: Omitted variables must be time-invariant
Examples: Ability, family background, personality, etc.
If omitted variables change over time (e.g., health status), FE cannot eliminate bias

Problem 2: Hausman Test

Question: You estimated FE and RE, Hausman test -value is 0.001. Which model should you use? Why?

Click to view answer

Answer:

Should use FE
Reason: , reject null hypothesis ()
Indicates individual effects correlated with independent variables, RE is inconsistent
Although FE is less efficient, it is consistent

Problem 3: Programming Exercise

Data: Simulate 100 companies, 10 years, study effect of R&D expenditure on profit

Tasks:

Generate data, ensure company effects correlated with R&D (endogeneity)
Estimate pooled OLS, FE, RE
Conduct Hausman test
Compare bias of three methods

Click to view answer

python

import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects
import statsmodels.api as sm

# Generate data
np.random.seed(42)
data = []
for i in range(100):
    company_effect = np.random.normal(0, 1)
    for t in range(10):
        # R&D correlated with company effect (endogeneity)
        rd = 5 + 0.5 * company_effect + np.random.normal(0, 1)
        # Profit
        profit = 10 + 2 * rd + company_effect + np.random.normal(0, 0.5)
        data.append({'id': i, 'year': 2010 + t, 'profit': profit, 'rd': rd})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# 1. Pooled OLS
X_pooled = sm.add_constant(df[['rd']])
model_pooled = sm.OLS(df['profit'], X_pooled).fit()

# 2. FE
model_fe = PanelOLS(df_panel['profit'], df_panel[['rd']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

# 3. RE
model_re = RandomEffects(df_panel['profit'], df_panel[['rd']]).fit()

# 4. Hausman test
beta_diff = model_fe.params['rd'] - model_re.params['rd']
var_diff = model_fe.cov['rd']['rd'] - model_re.cov['rd']['rd']
hausman_stat = (beta_diff ** 2) / var_diff
from scipy.stats import chi2
p_value = 1 - chi2.cdf(hausman_stat, df=1)

print("=" * 70)
print("Estimation Results Comparison")
print("=" * 70)
print(f"True parameter:  2.00")
print(f"Pooled OLS:      {model_pooled.params['rd']:.4f}  (bias: {model_pooled.params['rd'] - 2:.4f})")
print(f"FE:              {model_fe.params['rd']:.4f}  (bias: {model_fe.params['rd'] - 2:.4f})")
print(f"RE:              {model_re.params['rd']:.4f}  (bias: {model_re.params['rd'] - 2:.4f})")
print(f"\nHausman test p-value: {p_value:.4f}")
print(f"Conclusion: {'Use FE' if p_value < 0.05 else 'Use RE'}")

Problem 4: DID Application

Scenario: Province A raised minimum wage in 2018, Province B did not. Data includes employment rates for both provinces 2015-2020.

Tasks:

Set up DID model
Estimate policy effect
Plot event study graph

Click to view answer

python

import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Generate DID data
np.random.seed(123)
data = []

for province in ['A', 'B']:
    treated = 1 if province == 'A' else 0
    province_effect = 0.5 if province == 'A' else 0

    for year in range(2015, 2021):
        post = 1 if year >= 2018 else 0
        treated_post = treated * post

        # Employment rate (%)
        # DID effect = 2 (increase by 2 percentage points)
        employment = (60 + 2 * treated_post + province_effect +
                      0.3 * year + np.random.normal(0, 1))

        data.append({
            'province': province,
            'year': year,
            'employment': employment,
            'treated': treated,
            'post': post,
            'treated_post': treated_post
        })

df = pd.DataFrame(data)

# Create province ID
df['province_id'] = df['province'].map({'A': 1, 'B': 2})
df_panel = df.set_index(['province_id', 'year'])

# DID regression
model_did = PanelOLS(df_panel['employment'],
                     df_panel[['treated_post']],
                     entity_effects=True,
                     time_effects=True).fit(cov_type='clustered',
                                            cluster_entity=True)

print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)

# Event study plot
df_avg = df.groupby(['treated', 'year'])['employment'].mean().reset_index()
df_treated = df_avg[df_avg['treated'] == 1]
df_control = df_avg[df_avg['treated'] == 0]

plt.figure(figsize=(12, 6))
plt.plot(df_treated['year'], df_treated['employment'], 'o-',
         label='Treatment Group (Province A)', linewidth=2, markersize=8)
plt.plot(df_control['year'], df_control['employment'], 's-',
         label='Control Group (Province B)', linewidth=2, markersize=8)
plt.axvline(2017.5, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.6, 63, 'Policy Implementation', fontsize=12, color='red')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Employment Rate (%)', fontweight='bold', fontsize=12)
plt.title('Difference-in-Differences: Effect of Minimum Wage on Employment', fontweight='bold', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Problem 5: Interpretation Question

Scenario: You study the effect of education on wages using 5-year panel data. Within variation accounts for only 5% of total variation.

Questions:

What is the impact on FE estimation?
What would you do?

Click to view answer

Answer:

Impact:
- FE only uses within variation (5%)
- Standard errors will be large (imprecise estimates)
- Although consistent, very inefficient
Solutions:
- Option 1: Use RE (if Hausman test passes)
  - RE uses all variation (more efficient)
- Option 2: Increase panel time span
  - Longer time → more within variation
- Option 3: Accept FE's low efficiency
  - If consistency is more important than efficiency (conservative strategy)

Classic Literature Recommendations

Foundational Papers

Mundlak, Y. (1978). "On the Pooling of Time Series and Cross Section Data." Econometrica, 46(1), 69-85.
- Theoretical foundation of FE vs RE
- Proposes Mundlak method (add means of time-varying variables in RE)
Hausman, J. A. (1978). "Specification Tests in Econometrics." Econometrica, 46(6), 1251-1271.
- Proposes Hausman test
- Scientific tool for choosing between FE vs RE
Arellano, M., & Bond, S. (1991). "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations." Review of Economic Studies, 58(2), 277-297.
- Pioneering work on dynamic panel models
- GMM estimation method
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). "How Much Should We Trust Differences-In-Differences Estimates?" Quarterly Journal of Economics, 119(1), 249-275.
- Importance of clustered standard errors
- Statistical inference issues in DID

Authoritative Textbooks

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed., MIT Press
- Authoritative textbook on panel data
- Chapters 10-11 detail FE and RE
Baltagi, B. H. (2021). Econometric Analysis of Panel Data, 6th ed., Springer
- Comprehensive coverage of panel methods
- Includes latest developments (e.g., Synthetic Control)
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications, Cambridge University Press
- Chapters 21-23: Panel data
- Empirical application oriented, rich in cases
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics, Princeton University Press
- Chapter 5: Fixed effects, DID, RDD
- Strong intuition, suitable for beginners

Classic Empirical Applications

Card, D. (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling." Aspects of Labour Market Behaviour, 201-222.
- Uses panel data + IV to estimate returns to education
Jacobson, L. S., LaLonde, R. J., & Sullivan, D. G. (1993). "Earnings Losses of Displaced Workers." American Economic Review, 83(4), 685-709.
- Classic case of event study design

Further Learning

Python Resources

linearmodels Official Documentation
- https://bashtage.github.io/linearmodels/
- Detailed API documentation and examples
Python Panel Data Tutorials
- Kevin Sheppard's Panel Data lecture notes
- Paired with linearmodels library

Online Courses

MIT OpenCourseWare - 14.32 Econometrics
- Joshua Angrist's panel data lectures
- Emphasis on causal inference
Coursera - Econometrics: Methods and Applications
- University of Amsterdam
- Includes panel data module

Panel Data Quick Reference

Task	Python Code	Notes
Set panel index	`df.set_index(['id', 'year'])`	Must be MultiIndex
One-way FE	`PanelOLS(y, X, entity_effects=True)`	Control for individual heterogeneity
Two-way FE	`PanelOLS(y, X, entity_effects=True, time_effects=True)`	Control for individual + time
Random effects	`RandomEffects(y, X)`	Assume
Clustered SE	`.fit(cov_type='clustered', cluster_entity=True)`	Must use for panel data!
Hausman test	`compare({'FE': fe, 'RE': re})`	Choose FE vs RE
Extract fixed effects	`model.estimated_effects`	Only available for FE

Conclusion

Congratulations on completing your study of Panel Data and Fixed Effects Models!

You now master:

✓ Panel data structure and advantages
✓ Fixed effects and random effects principles
✓ Hausman test and model selection
✓ Two-way fixed effects and clustered standard errors
✓ Panel data applications in DID
✓ Professional use of linearmodels library

Next Steps:

Read classic papers to understand best practices in empirical research
Apply panel methods to your own research topics
Learn more advanced causal inference methods (DID, RDD, IV)

Remember the core idea:

"Panel data + Fixed effects = The workhorse of modern applied econometrics!"

Continue forward, explore Module 9: Difference-in-Differences (DID)!

Panel data, opening a new chapter in causal inference!

8.6 Summary and Review ​

Core Content Review ​

Three Major Advantages of Panel Data ​

Panel Regression Methods Decision Tree ​

Comparison of Three Panel Methods ​

Core Estimation Methods ​

Three FE Estimations ​

RE's GLS Estimation ​

Practical Checklist ​

Standard Panel Regression Workflow ​

Common Pitfalls and Solutions ​

Pitfall 1: Forgetting to Use Clustered Standard Errors ​

Pitfall 2: Attempting to Estimate Time-Invariant Variables ​

Pitfall 3: Bad Control Problem ​

Pitfall 4: Insufficient Within Variation ​

Pitfall 5: Ignoring Time Trends ​

Practice Problems ​

Problem 1: Conceptual Question ​

Problem 2: Hausman Test ​

Problem 3: Programming Exercise ​

Problem 4: DID Application ​

Problem 5: Interpretation Question ​

Classic Literature Recommendations ​

Foundational Papers ​

Authoritative Textbooks ​

Classic Empirical Applications ​

Further Learning ​

Python Resources ​

Online Courses ​

Panel Data Quick Reference ​

Conclusion ​

8.6 Summary and Review

Core Content Review

Three Major Advantages of Panel Data

Panel Regression Methods Decision Tree

Comparison of Three Panel Methods

Core Estimation Methods

Three FE Estimations

RE's GLS Estimation

Practical Checklist

Standard Panel Regression Workflow

Common Pitfalls and Solutions

Pitfall 1: Forgetting to Use Clustered Standard Errors

Pitfall 2: Attempting to Estimate Time-Invariant Variables

Pitfall 3: Bad Control Problem

Pitfall 4: Insufficient Within Variation

Pitfall 5: Ignoring Time Trends

Practice Problems

Problem 1: Conceptual Question

Problem 2: Hausman Test

Problem 3: Programming Exercise

Problem 4: DID Application

Problem 5: Interpretation Question

Classic Literature Recommendations

Foundational Papers

Authoritative Textbooks

Classic Empirical Applications

Further Learning

Python Resources

Online Courses

Panel Data Quick Reference

Conclusion