8.6 Summary and Review
Comprehensive Integration: Systematic Summary and Practical Training in Panel Data Methods
Core Content Review
Three Major Advantages of Panel Data
Control for Unobserved Heterogeneity ⭐⭐⭐
- Differencing eliminates time-invariant individual characteristics (ability, personality, family background, etc.)
- Solves omitted variable bias (OVB)
More Variation and Observations
- Sample size:
- Utilize both between and within variation
Dynamic Analysis and Causal Identification
- Track individual changes over time
- Foundation for DID, event study, and other methods
Panel Regression Methods Decision Tree
Start: Have panel data (N × T)
↓
Need to estimate time-invariant variables?
↓ Yes
Consider Random Effects (RE)
↓
Conduct Hausman test
↓ p < 0.05
Use Fixed Effects (FE)
↓
↓ No (don't need to estimate time-invariant variables)
↓
Prioritize Fixed Effects (FE)
↓
Common time trends exist?
↓ Yes
Two-way Fixed Effects (Two-Way FE)
↓ No
One-way Fixed Effects (One-Way FE)
↓
Use clustered standard errors (cluster_entity=True)
↓
Check if within variation is sufficient
↓ Small within variation
Consider RE or add control variables
↓ Sufficient within variation
Complete!Comparison of Three Panel Methods
| Method | Model | Assumptions | Advantages | Disadvantages | Python Implementation |
|---|---|---|---|---|---|
| Pooled OLS | No individual heterogeneity | Simple | Omitted variable bias | sm.OLS(y, X) | |
| Fixed Effects (FE) | Allow correlated with | Consistent (even with endogeneity) | Cannot estimate time-invariant variables | PanelOLS(..., entity_effects=True) | |
| Random Effects (RE) | Require | Efficient, can estimate time-invariant variables | Inconsistent if correlated with | RandomEffects(...) |
Core Estimation Methods
Three FE Estimations
Within Transformation: Demean variables
LSDV: Add dummy variables
First Difference: Difference adjacent periods
RE's GLS Estimation
Quasi-demeaning transformation:
Practical Checklist
Standard Panel Regression Workflow
[ ] Step 1: Check data structure
- Confirm balanced or unbalanced panel
- Calculate , , total observations
[ ] Step 2: Exploratory analysis
- Plot spaghetti plot (time trends)
- Calculate within/between variation proportions
[ ] Step 3: Choose model
- Default use FE (economics standard)
- If need to estimate time-invariant variables, consider RE
- Conduct Hausman test
[ ] Step 4: Decide whether to use two-way FE
- Check for common time trends
- DID research must use two-way FE
[ ] Step 5: Choose correct standard errors
- Panel data must use clustered standard errors
- Cluster at individual level (
cluster_entity=True)
[ ] Step 6: Diagnostics and robustness checks
- Check if within variation is sufficient
- Compare FE and RE results
- Check for bad control problems
Common Pitfalls and Solutions
Pitfall 1: Forgetting to Use Clustered Standard Errors
Problem: Standard errors underestimated, -statistics inflated
Solution:
model = PanelOLS(y, X, entity_effects=True).fit(
cov_type='clustered',
cluster_entity=True # Must use!
)Pitfall 2: Attempting to Estimate Time-Invariant Variables
Problem: FE cannot estimate gender, race, and other time-invariant variables
Wrong Example:
# gender coefficient cannot be estimated (eliminated by differencing)
model = PanelOLS(y, X + gender, entity_effects=True).fit()Solutions:
- Use RE (if Hausman test passes)
- Study interaction effects between time-invariant and time-varying variables
Pitfall 3: Bad Control Problem
Problem: Controlling for variables affected by treatment (mediators)
Wrong Example:
# Occupation is a result of education (mediator)
# Controlling for occupation blocks part of education's effect
model = PanelOLS(log_wage, education + occupation, entity_effects=True).fit()Decision Rule:
- Control: Confounders (simultaneously affect and )
- Don't control: Mediators (), outcome variables
Pitfall 4: Insufficient Within Variation
Problem: If independent variables barely change within groups, FE estimates are imprecise
Diagnostics:
# Calculate within variation proportion
total_var = df['education'].var()
within_var = df.groupby('id')['education'].apply(lambda x: (x - x.mean()).var()).mean()
print(f"Within variation proportion: {within_var / total_var * 100:.1f}%")Solutions:
- If < 10%: Consider RE (if Hausman test passes)
- Increase time span or find variables with more variation
Pitfall 5: Ignoring Time Trends
Problem: If both and have upward trends, might be due to common time factors
Solution:
# Use two-way FE
model = PanelOLS(y, X,
entity_effects=True,
time_effects=True).fit() # Control for time trendsPractice Problems
Problem 1: Conceptual Question
Question: Explain why fixed effects can eliminate omitted variable bias? What conditions must be satisfied?
Click to view answer
Answer:
- FE eliminates time-invariant individual characteristics () through differencing
- Condition: Omitted variables must be time-invariant
- Examples: Ability, family background, personality, etc.
- If omitted variables change over time (e.g., health status), FE cannot eliminate bias
Problem 2: Hausman Test
Question: You estimated FE and RE, Hausman test -value is 0.001. Which model should you use? Why?
Click to view answer
Answer:
- Should use FE
- Reason: , reject null hypothesis ()
- Indicates individual effects correlated with independent variables, RE is inconsistent
- Although FE is less efficient, it is consistent
Problem 3: Programming Exercise
Data: Simulate 100 companies, 10 years, study effect of R&D expenditure on profit
Tasks:
- Generate data, ensure company effects correlated with R&D (endogeneity)
- Estimate pooled OLS, FE, RE
- Conduct Hausman test
- Compare bias of three methods
Click to view answer
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects
import statsmodels.api as sm
# Generate data
np.random.seed(42)
data = []
for i in range(100):
company_effect = np.random.normal(0, 1)
for t in range(10):
# R&D correlated with company effect (endogeneity)
rd = 5 + 0.5 * company_effect + np.random.normal(0, 1)
# Profit
profit = 10 + 2 * rd + company_effect + np.random.normal(0, 0.5)
data.append({'id': i, 'year': 2010 + t, 'profit': profit, 'rd': rd})
df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])
# 1. Pooled OLS
X_pooled = sm.add_constant(df[['rd']])
model_pooled = sm.OLS(df['profit'], X_pooled).fit()
# 2. FE
model_fe = PanelOLS(df_panel['profit'], df_panel[['rd']],
entity_effects=True).fit(cov_type='clustered',
cluster_entity=True)
# 3. RE
model_re = RandomEffects(df_panel['profit'], df_panel[['rd']]).fit()
# 4. Hausman test
beta_diff = model_fe.params['rd'] - model_re.params['rd']
var_diff = model_fe.cov['rd']['rd'] - model_re.cov['rd']['rd']
hausman_stat = (beta_diff ** 2) / var_diff
from scipy.stats import chi2
p_value = 1 - chi2.cdf(hausman_stat, df=1)
print("=" * 70)
print("Estimation Results Comparison")
print("=" * 70)
print(f"True parameter: 2.00")
print(f"Pooled OLS: {model_pooled.params['rd']:.4f} (bias: {model_pooled.params['rd'] - 2:.4f})")
print(f"FE: {model_fe.params['rd']:.4f} (bias: {model_fe.params['rd'] - 2:.4f})")
print(f"RE: {model_re.params['rd']:.4f} (bias: {model_re.params['rd'] - 2:.4f})")
print(f"\nHausman test p-value: {p_value:.4f}")
print(f"Conclusion: {'Use FE' if p_value < 0.05 else 'Use RE'}")Problem 4: DID Application
Scenario: Province A raised minimum wage in 2018, Province B did not. Data includes employment rates for both provinces 2015-2020.
Tasks:
- Set up DID model
- Estimate policy effect
- Plot event study graph
Click to view answer
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
# Generate DID data
np.random.seed(123)
data = []
for province in ['A', 'B']:
treated = 1 if province == 'A' else 0
province_effect = 0.5 if province == 'A' else 0
for year in range(2015, 2021):
post = 1 if year >= 2018 else 0
treated_post = treated * post
# Employment rate (%)
# DID effect = 2 (increase by 2 percentage points)
employment = (60 + 2 * treated_post + province_effect +
0.3 * year + np.random.normal(0, 1))
data.append({
'province': province,
'year': year,
'employment': employment,
'treated': treated,
'post': post,
'treated_post': treated_post
})
df = pd.DataFrame(data)
# Create province ID
df['province_id'] = df['province'].map({'A': 1, 'B': 2})
df_panel = df.set_index(['province_id', 'year'])
# DID regression
model_did = PanelOLS(df_panel['employment'],
df_panel[['treated_post']],
entity_effects=True,
time_effects=True).fit(cov_type='clustered',
cluster_entity=True)
print("=" * 70)
print("DID Estimation Results")
print("=" * 70)
print(model_did)
# Event study plot
df_avg = df.groupby(['treated', 'year'])['employment'].mean().reset_index()
df_treated = df_avg[df_avg['treated'] == 1]
df_control = df_avg[df_avg['treated'] == 0]
plt.figure(figsize=(12, 6))
plt.plot(df_treated['year'], df_treated['employment'], 'o-',
label='Treatment Group (Province A)', linewidth=2, markersize=8)
plt.plot(df_control['year'], df_control['employment'], 's-',
label='Control Group (Province B)', linewidth=2, markersize=8)
plt.axvline(2017.5, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
plt.text(2017.6, 63, 'Policy Implementation', fontsize=12, color='red')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Employment Rate (%)', fontweight='bold', fontsize=12)
plt.title('Difference-in-Differences: Effect of Minimum Wage on Employment', fontweight='bold', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Problem 5: Interpretation Question
Scenario: You study the effect of education on wages using 5-year panel data. Within variation accounts for only 5% of total variation.
Questions:
- What is the impact on FE estimation?
- What would you do?
Click to view answer
Answer:
Impact:
- FE only uses within variation (5%)
- Standard errors will be large (imprecise estimates)
- Although consistent, very inefficient
Solutions:
- Option 1: Use RE (if Hausman test passes)
- RE uses all variation (more efficient)
- Option 2: Increase panel time span
- Longer time → more within variation
- Option 3: Accept FE's low efficiency
- If consistency is more important than efficiency (conservative strategy)
- Option 1: Use RE (if Hausman test passes)
Classic Literature Recommendations
Foundational Papers
Mundlak, Y. (1978). "On the Pooling of Time Series and Cross Section Data." Econometrica, 46(1), 69-85.
- Theoretical foundation of FE vs RE
- Proposes Mundlak method (add means of time-varying variables in RE)
Hausman, J. A. (1978). "Specification Tests in Econometrics." Econometrica, 46(6), 1251-1271.
- Proposes Hausman test
- Scientific tool for choosing between FE vs RE
Arellano, M., & Bond, S. (1991). "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations." Review of Economic Studies, 58(2), 277-297.
- Pioneering work on dynamic panel models
- GMM estimation method
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). "How Much Should We Trust Differences-In-Differences Estimates?" Quarterly Journal of Economics, 119(1), 249-275.
- Importance of clustered standard errors
- Statistical inference issues in DID
Authoritative Textbooks
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed., MIT Press
- Authoritative textbook on panel data
- Chapters 10-11 detail FE and RE
Baltagi, B. H. (2021). Econometric Analysis of Panel Data, 6th ed., Springer
- Comprehensive coverage of panel methods
- Includes latest developments (e.g., Synthetic Control)
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications, Cambridge University Press
- Chapters 21-23: Panel data
- Empirical application oriented, rich in cases
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics, Princeton University Press
- Chapter 5: Fixed effects, DID, RDD
- Strong intuition, suitable for beginners
Classic Empirical Applications
Card, D. (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling." Aspects of Labour Market Behaviour, 201-222.
- Uses panel data + IV to estimate returns to education
Jacobson, L. S., LaLonde, R. J., & Sullivan, D. G. (1993). "Earnings Losses of Displaced Workers." American Economic Review, 83(4), 685-709.
- Classic case of event study design
Further Learning
Python Resources
linearmodels Official Documentation
- https://bashtage.github.io/linearmodels/
- Detailed API documentation and examples
Python Panel Data Tutorials
- Kevin Sheppard's Panel Data lecture notes
- Paired with linearmodels library
Online Courses
MIT OpenCourseWare - 14.32 Econometrics
- Joshua Angrist's panel data lectures
- Emphasis on causal inference
Coursera - Econometrics: Methods and Applications
- University of Amsterdam
- Includes panel data module
Panel Data Quick Reference
| Task | Python Code | Notes |
|---|---|---|
| Set panel index | df.set_index(['id', 'year']) | Must be MultiIndex |
| One-way FE | PanelOLS(y, X, entity_effects=True) | Control for individual heterogeneity |
| Two-way FE | PanelOLS(y, X, entity_effects=True, time_effects=True) | Control for individual + time |
| Random effects | RandomEffects(y, X) | Assume |
| Clustered SE | .fit(cov_type='clustered', cluster_entity=True) | Must use for panel data! |
| Hausman test | compare({'FE': fe, 'RE': re}) | Choose FE vs RE |
| Extract fixed effects | model.estimated_effects | Only available for FE |
Conclusion
Congratulations on completing your study of Panel Data and Fixed Effects Models!
You now master:
- ✓ Panel data structure and advantages
- ✓ Fixed effects and random effects principles
- ✓ Hausman test and model selection
- ✓ Two-way fixed effects and clustered standard errors
- ✓ Panel data applications in DID
- ✓ Professional use of linearmodels library
Next Steps:
- Read classic papers to understand best practices in empirical research
- Apply panel methods to your own research topics
- Learn more advanced causal inference methods (DID, RDD, IV)
Remember the core idea:
"Panel data + Fixed effects = The workhorse of modern applied econometrics!"
Continue forward, explore Module 9: Difference-in-Differences (DID)!
Panel data, opening a new chapter in causal inference!