8.2 Panel Data Basics
Understanding panel data structure, variance decomposition, and the pitfalls of pooled OLS
Section Objectives
- Master long format and wide format of panel data
- Understand within variation vs between variation
- Identify problems with pooled OLS (omitted variable bias)
- Use pandas MultiIndex for panel data processing
- Visualize time trends in panel data
Structure of Panel Data
What is Panel Data?
Definition: Panel Data = Multiple individuals (Cross-Section) × Multiple time points (Time Series)
Notation:
Three Dimensions:
: Number of individuals (entities, units, cross-sectional dimension)
- Examples: firms, people, countries, cities
: Number of time periods (temporal dimension)
- Examples: years, quarters, months
Total observations: (ideally, balanced panel)
Types of Panel Data
1. Balanced Panel
Definition: Each individual has observations at all time points
Example:
ID Year Wage Education
1 2015 5000 12
1 2016 5200 12
1 2017 5500 13
2 2015 6000 16
2 2016 6300 16
2 2017 6600 16Characteristics:
- Each ID has observations
- Total observations =
- Simple analysis, no need to handle missing values
2. Unbalanced Panel
Definition: Some individuals missing observations at some time points
Example:
ID Year Wage Education
1 2015 5000 12
1 2016 5200 12
1 2017 5500 13
2 2015 6000 16
2 2017 6600 16 # Missing 2016
3 2016 4800 10
3 2017 5000 10 # Missing 2015Causes:
- Sample attrition: Individuals drop out of surveys
- Firms going public/delisting, bankruptcy
- Data collection issues
Handling: Modern software (like linearmodels) can automatically handle unbalanced panels
Data Format: Long Format vs Wide Format
Long Format ⭐ Recommended
Structure: Each row is an observation (individual-time pair)
id year wage education experience
0 1 2015 5000 12 3
1 1 2016 5200 12 4
2 1 2017 5500 13 5
3 2 2015 6000 16 5
4 2 2016 6300 16 6
5 2 2017 6600 16 7Pros:
- Directly corresponds to regression models
- Convenient for handling unbalanced panels
- Standard format for linearmodels, statsmodels
- Easy to add new variables
Python Implementation:
import pandas as pd
# Long format data
long_data = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2],
'year': [2015, 2016, 2017, 2015, 2016, 2017],
'wage': [5000, 5200, 5500, 6000, 6300, 6600],
'education': [12, 12, 13, 16, 16, 16]
})Wide Format
Structure: Each row is an individual, different time points in different columns
id wage_2015 wage_2016 wage_2017 education_2015 education_2016 education_2017
0 1 5000 5200 5500 12 12 13
1 2 6000 6300 6600 16 16 16Pros:
- Intuitive, easy to view time series for a single individual
- More convenient for certain visualizations
Cons:
- Cannot be directly used for regression
- Difficult to handle unbalanced panels
- Adding new variables requires multiple columns
Long-Wide Format Conversion
Wide Format → Long Format (melt)
import pandas as pd
# Wide format data
wide_data = pd.DataFrame({
'id': [1, 2],
'wage_2015': [5000, 6000],
'wage_2016': [5200, 6300],
'wage_2017': [5500, 6600],
'education': [12, 16] # Time-invariant
})
# Convert to long format
long_data = wide_data.melt(
id_vars=['id', 'education'], # Invariant columns
value_vars=['wage_2015', 'wage_2016', 'wage_2017'],
var_name='year_var',
value_name='wage'
)
# Clean year column
long_data['year'] = long_data['year_var'].str.extract('(\d+)').astype(int)
long_data = long_data.drop('year_var', axis=1).sort_values(['id', 'year'])
print(long_data)
# id education wage year
# 0 1 12 5000 2015
# 1 1 12 5200 2016
# 2 1 12 5500 2017
# 3 2 16 6000 2015
# 4 2 16 6300 2016
# 5 2 16 6600 2017Long Format → Wide Format (pivot)
# Convert from long format back to wide format
wide_data_back = long_data.pivot(
index='id',
columns='year',
values='wage'
).reset_index()
# Rename columns
wide_data_back.columns = ['id'] + [f'wage_{c}' for c in wide_data_back.columns[1:]]
print(wide_data_back)
# id wage_2015 wage_2016 wage_2017
# 0 1 5000 5200 5500
# 1 2 6000 6300 6600Variance Decomposition: Within vs Between
Panel data's core advantage is containing two types of variation:
1. Total Variation
Definition: Variation of all observations around the grand mean
where is the grand mean
2. Between Variation
Definition: Differences between individual means
where is individual 's time mean
Intuition:
- Compares different individuals' "average levels"
- Example: Average wage of ID=1 vs average wage of ID=2
- This is the variation used by cross-sectional regression
3. Within Variation ⭐
Definition: Variation within the same individual over time
Intuition:
- Individual's fluctuation around their own average
- Example: ID=1's wage in 2016 vs ID=1's average wage
- This is the variation used by fixed effects regression
Key Property:
Python Implementation: Variance Decomposition
import numpy as np
import pandas as pd
# Simulate data
np.random.seed(42)
data = []
for i in range(1, 6): # 5 individuals
for t in range(2015, 2020): # 5 years
# Individual effect + time trend + noise
wage = 5000 + i * 500 + (t - 2015) * 100 + np.random.normal(0, 200)
data.append({'id': i, 'year': t, 'wage': wage})
df = pd.DataFrame(data)
print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df.head(10))
# Calculate means
overall_mean = df['wage'].mean() # Grand mean
group_means = df.groupby('id')['wage'].transform('mean') # Individual means
# 1. Total variation
total_var = np.var(df['wage'], ddof=1)
# 2. Between variation
between_var = np.var(df.groupby('id')['wage'].mean(), ddof=1)
# 3. Within variation (residual method)
df['wage_demeaned'] = df['wage'] - group_means
within_var = np.var(df['wage_demeaned'], ddof=1)
print("\n" + "=" * 70)
print("Variance Decomposition")
print("=" * 70)
print(f"Total variation: {total_var:,.2f}")
print(f"Between variation: {between_var:,.2f} ({between_var/total_var*100:.1f}%)")
print(f"Within variation: {within_var:,.2f} ({within_var/total_var*100:.1f}%)")
print(f"Between + Within: {between_var + within_var:,.2f}")
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Between variation (differences in individual means)
group_means_unique = df.groupby('id')['wage'].mean()
axes[0].bar(group_means_unique.index, group_means_unique.values, color='steelblue', alpha=0.7)
axes[0].axhline(overall_mean, color='red', linestyle='--', linewidth=2, label='Grand Mean')
axes[0].set_xlabel('Individual ID', fontweight='bold')
axes[0].set_ylabel('Average Wage', fontweight='bold')
axes[0].set_title('Between Variation\nAverage wages of different individuals', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Right plot: Within variation (within-individual time changes)
for i in [1, 2, 3]:
subset = df[df['id'] == i]
axes[1].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)
# Plot individual mean line
axes[1].axhline(subset['wage'].mean(), color='gray', linestyle=':', alpha=0.5)
axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Wage', fontweight='bold')
axes[1].set_title('Within Variation\nWithin-individual time changes', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()Output Interpretation:
- Large between variation: Large differences between individuals (e.g., different industries, education levels)
- Large within variation: Large changes over time within individuals (e.g., wage growth, experience accumulation)
- Fixed effects only use within variation, eliminating between differences
Problems with Pooled OLS
Pooled OLS
Model:
Assumption: All individuals and time periods are the same, is i.i.d.
Python Implementation:
import statsmodels.api as sm
# Ignore panel structure, direct OLS
X = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['wage'], X).fit()
print(model_pooled.summary())Pros:
- Simple, easy to implement
- Efficient (if model is correct)
Fatal Flaws:
Problem 1: Omitted Variable Bias (OVB) ⭐⭐⭐
Scenario: Studying the effect of education on wages
True Model:
where is unobservable individual characteristics (ability, family background, personality, etc.)
Model Estimated by Pooled OLS:
where
Problem: If , then is biased!
Intuition:
- High-ability people get more education AND earn more
- Pooled OLS mistakenly attributes the effect of ability to the effect of education
- Result: Overestimates the returns to education
Direction of Omitted Variable Bias
OVB Formula (when omitting third variable ):
where:
- : True effect of on
- : Covariance between and
Applied to Our Example:
- : Education
- : Ability (unobservable)
- : Wage
Analysis:
- : Ability increases wages
- : Smart people get more education
- Therefore: → Positive bias (overestimation)
Python Demonstration: Generating OVB
import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS
np.random.seed(123)
# Parameter settings
N = 500
T = 5
true_beta_education = 0.08 # True effect of education
# Simulate data
data = []
for i in range(N):
# Individual fixed effect (ability, family background, etc.)
ability = np.random.normal(0, 0.5)
for t in range(T):
# Education level (positively correlated with ability!)
education = 12 + 0.8 * ability + t * 0.2 + np.random.normal(0, 0.3)
# Wage (log)
log_wage = 1.5 + true_beta_education * education + 0.5 * ability + np.random.normal(0, 0.1)
data.append({
'id': i,
'year': 2015 + t,
'log_wage': log_wage,
'education': education,
'ability': ability # Unobservable in actual research!
})
df = pd.DataFrame(data)
print("=" * 70)
print("Data Generating Process (DGP)")
print("=" * 70)
print("True model: log(wage) = 1.5 + 0.08*education + 0.5*ability + noise")
print(f"True effect of education: {true_beta_education}")
print(f"Correlation between ability and education: {df[['education', 'ability']].corr().iloc[0, 1]:.3f}")
# Method 1: Pooled OLS (omitting ability)
X_pooled = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['log_wage'], X_pooled).fit()
print("\n" + "=" * 70)
print("Method 1: Pooled OLS (omitting ability → biased estimate)")
print("=" * 70)
print(f"Education coefficient: {model_pooled.params['education']:.4f}")
print(f"Standard error: {model_pooled.bse['education']:.4f}")
print(f"Bias: {model_pooled.params['education'] - true_beta_education:.4f} (overestimate!)")
# Method 2: Pooled OLS (including ability, for comparison only)
X_with_ability = sm.add_constant(df[['education', 'ability']])
model_with_ability = sm.OLS(df['log_wage'], X_with_ability).fit()
print("\n" + "=" * 70)
print("Method 2: Pooled OLS (including ability → unbiased, but infeasible in practice)")
print("=" * 70)
print(f"Education coefficient: {model_with_ability.params['education']:.4f}")
print(f"Ability coefficient: {model_with_ability.params['ability']:.4f}")
# Method 3: Fixed effects (eliminates ability)
df_panel = df.set_index(['id', 'year'])
model_fe = PanelOLS(df_panel['log_wage'],
df_panel[['education']],
entity_effects=True).fit(cov_type='clustered',
cluster_entity=True)
print("\n" + "=" * 70)
print("Method 3: Fixed Effects (differencing eliminates ability → unbiased)")
print("=" * 70)
print(f"Education coefficient: {model_fe.params['education']:.4f}")
print(f"Standard error: {model_fe.std_errors['education']:.4f}")
# Comparison summary
print("\n" + "=" * 70)
print("Estimation Comparison")
print("=" * 70)
print(f"True parameter: {true_beta_education:.4f}")
print(f"Pooled OLS (omitted): {model_pooled.params['education']:.4f} (bias: {model_pooled.params['education'] - true_beta_education:+.4f})")
print(f"Pooled OLS (included): {model_with_ability.params['education']:.4f} (bias: {model_with_ability.params['education'] - true_beta_education:+.4f})")
print(f"Fixed effects: {model_fe.params['education']:.4f} (bias: {model_fe.params['education'] - true_beta_education:+.4f})")Core Conclusion:
- Pooled OLS overestimates the education effect (because ability is omitted)
- Fixed effects eliminates ability through differencing, obtaining unbiased estimates
- This is the power of panel data!
Problem 2: Serial Correlation
Definition: Error terms of the same individual at different times are correlated
Consequences:
- Coefficient estimates still unbiased
- But standard errors are biased (usually underestimated)
- Leads to inflated statistics, increased false positives
Solution:
- Use clustered standard errors
- Cluster at the individual level: allows errors within the same individual to be correlated
# Fixed effects + clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
cov_type='clustered',
cluster_entity=True # Cluster at individual level
)Problem 3: Heteroskedasticity
Definition: Variance of error terms differs across individuals or time
Sources:
- Scale differences across individuals (large firms vs small firms)
- Volatility differences across time (during economic crises)
Solution:
- Use robust standard errors
- Or use both robust + clustered standard errors
# Fixed effects + robust clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
cov_type='clustered',
cluster_entity=True
)pandas MultiIndex: Foundation for Panel Data
What is MultiIndex?
MultiIndex: pandas' multi-level index, very suitable for panel data
Structure: as composite index
MultiIndex([(1, 2015),
(1, 2016),
(1, 2017),
(2, 2015),
(2, 2016),
(2, 2017)])Creating MultiIndex
Method 1: set_index
import pandas as pd
# Original data (long format)
df = pd.DataFrame({
'id': [1, 1, 2, 2],
'year': [2015, 2016, 2015, 2016],
'wage': [5000, 5200, 6000, 6300]
})
# Set dual-level index
df_panel = df.set_index(['id', 'year'])
print(df_panel)
# wage
# id year
# 1 2015 5000
# 2016 5200
# 2 2015 6000
# 2016 6300Method 2: from_arrays
index = pd.MultiIndex.from_arrays(
[df['id'], df['year']],
names=['id', 'year']
)
df_panel = df.set_index(index)Method 3: from_product (balanced panel)
# Create all (id, year) combinations
index = pd.MultiIndex.from_product(
[[1, 2, 3], [2015, 2016, 2017]],
names=['id', 'year']
)
print(index)
# MultiIndex([(1, 2015), (1, 2016), (1, 2017),
# (2, 2015), (2, 2016), (2, 2017),
# (3, 2015), (3, 2016), (3, 2017)])MultiIndex Operations
1. Selecting Data
# Select all times for id=1
df_panel.loc[1]
# Select observation for id=1, year=2016
df_panel.loc[(1, 2016)]
# Select multiple ids
df_panel.loc[[1, 2]]
# Using xs (cross-section)
df_panel.xs(1, level='id') # All times for id=1
df_panel.xs(2015, level='year') # All individuals in 20152. Grouping by Level
# Group by id (calculate individual means)
df_panel.groupby(level='id').mean()
# Group by year (calculate time means)
df_panel.groupby(level='year').mean()3. Sorting
# Sort by index
df_panel.sort_index()
# Sort by first level
df_panel.sort_index(level=0)Practice: Calculating Within/Between Variation
import pandas as pd
import numpy as np
# Simulate data
np.random.seed(42)
data = []
for i in range(1, 4):
for t in range(2015, 2020):
wage = 5000 + i * 1000 + (t - 2015) * 200 + np.random.normal(0, 100)
data.append({'id': i, 'year': t, 'wage': wage})
df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])
print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df_panel.head(10))
# Calculate individual means (between)
group_means = df_panel.groupby(level='id')['wage'].mean()
print("\n" + "=" * 70)
print("Individual Means (Between Variation)")
print("=" * 70)
print(group_means)
# Calculate demeaned data (within)
df_panel['wage_within'] = df_panel.groupby(level='id')['wage'].transform(
lambda x: x - x.mean()
)
print("\n" + "=" * 70)
print("Demeaned Wage (Within Variation)")
print("=" * 70)
print(df_panel[['wage', 'wage_within']].head(10))
# Visualization
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Original data
for i in [1, 2, 3]:
subset = df[df['id'] == i]
axes[0].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)
axes[0].set_xlabel('Year', fontweight='bold')
axes[0].set_ylabel('Wage', fontweight='bold')
axes[0].set_title('Original Data (Total Variation = Between + Within)', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Right plot: Demeaned data (within variation)
for i in [1, 2, 3]:
subset = df_panel.loc[i]
axes[1].plot(subset.index, subset['wage_within'], 'o-', label=f'ID {i}', linewidth=2)
axes[1].axhline(0, color='black', linestyle='--', linewidth=1)
axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Demeaned Wage', fontweight='bold')
axes[1].set_title('Within Variation (Variation Used by FE)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()Core Insight:
- Left plot: Obvious level differences between individuals (between variation)
- Right plot: After demeaning, only within-individual time changes remain (within variation)
- Fixed effects only use the variation in the right plot!
Visualizing Panel Data
1. Spaghetti Plot
Purpose: Show time trends of all individuals
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")
# Simulate data
np.random.seed(42)
data = []
for i in range(1, 21): # 20 individuals
for t in range(2010, 2020):
y = 100 + i * 5 + (t - 2010) * 3 + np.random.normal(0, 10)
data.append({'id': i, 'year': t, 'y': y})
df = pd.DataFrame(data)
# Plot
plt.figure(figsize=(12, 6))
for i in df['id'].unique():
subset = df[df['id'] == i]
plt.plot(subset['year'], subset['y'], alpha=0.4, linewidth=1)
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Spaghetti Plot: Time Trends of All Individuals', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Observations:
- Level differences between individuals (between)
- Common time trends (if exist)
- Heterogeneity (different slopes for different individuals)
2. Mean Trend Plot
Purpose: Show average time trends
# Calculate yearly mean and standard error
yearly_stats = df.groupby('year')['y'].agg(['mean', 'std', 'count'])
yearly_stats['se'] = yearly_stats['std'] / np.sqrt(yearly_stats['count'])
# Plot
plt.figure(figsize=(12, 6))
plt.plot(yearly_stats.index, yearly_stats['mean'], 'o-',
color='darkblue', linewidth=2, markersize=8, label='Mean')
plt.fill_between(yearly_stats.index,
yearly_stats['mean'] - 1.96 * yearly_stats['se'],
yearly_stats['mean'] + 1.96 * yearly_stats['se'],
alpha=0.3, label='95% CI')
plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Average Time Trend (± 95% CI)', fontweight='bold', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()Section Summary
Key Points
Panel Data Structure
- Long format (recommended) vs wide format
- Balanced panel vs unbalanced panel
- pandas MultiIndex is a powerful tool for panel data
Variance Decomposition
- Total variation = between variation + within variation
- Cross-sectional OLS uses total variation (including between)
- Fixed effects only use within variation
Problems with Pooled OLS
- Omitted variable bias (OVB): overestimation or underestimation
- Serial correlation: biased standard errors
- Need fixed effects + clustered standard errors
Practical Skills
- Long-wide format conversion (
melt/pivot) - MultiIndex operations (
set_index,loc,xs) - Calculate within/between variation
- Visualize panel data
- Long-wide format conversion (
Next Steps
In Section 3: Fixed Effects Models, we will dive deep into:
- Fixed effects estimation methods (within transformation, LSDV)
- One-way FE vs two-way FE
- Complete implementation with linearmodels.PanelOLS
- Real case study: Wage determinants
Mastering panel data structure is the first step toward fixed effects!