8.2 Panel Data Basics

Understanding panel data structure, variance decomposition, and the pitfalls of pooled OLS

Section Objectives

Master long format and wide format of panel data
Understand within variation vs between variation
Identify problems with pooled OLS (omitted variable bias)
Use pandas MultiIndex for panel data processing
Visualize time trends in panel data

Structure of Panel Data

What is Panel Data?

Definition: Panel Data = Multiple individuals (Cross-Section) × Multiple time points (Time Series)

Notation:

Three Dimensions:

: Number of individuals (entities, units, cross-sectional dimension)
- Examples: firms, people, countries, cities
: Number of time periods (temporal dimension)
- Examples: years, quarters, months
Total observations: (ideally, balanced panel)

Types of Panel Data

1. Balanced Panel

Definition: Each individual has observations at all time points

Example:

ID   Year   Wage   Education
1    2015   5000      12
1    2016   5200      12
1    2017   5500      13
2    2015   6000      16
2    2016   6300      16
2    2017   6600      16

Characteristics:

Each ID has observations
Total observations =
Simple analysis, no need to handle missing values

2. Unbalanced Panel

Definition: Some individuals missing observations at some time points

Example:

ID   Year   Wage   Education
1    2015   5000      12
1    2016   5200      12
1    2017   5500      13
2    2015   6000      16
2    2017   6600      16    # Missing 2016
3    2016   4800      10
3    2017   5000      10    # Missing 2015

Causes:

Sample attrition: Individuals drop out of surveys
Firms going public/delisting, bankruptcy
Data collection issues

Handling: Modern software (like linearmodels) can automatically handle unbalanced panels

Data Format: Long Format vs Wide Format

Long Format ⭐ Recommended

Structure: Each row is an observation (individual-time pair)

python

   id  year   wage  education  experience
0   1  2015   5000         12           3
1   1  2016   5200         12           4
2   1  2017   5500         13           5
3   2  2015   6000         16           5
4   2  2016   6300         16           6
5   2  2017   6600         16           7

Pros:

Directly corresponds to regression models
Convenient for handling unbalanced panels
Standard format for linearmodels, statsmodels
Easy to add new variables

Python Implementation:

python

import pandas as pd

# Long format data
long_data = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 2],
    'year': [2015, 2016, 2017, 2015, 2016, 2017],
    'wage': [5000, 5200, 5500, 6000, 6300, 6600],
    'education': [12, 12, 13, 16, 16, 16]
})

Wide Format

Structure: Each row is an individual, different time points in different columns

python

   id  wage_2015  wage_2016  wage_2017  education_2015  education_2016  education_2017
0   1       5000       5200       5500              12              12              13
1   2       6000       6300       6600              16              16              16

Pros:

Intuitive, easy to view time series for a single individual
More convenient for certain visualizations

Cons:

Cannot be directly used for regression
Difficult to handle unbalanced panels
Adding new variables requires multiple columns

Long-Wide Format Conversion

Wide Format → Long Format (`melt`)

python

import pandas as pd

# Wide format data
wide_data = pd.DataFrame({
    'id': [1, 2],
    'wage_2015': [5000, 6000],
    'wage_2016': [5200, 6300],
    'wage_2017': [5500, 6600],
    'education': [12, 16]  # Time-invariant
})

# Convert to long format
long_data = wide_data.melt(
    id_vars=['id', 'education'],           # Invariant columns
    value_vars=['wage_2015', 'wage_2016', 'wage_2017'],
    var_name='year_var',
    value_name='wage'
)

# Clean year column
long_data['year'] = long_data['year_var'].str.extract('(\d+)').astype(int)
long_data = long_data.drop('year_var', axis=1).sort_values(['id', 'year'])

print(long_data)
#    id  education   wage  year
# 0   1         12   5000  2015
# 1   1         12   5200  2016
# 2   1         12   5500  2017
# 3   2         16   6000  2015
# 4   2         16   6300  2016
# 5   2         16   6600  2017

Long Format → Wide Format (`pivot`)

python

# Convert from long format back to wide format
wide_data_back = long_data.pivot(
    index='id',
    columns='year',
    values='wage'
).reset_index()

# Rename columns
wide_data_back.columns = ['id'] + [f'wage_{c}' for c in wide_data_back.columns[1:]]

print(wide_data_back)
#    id  wage_2015  wage_2016  wage_2017
# 0   1       5000       5200       5500
# 1   2       6000       6300       6600

Variance Decomposition: Within vs Between

Panel data's core advantage is containing two types of variation:

1. Total Variation

Definition: Variation of all observations around the grand mean

where is the grand mean

2. Between Variation

Definition: Differences between individual means

where is individual 's time mean

Intuition:

Compares different individuals' "average levels"
Example: Average wage of ID=1 vs average wage of ID=2
This is the variation used by cross-sectional regression

3. Within Variation ⭐

Definition: Variation within the same individual over time

Intuition:

Individual's fluctuation around their own average
Example: ID=1's wage in 2016 vs ID=1's average wage
This is the variation used by fixed effects regression

Key Property:

Python Implementation: Variance Decomposition

python

import numpy as np
import pandas as pd

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 6):  # 5 individuals
    for t in range(2015, 2020):  # 5 years
        # Individual effect + time trend + noise
        wage = 5000 + i * 500 + (t - 2015) * 100 + np.random.normal(0, 200)
        data.append({'id': i, 'year': t, 'wage': wage})

df = pd.DataFrame(data)

print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df.head(10))

# Calculate means
overall_mean = df['wage'].mean()  # Grand mean
group_means = df.groupby('id')['wage'].transform('mean')  # Individual means

# 1. Total variation
total_var = np.var(df['wage'], ddof=1)

# 2. Between variation
between_var = np.var(df.groupby('id')['wage'].mean(), ddof=1)

# 3. Within variation (residual method)
df['wage_demeaned'] = df['wage'] - group_means
within_var = np.var(df['wage_demeaned'], ddof=1)

print("\n" + "=" * 70)
print("Variance Decomposition")
print("=" * 70)
print(f"Total variation:     {total_var:,.2f}")
print(f"Between variation: {between_var:,.2f}  ({between_var/total_var*100:.1f}%)")
print(f"Within variation:  {within_var:,.2f}  ({within_var/total_var*100:.1f}%)")
print(f"Between + Within:   {between_var + within_var:,.2f}")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Between variation (differences in individual means)
group_means_unique = df.groupby('id')['wage'].mean()
axes[0].bar(group_means_unique.index, group_means_unique.values, color='steelblue', alpha=0.7)
axes[0].axhline(overall_mean, color='red', linestyle='--', linewidth=2, label='Grand Mean')
axes[0].set_xlabel('Individual ID', fontweight='bold')
axes[0].set_ylabel('Average Wage', fontweight='bold')
axes[0].set_title('Between Variation\nAverage wages of different individuals', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right plot: Within variation (within-individual time changes)
for i in [1, 2, 3]:
    subset = df[df['id'] == i]
    axes[1].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)
    # Plot individual mean line
    axes[1].axhline(subset['wage'].mean(), color='gray', linestyle=':', alpha=0.5)

axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Wage', fontweight='bold')
axes[1].set_title('Within Variation\nWithin-individual time changes', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Output Interpretation:

Large between variation: Large differences between individuals (e.g., different industries, education levels)
Large within variation: Large changes over time within individuals (e.g., wage growth, experience accumulation)
Fixed effects only use within variation, eliminating between differences

Problems with Pooled OLS

Pooled OLS

Model:

Assumption: All individuals and time periods are the same, is i.i.d.

Python Implementation:

python

import statsmodels.api as sm

# Ignore panel structure, direct OLS
X = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['wage'], X).fit()
print(model_pooled.summary())

Pros:

Simple, easy to implement
Efficient (if model is correct)

Fatal Flaws:

Problem 1: Omitted Variable Bias (OVB) ⭐⭐⭐

Scenario: Studying the effect of education on wages

True Model:

where is unobservable individual characteristics (ability, family background, personality, etc.)

Model Estimated by Pooled OLS:

where

Problem: If , then is biased!

Intuition:

High-ability people get more education AND earn more
Pooled OLS mistakenly attributes the effect of ability to the effect of education
Result: Overestimates the returns to education

Direction of Omitted Variable Bias

OVB Formula (when omitting third variable ):

where:

: True effect of on
: Covariance between and

Applied to Our Example:

: Education
: Ability (unobservable)
: Wage

Analysis:

: Ability increases wages
: Smart people get more education
Therefore: → Positive bias (overestimation)

Python Demonstration: Generating OVB

python

import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS

np.random.seed(123)

# Parameter settings
N = 500
T = 5
true_beta_education = 0.08  # True effect of education

# Simulate data
data = []
for i in range(N):
    # Individual fixed effect (ability, family background, etc.)
    ability = np.random.normal(0, 0.5)

    for t in range(T):
        # Education level (positively correlated with ability!)
        education = 12 + 0.8 * ability + t * 0.2 + np.random.normal(0, 0.3)

        # Wage (log)
        log_wage = 1.5 + true_beta_education * education + 0.5 * ability + np.random.normal(0, 0.1)

        data.append({
            'id': i,
            'year': 2015 + t,
            'log_wage': log_wage,
            'education': education,
            'ability': ability  # Unobservable in actual research!
        })

df = pd.DataFrame(data)

print("=" * 70)
print("Data Generating Process (DGP)")
print("=" * 70)
print("True model: log(wage) = 1.5 + 0.08*education + 0.5*ability + noise")
print(f"True effect of education: {true_beta_education}")
print(f"Correlation between ability and education: {df[['education', 'ability']].corr().iloc[0, 1]:.3f}")

# Method 1: Pooled OLS (omitting ability)
X_pooled = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['log_wage'], X_pooled).fit()

print("\n" + "=" * 70)
print("Method 1: Pooled OLS (omitting ability → biased estimate)")
print("=" * 70)
print(f"Education coefficient: {model_pooled.params['education']:.4f}")
print(f"Standard error:  {model_pooled.bse['education']:.4f}")
print(f"Bias:    {model_pooled.params['education'] - true_beta_education:.4f} (overestimate!)")

# Method 2: Pooled OLS (including ability, for comparison only)
X_with_ability = sm.add_constant(df[['education', 'ability']])
model_with_ability = sm.OLS(df['log_wage'], X_with_ability).fit()

print("\n" + "=" * 70)
print("Method 2: Pooled OLS (including ability → unbiased, but infeasible in practice)")
print("=" * 70)
print(f"Education coefficient: {model_with_ability.params['education']:.4f}")
print(f"Ability coefficient: {model_with_ability.params['ability']:.4f}")

# Method 3: Fixed effects (eliminates ability)
df_panel = df.set_index(['id', 'year'])
model_fe = PanelOLS(df_panel['log_wage'],
                    df_panel[['education']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

print("\n" + "=" * 70)
print("Method 3: Fixed Effects (differencing eliminates ability → unbiased)")
print("=" * 70)
print(f"Education coefficient: {model_fe.params['education']:.4f}")
print(f"Standard error:  {model_fe.std_errors['education']:.4f}")

# Comparison summary
print("\n" + "=" * 70)
print("Estimation Comparison")
print("=" * 70)
print(f"True parameter:          {true_beta_education:.4f}")
print(f"Pooled OLS (omitted):  {model_pooled.params['education']:.4f}  (bias: {model_pooled.params['education'] - true_beta_education:+.4f})")
print(f"Pooled OLS (included):  {model_with_ability.params['education']:.4f}  (bias: {model_with_ability.params['education'] - true_beta_education:+.4f})")
print(f"Fixed effects:          {model_fe.params['education']:.4f}  (bias: {model_fe.params['education'] - true_beta_education:+.4f})")

Core Conclusion:

Pooled OLS overestimates the education effect (because ability is omitted)
Fixed effects eliminates ability through differencing, obtaining unbiased estimates
This is the power of panel data!

Problem 2: Serial Correlation

Definition: Error terms of the same individual at different times are correlated

Consequences:

Coefficient estimates still unbiased
But standard errors are biased (usually underestimated)
Leads to inflated statistics, increased false positives

Solution:

Use clustered standard errors
Cluster at the individual level: allows errors within the same individual to be correlated

python

# Fixed effects + clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Cluster at individual level
)

Problem 3: Heteroskedasticity

Definition: Variance of error terms differs across individuals or time

Sources:

Scale differences across individuals (large firms vs small firms)
Volatility differences across time (during economic crises)

Solution:

Use robust standard errors
Or use both robust + clustered standard errors

python

# Fixed effects + robust clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True
)

pandas MultiIndex: Foundation for Panel Data

What is MultiIndex?

MultiIndex: pandas' multi-level index, very suitable for panel data

Structure: as composite index

python

MultiIndex([(1, 2015),
            (1, 2016),
            (1, 2017),
            (2, 2015),
            (2, 2016),
            (2, 2017)])

Creating MultiIndex

Method 1: `set_index`

python

import pandas as pd

# Original data (long format)
df = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'year': [2015, 2016, 2015, 2016],
    'wage': [5000, 5200, 6000, 6300]
})

# Set dual-level index
df_panel = df.set_index(['id', 'year'])

print(df_panel)
#            wage
# id year
# 1  2015   5000
#    2016   5200
# 2  2015   6000
#    2016   6300

Method 2: `from_arrays`

python

index = pd.MultiIndex.from_arrays(
    [df['id'], df['year']],
    names=['id', 'year']
)
df_panel = df.set_index(index)

Method 3: `from_product` (balanced panel)

python

# Create all (id, year) combinations
index = pd.MultiIndex.from_product(
    [[1, 2, 3], [2015, 2016, 2017]],
    names=['id', 'year']
)
print(index)
# MultiIndex([(1, 2015), (1, 2016), (1, 2017),
#             (2, 2015), (2, 2016), (2, 2017),
#             (3, 2015), (3, 2016), (3, 2017)])

MultiIndex Operations

1. Selecting Data

python

# Select all times for id=1
df_panel.loc[1]

# Select observation for id=1, year=2016
df_panel.loc[(1, 2016)]

# Select multiple ids
df_panel.loc[[1, 2]]

# Using xs (cross-section)
df_panel.xs(1, level='id')        # All times for id=1
df_panel.xs(2015, level='year')   # All individuals in 2015

2. Grouping by Level

python

# Group by id (calculate individual means)
df_panel.groupby(level='id').mean()

# Group by year (calculate time means)
df_panel.groupby(level='year').mean()

3. Sorting

python

# Sort by index
df_panel.sort_index()

# Sort by first level
df_panel.sort_index(level=0)

Practice: Calculating Within/Between Variation

python

import pandas as pd
import numpy as np

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 4):
    for t in range(2015, 2020):
        wage = 5000 + i * 1000 + (t - 2015) * 200 + np.random.normal(0, 100)
        data.append({'id': i, 'year': t, 'wage': wage})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df_panel.head(10))

# Calculate individual means (between)
group_means = df_panel.groupby(level='id')['wage'].mean()
print("\n" + "=" * 70)
print("Individual Means (Between Variation)")
print("=" * 70)
print(group_means)

# Calculate demeaned data (within)
df_panel['wage_within'] = df_panel.groupby(level='id')['wage'].transform(
    lambda x: x - x.mean()
)

print("\n" + "=" * 70)
print("Demeaned Wage (Within Variation)")
print("=" * 70)
print(df_panel[['wage', 'wage_within']].head(10))

# Visualization
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original data
for i in [1, 2, 3]:
    subset = df[df['id'] == i]
    axes[0].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)

axes[0].set_xlabel('Year', fontweight='bold')
axes[0].set_ylabel('Wage', fontweight='bold')
axes[0].set_title('Original Data (Total Variation = Between + Within)', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right plot: Demeaned data (within variation)
for i in [1, 2, 3]:
    subset = df_panel.loc[i]
    axes[1].plot(subset.index, subset['wage_within'], 'o-', label=f'ID {i}', linewidth=2)

axes[1].axhline(0, color='black', linestyle='--', linewidth=1)
axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Demeaned Wage', fontweight='bold')
axes[1].set_title('Within Variation (Variation Used by FE)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Core Insight:

Left plot: Obvious level differences between individuals (between variation)
Right plot: After demeaning, only within-individual time changes remain (within variation)
Fixed effects only use the variation in the right plot!

Visualizing Panel Data

1. Spaghetti Plot

Purpose: Show time trends of all individuals

python

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 21):  # 20 individuals
    for t in range(2010, 2020):
        y = 100 + i * 5 + (t - 2010) * 3 + np.random.normal(0, 10)
        data.append({'id': i, 'year': t, 'y': y})

df = pd.DataFrame(data)

# Plot
plt.figure(figsize=(12, 6))
for i in df['id'].unique():
    subset = df[df['id'] == i]
    plt.plot(subset['year'], subset['y'], alpha=0.4, linewidth=1)

plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Spaghetti Plot: Time Trends of All Individuals', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Observations:

Level differences between individuals (between)
Common time trends (if exist)
Heterogeneity (different slopes for different individuals)

2. Mean Trend Plot

Purpose: Show average time trends

python

# Calculate yearly mean and standard error
yearly_stats = df.groupby('year')['y'].agg(['mean', 'std', 'count'])
yearly_stats['se'] = yearly_stats['std'] / np.sqrt(yearly_stats['count'])

# Plot
plt.figure(figsize=(12, 6))
plt.plot(yearly_stats.index, yearly_stats['mean'], 'o-',
         color='darkblue', linewidth=2, markersize=8, label='Mean')
plt.fill_between(yearly_stats.index,
                 yearly_stats['mean'] - 1.96 * yearly_stats['se'],
                 yearly_stats['mean'] + 1.96 * yearly_stats['se'],
                 alpha=0.3, label='95% CI')

plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Average Time Trend (± 95% CI)', fontweight='bold', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Section Summary

Key Points

Panel Data Structure
- Long format (recommended) vs wide format
- Balanced panel vs unbalanced panel
- pandas MultiIndex is a powerful tool for panel data
Variance Decomposition
- Total variation = between variation + within variation
- Cross-sectional OLS uses total variation (including between)
- Fixed effects only use within variation
Problems with Pooled OLS
- Omitted variable bias (OVB): overestimation or underestimation
- Serial correlation: biased standard errors
- Need fixed effects + clustered standard errors
Practical Skills
- Long-wide format conversion (melt / pivot)
- MultiIndex operations (set_index, loc, xs)
- Calculate within/between variation
- Visualize panel data

Next Steps

In Section 3: Fixed Effects Models, we will dive deep into:

Fixed effects estimation methods (within transformation, LSDV)
One-way FE vs two-way FE
Complete implementation with linearmodels.PanelOLS
Real case study: Wage determinants

Mastering panel data structure is the first step toward fixed effects!

8.2 Panel Data Basics ​

Section Objectives ​

Structure of Panel Data ​

What is Panel Data? ​

Types of Panel Data ​

1. Balanced Panel ​

2. Unbalanced Panel ​

Data Format: Long Format vs Wide Format ​

Long Format ⭐ Recommended ​

Wide Format ​

Long-Wide Format Conversion ​

Wide Format → Long Format (melt) ​

Long Format → Wide Format (pivot) ​

Variance Decomposition: Within vs Between ​

1. Total Variation ​

2. Between Variation ​

3. Within Variation ⭐ ​

Python Implementation: Variance Decomposition ​

Problems with Pooled OLS ​

Pooled OLS ​

Problem 1: Omitted Variable Bias (OVB) ⭐⭐⭐ ​

Direction of Omitted Variable Bias ​

Python Demonstration: Generating OVB ​

Problem 2: Serial Correlation ​

Problem 3: Heteroskedasticity ​

pandas MultiIndex: Foundation for Panel Data ​

What is MultiIndex? ​

Creating MultiIndex ​

Method 1: set_index ​

Method 2: from_arrays ​

Method 3: from_product (balanced panel) ​

MultiIndex Operations ​

1. Selecting Data ​

2. Grouping by Level ​

3. Sorting ​

Practice: Calculating Within/Between Variation ​

Visualizing Panel Data ​

1. Spaghetti Plot ​

2. Mean Trend Plot ​

Section Summary ​

Key Points ​

Next Steps ​

8.2 Panel Data Basics

Section Objectives

Structure of Panel Data

What is Panel Data?

Types of Panel Data

1. Balanced Panel

2. Unbalanced Panel

Data Format: Long Format vs Wide Format

Long Format ⭐ Recommended

Wide Format

Long-Wide Format Conversion

Wide Format → Long Format (`melt`)

Long Format → Wide Format (`pivot`)

Variance Decomposition: Within vs Between

1. Total Variation

2. Between Variation

3. Within Variation ⭐

Python Implementation: Variance Decomposition

Problems with Pooled OLS

Pooled OLS

Problem 1: Omitted Variable Bias (OVB) ⭐⭐⭐

Direction of Omitted Variable Bias

Python Demonstration: Generating OVB

Problem 2: Serial Correlation

Problem 3: Heteroskedasticity

pandas MultiIndex: Foundation for Panel Data

What is MultiIndex?

Creating MultiIndex

Method 1: `set_index`

Method 2: `from_arrays`

Method 3: `from_product` (balanced panel)

MultiIndex Operations

1. Selecting Data

2. Grouping by Level

3. Sorting

Practice: Calculating Within/Between Variation

Visualizing Panel Data

1. Spaghetti Plot

2. Mean Trend Plot

Section Summary

Key Points

Next Steps