Skip to content

8.2 Panel Data Basics

Understanding panel data structure, variance decomposition, and the pitfalls of pooled OLS

DifficultyPracticality


Section Objectives

  • Master long format and wide format of panel data
  • Understand within variation vs between variation
  • Identify problems with pooled OLS (omitted variable bias)
  • Use pandas MultiIndex for panel data processing
  • Visualize time trends in panel data

Structure of Panel Data

What is Panel Data?

Definition: Panel Data = Multiple individuals (Cross-Section) × Multiple time points (Time Series)

Notation:

Three Dimensions:

  1. : Number of individuals (entities, units, cross-sectional dimension)

    • Examples: firms, people, countries, cities
  2. : Number of time periods (temporal dimension)

    • Examples: years, quarters, months
  3. Total observations: (ideally, balanced panel)


Types of Panel Data

1. Balanced Panel

Definition: Each individual has observations at all time points

Example:

ID   Year   Wage   Education
1    2015   5000      12
1    2016   5200      12
1    2017   5500      13
2    2015   6000      16
2    2016   6300      16
2    2017   6600      16

Characteristics:

  • Each ID has observations
  • Total observations =
  • Simple analysis, no need to handle missing values

2. Unbalanced Panel

Definition: Some individuals missing observations at some time points

Example:

ID   Year   Wage   Education
1    2015   5000      12
1    2016   5200      12
1    2017   5500      13
2    2015   6000      16
2    2017   6600      16    # Missing 2016
3    2016   4800      10
3    2017   5000      10    # Missing 2015

Causes:

  • Sample attrition: Individuals drop out of surveys
  • Firms going public/delisting, bankruptcy
  • Data collection issues

Handling: Modern software (like linearmodels) can automatically handle unbalanced panels


Data Format: Long Format vs Wide Format

Structure: Each row is an observation (individual-time pair)

python
   id  year   wage  education  experience
0   1  2015   5000         12           3
1   1  2016   5200         12           4
2   1  2017   5500         13           5
3   2  2015   6000         16           5
4   2  2016   6300         16           6
5   2  2017   6600         16           7

Pros:

  • Directly corresponds to regression models
  • Convenient for handling unbalanced panels
  • Standard format for linearmodels, statsmodels
  • Easy to add new variables

Python Implementation:

python
import pandas as pd

# Long format data
long_data = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 2],
    'year': [2015, 2016, 2017, 2015, 2016, 2017],
    'wage': [5000, 5200, 5500, 6000, 6300, 6600],
    'education': [12, 12, 13, 16, 16, 16]
})

Wide Format

Structure: Each row is an individual, different time points in different columns

python
   id  wage_2015  wage_2016  wage_2017  education_2015  education_2016  education_2017
0   1       5000       5200       5500              12              12              13
1   2       6000       6300       6600              16              16              16

Pros:

  • Intuitive, easy to view time series for a single individual
  • More convenient for certain visualizations

Cons:

  • Cannot be directly used for regression
  • Difficult to handle unbalanced panels
  • Adding new variables requires multiple columns

Long-Wide Format Conversion

Wide Format → Long Format (melt)

python
import pandas as pd

# Wide format data
wide_data = pd.DataFrame({
    'id': [1, 2],
    'wage_2015': [5000, 6000],
    'wage_2016': [5200, 6300],
    'wage_2017': [5500, 6600],
    'education': [12, 16]  # Time-invariant
})

# Convert to long format
long_data = wide_data.melt(
    id_vars=['id', 'education'],           # Invariant columns
    value_vars=['wage_2015', 'wage_2016', 'wage_2017'],
    var_name='year_var',
    value_name='wage'
)

# Clean year column
long_data['year'] = long_data['year_var'].str.extract('(\d+)').astype(int)
long_data = long_data.drop('year_var', axis=1).sort_values(['id', 'year'])

print(long_data)
#    id  education   wage  year
# 0   1         12   5000  2015
# 1   1         12   5200  2016
# 2   1         12   5500  2017
# 3   2         16   6000  2015
# 4   2         16   6300  2016
# 5   2         16   6600  2017

Long Format → Wide Format (pivot)

python
# Convert from long format back to wide format
wide_data_back = long_data.pivot(
    index='id',
    columns='year',
    values='wage'
).reset_index()

# Rename columns
wide_data_back.columns = ['id'] + [f'wage_{c}' for c in wide_data_back.columns[1:]]

print(wide_data_back)
#    id  wage_2015  wage_2016  wage_2017
# 0   1       5000       5200       5500
# 1   2       6000       6300       6600

Variance Decomposition: Within vs Between

Panel data's core advantage is containing two types of variation:

1. Total Variation

Definition: Variation of all observations around the grand mean

where is the grand mean


2. Between Variation

Definition: Differences between individual means

where is individual 's time mean

Intuition:

  • Compares different individuals' "average levels"
  • Example: Average wage of ID=1 vs average wage of ID=2
  • This is the variation used by cross-sectional regression

3. Within Variation ⭐

Definition: Variation within the same individual over time

Intuition:

  • Individual's fluctuation around their own average
  • Example: ID=1's wage in 2016 vs ID=1's average wage
  • This is the variation used by fixed effects regression

Key Property:


Python Implementation: Variance Decomposition

python
import numpy as np
import pandas as pd

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 6):  # 5 individuals
    for t in range(2015, 2020):  # 5 years
        # Individual effect + time trend + noise
        wage = 5000 + i * 500 + (t - 2015) * 100 + np.random.normal(0, 200)
        data.append({'id': i, 'year': t, 'wage': wage})

df = pd.DataFrame(data)

print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df.head(10))

# Calculate means
overall_mean = df['wage'].mean()  # Grand mean
group_means = df.groupby('id')['wage'].transform('mean')  # Individual means

# 1. Total variation
total_var = np.var(df['wage'], ddof=1)

# 2. Between variation
between_var = np.var(df.groupby('id')['wage'].mean(), ddof=1)

# 3. Within variation (residual method)
df['wage_demeaned'] = df['wage'] - group_means
within_var = np.var(df['wage_demeaned'], ddof=1)

print("\n" + "=" * 70)
print("Variance Decomposition")
print("=" * 70)
print(f"Total variation:     {total_var:,.2f}")
print(f"Between variation: {between_var:,.2f}  ({between_var/total_var*100:.1f}%)")
print(f"Within variation:  {within_var:,.2f}  ({within_var/total_var*100:.1f}%)")
print(f"Between + Within:   {between_var + within_var:,.2f}")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Between variation (differences in individual means)
group_means_unique = df.groupby('id')['wage'].mean()
axes[0].bar(group_means_unique.index, group_means_unique.values, color='steelblue', alpha=0.7)
axes[0].axhline(overall_mean, color='red', linestyle='--', linewidth=2, label='Grand Mean')
axes[0].set_xlabel('Individual ID', fontweight='bold')
axes[0].set_ylabel('Average Wage', fontweight='bold')
axes[0].set_title('Between Variation\nAverage wages of different individuals', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right plot: Within variation (within-individual time changes)
for i in [1, 2, 3]:
    subset = df[df['id'] == i]
    axes[1].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)
    # Plot individual mean line
    axes[1].axhline(subset['wage'].mean(), color='gray', linestyle=':', alpha=0.5)

axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Wage', fontweight='bold')
axes[1].set_title('Within Variation\nWithin-individual time changes', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Output Interpretation:

  • Large between variation: Large differences between individuals (e.g., different industries, education levels)
  • Large within variation: Large changes over time within individuals (e.g., wage growth, experience accumulation)
  • Fixed effects only use within variation, eliminating between differences

Problems with Pooled OLS

Pooled OLS

Model:

Assumption: All individuals and time periods are the same, is i.i.d.

Python Implementation:

python
import statsmodels.api as sm

# Ignore panel structure, direct OLS
X = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['wage'], X).fit()
print(model_pooled.summary())

Pros:

  • Simple, easy to implement
  • Efficient (if model is correct)

Fatal Flaws:


Problem 1: Omitted Variable Bias (OVB) ⭐⭐⭐

Scenario: Studying the effect of education on wages

True Model:

where is unobservable individual characteristics (ability, family background, personality, etc.)

Model Estimated by Pooled OLS:

where

Problem: If , then is biased!

Intuition:

  • High-ability people get more education AND earn more
  • Pooled OLS mistakenly attributes the effect of ability to the effect of education
  • Result: Overestimates the returns to education

Direction of Omitted Variable Bias

OVB Formula (when omitting third variable ):

where:

  • : True effect of on
  • : Covariance between and

Applied to Our Example:

  • : Education
  • : Ability (unobservable)
  • : Wage

Analysis:

  1. : Ability increases wages
  2. : Smart people get more education
  3. Therefore: Positive bias (overestimation)

Python Demonstration: Generating OVB

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS

np.random.seed(123)

# Parameter settings
N = 500
T = 5
true_beta_education = 0.08  # True effect of education

# Simulate data
data = []
for i in range(N):
    # Individual fixed effect (ability, family background, etc.)
    ability = np.random.normal(0, 0.5)

    for t in range(T):
        # Education level (positively correlated with ability!)
        education = 12 + 0.8 * ability + t * 0.2 + np.random.normal(0, 0.3)

        # Wage (log)
        log_wage = 1.5 + true_beta_education * education + 0.5 * ability + np.random.normal(0, 0.1)

        data.append({
            'id': i,
            'year': 2015 + t,
            'log_wage': log_wage,
            'education': education,
            'ability': ability  # Unobservable in actual research!
        })

df = pd.DataFrame(data)

print("=" * 70)
print("Data Generating Process (DGP)")
print("=" * 70)
print("True model: log(wage) = 1.5 + 0.08*education + 0.5*ability + noise")
print(f"True effect of education: {true_beta_education}")
print(f"Correlation between ability and education: {df[['education', 'ability']].corr().iloc[0, 1]:.3f}")

# Method 1: Pooled OLS (omitting ability)
X_pooled = sm.add_constant(df[['education']])
model_pooled = sm.OLS(df['log_wage'], X_pooled).fit()

print("\n" + "=" * 70)
print("Method 1: Pooled OLS (omitting ability → biased estimate)")
print("=" * 70)
print(f"Education coefficient: {model_pooled.params['education']:.4f}")
print(f"Standard error:  {model_pooled.bse['education']:.4f}")
print(f"Bias:    {model_pooled.params['education'] - true_beta_education:.4f} (overestimate!)")

# Method 2: Pooled OLS (including ability, for comparison only)
X_with_ability = sm.add_constant(df[['education', 'ability']])
model_with_ability = sm.OLS(df['log_wage'], X_with_ability).fit()

print("\n" + "=" * 70)
print("Method 2: Pooled OLS (including ability → unbiased, but infeasible in practice)")
print("=" * 70)
print(f"Education coefficient: {model_with_ability.params['education']:.4f}")
print(f"Ability coefficient: {model_with_ability.params['ability']:.4f}")

# Method 3: Fixed effects (eliminates ability)
df_panel = df.set_index(['id', 'year'])
model_fe = PanelOLS(df_panel['log_wage'],
                    df_panel[['education']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

print("\n" + "=" * 70)
print("Method 3: Fixed Effects (differencing eliminates ability → unbiased)")
print("=" * 70)
print(f"Education coefficient: {model_fe.params['education']:.4f}")
print(f"Standard error:  {model_fe.std_errors['education']:.4f}")

# Comparison summary
print("\n" + "=" * 70)
print("Estimation Comparison")
print("=" * 70)
print(f"True parameter:          {true_beta_education:.4f}")
print(f"Pooled OLS (omitted):  {model_pooled.params['education']:.4f}  (bias: {model_pooled.params['education'] - true_beta_education:+.4f})")
print(f"Pooled OLS (included):  {model_with_ability.params['education']:.4f}  (bias: {model_with_ability.params['education'] - true_beta_education:+.4f})")
print(f"Fixed effects:          {model_fe.params['education']:.4f}  (bias: {model_fe.params['education'] - true_beta_education:+.4f})")

Core Conclusion:

  • Pooled OLS overestimates the education effect (because ability is omitted)
  • Fixed effects eliminates ability through differencing, obtaining unbiased estimates
  • This is the power of panel data!

Problem 2: Serial Correlation

Definition: Error terms of the same individual at different times are correlated

Consequences:

  • Coefficient estimates still unbiased
  • But standard errors are biased (usually underestimated)
  • Leads to inflated statistics, increased false positives

Solution:

  • Use clustered standard errors
  • Cluster at the individual level: allows errors within the same individual to be correlated
python
# Fixed effects + clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True  # Cluster at individual level
)

Problem 3: Heteroskedasticity

Definition: Variance of error terms differs across individuals or time

Sources:

  • Scale differences across individuals (large firms vs small firms)
  • Volatility differences across time (during economic crises)

Solution:

  • Use robust standard errors
  • Or use both robust + clustered standard errors
python
# Fixed effects + robust clustered standard errors
model_fe = PanelOLS(y, X, entity_effects=True).fit(
    cov_type='clustered',
    cluster_entity=True
)

pandas MultiIndex: Foundation for Panel Data

What is MultiIndex?

MultiIndex: pandas' multi-level index, very suitable for panel data

Structure: as composite index

python
MultiIndex([(1, 2015),
            (1, 2016),
            (1, 2017),
            (2, 2015),
            (2, 2016),
            (2, 2017)])

Creating MultiIndex

Method 1: set_index

python
import pandas as pd

# Original data (long format)
df = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'year': [2015, 2016, 2015, 2016],
    'wage': [5000, 5200, 6000, 6300]
})

# Set dual-level index
df_panel = df.set_index(['id', 'year'])

print(df_panel)
#            wage
# id year
# 1  2015   5000
#    2016   5200
# 2  2015   6000
#    2016   6300

Method 2: from_arrays

python
index = pd.MultiIndex.from_arrays(
    [df['id'], df['year']],
    names=['id', 'year']
)
df_panel = df.set_index(index)

Method 3: from_product (balanced panel)

python
# Create all (id, year) combinations
index = pd.MultiIndex.from_product(
    [[1, 2, 3], [2015, 2016, 2017]],
    names=['id', 'year']
)
print(index)
# MultiIndex([(1, 2015), (1, 2016), (1, 2017),
#             (2, 2015), (2, 2016), (2, 2017),
#             (3, 2015), (3, 2016), (3, 2017)])

MultiIndex Operations

1. Selecting Data

python
# Select all times for id=1
df_panel.loc[1]

# Select observation for id=1, year=2016
df_panel.loc[(1, 2016)]

# Select multiple ids
df_panel.loc[[1, 2]]

# Using xs (cross-section)
df_panel.xs(1, level='id')        # All times for id=1
df_panel.xs(2015, level='year')   # All individuals in 2015

2. Grouping by Level

python
# Group by id (calculate individual means)
df_panel.groupby(level='id').mean()

# Group by year (calculate time means)
df_panel.groupby(level='year').mean()

3. Sorting

python
# Sort by index
df_panel.sort_index()

# Sort by first level
df_panel.sort_index(level=0)

Practice: Calculating Within/Between Variation

python
import pandas as pd
import numpy as np

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 4):
    for t in range(2015, 2020):
        wage = 5000 + i * 1000 + (t - 2015) * 200 + np.random.normal(0, 100)
        data.append({'id': i, 'year': t, 'wage': wage})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

print("=" * 70)
print("Data Preview")
print("=" * 70)
print(df_panel.head(10))

# Calculate individual means (between)
group_means = df_panel.groupby(level='id')['wage'].mean()
print("\n" + "=" * 70)
print("Individual Means (Between Variation)")
print("=" * 70)
print(group_means)

# Calculate demeaned data (within)
df_panel['wage_within'] = df_panel.groupby(level='id')['wage'].transform(
    lambda x: x - x.mean()
)

print("\n" + "=" * 70)
print("Demeaned Wage (Within Variation)")
print("=" * 70)
print(df_panel[['wage', 'wage_within']].head(10))

# Visualization
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original data
for i in [1, 2, 3]:
    subset = df[df['id'] == i]
    axes[0].plot(subset['year'], subset['wage'], 'o-', label=f'ID {i}', linewidth=2)

axes[0].set_xlabel('Year', fontweight='bold')
axes[0].set_ylabel('Wage', fontweight='bold')
axes[0].set_title('Original Data (Total Variation = Between + Within)', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right plot: Demeaned data (within variation)
for i in [1, 2, 3]:
    subset = df_panel.loc[i]
    axes[1].plot(subset.index, subset['wage_within'], 'o-', label=f'ID {i}', linewidth=2)

axes[1].axhline(0, color='black', linestyle='--', linewidth=1)
axes[1].set_xlabel('Year', fontweight='bold')
axes[1].set_ylabel('Demeaned Wage', fontweight='bold')
axes[1].set_title('Within Variation (Variation Used by FE)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Core Insight:

  • Left plot: Obvious level differences between individuals (between variation)
  • Right plot: After demeaning, only within-individual time changes remain (within variation)
  • Fixed effects only use the variation in the right plot!

Visualizing Panel Data

1. Spaghetti Plot

Purpose: Show time trends of all individuals

python
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
sns.set_style("whitegrid")

# Simulate data
np.random.seed(42)
data = []
for i in range(1, 21):  # 20 individuals
    for t in range(2010, 2020):
        y = 100 + i * 5 + (t - 2010) * 3 + np.random.normal(0, 10)
        data.append({'id': i, 'year': t, 'y': y})

df = pd.DataFrame(data)

# Plot
plt.figure(figsize=(12, 6))
for i in df['id'].unique():
    subset = df[df['id'] == i]
    plt.plot(subset['year'], subset['y'], alpha=0.4, linewidth=1)

plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Spaghetti Plot: Time Trends of All Individuals', fontweight='bold', fontsize=14)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Observations:

  • Level differences between individuals (between)
  • Common time trends (if exist)
  • Heterogeneity (different slopes for different individuals)

2. Mean Trend Plot

Purpose: Show average time trends

python
# Calculate yearly mean and standard error
yearly_stats = df.groupby('year')['y'].agg(['mean', 'std', 'count'])
yearly_stats['se'] = yearly_stats['std'] / np.sqrt(yearly_stats['count'])

# Plot
plt.figure(figsize=(12, 6))
plt.plot(yearly_stats.index, yearly_stats['mean'], 'o-',
         color='darkblue', linewidth=2, markersize=8, label='Mean')
plt.fill_between(yearly_stats.index,
                 yearly_stats['mean'] - 1.96 * yearly_stats['se'],
                 yearly_stats['mean'] + 1.96 * yearly_stats['se'],
                 alpha=0.3, label='95% CI')

plt.xlabel('Year', fontweight='bold', fontsize=12)
plt.ylabel('Outcome Variable', fontweight='bold', fontsize=12)
plt.title('Average Time Trend (± 95% CI)', fontweight='bold', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Section Summary

Key Points

  1. Panel Data Structure

    • Long format (recommended) vs wide format
    • Balanced panel vs unbalanced panel
    • pandas MultiIndex is a powerful tool for panel data
  2. Variance Decomposition

    • Total variation = between variation + within variation
    • Cross-sectional OLS uses total variation (including between)
    • Fixed effects only use within variation
  3. Problems with Pooled OLS

    • Omitted variable bias (OVB): overestimation or underestimation
    • Serial correlation: biased standard errors
    • Need fixed effects + clustered standard errors
  4. Practical Skills

    • Long-wide format conversion (melt / pivot)
    • MultiIndex operations (set_index, loc, xs)
    • Calculate within/between variation
    • Visualize panel data

Next Steps

In Section 3: Fixed Effects Models, we will dive deep into:

  • Fixed effects estimation methods (within transformation, LSDV)
  • One-way FE vs two-way FE
  • Complete implementation with linearmodels.PanelOLS
  • Real case study: Wage determinants

Mastering panel data structure is the first step toward fixed effects!

Released under the MIT License. Content © Author.