Skip to content

8.4 Random Effects Models

The Trade-off Between Efficiency and Consistency: When is Random Effects Better Than Fixed Effects?

DifficultyImportance


Section Objectives

  • Understand the mathematical principles of Random Effects (RE) models
  • Master GLS estimation methods
  • Distinguish core assumption differences between FE and RE
  • Implement Hausman test to choose between FE vs RE
  • Understand RE's efficiency advantage and consistency risk
  • Use linearmodels.RandomEffects for RE regression
  • Complete case study: Corporate capital structure determinants

Core Idea of Random Effects

FE vs RE: Key Differences

Fixed Effects (FE):

  • is a fixed parameter (each individual has its own parameter)
  • Allows to be correlated with :
  • Estimation method: differencing to eliminate

Random Effects (RE):

  • is a random variable,
  • Assumes is uncorrelated with :
  • Estimation method: Generalized Least Squares (GLS)

Why Called "Random" Effects?

Intuition:

  • FE: is an inherent characteristic of individual (fixed)
  • RE: is randomly drawn from a population distribution (random)

Statistical Meaning:

  • FE: consists of parameters to be estimated
  • RE: doesn't need estimation, only its variance needs estimation

Analogy:

  • FE: Like a "fixed intercept" model, each individual has its own intercept
  • RE: Like a "hierarchical model", individual intercepts follow a distribution

Mathematical Expression of Random Effects Model

Standard RE Model

Symbol Definitions:

  • : Individual random effect (unobservable)
  • : Random error term
  • : Both are independent
  • Key assumption: (exogeneity)

Composite Error Term:

Therefore, the model can be written as:


Variance-Covariance Structure of Composite Error

Variance:

Covariance (same individual, different times):

Intra-class Correlation:

Interpretation:

  • : Correlation between different time observations of the same individual
  • : No individual effects, reduces to pooled OLS
  • : Errors completely determined by individual effects

Random Effects Estimation: GLS

Why Can't We Use OLS?

Problem: Composite error has serial correlation

  • Errors at different times for the same individual are correlated:
  • Violates OLS independence assumption

Consequences:

  • OLS coefficients are still unbiased (if )
  • But standard errors are biased (underestimated) → -statistics are inflated

Generalized Least Squares (GLS)

Core Idea: Transform data so that transformed errors satisfy OLS assumptions

Step 1: Construct Transformation

Quasi-Demeaning Transformation:

where:

Special Cases:

  • If (no individual effects): → pooled OLS
  • If (individual effects dominate): → fixed effects (within transformation)

Intuition:

  • RE is a weighted average of FE and pooled OLS
  • Weight depends on the relative importance of individual effects

Step 2: OLS Estimation on Transformed Data

This is the Random Effects Estimator (RE Estimator)


Feasible GLS (FGLS)

Problem: depends on unknown parameters and

Solution: Two-step estimation

  1. Step 1: Estimate variance components

    • Run pooled OLS or FE, obtain residuals
    • Calculate and
  2. Step 2: Use estimated variances to calculate , perform GLS

Python Implementation: linearmodels automatically performs FGLS


FE vs RE: In-Depth Comparison

Comparison Table

DimensionFixed Effects (FE)Random Effects (RE)
Individual Effect (fixed parameter) (random variable)
Core AssumptionAllows correlated with Requires
Estimation MethodWithin transformationGLS
Variation UsedWithin onlyWithin + Between (Both)
EfficiencyRelatively low (only uses within variation)Relatively high (uses all variation)
ConsistencyConsistent (even if correlated with )Consistent only when
Time-Invariant VariablesCannot estimateCan estimate
Applicable Scenarios correlated with (endogeneity) uncorrelated with (exogeneity)

Core Trade-off: Efficiency vs Consistency

Consistency:

  • As sample size increases, estimator converges to true parameter

Efficiency:

  • Estimator has smaller variance (smaller standard errors)

Trade-off:

  • FE: Consistent (even with endogeneity), but less efficient (only uses within variation)
  • RE: More efficient (uses all variation), but inconsistent if correlated with

Decision Rule:

  • If : RE is better (more efficient)
  • If correlated with : FE is better (consistent)
  • Key: How to determine? → Hausman Test

Hausman Test: Decision Tool for FE vs RE

Logic of Hausman Test

Core Question: Is correlated with ?

Null Hypothesis:

Alternative Hypothesis:

Test Statistic:

where is the number of independent variables

Intuition:

  • If holds (), both FE and RE are consistent, estimates should be close
  • If holds ( correlated with ), FE is consistent but RE is not, estimates will differ significantly

Decision Rule:

  • : Reject → Use FE
  • : Accept → Use RE

Python Implementation: Hausman Test

python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects, compare
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# Simulate data: u_i correlated with X (FE should win)
np.random.seed(123)

N = 200
T = 5

data = []
for i in range(N):
    # Individual effect
    u_i = np.random.normal(0, 1)

    for t in range(T):
        # X correlated with u_i (violates RE assumption!)
        x = 10 + 0.5 * u_i + np.random.normal(0, 2)

        # Y
        y = 5 + 2 * x + u_i + np.random.normal(0, 1)

        data.append({'id': i, 'year': 2015 + t, 'y': y, 'x': x})

df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])

# Estimate FE and RE
model_fe = PanelOLS(df_panel['y'], df_panel[['x']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

model_re = RandomEffects(df_panel['y'], df_panel[['x']]).fit()

print("=" * 70)
print("FE vs RE Estimation Results")
print("=" * 70)
print(f"True parameter:  2.0000")
print(f"FE estimate:     {model_fe.params['x']:.4f}")
print(f"RE estimate:     {model_re.params['x']:.4f}")

# Hausman test (manual implementation)
beta_diff = model_fe.params['x'] - model_re.params['x']
var_diff = model_fe.cov['x']['x'] - model_re.cov['x']['x']
hausman_stat = (beta_diff ** 2) / var_diff

from scipy.stats import chi2
p_value = 1 - chi2.cdf(hausman_stat, df=1)

print("\n" + "=" * 70)
print("Hausman Test")
print("=" * 70)
print(f"H statistic:   {hausman_stat:.3f}")
print(f"p-value:       {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion:    Reject H0, should use FE (RE is inconsistent)")
else:
    print("Conclusion:    Accept H0, should use RE (RE is consistent and more efficient)")

# Use linearmodels built-in comparison function
print("\n" + "=" * 70)
print("linearmodels Built-in Comparison")
print("=" * 70)
comparison = compare({'FE': model_fe, 'RE': model_re})
print(comparison)

Output Interpretation:

  • If : FE and RE differ significantly → Use FE
  • If : FE and RE don't differ significantly → Use RE (more efficient)

Practical Recommendations

Conservative Strategy (recommended):

  1. Report both FE and RE
  2. Conduct Hausman test
  3. Prioritize FE for main results (because endogeneity is common)

Exception Cases (may prioritize RE):

  • Education research: students randomly assigned to schools
  • Medical research: patients randomly assigned to hospitals
  • Survey research: individuals randomly sampled from population

Economics Research:

  • Typically use FE (because endogeneity almost always exists)
  • RE often used for robustness checks

linearmodels.RandomEffects

Basic Syntax

python
from linearmodels.panel import RandomEffects

# Set panel index
df_panel = df.set_index(['id', 'year'])

# Random effects regression
model_re = RandomEffects(
    dependent=df_panel['y'],
    exog=df_panel[['x1', 'x2']]
).fit()

print(model_re)

Complete Example: Corporate Capital Structure

python
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects
from statsmodels.iolib.summary2 import summary_col

# Simulate corporate panel data
np.random.seed(2024)

N = 300  # 300 companies
T = 10   # 10 years

data = []
for i in range(N):
    # Company fixed effect (management style, industry characteristics, etc.)
    company_effect = np.random.normal(0, 0.1)

    # Industry (time-invariant)
    industry = np.random.choice(['Manufacturing', 'Services', 'Technology'],
                                p=[0.4, 0.3, 0.3])

    for t in range(T):
        year = 2010 + t

        # Profitability (ROA)
        roa = 0.05 + company_effect * 0.5 + np.random.normal(0, 0.02)

        # Company size (log(assets))
        log_assets = 10 + 0.1 * t + np.random.normal(0, 0.5)

        # Growth opportunities (Tobin's Q)
        tobins_q = 1.5 + np.random.normal(0, 0.3)

        # Leverage (dependent variable)
        # True parameters: roa=-0.3, log_assets=0.05, tobins_q=-0.1
        leverage = (0.3 - 0.3 * roa + 0.05 * log_assets -
                    0.1 * tobins_q + company_effect + np.random.normal(0, 0.05))

        data.append({
            'company_id': i,
            'year': year,
            'leverage': leverage,
            'roa': roa,
            'log_assets': log_assets,
            'tobins_q': tobins_q,
            'industry': industry
        })

df = pd.DataFrame(data)

# Industry dummies
df = pd.get_dummies(df, columns=['industry'], drop_first=True)

print("=" * 70)
print("Corporate Capital Structure Study")
print("=" * 70)
print(f"Sample size: {len(df):,}")
print(f"Number of companies: {df['company_id'].nunique()}")
print(f"Time span: {df['year'].min()} - {df['year'].max()}")

# Set panel index
df_panel = df.set_index(['company_id', 'year'])

# Model 1: Pooled OLS
import statsmodels.api as sm
X1 = sm.add_constant(df[['roa', 'log_assets', 'tobins_q']])
model_pooled = sm.OLS(df['leverage'], X1).fit()

# Model 2: Fixed effects
model_fe = PanelOLS(df_panel['leverage'],
                    df_panel[['roa', 'log_assets', 'tobins_q']],
                    entity_effects=True).fit(cov_type='clustered',
                                             cluster_entity=True)

# Model 3: Random effects
model_re = RandomEffects(df_panel['leverage'],
                         df_panel[['roa', 'log_assets', 'tobins_q']]).fit()

# Model 4: RE + industry dummies (utilizing RE's advantage of estimating time-invariant variables)
model_re_industry = RandomEffects(
    df_panel['leverage'],
    df_panel[['roa', 'log_assets', 'tobins_q', 'industry_Services', 'industry_Technology']]
).fit()

# Hausman test
from scipy.stats import chi2
beta_diff = model_fe.params - model_re.params
var_diff = model_fe.cov - model_re.cov
hausman_stat = float(beta_diff.T @ np.linalg.inv(var_diff) @ beta_diff)
p_value = 1 - chi2.cdf(hausman_stat, df=len(beta_diff))

print("\n" + "=" * 70)
print("Hausman Test")
print("=" * 70)
print(f"H statistic: {hausman_stat:.3f}")
print(f"p-value:     {p_value:.4f}")
print(f"Conclusion:  {'Use FE' if p_value < 0.05 else 'Use RE'}")

# Compare results
print("\n" + "=" * 70)
print("Regression Results Comparison")
print("=" * 70)

results = summary_col([model_pooled, model_fe, model_re],
                      stars=True,
                      float_format='%.4f',
                      model_names=['Pooled OLS', 'FE', 'RE'],
                      info_dict={
                          'N': lambda x: f"{int(x.nobs):,}"
                      })
print(results)

print("\n" + "=" * 70)
print("RE + Industry Dummies (Utilizing RE's Advantage)")
print("=" * 70)
print(model_re_industry.summary)

# Interpret coefficients
print("\n" + "=" * 70)
print("Economic Interpretation")
print("=" * 70)
print(f"ROA coefficient (FE):    {model_fe.params['roa']:.4f}")
print("  → 1% increase in profitability reduces leverage by {:.2f} percentage points".format(-model_fe.params['roa'] * 100))
print(f"\nlog(assets) coefficient (FE): {model_fe.params['log_assets']:.4f}")
print("  → Company size doubles (log increases by 0.693), leverage increases by {:.2f} percentage points".format(
    model_fe.params['log_assets'] * 0.693 * 100))

Output Interpretation:

  1. Hausman Test: If rejected, use FE; otherwise use RE
  2. RE's Advantage: Can estimate industry dummies (time-invariant)
  3. Economic Meaning:
    • Negative ROA coefficient: High-profit companies reduce debt (pecking order theory)
    • Positive size coefficient: Large companies easier to obtain debt financing

RE's Advantage Scenarios

Scenario 1: Estimating Time-Invariant Variables

Example: Studying gender wage gap

python
# FE cannot estimate gender (time-invariant)
# model_fe = PanelOLS(log_wage, education + gender, entity_effects=True).fit()
# → gender coefficient cannot be estimated (eliminated by differencing)

# RE can estimate gender
model_re = RandomEffects(log_wage, education + gender).fit()
# → gender coefficient can be estimated

Note: Only when gender is uncorrelated with individual effects, RE estimation is consistent


Scenario 2: Small Within Variation

Example: Studying effect of education on wages (short panel)

If panel time span is short (e.g., 2-3 years), education level barely changes:

  • FE only uses within variation (almost 0) → large standard errors
  • RE uses between variation (large) → more precise

Trade-off:

  • FE: Consistent but imprecise
  • RE: Precise but possibly inconsistent (if endogeneity exists)

Scenario 3: Random Sampling

Example: Randomly sample 100 schools from nationwide schools

If schools are randomly sampled, school effect unlikely correlated with student characteristics

  • RE assumption more reasonable
  • RE estimation more efficient

Contrast:

  • If studying specific 100 schools (non-random), FE more appropriate

Section Summary

Key Points

  1. Essence of RE:

    • Individual effects are random variables, drawn from a distribution
    • Core assumption: (exogeneity)
  2. GLS Estimation:

    • Quasi-demeaning transformation: depends on variance ratio
    • RE is weighted average of FE and pooled OLS
  3. FE vs RE:

    • Efficiency: RE > FE (uses all variation)
    • Consistency: FE always consistent, RE consistent only when
    • Time-invariant variables: FE cannot estimate, RE can
  4. Hausman Test:

    • Tests whether correlated with
    • : Use FE
    • : Use RE
  5. Practical Recommendations:

    • Economics research: Prioritize FE (endogeneity common)
    • Education/medical research: Consider RE (random sampling)
    • Robustness check: Report both FE and RE
  6. RE's Advantage Scenarios:

    • Need to estimate time-invariant variables
    • Small within variation
    • Individual random sampling

Decision Tree

Start

Need to estimate time-invariant variables?
  ↓ Yes
  Use RE (if Hausman test passes)
  ↓ No
Estimate FE and RE, conduct Hausman test

Hausman test p < 0.05?
  ↓ Yes
  Use FE (RE is inconsistent)
  ↓ No
  Use RE (more efficient)

Next Steps

In Section 5: Advanced Panel Data Topics, we will learn:

  • Two-way fixed effects (Two-Way FE) detailed explanation
  • Correct use of clustered standard errors
  • Dynamic panel models (Arellano-Bond)
  • Handling unbalanced panels

Wise choice between efficiency and consistency!

Released under the MIT License. Content © Author.