8.4 Random Effects Models
The Trade-off Between Efficiency and Consistency: When is Random Effects Better Than Fixed Effects?
Section Objectives
- Understand the mathematical principles of Random Effects (RE) models
- Master GLS estimation methods
- Distinguish core assumption differences between FE and RE
- Implement Hausman test to choose between FE vs RE
- Understand RE's efficiency advantage and consistency risk
- Use linearmodels.RandomEffects for RE regression
- Complete case study: Corporate capital structure determinants
Core Idea of Random Effects
FE vs RE: Key Differences
Fixed Effects (FE):
- is a fixed parameter (each individual has its own parameter)
- Allows to be correlated with :
- Estimation method: differencing to eliminate
Random Effects (RE):
- is a random variable,
- Assumes is uncorrelated with : ⭐
- Estimation method: Generalized Least Squares (GLS)
Why Called "Random" Effects?
Intuition:
- FE: is an inherent characteristic of individual (fixed)
- RE: is randomly drawn from a population distribution (random)
Statistical Meaning:
- FE: consists of parameters to be estimated
- RE: doesn't need estimation, only its variance needs estimation
Analogy:
- FE: Like a "fixed intercept" model, each individual has its own intercept
- RE: Like a "hierarchical model", individual intercepts follow a distribution
Mathematical Expression of Random Effects Model
Standard RE Model
Symbol Definitions:
- : Individual random effect (unobservable)
- : Random error term
- : Both are independent
- Key assumption: (exogeneity)
Composite Error Term:
Therefore, the model can be written as:
Variance-Covariance Structure of Composite Error
Variance:
Covariance (same individual, different times):
Intra-class Correlation:
Interpretation:
- : Correlation between different time observations of the same individual
- : No individual effects, reduces to pooled OLS
- : Errors completely determined by individual effects
Random Effects Estimation: GLS
Why Can't We Use OLS?
Problem: Composite error has serial correlation
- Errors at different times for the same individual are correlated:
- Violates OLS independence assumption
Consequences:
- OLS coefficients are still unbiased (if )
- But standard errors are biased (underestimated) → -statistics are inflated
Generalized Least Squares (GLS)
Core Idea: Transform data so that transformed errors satisfy OLS assumptions
Step 1: Construct Transformation
Quasi-Demeaning Transformation:
where:
Special Cases:
- If (no individual effects): → pooled OLS
- If (individual effects dominate): → fixed effects (within transformation)
Intuition:
- RE is a weighted average of FE and pooled OLS
- Weight depends on the relative importance of individual effects
Step 2: OLS Estimation on Transformed Data
This is the Random Effects Estimator (RE Estimator)
Feasible GLS (FGLS)
Problem: depends on unknown parameters and
Solution: Two-step estimation
Step 1: Estimate variance components
- Run pooled OLS or FE, obtain residuals
- Calculate and
Step 2: Use estimated variances to calculate , perform GLS
Python Implementation: linearmodels automatically performs FGLS
FE vs RE: In-Depth Comparison
Comparison Table
| Dimension | Fixed Effects (FE) | Random Effects (RE) |
|---|---|---|
| Individual Effect | (fixed parameter) | (random variable) |
| Core Assumption | Allows correlated with | Requires |
| Estimation Method | Within transformation | GLS |
| Variation Used | Within only | Within + Between (Both) |
| Efficiency | Relatively low (only uses within variation) | Relatively high (uses all variation) |
| Consistency | Consistent (even if correlated with ) | Consistent only when |
| Time-Invariant Variables | Cannot estimate | Can estimate |
| Applicable Scenarios | correlated with (endogeneity) | uncorrelated with (exogeneity) |
Core Trade-off: Efficiency vs Consistency
Consistency:
- As sample size increases, estimator converges to true parameter
Efficiency:
- Estimator has smaller variance (smaller standard errors)
Trade-off:
- FE: Consistent (even with endogeneity), but less efficient (only uses within variation)
- RE: More efficient (uses all variation), but inconsistent if correlated with
Decision Rule:
- If : RE is better (more efficient)
- If correlated with : FE is better (consistent)
- Key: How to determine? → Hausman Test
Hausman Test: Decision Tool for FE vs RE
Logic of Hausman Test
Core Question: Is correlated with ?
Null Hypothesis:
Alternative Hypothesis:
Test Statistic:
where is the number of independent variables
Intuition:
- If holds (), both FE and RE are consistent, estimates should be close
- If holds ( correlated with ), FE is consistent but RE is not, estimates will differ significantly
Decision Rule:
- : Reject → Use FE
- : Accept → Use RE
Python Implementation: Hausman Test
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects, compare
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
# Simulate data: u_i correlated with X (FE should win)
np.random.seed(123)
N = 200
T = 5
data = []
for i in range(N):
# Individual effect
u_i = np.random.normal(0, 1)
for t in range(T):
# X correlated with u_i (violates RE assumption!)
x = 10 + 0.5 * u_i + np.random.normal(0, 2)
# Y
y = 5 + 2 * x + u_i + np.random.normal(0, 1)
data.append({'id': i, 'year': 2015 + t, 'y': y, 'x': x})
df = pd.DataFrame(data)
df_panel = df.set_index(['id', 'year'])
# Estimate FE and RE
model_fe = PanelOLS(df_panel['y'], df_panel[['x']],
entity_effects=True).fit(cov_type='clustered',
cluster_entity=True)
model_re = RandomEffects(df_panel['y'], df_panel[['x']]).fit()
print("=" * 70)
print("FE vs RE Estimation Results")
print("=" * 70)
print(f"True parameter: 2.0000")
print(f"FE estimate: {model_fe.params['x']:.4f}")
print(f"RE estimate: {model_re.params['x']:.4f}")
# Hausman test (manual implementation)
beta_diff = model_fe.params['x'] - model_re.params['x']
var_diff = model_fe.cov['x']['x'] - model_re.cov['x']['x']
hausman_stat = (beta_diff ** 2) / var_diff
from scipy.stats import chi2
p_value = 1 - chi2.cdf(hausman_stat, df=1)
print("\n" + "=" * 70)
print("Hausman Test")
print("=" * 70)
print(f"H statistic: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Conclusion: Reject H0, should use FE (RE is inconsistent)")
else:
print("Conclusion: Accept H0, should use RE (RE is consistent and more efficient)")
# Use linearmodels built-in comparison function
print("\n" + "=" * 70)
print("linearmodels Built-in Comparison")
print("=" * 70)
comparison = compare({'FE': model_fe, 'RE': model_re})
print(comparison)Output Interpretation:
- If : FE and RE differ significantly → Use FE
- If : FE and RE don't differ significantly → Use RE (more efficient)
Practical Recommendations
Conservative Strategy (recommended):
- Report both FE and RE
- Conduct Hausman test
- Prioritize FE for main results (because endogeneity is common)
Exception Cases (may prioritize RE):
- Education research: students randomly assigned to schools
- Medical research: patients randomly assigned to hospitals
- Survey research: individuals randomly sampled from population
Economics Research:
- Typically use FE (because endogeneity almost always exists)
- RE often used for robustness checks
linearmodels.RandomEffects
Basic Syntax
from linearmodels.panel import RandomEffects
# Set panel index
df_panel = df.set_index(['id', 'year'])
# Random effects regression
model_re = RandomEffects(
dependent=df_panel['y'],
exog=df_panel[['x1', 'x2']]
).fit()
print(model_re)Complete Example: Corporate Capital Structure
import numpy as np
import pandas as pd
from linearmodels.panel import PanelOLS, RandomEffects
from statsmodels.iolib.summary2 import summary_col
# Simulate corporate panel data
np.random.seed(2024)
N = 300 # 300 companies
T = 10 # 10 years
data = []
for i in range(N):
# Company fixed effect (management style, industry characteristics, etc.)
company_effect = np.random.normal(0, 0.1)
# Industry (time-invariant)
industry = np.random.choice(['Manufacturing', 'Services', 'Technology'],
p=[0.4, 0.3, 0.3])
for t in range(T):
year = 2010 + t
# Profitability (ROA)
roa = 0.05 + company_effect * 0.5 + np.random.normal(0, 0.02)
# Company size (log(assets))
log_assets = 10 + 0.1 * t + np.random.normal(0, 0.5)
# Growth opportunities (Tobin's Q)
tobins_q = 1.5 + np.random.normal(0, 0.3)
# Leverage (dependent variable)
# True parameters: roa=-0.3, log_assets=0.05, tobins_q=-0.1
leverage = (0.3 - 0.3 * roa + 0.05 * log_assets -
0.1 * tobins_q + company_effect + np.random.normal(0, 0.05))
data.append({
'company_id': i,
'year': year,
'leverage': leverage,
'roa': roa,
'log_assets': log_assets,
'tobins_q': tobins_q,
'industry': industry
})
df = pd.DataFrame(data)
# Industry dummies
df = pd.get_dummies(df, columns=['industry'], drop_first=True)
print("=" * 70)
print("Corporate Capital Structure Study")
print("=" * 70)
print(f"Sample size: {len(df):,}")
print(f"Number of companies: {df['company_id'].nunique()}")
print(f"Time span: {df['year'].min()} - {df['year'].max()}")
# Set panel index
df_panel = df.set_index(['company_id', 'year'])
# Model 1: Pooled OLS
import statsmodels.api as sm
X1 = sm.add_constant(df[['roa', 'log_assets', 'tobins_q']])
model_pooled = sm.OLS(df['leverage'], X1).fit()
# Model 2: Fixed effects
model_fe = PanelOLS(df_panel['leverage'],
df_panel[['roa', 'log_assets', 'tobins_q']],
entity_effects=True).fit(cov_type='clustered',
cluster_entity=True)
# Model 3: Random effects
model_re = RandomEffects(df_panel['leverage'],
df_panel[['roa', 'log_assets', 'tobins_q']]).fit()
# Model 4: RE + industry dummies (utilizing RE's advantage of estimating time-invariant variables)
model_re_industry = RandomEffects(
df_panel['leverage'],
df_panel[['roa', 'log_assets', 'tobins_q', 'industry_Services', 'industry_Technology']]
).fit()
# Hausman test
from scipy.stats import chi2
beta_diff = model_fe.params - model_re.params
var_diff = model_fe.cov - model_re.cov
hausman_stat = float(beta_diff.T @ np.linalg.inv(var_diff) @ beta_diff)
p_value = 1 - chi2.cdf(hausman_stat, df=len(beta_diff))
print("\n" + "=" * 70)
print("Hausman Test")
print("=" * 70)
print(f"H statistic: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Conclusion: {'Use FE' if p_value < 0.05 else 'Use RE'}")
# Compare results
print("\n" + "=" * 70)
print("Regression Results Comparison")
print("=" * 70)
results = summary_col([model_pooled, model_fe, model_re],
stars=True,
float_format='%.4f',
model_names=['Pooled OLS', 'FE', 'RE'],
info_dict={
'N': lambda x: f"{int(x.nobs):,}"
})
print(results)
print("\n" + "=" * 70)
print("RE + Industry Dummies (Utilizing RE's Advantage)")
print("=" * 70)
print(model_re_industry.summary)
# Interpret coefficients
print("\n" + "=" * 70)
print("Economic Interpretation")
print("=" * 70)
print(f"ROA coefficient (FE): {model_fe.params['roa']:.4f}")
print(" → 1% increase in profitability reduces leverage by {:.2f} percentage points".format(-model_fe.params['roa'] * 100))
print(f"\nlog(assets) coefficient (FE): {model_fe.params['log_assets']:.4f}")
print(" → Company size doubles (log increases by 0.693), leverage increases by {:.2f} percentage points".format(
model_fe.params['log_assets'] * 0.693 * 100))Output Interpretation:
- Hausman Test: If rejected, use FE; otherwise use RE
- RE's Advantage: Can estimate industry dummies (time-invariant)
- Economic Meaning:
- Negative ROA coefficient: High-profit companies reduce debt (pecking order theory)
- Positive size coefficient: Large companies easier to obtain debt financing
RE's Advantage Scenarios
Scenario 1: Estimating Time-Invariant Variables
Example: Studying gender wage gap
# FE cannot estimate gender (time-invariant)
# model_fe = PanelOLS(log_wage, education + gender, entity_effects=True).fit()
# → gender coefficient cannot be estimated (eliminated by differencing)
# RE can estimate gender
model_re = RandomEffects(log_wage, education + gender).fit()
# → gender coefficient can be estimatedNote: Only when gender is uncorrelated with individual effects, RE estimation is consistent
Scenario 2: Small Within Variation
Example: Studying effect of education on wages (short panel)
If panel time span is short (e.g., 2-3 years), education level barely changes:
- FE only uses within variation (almost 0) → large standard errors
- RE uses between variation (large) → more precise
Trade-off:
- FE: Consistent but imprecise
- RE: Precise but possibly inconsistent (if endogeneity exists)
Scenario 3: Random Sampling
Example: Randomly sample 100 schools from nationwide schools
If schools are randomly sampled, school effect unlikely correlated with student characteristics
- RE assumption more reasonable
- RE estimation more efficient
Contrast:
- If studying specific 100 schools (non-random), FE more appropriate
Section Summary
Key Points
Essence of RE:
- Individual effects are random variables, drawn from a distribution
- Core assumption: (exogeneity)
GLS Estimation:
- Quasi-demeaning transformation: depends on variance ratio
- RE is weighted average of FE and pooled OLS
FE vs RE:
- Efficiency: RE > FE (uses all variation)
- Consistency: FE always consistent, RE consistent only when
- Time-invariant variables: FE cannot estimate, RE can
Hausman Test:
- Tests whether correlated with
- : Use FE
- : Use RE
Practical Recommendations:
- Economics research: Prioritize FE (endogeneity common)
- Education/medical research: Consider RE (random sampling)
- Robustness check: Report both FE and RE
RE's Advantage Scenarios:
- Need to estimate time-invariant variables
- Small within variation
- Individual random sampling
Decision Tree
Start
↓
Need to estimate time-invariant variables?
↓ Yes
Use RE (if Hausman test passes)
↓ No
Estimate FE and RE, conduct Hausman test
↓
Hausman test p < 0.05?
↓ Yes
Use FE (RE is inconsistent)
↓ No
Use RE (more efficient)Next Steps
In Section 5: Advanced Panel Data Topics, we will learn:
- Two-way fixed effects (Two-Way FE) detailed explanation
- Correct use of clustered standard errors
- Dynamic panel models (Arellano-Bond)
- Handling unbalanced panels
Wise choice between efficiency and consistency!