4.3 SciPy.stats & LinearModels
"Without data, you're just another person with an opinion."— W. Edwards Deming, Statistician
Statistical Testing and Professional Econometric Tools
Part A: SciPy.stats — Swiss Army Knife of Statistical Inference
Core Functionality
Positioning: Rapid statistical testing without complex output requirements
Advantages:
- Functional API (simple and direct)
- Covers all classical tests
- Probability distribution operations
- Fast
Hypothesis Testing
1. t-tests
python
from scipy import stats
import numpy as np
# Data
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
# (1) One-sample t-test
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# (2) Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# (3) Paired t-test
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"Paired t = {t_stat:.3f}, p = {p_value:.4f}")
# (4) Welch's t-test (does not assume equal variances)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)2. Chi-square Test
python
# Contingency table
contingency_table = [[10, 20, 30],
[15, 25, 35]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")3. Analysis of Variance (ANOVA)
python
# One-way ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")4. Normality Tests
python
data = np.random.normal(0, 1, 100)
# Shapiro-Wilk (small samples)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")
# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")5. Correlation Tests
python
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")
# Spearman (non-parametric)
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")
# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")Probability Distributions
python
from scipy.stats import norm, t, chi2, f
# Normal distribution
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}") # 0.9750
print(f"95% quantile = {norm.ppf(0.95):.4f}") # 1.6449
# t-distribution
print(f"t(df=10) 95% quantile = {t.ppf(0.95, df=10):.4f}")
# Generate random numbers
random_sample = norm.rvs(loc=0, scale=1, size=1000)Part B: LinearModels — Professional Econometric Tools
Why LinearModels?
Statsmodels Limitations:
- Limited panel data support
- Incomplete instrumental variable functionality
- Complex cluster-robust standard error calculation
LinearModels Advantages:
- Designed specifically for panel data
- Comprehensive IV estimation
- Cluster-robust standard errors
- System GMM
Installation:
bash
pip install linearmodelsPanel Data Regression
1. Fixed Effects (FE)
python
from linearmodels.panel import PanelOLS
import pandas as pd
# Panel data (MultiIndex)
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])
# Fixed effects regression
model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True # Individual fixed effects
).fit(cov_type='clustered', cluster_entity=True)
print(model)Output Interpretation:
PanelOLS Estimation Summary
================================================================================
Dep. Variable: log_wage R-squared: 0.8234
Estimator: PanelOLS R-squared (Between): 0.7123
No. Observations: 5000 R-squared (Within): 0.6543
Entity Effects: Yes F-statistic: 1234.56
Time Effects: No P-value (F-stat): 0.0000
Cov. Estimator: Clustered
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
education 0.0850 0.0123 6.911 0.0000 0.0609 0.1091
experience 0.0234 0.0045 5.200 0.0000 0.0146 0.0322
==============================================================================2. Random Effects (RE)
python
from linearmodels.panel import RandomEffects
model = RandomEffects(
dependent=df['log_wage'],
exog=df[['education', 'experience']]
).fit()
print(model)3. Hausman Test (FE vs RE)
python
# Fit FE and RE separately
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']],
entity_effects=True).fit()
re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()
# Hausman test (manual)
from scipy.stats import chi2
# Compute test statistic
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))
print(f"Hausman statistic: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")
# Interpretation:
# p < 0.05 → Reject RE, use FE
# p > 0.05 → Accept RE4. Two-way Fixed Effects
python
model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True, # Individual fixed effects
time_effects=True # Time fixed effects
).fit(cov_type='clustered', cluster_entity=True)Instrumental Variables (IV)
1. 2SLS Regression
python
from linearmodels.iv import IV2SLS
# Example: Returns to education (ability is endogenous)
# Model: log(wage) = β₀ + β₁*education + β₂*ability + ε
# Problem: ability is unobservable and correlated with education
# Instrument: father's education (father_education)
model = IV2SLS(
dependent=df['log_wage'],
exog=df[['experience']], # Exogenous variables
endog=df[['education']], # Endogenous variables
instruments=df[['father_education']] # Instruments
).fit(cov_type='robust')
print(model)Key Output:
IV-2SLS Estimation Summary
================================================================================
Endogenous: education Instruments: father_education
Exogenous: experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν
F-statistic: 156.34 # Need > 10 (strong instrument)
...2. Weak Instrument Test
python
# First stage F-statistic
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']
print(f"First stage F-statistic: {first_stage_fstat:.2f}")
# Criteria:
# F > 10: Strong instrument
# F < 10: Weak instrument (results unreliable)3. Overidentification Test
python
# If instruments exceed endogenous variables
model = IV2SLS(
dependent=df['log_wage'],
exog=df[['experience']],
endog=df[['education']],
instruments=df[['father_education', 'mother_education']] # 2 IVs
).fit()
# Sargan-Hansen J statistic (automatically calculated)
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")
# Interpretation:
# p > 0.05 → Instruments are valid (cannot reject exogeneity)Complete Panel Data Example
python
import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS
# 1. Construct panel data
np.random.seed(42)
n_individuals = 500
n_years = 5
data = []
for i in range(n_individuals):
for t in range(n_years):
data.append({
'person_id': i,
'year': 2016 + t,
'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
'education': 12 + np.random.randint(0, 9),
'experience': t + np.random.randint(0, 10)
})
df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])
# 2. Fixed effects regression
fe_model = PanelOLS(
dependent=df['log_wage'],
exog=df[['education', 'experience']],
entity_effects=True,
time_effects=True
).fit(cov_type='clustered', cluster_entity=True)
print("=" * 70)
print("Fixed Effects Regression Results")
print("=" * 70)
print(fe_model)
# 3. Check F-statistic
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")
# 4. Extract coefficients
print(f"\nReturns to education: {fe_model.params['education']:.4f}")
print(f"Standard error: {fe_model.std_errors['education']:.4f}")Summary
SciPy.stats Core Functions
| Task | Function | Purpose |
|---|---|---|
| t-test | ttest_ind(), ttest_1samp() | Mean comparison |
| Chi-square | chi2_contingency() | Independence test |
| ANOVA | f_oneway() | Multiple group comparison |
| Correlation | pearsonr(), spearmanr() | Correlation |
| Normality | shapiro(), kstest() | Distribution test |
LinearModels Core Functionality
| Task | Class | Purpose |
|---|---|---|
| Panel OLS | PanelOLS | FE/RE regression |
| Random Effects | RandomEffects | RE models |
| 2SLS | IV2SLS | Instrumental variables |
| GMM | IVGMM | System GMM |
When to Use?
| Scenario | Tool |
|---|---|
| Rapid hypothesis testing | scipy.stats |
| Panel data | linearmodels.PanelOLS |
| Instrumental variables | linearmodels.IV2SLS |
| Detailed diagnostics | statsmodels |
Next Steps
Next Section: Integrated Workflow
References:
- SciPy Documentation: https://docs.scipy.org/doc/scipy/reference/stats.html
- LinearModels Documentation: https://bashtage.github.io/linearmodels/