Skip to content

4.3 SciPy.stats & LinearModels

"Without data, you're just another person with an opinion."— W. Edwards Deming, Statistician

Statistical Testing and Professional Econometric Tools


Part A: SciPy.stats — Swiss Army Knife of Statistical Inference

Core Functionality

Positioning: Rapid statistical testing without complex output requirements

Advantages:

  • Functional API (simple and direct)
  • Covers all classical tests
  • Probability distribution operations
  • Fast

Hypothesis Testing

1. t-tests

python
from scipy import stats
import numpy as np

# Data
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]

# (1) One-sample t-test
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (2) Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (3) Paired t-test
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"Paired t = {t_stat:.3f}, p = {p_value:.4f}")

# (4) Welch's t-test (does not assume equal variances)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

2. Chi-square Test

python
# Contingency table
contingency_table = [[10, 20, 30],
                     [15, 25, 35]]

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")

3. Analysis of Variance (ANOVA)

python
# One-way ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")

4. Normality Tests

python
data = np.random.normal(0, 1, 100)

# Shapiro-Wilk (small samples)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")

# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")

5. Correlation Tests

python
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")

# Spearman (non-parametric)
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")

# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")

Probability Distributions

python
from scipy.stats import norm, t, chi2, f

# Normal distribution
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}")  # 0.9750
print(f"95% quantile = {norm.ppf(0.95):.4f}")   # 1.6449

# t-distribution
print(f"t(df=10) 95% quantile = {t.ppf(0.95, df=10):.4f}")

# Generate random numbers
random_sample = norm.rvs(loc=0, scale=1, size=1000)

Part B: LinearModels — Professional Econometric Tools

Why LinearModels?

Statsmodels Limitations:

  • Limited panel data support
  • Incomplete instrumental variable functionality
  • Complex cluster-robust standard error calculation

LinearModels Advantages:

  • Designed specifically for panel data
  • Comprehensive IV estimation
  • Cluster-robust standard errors
  • System GMM

Installation:

bash
pip install linearmodels

Panel Data Regression

1. Fixed Effects (FE)

python
from linearmodels.panel import PanelOLS
import pandas as pd

# Panel data (MultiIndex)
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])

# Fixed effects regression
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True  # Individual fixed effects
).fit(cov_type='clustered', cluster_entity=True)

print(model)

Output Interpretation:

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:               log_wage   R-squared:                        0.8234
Estimator:                   PanelOLS   R-squared (Between):              0.7123
No. Observations:                5000   R-squared (Within):               0.6543
Entity Effects:                   Yes   F-statistic:                    1234.56
Time Effects:                      No   P-value (F-stat):                0.0000
Cov. Estimator:             Clustered

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
education      0.0850     0.0123      6.911     0.0000      0.0609      0.1091
experience     0.0234     0.0045      5.200     0.0000      0.0146      0.0322
==============================================================================

2. Random Effects (RE)

python
from linearmodels.panel import RandomEffects

model = RandomEffects(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']]
).fit()

print(model)

3. Hausman Test (FE vs RE)

python
# Fit FE and RE separately
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']],
                    entity_effects=True).fit()

re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()

# Hausman test (manual)
from scipy.stats import chi2

# Compute test statistic
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))

print(f"Hausman statistic: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# Interpretation:
# p < 0.05 → Reject RE, use FE
# p > 0.05 → Accept RE

4. Two-way Fixed Effects

python
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,  # Individual fixed effects
    time_effects=True     # Time fixed effects
).fit(cov_type='clustered', cluster_entity=True)

Instrumental Variables (IV)

1. 2SLS Regression

python
from linearmodels.iv import IV2SLS

# Example: Returns to education (ability is endogenous)
# Model: log(wage) = β₀ + β₁*education + β₂*ability + ε
# Problem: ability is unobservable and correlated with education
# Instrument: father's education (father_education)

model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],  # Exogenous variables
    endog=df[['education']],  # Endogenous variables
    instruments=df[['father_education']]  # Instruments
).fit(cov_type='robust')

print(model)

Key Output:

                          IV-2SLS Estimation Summary
================================================================================
Endogenous:                  education   Instruments:            father_education
Exogenous:                  experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν

F-statistic:                     156.34   # Need > 10 (strong instrument)
...

2. Weak Instrument Test

python
# First stage F-statistic
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']

print(f"First stage F-statistic: {first_stage_fstat:.2f}")

# Criteria:
# F > 10: Strong instrument
# F < 10: Weak instrument (results unreliable)

3. Overidentification Test

python
# If instruments exceed endogenous variables
model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],
    endog=df[['education']],
    instruments=df[['father_education', 'mother_education']]  # 2 IVs
).fit()

# Sargan-Hansen J statistic (automatically calculated)
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")

# Interpretation:
# p > 0.05 → Instruments are valid (cannot reject exogeneity)

Complete Panel Data Example

python
import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

# 1. Construct panel data
np.random.seed(42)
n_individuals = 500
n_years = 5

data = []
for i in range(n_individuals):
    for t in range(n_years):
        data.append({
            'person_id': i,
            'year': 2016 + t,
            'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
            'education': 12 + np.random.randint(0, 9),
            'experience': t + np.random.randint(0, 10)
        })

df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])

# 2. Fixed effects regression
fe_model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("Fixed Effects Regression Results")
print("=" * 70)
print(fe_model)

# 3. Check F-statistic
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")

# 4. Extract coefficients
print(f"\nReturns to education: {fe_model.params['education']:.4f}")
print(f"Standard error: {fe_model.std_errors['education']:.4f}")

Summary

SciPy.stats Core Functions

TaskFunctionPurpose
t-testttest_ind(), ttest_1samp()Mean comparison
Chi-squarechi2_contingency()Independence test
ANOVAf_oneway()Multiple group comparison
Correlationpearsonr(), spearmanr()Correlation
Normalityshapiro(), kstest()Distribution test

LinearModels Core Functionality

TaskClassPurpose
Panel OLSPanelOLSFE/RE regression
Random EffectsRandomEffectsRE models
2SLSIV2SLSInstrumental variables
GMMIVGMMSystem GMM

When to Use?

ScenarioTool
Rapid hypothesis testingscipy.stats
Panel datalinearmodels.PanelOLS
Instrumental variableslinearmodels.IV2SLS
Detailed diagnosticsstatsmodels

Next Steps

Next Section: Integrated Workflow


References:

Released under the MIT License. Content © Author.