4.3 SciPy.stats & LinearModels

"Without data, you're just another person with an opinion."— W. Edwards Deming, Statistician

Statistical Testing and Professional Econometric Tools

Part A: SciPy.stats — Swiss Army Knife of Statistical Inference

Core Functionality

Positioning: Rapid statistical testing without complex output requirements

Advantages:

Functional API (simple and direct)
Covers all classical tests
Probability distribution operations
Fast

Hypothesis Testing

1. t-tests

python

from scipy import stats
import numpy as np

# Data
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]

# (1) One-sample t-test
t_stat, p_value = stats.ttest_1samp(group1, popmean=25)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (2) Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# (3) Paired t-test
before = [100, 105, 110, 115, 120]
after = [105, 108, 112, 118, 125]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"Paired t = {t_stat:.3f}, p = {p_value:.4f}")

# (4) Welch's t-test (does not assume equal variances)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

2. Chi-square Test

python

# Contingency table
contingency_table = [[10, 20, 30],
                     [15, 25, 35]]

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}, df = {dof}")

3. Analysis of Variance (ANOVA)

python

# One-way ANOVA
group1 = [23, 25, 27, 29, 31]
group2 = [18, 20, 22, 24, 26]
group3 = [20, 22, 24, 26, 28]

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F = {f_stat:.3f}, p = {p_value:.4f}")

4. Normality Tests

python

data = np.random.normal(0, 1, 100)

# Shapiro-Wilk (small samples)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk: p = {p:.4f}")

# Kolmogorov-Smirnov
stat, p = stats.kstest(data, 'norm')
print(f"KS test: p = {p:.4f}")

5. Correlation Tests

python

x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

# Pearson
r, p = stats.pearsonr(x, y)
print(f"Pearson r = {r:.3f}, p = {p:.4f}")

# Spearman (non-parametric)
rho, p = stats.spearmanr(x, y)
print(f"Spearman ρ = {rho:.3f}, p = {p:.4f}")

# Kendall's Tau
tau, p = stats.kendalltau(x, y)
print(f"Kendall τ = {tau:.3f}, p = {p:.4f}")

Probability Distributions

python

from scipy.stats import norm, t, chi2, f

# Normal distribution
print(f"P(Z < 1.96) = {norm.cdf(1.96):.4f}")  # 0.9750
print(f"95% quantile = {norm.ppf(0.95):.4f}")   # 1.6449

# t-distribution
print(f"t(df=10) 95% quantile = {t.ppf(0.95, df=10):.4f}")

# Generate random numbers
random_sample = norm.rvs(loc=0, scale=1, size=1000)

Part B: LinearModels — Professional Econometric Tools

Why LinearModels?

Statsmodels Limitations:

Limited panel data support
Incomplete instrumental variable functionality
Complex cluster-robust standard error calculation

LinearModels Advantages:

Designed specifically for panel data
Comprehensive IV estimation
Cluster-robust standard errors
System GMM

Installation:

bash

pip install linearmodels

Panel Data Regression

1. Fixed Effects (FE)

python

from linearmodels.panel import PanelOLS
import pandas as pd

# Panel data (MultiIndex)
df = pd.read_csv('wage_panel.csv')
df = df.set_index(['person_id', 'year'])

# Fixed effects regression
model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True  # Individual fixed effects
).fit(cov_type='clustered', cluster_entity=True)

print(model)

Output Interpretation:

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:               log_wage   R-squared:                        0.8234
Estimator:                   PanelOLS   R-squared (Between):              0.7123
No. Observations:                5000   R-squared (Within):               0.6543
Entity Effects:                   Yes   F-statistic:                    1234.56
Time Effects:                      No   P-value (F-stat):                0.0000
Cov. Estimator:             Clustered

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
education      0.0850     0.0123      6.911     0.0000      0.0609      0.1091
experience     0.0234     0.0045      5.200     0.0000      0.0146      0.0322
==============================================================================

2. Random Effects (RE)

python

from linearmodels.panel import RandomEffects

model = RandomEffects(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']]
).fit()

print(model)

3. Hausman Test (FE vs RE)

python

# Fit FE and RE separately
fe_model = PanelOLS(df['log_wage'], df[['education', 'experience']],
                    entity_effects=True).fit()

re_model = RandomEffects(df['log_wage'], df[['education', 'experience']]).fit()

# Hausman test (manual)
from scipy.stats import chi2

# Compute test statistic
diff = fe_model.params - re_model.params
var_diff = fe_model.cov - re_model.cov
hausman_stat = diff.T @ np.linalg.inv(var_diff) @ diff
p_value = 1 - chi2.cdf(hausman_stat, df=len(diff))

print(f"Hausman statistic: {hausman_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# Interpretation:
# p < 0.05 → Reject RE, use FE
# p > 0.05 → Accept RE

4. Two-way Fixed Effects

python

model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,  # Individual fixed effects
    time_effects=True     # Time fixed effects
).fit(cov_type='clustered', cluster_entity=True)

Instrumental Variables (IV)

1. 2SLS Regression

python

from linearmodels.iv import IV2SLS

# Example: Returns to education (ability is endogenous)
# Model: log(wage) = β₀ + β₁*education + β₂*ability + ε
# Problem: ability is unobservable and correlated with education
# Instrument: father's education (father_education)

model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],  # Exogenous variables
    endog=df[['education']],  # Endogenous variables
    instruments=df[['father_education']]  # Instruments
).fit(cov_type='robust')

print(model)

Key Output:

                          IV-2SLS Estimation Summary
================================================================================
Endogenous:                  education   Instruments:            father_education
Exogenous:                  experience
...
First Stage Estimation Results:
education = γ₀ + γ₁*experience + γ₂*father_education + ν

F-statistic:                     156.34   # Need > 10 (strong instrument)
...

2. Weak Instrument Test

python

# First stage F-statistic
first_stage_fstat = model.first_stage.diagnostics['f.stat']['education']

print(f"First stage F-statistic: {first_stage_fstat:.2f}")

# Criteria:
# F > 10: Strong instrument
# F < 10: Weak instrument (results unreliable)

3. Overidentification Test

python

# If instruments exceed endogenous variables
model = IV2SLS(
    dependent=df['log_wage'],
    exog=df[['experience']],
    endog=df[['education']],
    instruments=df[['father_education', 'mother_education']]  # 2 IVs
).fit()

# Sargan-Hansen J statistic (automatically calculated)
print(f"J-statistic: {model.sargan.stat:.3f}")
print(f"p-value: {model.sargan.pval:.4f}")

# Interpretation:
# p > 0.05 → Instruments are valid (cannot reject exogeneity)

Complete Panel Data Example

python

import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

# 1. Construct panel data
np.random.seed(42)
n_individuals = 500
n_years = 5

data = []
for i in range(n_individuals):
    for t in range(n_years):
        data.append({
            'person_id': i,
            'year': 2016 + t,
            'log_wage': 10 + 0.05*t + np.random.normal(0, 0.5),
            'education': 12 + np.random.randint(0, 9),
            'experience': t + np.random.randint(0, 10)
        })

df = pd.DataFrame(data)
df = df.set_index(['person_id', 'year'])

# 2. Fixed effects regression
fe_model = PanelOLS(
    dependent=df['log_wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("Fixed Effects Regression Results")
print("=" * 70)
print(fe_model)

# 3. Check F-statistic
print(f"\nF-statistic: {fe_model.f_statistic.stat:.2f}")
print(f"p-value: {fe_model.f_statistic.pval:.4f}")

# 4. Extract coefficients
print(f"\nReturns to education: {fe_model.params['education']:.4f}")
print(f"Standard error: {fe_model.std_errors['education']:.4f}")

Summary

SciPy.stats Core Functions

Task	Function	Purpose
t-test	`ttest_ind()`, `ttest_1samp()`	Mean comparison
Chi-square	`chi2_contingency()`	Independence test
ANOVA	`f_oneway()`	Multiple group comparison
Correlation	`pearsonr()`, `spearmanr()`	Correlation
Normality	`shapiro()`, `kstest()`	Distribution test

LinearModels Core Functionality

Task	Class	Purpose
Panel OLS	`PanelOLS`	FE/RE regression
Random Effects	`RandomEffects`	RE models
2SLS	`IV2SLS`	Instrumental variables
GMM	`IVGMM`	System GMM

When to Use?

Scenario	Tool
Rapid hypothesis testing	scipy.stats
Panel data	linearmodels.PanelOLS
Instrumental variables	linearmodels.IV2SLS
Detailed diagnostics	statsmodels

Next Steps

Next Section: Integrated Workflow

References:

SciPy Documentation: https://docs.scipy.org/doc/scipy/reference/stats.html
LinearModels Documentation: https://bashtage.github.io/linearmodels/

4.3 SciPy.stats & LinearModels ​

Part A: SciPy.stats — Swiss Army Knife of Statistical Inference ​

Core Functionality ​

Hypothesis Testing ​

1. t-tests ​

2. Chi-square Test ​

3. Analysis of Variance (ANOVA) ​

4. Normality Tests ​

5. Correlation Tests ​

Probability Distributions ​

Part B: LinearModels — Professional Econometric Tools ​

Why LinearModels? ​

Panel Data Regression ​

1. Fixed Effects (FE) ​

2. Random Effects (RE) ​

3. Hausman Test (FE vs RE) ​

4. Two-way Fixed Effects ​

Instrumental Variables (IV) ​

1. 2SLS Regression ​

2. Weak Instrument Test ​

3. Overidentification Test ​

Complete Panel Data Example ​

Summary ​

SciPy.stats Core Functions ​

LinearModels Core Functionality ​

When to Use? ​

Next Steps ​

4.3 SciPy.stats & LinearModels

Part A: SciPy.stats — Swiss Army Knife of Statistical Inference

Core Functionality

Hypothesis Testing

1. t-tests

2. Chi-square Test

3. Analysis of Variance (ANOVA)

4. Normality Tests

5. Correlation Tests

Probability Distributions

Part B: LinearModels — Professional Econometric Tools

Why LinearModels?

Panel Data Regression

1. Fixed Effects (FE)

2. Random Effects (RE)

3. Hausman Test (FE vs RE)

4. Two-way Fixed Effects

Instrumental Variables (IV)

1. 2SLS Regression

2. Weak Instrument Test

3. Overidentification Test

Complete Panel Data Example

Summary

SciPy.stats Core Functions

LinearModels Core Functionality

When to Use?

Next Steps