4.2 Statsmodels Essentials
"Make everything as simple as possible, but not simpler."— Albert Einstein, Physicist
The Foundation of Python Statistical Analysis
Section Objectives
- Master the core Statsmodels API
- Understand OLS, GLM, and time series modeling
- Learn model diagnostics and robust standard errors
- Use formula interface for rapid modeling
Statsmodels Overview
Positioning: Python's Stata — First choice for classical statistical analysis
Core Features:
- Publication-quality output (detailed statistical tables)
- Rich diagnostic tools
- Heteroskedasticity-robust standard errors
- Comprehensive time series support
Installation:
bash
pip install statsmodels
# or
conda install -c conda-forge statsmodelsTwo API Styles
API 1: Standard Interface (Recommended for Programming)
python
import statsmodels.api as sm
import pandas as pd
import numpy as np
# 1. Prepare data
X = df[['education', 'experience']]
X = sm.add_constant(X) # Must manually add intercept
y = df['wage']
# 2. Fit model
model = sm.OLS(y, X).fit()
# 3. View results
print(model.summary())API 2: Formula Interface (Recommended for Interactive Analysis)
python
import statsmodels.formula.api as smf
# R-style formula
model = smf.ols('wage ~ education + experience', data=df).fit()
print(model.summary())
# Advantages:
# Automatically adds intercept
# Automatically handles categorical variables
# Supports transformations and interactionsOLS Regression
Basic OLS
python
import statsmodels.api as sm
import pandas as pd
# Data
df = pd.DataFrame({
'wage': [3000, 3500, 4000, 5000, 5500],
'education': [12, 14, 16, 18, 20],
'experience': [2, 3, 4, 5, 6]
})
# OLS Regression
X = sm.add_constant(df[['education', 'experience']])
y = df['wage']
model = sm.OLS(y, X).fit()
print(model.summary())Output Interpretation:
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.999
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 857.5
Prob (F-statistic): 0.00116
...
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -250.0000 164.317 -1.521 0.265 -1155.959 655.959
education 125.0000 10.000 12.500 0.003 87.156 162.844
experience 125.0000 20.000 6.250 0.016 38.313 211.687
==============================================================================Heteroskedasticity-Robust Standard Errors
python
# HC0, HC1, HC2, HC3 (default HC1)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')
print(model_robust.summary())
# Comparison
print(f"Standard SE: {model.bse['education']:.4f}")
print(f"Robust SE: {model_robust.bse['education']:.4f}")Standard Error Types:
'nonrobust': Default (assumes homoskedasticity)'HC0': White robust standard errors'HC1': Small sample adjustment'HC2': More conservative'HC3': Most robust (recommended)
Weighted Least Squares (WLS)
python
# If heteroskedasticity structure is known
weights = 1 / df['variance']
model_wls = sm.WLS(y, X, weights=weights).fit()Generalized Linear Models (GLM)
Logit Regression (Binary Dependent Variable)
python
# Example: Employment status (0/1)
df['employed'] = [0, 1, 1, 0, 1, 1, 0, 1]
X = sm.add_constant(df[['education', 'experience']])
y = df['employed']
# Logit
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())
# Marginal effects
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())Probit Regression
python
probit_model = sm.Probit(y, X).fit()Poisson Regression (Count Data)
python
# Example: Patent applications
df['patents'] = [0, 1, 2, 5, 3, 1, 0, 2]
poisson_model = sm.GLM(df['patents'], X,
family=sm.families.Poisson()).fit()
print(poisson_model.summary())Model Diagnostics
1. Residual Analysis
python
import matplotlib.pyplot as plt
# Fit model
model = sm.OLS(y, X).fit()
# Residuals
residuals = model.resid
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# (1) Residuals vs Fitted values
axes[0, 0].scatter(model.fittedvalues, residuals)
axes[0, 0].axhline(0, color='red', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# (2) Residual histogram
axes[0, 1].hist(residuals, bins=20, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_title('Histogram of Residuals')
# (3) Q-Q plot (normality)
sm.qqplot(residuals, line='s', ax=axes[1, 0])
axes[1, 0].set_title('Normal Q-Q Plot')
# (4) Residuals vs Leverage
influence = model.get_influence()
leverage = influence.hat_matrix_diag
axes[1, 1].scatter(leverage, residuals)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
plt.tight_layout()
plt.show()2. Heteroskedasticity Test
python
from statsmodels.stats.diagnostic import het_breuschpagan
# Breusch-Pagan test
bp_test = het_breuschpagan(model.resid, model.model.exog)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))
# Interpretation:
# p-value < 0.05 → Reject homoskedasticity, heteroskedasticity present3. Multicollinearity (VIF)
python
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# Criteria:
# VIF < 5: No problem
# 5 < VIF < 10: Moderate multicollinearity
# VIF > 10: Severe multicollinearity4. Normality Test
python
from scipy import stats
# Shapiro-Wilk test
stat, p_value = stats.shapiro(model.resid)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# Jarque-Bera test (large samples)
jb_test = sm.stats.jarque_bera(model.resid)
print(f"Jarque-Bera p-value: {jb_test[1]:.4f}")Formula Interface Advanced Usage
Categorical Variables
python
import statsmodels.formula.api as smf
# Automatically create dummy variables (C())
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()
# Specify reference group
model = smf.ols('wage ~ education + C(region, Treatment("East"))', data=df).fit()Transformations
python
# Log transformation
model = smf.ols('np.log(wage) ~ education + experience', data=df).fit()
# Polynomial
model = smf.ols('wage ~ education + experience + I(experience**2)', data=df).fit()
# Standardization
model = smf.ols('scale(wage) ~ scale(education) + scale(experience)', data=df).fit()Interaction Terms
python
# Method 1: Explicit interaction
model = smf.ols('wage ~ education + female + education:female', data=df).fit()
# Method 2: Full interaction
model = smf.ols('wage ~ education * female', data=df).fit()
# Equivalent to: wage ~ education + female + education:femaleTime Series
ARIMA Model
python
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd
# Example: Quarterly GDP
df = pd.read_csv('gdp.csv', index_col='date', parse_dates=True)
# ARIMA(1,1,1)
model = ARIMA(df['gdp'], order=(1, 1, 1)).fit()
print(model.summary())
# Forecast
forecast = model.forecast(steps=8)
print(forecast)Stationarity Test
python
from statsmodels.tsa.stattools import adfuller
# ADF test
result = adfuller(df['gdp'])
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')
# Interpretation:
# p-value < 0.05 → Reject unit root, series is stationaryOutputting Regression Tables
Single Model
python
# LaTeX format
print(model.summary().as_latex())
# HTML format
print(model.summary().as_html())
# Save
with open('regression_results.tex', 'w') as f:
f.write(model.summary().as_latex())Multiple Model Comparison
python
from statsmodels.iolib.summary2 import summary_col
# Fit multiple models
model1 = smf.ols('wage ~ education', data=df).fit()
model2 = smf.ols('wage ~ education + experience', data=df).fit()
model3 = smf.ols('wage ~ education + experience + female', data=df).fit()
# Comparison table
results = summary_col(
[model1, model2, model3],
model_names=['(1)', '(2)', '(3)'],
stars=True,
info_dict={
'N': lambda x: f"{int(x.nobs):,}",
'R²': lambda x: f"{x.rsquared:.3f}"
}
)
print(results)
# Export as LaTeX
print(results.as_latex())Summary
Core API
| Task | Function | Example |
|---|---|---|
| OLS | sm.OLS() | Linear regression |
| Logit | sm.Logit() | Binary dependent variable |
| Probit | sm.Probit() | Binary dependent variable |
| GLM | sm.GLM() | Generalized linear model |
| WLS | sm.WLS() | Weighted least squares |
| ARIMA | ARIMA() | Time series |
Best Practices
- Use robust standard errors:
cov_type='HC3' - Check diagnostics: Residuals, VIF, heteroskedasticity
- Formula interface for prototyping: Rapid exploration
- Standard interface for production: Precise control
- Save results:
.summary().as_latex()
Next Steps
Next Section: SciPy.stats and LinearModels
References:
- Statsmodels Official Documentation: https://www.statsmodels.org/
- Seabold & Perktold (2010): "Statsmodels: Econometric modeling with Python"