Skip to content

4.2 Statsmodels Essentials

"Make everything as simple as possible, but not simpler."— Albert Einstein, Physicist

The Foundation of Python Statistical Analysis


Section Objectives

  • Master the core Statsmodels API
  • Understand OLS, GLM, and time series modeling
  • Learn model diagnostics and robust standard errors
  • Use formula interface for rapid modeling

Statsmodels Overview

Positioning: Python's Stata — First choice for classical statistical analysis

Core Features:

  • Publication-quality output (detailed statistical tables)
  • Rich diagnostic tools
  • Heteroskedasticity-robust standard errors
  • Comprehensive time series support

Installation:

bash
pip install statsmodels
# or
conda install -c conda-forge statsmodels

Two API Styles

python
import statsmodels.api as sm
import pandas as pd
import numpy as np

# 1. Prepare data
X = df[['education', 'experience']]
X = sm.add_constant(X)  # Must manually add intercept
y = df['wage']

# 2. Fit model
model = sm.OLS(y, X).fit()

# 3. View results
print(model.summary())
python
import statsmodels.formula.api as smf

# R-style formula
model = smf.ols('wage ~ education + experience', data=df).fit()
print(model.summary())

# Advantages:
# Automatically adds intercept
# Automatically handles categorical variables
# Supports transformations and interactions

OLS Regression

Basic OLS

python
import statsmodels.api as sm
import pandas as pd

# Data
df = pd.DataFrame({
    'wage': [3000, 3500, 4000, 5000, 5500],
    'education': [12, 14, 16, 18, 20],
    'experience': [2, 3, 4, 5, 6]
})

# OLS Regression
X = sm.add_constant(df[['education', 'experience']])
y = df['wage']

model = sm.OLS(y, X).fit()
print(model.summary())

Output Interpretation:

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                     857.5
Prob (F-statistic):            0.00116
...
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -250.0000    164.317     -1.521      0.265   -1155.959     655.959
education    125.0000     10.000     12.500      0.003      87.156     162.844
experience   125.0000     20.000      6.250      0.016      38.313     211.687
==============================================================================

Heteroskedasticity-Robust Standard Errors

python
# HC0, HC1, HC2, HC3 (default HC1)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')
print(model_robust.summary())

# Comparison
print(f"Standard SE: {model.bse['education']:.4f}")
print(f"Robust SE: {model_robust.bse['education']:.4f}")

Standard Error Types:

  • 'nonrobust': Default (assumes homoskedasticity)
  • 'HC0': White robust standard errors
  • 'HC1': Small sample adjustment
  • 'HC2': More conservative
  • 'HC3': Most robust (recommended)

Weighted Least Squares (WLS)

python
# If heteroskedasticity structure is known
weights = 1 / df['variance']
model_wls = sm.WLS(y, X, weights=weights).fit()

Generalized Linear Models (GLM)

Logit Regression (Binary Dependent Variable)

python
# Example: Employment status (0/1)
df['employed'] = [0, 1, 1, 0, 1, 1, 0, 1]

X = sm.add_constant(df[['education', 'experience']])
y = df['employed']

# Logit
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())

# Marginal effects
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())

Probit Regression

python
probit_model = sm.Probit(y, X).fit()

Poisson Regression (Count Data)

python
# Example: Patent applications
df['patents'] = [0, 1, 2, 5, 3, 1, 0, 2]

poisson_model = sm.GLM(df['patents'], X,
                       family=sm.families.Poisson()).fit()
print(poisson_model.summary())

Model Diagnostics

1. Residual Analysis

python
import matplotlib.pyplot as plt

# Fit model
model = sm.OLS(y, X).fit()

# Residuals
residuals = model.resid

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# (1) Residuals vs Fitted values
axes[0, 0].scatter(model.fittedvalues, residuals)
axes[0, 0].axhline(0, color='red', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# (2) Residual histogram
axes[0, 1].hist(residuals, bins=20, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_title('Histogram of Residuals')

# (3) Q-Q plot (normality)
sm.qqplot(residuals, line='s', ax=axes[1, 0])
axes[1, 0].set_title('Normal Q-Q Plot')

# (4) Residuals vs Leverage
influence = model.get_influence()
leverage = influence.hat_matrix_diag
axes[1, 1].scatter(leverage, residuals)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals vs Leverage')

plt.tight_layout()
plt.show()

2. Heteroskedasticity Test

python
from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_test = het_breuschpagan(model.resid, model.model.exog)

labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))

# Interpretation:
# p-value < 0.05 → Reject homoskedasticity, heteroskedasticity present

3. Multicollinearity (VIF)

python
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

# Criteria:
# VIF < 5: No problem
# 5 < VIF < 10: Moderate multicollinearity
# VIF > 10: Severe multicollinearity

4. Normality Test

python
from scipy import stats

# Shapiro-Wilk test
stat, p_value = stats.shapiro(model.resid)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")

# Jarque-Bera test (large samples)
jb_test = sm.stats.jarque_bera(model.resid)
print(f"Jarque-Bera p-value: {jb_test[1]:.4f}")

Formula Interface Advanced Usage

Categorical Variables

python
import statsmodels.formula.api as smf

# Automatically create dummy variables (C())
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()

# Specify reference group
model = smf.ols('wage ~ education + C(region, Treatment("East"))', data=df).fit()

Transformations

python
# Log transformation
model = smf.ols('np.log(wage) ~ education + experience', data=df).fit()

# Polynomial
model = smf.ols('wage ~ education + experience + I(experience**2)', data=df).fit()

# Standardization
model = smf.ols('scale(wage) ~ scale(education) + scale(experience)', data=df).fit()

Interaction Terms

python
# Method 1: Explicit interaction
model = smf.ols('wage ~ education + female + education:female', data=df).fit()

# Method 2: Full interaction
model = smf.ols('wage ~ education * female', data=df).fit()
# Equivalent to: wage ~ education + female + education:female

Time Series

ARIMA Model

python
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

# Example: Quarterly GDP
df = pd.read_csv('gdp.csv', index_col='date', parse_dates=True)

# ARIMA(1,1,1)
model = ARIMA(df['gdp'], order=(1, 1, 1)).fit()
print(model.summary())

# Forecast
forecast = model.forecast(steps=8)
print(forecast)

Stationarity Test

python
from statsmodels.tsa.stattools import adfuller

# ADF test
result = adfuller(df['gdp'])
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')

# Interpretation:
# p-value < 0.05 → Reject unit root, series is stationary

Outputting Regression Tables

Single Model

python
# LaTeX format
print(model.summary().as_latex())

# HTML format
print(model.summary().as_html())

# Save
with open('regression_results.tex', 'w') as f:
    f.write(model.summary().as_latex())

Multiple Model Comparison

python
from statsmodels.iolib.summary2 import summary_col

# Fit multiple models
model1 = smf.ols('wage ~ education', data=df).fit()
model2 = smf.ols('wage ~ education + experience', data=df).fit()
model3 = smf.ols('wage ~ education + experience + female', data=df).fit()

# Comparison table
results = summary_col(
    [model1, model2, model3],
    model_names=['(1)', '(2)', '(3)'],
    stars=True,
    info_dict={
        'N': lambda x: f"{int(x.nobs):,}",
        'R²': lambda x: f"{x.rsquared:.3f}"
    }
)

print(results)

# Export as LaTeX
print(results.as_latex())

Summary

Core API

TaskFunctionExample
OLSsm.OLS()Linear regression
Logitsm.Logit()Binary dependent variable
Probitsm.Probit()Binary dependent variable
GLMsm.GLM()Generalized linear model
WLSsm.WLS()Weighted least squares
ARIMAARIMA()Time series

Best Practices

  1. Use robust standard errors: cov_type='HC3'
  2. Check diagnostics: Residuals, VIF, heteroskedasticity
  3. Formula interface for prototyping: Rapid exploration
  4. Standard interface for production: Precise control
  5. Save results: .summary().as_latex()

Next Steps

Next Section: SciPy.stats and LinearModels


References:

Released under the MIT License. Content © Author.