4.2 Statsmodels Essentials

"Make everything as simple as possible, but not simpler."— Albert Einstein, Physicist

The Foundation of Python Statistical Analysis

Section Objectives

Master the core Statsmodels API
Understand OLS, GLM, and time series modeling
Learn model diagnostics and robust standard errors
Use formula interface for rapid modeling

Statsmodels Overview

Positioning: Python's Stata — First choice for classical statistical analysis

Core Features:

Publication-quality output (detailed statistical tables)
Rich diagnostic tools
Heteroskedasticity-robust standard errors
Comprehensive time series support

Installation:

bash

pip install statsmodels
# or
conda install -c conda-forge statsmodels

Two API Styles

API 1: Standard Interface (Recommended for Programming)

python

import statsmodels.api as sm
import pandas as pd
import numpy as np

# 1. Prepare data
X = df[['education', 'experience']]
X = sm.add_constant(X)  # Must manually add intercept
y = df['wage']

# 2. Fit model
model = sm.OLS(y, X).fit()

# 3. View results
print(model.summary())

API 2: Formula Interface (Recommended for Interactive Analysis)

python

import statsmodels.formula.api as smf

# R-style formula
model = smf.ols('wage ~ education + experience', data=df).fit()
print(model.summary())

# Advantages:
# Automatically adds intercept
# Automatically handles categorical variables
# Supports transformations and interactions

OLS Regression

Basic OLS

python

import statsmodels.api as sm
import pandas as pd

# Data
df = pd.DataFrame({
    'wage': [3000, 3500, 4000, 5000, 5500],
    'education': [12, 14, 16, 18, 20],
    'experience': [2, 3, 4, 5, 6]
})

# OLS Regression
X = sm.add_constant(df[['education', 'experience']])
y = df['wage']

model = sm.OLS(y, X).fit()
print(model.summary())

Output Interpretation:

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                     857.5
Prob (F-statistic):            0.00116
...
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -250.0000    164.317     -1.521      0.265   -1155.959     655.959
education    125.0000     10.000     12.500      0.003      87.156     162.844
experience   125.0000     20.000      6.250      0.016      38.313     211.687
==============================================================================

Heteroskedasticity-Robust Standard Errors

python

# HC0, HC1, HC2, HC3 (default HC1)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')
print(model_robust.summary())

# Comparison
print(f"Standard SE: {model.bse['education']:.4f}")
print(f"Robust SE: {model_robust.bse['education']:.4f}")

Standard Error Types:

'nonrobust': Default (assumes homoskedasticity)
'HC0': White robust standard errors
'HC1': Small sample adjustment
'HC2': More conservative
'HC3': Most robust (recommended)

Weighted Least Squares (WLS)

python

# If heteroskedasticity structure is known
weights = 1 / df['variance']
model_wls = sm.WLS(y, X, weights=weights).fit()

Generalized Linear Models (GLM)

Logit Regression (Binary Dependent Variable)

python

# Example: Employment status (0/1)
df['employed'] = [0, 1, 1, 0, 1, 1, 0, 1]

X = sm.add_constant(df[['education', 'experience']])
y = df['employed']

# Logit
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())

# Marginal effects
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())

Probit Regression

python

probit_model = sm.Probit(y, X).fit()

Poisson Regression (Count Data)

python

# Example: Patent applications
df['patents'] = [0, 1, 2, 5, 3, 1, 0, 2]

poisson_model = sm.GLM(df['patents'], X,
                       family=sm.families.Poisson()).fit()
print(poisson_model.summary())

Model Diagnostics

1. Residual Analysis

python

import matplotlib.pyplot as plt

# Fit model
model = sm.OLS(y, X).fit()

# Residuals
residuals = model.resid

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# (1) Residuals vs Fitted values
axes[0, 0].scatter(model.fittedvalues, residuals)
axes[0, 0].axhline(0, color='red', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# (2) Residual histogram
axes[0, 1].hist(residuals, bins=20, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_title('Histogram of Residuals')

# (3) Q-Q plot (normality)
sm.qqplot(residuals, line='s', ax=axes[1, 0])
axes[1, 0].set_title('Normal Q-Q Plot')

# (4) Residuals vs Leverage
influence = model.get_influence()
leverage = influence.hat_matrix_diag
axes[1, 1].scatter(leverage, residuals)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals vs Leverage')

plt.tight_layout()
plt.show()

2. Heteroskedasticity Test

python

from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_test = het_breuschpagan(model.resid, model.model.exog)

labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))

# Interpretation:
# p-value < 0.05 → Reject homoskedasticity, heteroskedasticity present

3. Multicollinearity (VIF)

python

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

# Criteria:
# VIF < 5: No problem
# 5 < VIF < 10: Moderate multicollinearity
# VIF > 10: Severe multicollinearity

4. Normality Test

python

from scipy import stats

# Shapiro-Wilk test
stat, p_value = stats.shapiro(model.resid)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")

# Jarque-Bera test (large samples)
jb_test = sm.stats.jarque_bera(model.resid)
print(f"Jarque-Bera p-value: {jb_test[1]:.4f}")

Formula Interface Advanced Usage

Categorical Variables

python

import statsmodels.formula.api as smf

# Automatically create dummy variables (C())
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()

# Specify reference group
model = smf.ols('wage ~ education + C(region, Treatment("East"))', data=df).fit()

Transformations

python

# Log transformation
model = smf.ols('np.log(wage) ~ education + experience', data=df).fit()

# Polynomial
model = smf.ols('wage ~ education + experience + I(experience**2)', data=df).fit()

# Standardization
model = smf.ols('scale(wage) ~ scale(education) + scale(experience)', data=df).fit()

Interaction Terms

python

# Method 1: Explicit interaction
model = smf.ols('wage ~ education + female + education:female', data=df).fit()

# Method 2: Full interaction
model = smf.ols('wage ~ education * female', data=df).fit()
# Equivalent to: wage ~ education + female + education:female

Time Series

ARIMA Model

python

from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

# Example: Quarterly GDP
df = pd.read_csv('gdp.csv', index_col='date', parse_dates=True)

# ARIMA(1,1,1)
model = ARIMA(df['gdp'], order=(1, 1, 1)).fit()
print(model.summary())

# Forecast
forecast = model.forecast(steps=8)
print(forecast)

Stationarity Test

python

from statsmodels.tsa.stattools import adfuller

# ADF test
result = adfuller(df['gdp'])
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')

# Interpretation:
# p-value < 0.05 → Reject unit root, series is stationary

Outputting Regression Tables

Single Model

python

# LaTeX format
print(model.summary().as_latex())

# HTML format
print(model.summary().as_html())

# Save
with open('regression_results.tex', 'w') as f:
    f.write(model.summary().as_latex())

Multiple Model Comparison

python

from statsmodels.iolib.summary2 import summary_col

# Fit multiple models
model1 = smf.ols('wage ~ education', data=df).fit()
model2 = smf.ols('wage ~ education + experience', data=df).fit()
model3 = smf.ols('wage ~ education + experience + female', data=df).fit()

# Comparison table
results = summary_col(
    [model1, model2, model3],
    model_names=['(1)', '(2)', '(3)'],
    stars=True,
    info_dict={
        'N': lambda x: f"{int(x.nobs):,}",
        'R²': lambda x: f"{x.rsquared:.3f}"
    }
)

print(results)

# Export as LaTeX
print(results.as_latex())

Summary

Core API

Task	Function	Example
OLS	`sm.OLS()`	Linear regression
Logit	`sm.Logit()`	Binary dependent variable
Probit	`sm.Probit()`	Binary dependent variable
GLM	`sm.GLM()`	Generalized linear model
WLS	`sm.WLS()`	Weighted least squares
ARIMA	`ARIMA()`	Time series

Best Practices

Use robust standard errors: cov_type='HC3'
Check diagnostics: Residuals, VIF, heteroskedasticity
Formula interface for prototyping: Rapid exploration
Standard interface for production: Precise control
Save results: .summary().as_latex()

Next Steps

Next Section: SciPy.stats and LinearModels

References:

Statsmodels Official Documentation: https://www.statsmodels.org/
Seabold & Perktold (2010): "Statsmodels: Econometric modeling with Python"

4.2 Statsmodels Essentials ​

Section Objectives ​

Statsmodels Overview ​

Two API Styles ​

API 1: Standard Interface (Recommended for Programming) ​

API 2: Formula Interface (Recommended for Interactive Analysis) ​

OLS Regression ​

Basic OLS ​

Heteroskedasticity-Robust Standard Errors ​

Weighted Least Squares (WLS) ​

Generalized Linear Models (GLM) ​

Logit Regression (Binary Dependent Variable) ​

Probit Regression ​

Poisson Regression (Count Data) ​

Model Diagnostics ​

1. Residual Analysis ​

2. Heteroskedasticity Test ​

3. Multicollinearity (VIF) ​

4. Normality Test ​

Formula Interface Advanced Usage ​

Categorical Variables ​

Transformations ​

Interaction Terms ​

Time Series ​

ARIMA Model ​

Stationarity Test ​

Outputting Regression Tables ​

Single Model ​

Multiple Model Comparison ​

Summary ​

Core API ​

Best Practices ​

Next Steps ​

4.2 Statsmodels Essentials

Section Objectives

Statsmodels Overview

Two API Styles

API 1: Standard Interface (Recommended for Programming)

API 2: Formula Interface (Recommended for Interactive Analysis)

OLS Regression

Basic OLS

Heteroskedasticity-Robust Standard Errors

Weighted Least Squares (WLS)

Generalized Linear Models (GLM)

Logit Regression (Binary Dependent Variable)

Probit Regression

Poisson Regression (Count Data)

Model Diagnostics

1. Residual Analysis

2. Heteroskedasticity Test

3. Multicollinearity (VIF)

4. Normality Test

Formula Interface Advanced Usage

Categorical Variables

Transformations

Interaction Terms

Time Series

ARIMA Model

Stationarity Test

Outputting Regression Tables

Single Model

Multiple Model Comparison

Summary

Core API

Best Practices

Next Steps