1.2 Python Regression Analysis Quick Start

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."— John Tukey, Statistician

Experience Python's regression analysis capabilities in 5 minutes

Section Objectives

Quickly run your first OLS regression
Compare regression syntax across Python, Stata, and R
Understand core functions of statsmodels

Core Tool: statsmodels

The core library for regression analysis in Python is statsmodels, which provides regression output similar to Stata.

python

# Install statsmodels
!pip install statsmodels

Your First Regression Model

Python Code

python

import pandas as pd
import statsmodels.api as sm

# Simulated data: studying the impact of education years on wage
data = pd.DataFrame({
    'wage': [3000, 3500, 4000, 5000, 5500, 6000, 7000, 8000],
    'education': [12, 12, 14, 14, 16, 16, 18, 18],
    'experience': [0, 2, 1, 3, 2, 4, 3, 5]
})

# OLS regression: wage = β0 + β1*education + β2*experience + ε
X = data[['education', 'experience']]
X = sm.add_constant(X)  # Add constant term
y = data['wage']

model = sm.OLS(y, X).fit()
print(model.summary())

Output:

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.975
Method:                 Least Squares   F-statistic:                     134.8
Date:                ...              Prob (F-statistic):           7.09e-05
Time:                        ...      Log-Likelihood:                -41.234
No. Observations:                   8   AIC:                             88.47
Df Residuals:                       5   BIC:                             88.78
Df Model:                           2
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -5250.0000   1162.054     -4.518      0.006   -8234.761   -2265.239
education     625.0000     75.000      8.333      0.000     432.196     817.804
experience    375.0000     89.443      4.193      0.008     145.132     604.868
==============================================================================

Interpretation:

Each additional year of education increases wage by 625 dollars (p < 0.001, significant)
Each additional year of work experience increases wage by 375 dollars (p < 0.01, significant)
R² = 0.982, indicating very high model fit

Three-Language Comparison

Stata Code

stata

* Load data
use wage_data.dta, clear

* Run OLS regression
regress wage education experience

* View regression results

R Code

# Load data
data <- read.csv("wage_data.csv")

# Run OLS regression
model <- lm(wage ~ education + experience, data = data)

# View regression results
summary(model)

Python Code

python

# Load data
data = pd.read_csv("wage_data.csv")

# Run OLS regression
X = sm.add_constant(data[['education', 'experience']])
y = data['wage']
model = sm.OLS(y, X).fit()

# View regression results
print(model.summary())

Syntax Comparison Summary

Function	Stata	R	Python (statsmodels)
Regression Command	`regress y x1 x2`	`lm(y ~ x1 + x2)`	`sm.OLS(y, X).fit()`
Add Constant	Auto-added	Auto-added	Manual `sm.add_constant()`
View Results	Auto-displayed	`summary(model)`	`model.summary()`
Get Coefficients	`_b[x1]`	`coef(model)`	`model.params`
Get R²	`e(r2)`	`summary(model)$r.squared`	`model.rsquared`
Predict	`predict yhat`	`predict(model)`	`model.predict(X)`

Key Considerations for Python Regression

1. Must Manually Add Constant Term

python

# ❌ Wrong: Forgot to add constant
X = data[['education', 'experience']]
model = sm.OLS(y, X).fit()  # Results will be biased!

# ✅ Correct: Add constant
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(y, X).fit()

2. Order of X and y

python

# Python/statsmodels: OLS(y, X)
model = sm.OLS(y, X).fit()

# R syntax: lm(y ~ X)
# Note: Python is (y, X), R is formula form

3. Must Call summary() to View Results

python

# ❌ Only displays model object
print(model)

# ✅ Displays complete regression results
print(model.summary())

Quick Practice

Run the following code to experience Python regression:

python

import pandas as pd
import statsmodels.api as sm
import numpy as np

# Generate simulated data
np.random.seed(42)
n = 100
data = pd.DataFrame({
    'income': np.random.normal(5000, 1500, n),
    'age': np.random.randint(22, 65, n),
    'education': np.random.randint(9, 22, n)
})

# Income = f(age, education)
X = sm.add_constant(data[['age', 'education']])
y = data['income']

model = sm.OLS(y, X).fit()
print(model.summary())

# Extract key results
print(f"\n📊 Key Metrics:")
print(f"R² = {model.rsquared:.3f}")
print(f"Education coefficient = {model.params['education']:.2f}")
print(f"Education p-value = {model.pvalues['education']:.4f}")

Next Steps

Article 02: Deep Dive into OLS Regression - Model Diagnostics & Interpretation
Article 03: Logit Regression - Binary Dependent Variable Models
Article 04: summary_col() - Elegantly Comparing Multiple Models

🎉 Congratulations! You've run your first Python regression model!

1.2 Python Regression Analysis Quick Start ​

Section Objectives ​

Core Tool: statsmodels ​

Your First Regression Model ​

Python Code ​

Three-Language Comparison ​

Stata Code ​

R Code ​

Python Code ​

Syntax Comparison Summary ​

Key Considerations for Python Regression ​

1. Must Manually Add Constant Term ​

2. Order of X and y ​

3. Must Call summary() to View Results ​

Quick Practice ​

Next Steps ​

1.2 Python Regression Analysis Quick Start

Section Objectives

Core Tool: statsmodels

Your First Regression Model

Python Code

Three-Language Comparison

Stata Code

R Code

Python Code

Syntax Comparison Summary

Key Considerations for Python Regression

1. Must Manually Add Constant Term

2. Order of X and y

3. Must Call summary() to View Results

Quick Practice

Next Steps