1.2 Python Regression Analysis Quick Start
"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."— John Tukey, Statistician
Experience Python's regression analysis capabilities in 5 minutes
Section Objectives
- Quickly run your first OLS regression
- Compare regression syntax across Python, Stata, and R
- Understand core functions of statsmodels
Core Tool: statsmodels
The core library for regression analysis in Python is statsmodels, which provides regression output similar to Stata.
python
# Install statsmodels
!pip install statsmodelsYour First Regression Model
Python Code
python
import pandas as pd
import statsmodels.api as sm
# Simulated data: studying the impact of education years on wage
data = pd.DataFrame({
'wage': [3000, 3500, 4000, 5000, 5500, 6000, 7000, 8000],
'education': [12, 12, 14, 14, 16, 16, 18, 18],
'experience': [0, 2, 1, 3, 2, 4, 3, 5]
})
# OLS regression: wage = β0 + β1*education + β2*experience + ε
X = data[['education', 'experience']]
X = sm.add_constant(X) # Add constant term
y = data['wage']
model = sm.OLS(y, X).fit()
print(model.summary())Output:
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.982
Model: OLS Adj. R-squared: 0.975
Method: Least Squares F-statistic: 134.8
Date: ... Prob (F-statistic): 7.09e-05
Time: ... Log-Likelihood: -41.234
No. Observations: 8 AIC: 88.47
Df Residuals: 5 BIC: 88.78
Df Model: 2
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -5250.0000 1162.054 -4.518 0.006 -8234.761 -2265.239
education 625.0000 75.000 8.333 0.000 432.196 817.804
experience 375.0000 89.443 4.193 0.008 145.132 604.868
==============================================================================Interpretation:
- Each additional year of education increases wage by 625 dollars (p < 0.001, significant)
- Each additional year of work experience increases wage by 375 dollars (p < 0.01, significant)
- R² = 0.982, indicating very high model fit
Three-Language Comparison
Stata Code
stata
* Load data
use wage_data.dta, clear
* Run OLS regression
regress wage education experience
* View regression resultsR Code
r
# Load data
data <- read.csv("wage_data.csv")
# Run OLS regression
model <- lm(wage ~ education + experience, data = data)
# View regression results
summary(model)Python Code
python
# Load data
data = pd.read_csv("wage_data.csv")
# Run OLS regression
X = sm.add_constant(data[['education', 'experience']])
y = data['wage']
model = sm.OLS(y, X).fit()
# View regression results
print(model.summary())Syntax Comparison Summary
| Function | Stata | R | Python (statsmodels) |
|---|---|---|---|
| Regression Command | regress y x1 x2 | lm(y ~ x1 + x2) | sm.OLS(y, X).fit() |
| Add Constant | Auto-added | Auto-added | Manual sm.add_constant() |
| View Results | Auto-displayed | summary(model) | model.summary() |
| Get Coefficients | _b[x1] | coef(model) | model.params |
| Get R² | e(r2) | summary(model)$r.squared | model.rsquared |
| Predict | predict yhat | predict(model) | model.predict(X) |
Key Considerations for Python Regression
1. Must Manually Add Constant Term
python
# ❌ Wrong: Forgot to add constant
X = data[['education', 'experience']]
model = sm.OLS(y, X).fit() # Results will be biased!
# ✅ Correct: Add constant
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(y, X).fit()2. Order of X and y
python
# Python/statsmodels: OLS(y, X)
model = sm.OLS(y, X).fit()
# R syntax: lm(y ~ X)
# Note: Python is (y, X), R is formula form3. Must Call summary() to View Results
python
# ❌ Only displays model object
print(model)
# ✅ Displays complete regression results
print(model.summary())Quick Practice
Run the following code to experience Python regression:
python
import pandas as pd
import statsmodels.api as sm
import numpy as np
# Generate simulated data
np.random.seed(42)
n = 100
data = pd.DataFrame({
'income': np.random.normal(5000, 1500, n),
'age': np.random.randint(22, 65, n),
'education': np.random.randint(9, 22, n)
})
# Income = f(age, education)
X = sm.add_constant(data[['age', 'education']])
y = data['income']
model = sm.OLS(y, X).fit()
print(model.summary())
# Extract key results
print(f"\n📊 Key Metrics:")
print(f"R² = {model.rsquared:.3f}")
print(f"Education coefficient = {model.params['education']:.2f}")
print(f"Education p-value = {model.pvalues['education']:.4f}")Next Steps
- Article 02: Deep Dive into OLS Regression - Model Diagnostics & Interpretation
- Article 03: Logit Regression - Binary Dependent Variable Models
- Article 04:
summary_col()- Elegantly Comparing Multiple Models
🎉 Congratulations! You've run your first Python regression model!