1.3 OLS Regression Explained: From Simple to Multiple Regression
"Correlation is not causation but it sure is a hint."— Edward Tufte, Statistician
Master the most commonly used regression method in Python
Section Objectives
- Conduct OLS regression using statsmodels
- Understand every metric in regression output
- Compare with Stata/R
- Extract and use regression results
Case Study: Returns to Education Research
We use a classic labor economics question: How much does education affect wages?
python
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Generate simulated data: 1000 observations
np.random.seed(2024)
n = 1000
data = pd.DataFrame({
'wage': np.random.normal(5000, 2000, n),
'education': np.random.randint(9, 22, n),
'experience': np.random.randint(0, 30, n),
'female': np.random.choice([0, 1], n),
'urban': np.random.choice([0, 1], n)
})
# Make data more realistic: education and experience positively affect wage
data['wage'] = (2000 +
300 * data['education'] +
50 * data['experience'] -
800 * data['female'] +
500 * data['urban'] +
np.random.normal(0, 500, n))
print(data.head())
print(f"\nData dimensions: {data.shape}")Model 1: Simple Linear Regression
Start with the simplest model: using only years of education to explain wages.
Python Code
python
# Model 1: wage = β0 + β1*education + ε
X1 = sm.add_constant(data['education'])
y = data['wage']
model1 = sm.OLS(y, X1).fit()
print(model1.summary())Key Output:
coef std err t P>|t|
------------------------------------------------------
const 2134.567 89.234 23.918 0.000
education 312.456 5.823 53.667 0.000
------------------------------------------------------
R-squared: 0.743Stata Comparison
stata
* Stata code
regress wage educationR Comparison
r
# R code
model1 <- lm(wage ~ education, data = data)
summary(model1)Interpretation:
- Each additional year of education increases wage by approximately 312 dollars
- R² = 0.743, education explains 74.3% of wage variation
- p < 0.001, education's effect is highly significant
Model 2: Multiple Linear Regression
Add more control variables: work experience, gender, urban/rural.
python
# Model 2: wage = β0 + β1*education + β2*experience + β3*female + β4*urban + ε
X2 = sm.add_constant(data[['education', 'experience', 'female', 'urban']])
y = data['wage']
model2 = sm.OLS(y, X2).fit()
print(model2.summary())Complete Output:
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.952
Model: OLS Adj. R-squared: 0.952
Method: Least Squares F-statistic: 4923.
Date: ... Prob (F-statistic): 0.00
No. Observations: 1000 AIC: 1.234e+04
Df Residuals: 995 BIC: 1.236e+04
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2045.234 45.678 44.778 0.000 1955.570 2134.898
education 298.765 2.345 127.418 0.000 294.165 303.365
experience 49.876 1.234 40.418 0.000 47.456 52.296
female -789.234 23.456 -33.642 0.000 -835.267 -743.201
urban 487.654 23.456 20.789 0.000 441.621 533.687
==============================================================================Interpretation:
- Education: Each additional year increases wage by 299 dollars (after controlling for other variables)
- Experience: Each additional year increases wage by 50 dollars
- Gender: Female wages are 789 dollars lower than male (gender wage gap)
- Urban/Rural: Urban residents earn 488 dollars more than rural
- R² increased to 0.952, model fit significantly improved
Model Comparison: Three Languages Side by Side
Python (statsmodels)
python
import statsmodels.api as sm
# Prepare data
X = sm.add_constant(data[['education', 'experience', 'female', 'urban']])
y = data['wage']
# Fit model
model = sm.OLS(y, X).fit()
# View results
print(model.summary())
# Extract coefficients
print(model.params)
# Extract R²
print(f"R² = {model.rsquared:.4f}")
# Predict
predictions = model.predict(X)Stata
stata
* Load data
use wage_data.dta, clear
* Fit model
regress wage education experience female urban
* View coefficients
display _b[education]
* Predict
predict wage_hatR
r
# Load data
data <- read.csv("wage_data.csv")
# Fit model
model <- lm(wage ~ education + experience + female + urban, data = data)
# View results
summary(model)
# Extract coefficients
coef(model)
# Extract R²
summary(model)$r.squared
# Predict
predictions <- predict(model)Extracting Regression Results
Get Coefficients
python
# All coefficients
print(model2.params)
'''
const 2045.234
education 298.765
experience 49.876
female -789.234
urban 487.654
'''
# Single coefficient
edu_coef = model2.params['education']
print(f"Education coefficient: {edu_coef:.2f}")Get p-values
python
# All p-values
print(model2.pvalues)
# Test significance
if model2.pvalues['education'] < 0.05:
print("Education coefficient is significant at 5% level")Get Confidence Intervals
python
# 95% confidence interval
ci = model2.conf_int(alpha=0.05)
print(ci)
'''
0 1
const 1955.570 2134.898
education 294.165 303.365
experience 47.456 52.296
female -835.267 -743.201
urban 441.621 533.687
'''Get Goodness of Fit
python
print(f"R² = {model2.rsquared:.4f}")
print(f"Adjusted R² = {model2.rsquared_adj:.4f}")
print(f"F-statistic = {model2.fvalue:.2f}")
print(f"Prob(F-statistic) = {model2.f_pvalue:.4e}")Predicting New Observations
python
# Create new data: urban male with 18 years education, 5 years experience
new_data = pd.DataFrame({
'const': [1],
'education': [18],
'experience': [5],
'female': [0],
'urban': [1]
})
# Predict wage
predicted_wage = model2.predict(new_data)
print(f"Predicted wage: {predicted_wage.values[0]:.2f} dollars")
# Prediction interval
predictions = model2.get_prediction(new_data)
pred_summary = predictions.summary_frame(alpha=0.05)
print(pred_summary)
'''
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 8234.56 23.45 8188.59 8280.53 7245.67 9223.45
'''Interpretation:
- Point prediction: 8234.56 dollars
- 95% confidence interval: [8188.59, 8280.53] (uncertainty of the mean)
- 95% prediction interval: [7245.67, 9223.45] (uncertainty of individual, wider)
Three-Language Comparison Summary
| Task | Python | Stata | R |
|---|---|---|---|
| Fit Model | sm.OLS(y, X).fit() | regress y x1 x2 | lm(y ~ x1 + x2) |
| View Results | model.summary() | Auto-displayed | summary(model) |
| Get Coefficients | model.params | _b[x1] | coef(model) |
| Get p-values | model.pvalues | test x1 | summary(model)$coefficients |
| Get R² | model.rsquared | e(r2) | summary(model)$r.squared |
| Predict | model.predict(X) | predict yhat | predict(model) |
| Confidence Interval | model.conf_int() | regress, level(95) | confint(model) |
Common Errors
Error 1: Forgot to Add Constant
python
# ❌ Wrong
X = data[['education', 'experience']]
model = sm.OLS(y, X).fit() # No intercept!
# ✅ Correct
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(y, X).fit()Error 2: Variable Name Typo
python
# ❌ Wrong
model.params['educaton'] # Typo
# ✅ Correct
model.params['education']Error 3: Confused X and y Order
python
# ❌ Wrong (will error)
model = sm.OLS(X, y).fit() # Order reversed
# ✅ Correct
model = sm.OLS(y, X).fit() # y first, X secondPractice Exercise
Run the following complete code:
python
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Generate data
np.random.seed(42)
n = 500
data = pd.DataFrame({
'wage': np.random.normal(4000, 1500, n),
'education': np.random.randint(9, 22, n),
'age': np.random.randint(22, 60, n),
'female': np.random.choice([0, 1], n)
})
# Construct true relationship
data['wage'] = (1000 + 250 * data['education'] +
30 * data['age'] - 500 * data['female'] +
np.random.normal(0, 400, n))
# Fit model
X = sm.add_constant(data[['education', 'age', 'female']])
y = data['wage']
model = sm.OLS(y, X).fit()
# View results
print(model.summary())
# Extract key information
print(f"\nReturns to education: {model.params['education']:.2f} dollars/year")
print(f"Gender wage gap: {model.params['female']:.2f} dollars")
print(f"R² = {model.rsquared:.4f}")Next Steps
- Article 03: Logit Regression - Handling Binary Dependent Variables
- Article 04:
summary_col()- Elegantly Displaying Multiple Models
🎉 You've mastered OLS regression in Python!