Skip to content

1.3 OLS Regression Explained: From Simple to Multiple Regression

"Correlation is not causation but it sure is a hint."— Edward Tufte, Statistician

Master the most commonly used regression method in Python


Section Objectives

  • Conduct OLS regression using statsmodels
  • Understand every metric in regression output
  • Compare with Stata/R
  • Extract and use regression results

Case Study: Returns to Education Research

We use a classic labor economics question: How much does education affect wages?

python
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate simulated data: 1000 observations
np.random.seed(2024)
n = 1000

data = pd.DataFrame({
    'wage': np.random.normal(5000, 2000, n),
    'education': np.random.randint(9, 22, n),
    'experience': np.random.randint(0, 30, n),
    'female': np.random.choice([0, 1], n),
    'urban': np.random.choice([0, 1], n)
})

# Make data more realistic: education and experience positively affect wage
data['wage'] = (2000 +
                300 * data['education'] +
                50 * data['experience'] -
                800 * data['female'] +
                500 * data['urban'] +
                np.random.normal(0, 500, n))

print(data.head())
print(f"\nData dimensions: {data.shape}")

Model 1: Simple Linear Regression

Start with the simplest model: using only years of education to explain wages.

Python Code

python
# Model 1: wage = β0 + β1*education + ε
X1 = sm.add_constant(data['education'])
y = data['wage']

model1 = sm.OLS(y, X1).fit()
print(model1.summary())

Key Output:

                 coef    std err          t      P>|t|
------------------------------------------------------
const       2134.567     89.234     23.918      0.000
education    312.456     5.823      53.667      0.000
------------------------------------------------------
R-squared:                       0.743

Stata Comparison

stata
* Stata code
regress wage education

R Comparison

r
# R code
model1 <- lm(wage ~ education, data = data)
summary(model1)

Interpretation:

  • Each additional year of education increases wage by approximately 312 dollars
  • R² = 0.743, education explains 74.3% of wage variation
  • p < 0.001, education's effect is highly significant

Model 2: Multiple Linear Regression

Add more control variables: work experience, gender, urban/rural.

python
# Model 2: wage = β0 + β1*education + β2*experience + β3*female + β4*urban + ε
X2 = sm.add_constant(data[['education', 'experience', 'female', 'urban']])
y = data['wage']

model2 = sm.OLS(y, X2).fit()
print(model2.summary())

Complete Output:

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.952
Model:                            OLS   Adj. R-squared:                  0.952
Method:                 Least Squares   F-statistic:                     4923.
Date:                ...              Prob (F-statistic):               0.00
No. Observations:                1000   AIC:                         1.234e+04
Df Residuals:                     995   BIC:                         1.236e+04
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2045.234     45.678     44.778      0.000    1955.570    2134.898
education    298.765      2.345    127.418      0.000     294.165     303.365
experience    49.876      1.234     40.418      0.000      47.456      52.296
female      -789.234     23.456    -33.642      0.000    -835.267    -743.201
urban        487.654     23.456     20.789      0.000     441.621     533.687
==============================================================================

Interpretation:

  • Education: Each additional year increases wage by 299 dollars (after controlling for other variables)
  • Experience: Each additional year increases wage by 50 dollars
  • Gender: Female wages are 789 dollars lower than male (gender wage gap)
  • Urban/Rural: Urban residents earn 488 dollars more than rural
  • increased to 0.952, model fit significantly improved

Model Comparison: Three Languages Side by Side

Python (statsmodels)

python
import statsmodels.api as sm

# Prepare data
X = sm.add_constant(data[['education', 'experience', 'female', 'urban']])
y = data['wage']

# Fit model
model = sm.OLS(y, X).fit()

# View results
print(model.summary())

# Extract coefficients
print(model.params)

# Extract R²
print(f"R² = {model.rsquared:.4f}")

# Predict
predictions = model.predict(X)

Stata

stata
* Load data
use wage_data.dta, clear

* Fit model
regress wage education experience female urban

* View coefficients
display _b[education]

* Predict
predict wage_hat

R

r
# Load data
data <- read.csv("wage_data.csv")

# Fit model
model <- lm(wage ~ education + experience + female + urban, data = data)

# View results
summary(model)

# Extract coefficients
coef(model)

# Extract R²
summary(model)$r.squared

# Predict
predictions <- predict(model)

Extracting Regression Results

Get Coefficients

python
# All coefficients
print(model2.params)
'''
const        2045.234
education     298.765
experience     49.876
female       -789.234
urban         487.654
'''

# Single coefficient
edu_coef = model2.params['education']
print(f"Education coefficient: {edu_coef:.2f}")

Get p-values

python
# All p-values
print(model2.pvalues)

# Test significance
if model2.pvalues['education'] < 0.05:
    print("Education coefficient is significant at 5% level")

Get Confidence Intervals

python
# 95% confidence interval
ci = model2.conf_int(alpha=0.05)
print(ci)
'''
                   0           1
const       1955.570    2134.898
education    294.165     303.365
experience    47.456      52.296
female      -835.267    -743.201
urban        441.621     533.687
'''

Get Goodness of Fit

python
print(f"R² = {model2.rsquared:.4f}")
print(f"Adjusted R² = {model2.rsquared_adj:.4f}")
print(f"F-statistic = {model2.fvalue:.2f}")
print(f"Prob(F-statistic) = {model2.f_pvalue:.4e}")

Predicting New Observations

python
# Create new data: urban male with 18 years education, 5 years experience
new_data = pd.DataFrame({
    'const': [1],
    'education': [18],
    'experience': [5],
    'female': [0],
    'urban': [1]
})

# Predict wage
predicted_wage = model2.predict(new_data)
print(f"Predicted wage: {predicted_wage.values[0]:.2f} dollars")

# Prediction interval
predictions = model2.get_prediction(new_data)
pred_summary = predictions.summary_frame(alpha=0.05)
print(pred_summary)
'''
       mean    mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  obs_ci_upper
0  8234.56      23.45        8188.59        8280.53       7245.67       9223.45
'''

Interpretation:

  • Point prediction: 8234.56 dollars
  • 95% confidence interval: [8188.59, 8280.53] (uncertainty of the mean)
  • 95% prediction interval: [7245.67, 9223.45] (uncertainty of individual, wider)

Three-Language Comparison Summary

TaskPythonStataR
Fit Modelsm.OLS(y, X).fit()regress y x1 x2lm(y ~ x1 + x2)
View Resultsmodel.summary()Auto-displayedsummary(model)
Get Coefficientsmodel.params_b[x1]coef(model)
Get p-valuesmodel.pvaluestest x1summary(model)$coefficients
Get R²model.rsquarede(r2)summary(model)$r.squared
Predictmodel.predict(X)predict yhatpredict(model)
Confidence Intervalmodel.conf_int()regress, level(95)confint(model)

Common Errors

Error 1: Forgot to Add Constant

python
# ❌ Wrong
X = data[['education', 'experience']]
model = sm.OLS(y, X).fit()  # No intercept!

# ✅ Correct
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(y, X).fit()

Error 2: Variable Name Typo

python
# ❌ Wrong
model.params['educaton']  # Typo

# ✅ Correct
model.params['education']

Error 3: Confused X and y Order

python
# ❌ Wrong (will error)
model = sm.OLS(X, y).fit()  # Order reversed

# ✅ Correct
model = sm.OLS(y, X).fit()  # y first, X second

Practice Exercise

Run the following complete code:

python
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate data
np.random.seed(42)
n = 500
data = pd.DataFrame({
    'wage': np.random.normal(4000, 1500, n),
    'education': np.random.randint(9, 22, n),
    'age': np.random.randint(22, 60, n),
    'female': np.random.choice([0, 1], n)
})

# Construct true relationship
data['wage'] = (1000 + 250 * data['education'] +
                30 * data['age'] - 500 * data['female'] +
                np.random.normal(0, 400, n))

# Fit model
X = sm.add_constant(data[['education', 'age', 'female']])
y = data['wage']
model = sm.OLS(y, X).fit()

# View results
print(model.summary())

# Extract key information
print(f"\nReturns to education: {model.params['education']:.2f} dollars/year")
print(f"Gender wage gap: {model.params['female']:.2f} dollars")
print(f"R² = {model.rsquared:.4f}")

Next Steps

  • Article 03: Logit Regression - Handling Binary Dependent Variables
  • Article 04: summary_col() - Elegantly Displaying Multiple Models

🎉 You've mastered OLS regression in Python!

Released under the MIT License. Content © Author.