Skip to content

1.2 Python Regression Analysis Quick Start

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."— John Tukey, Statistician

Experience Python's regression analysis capabilities in 5 minutes


Section Objectives

  • Quickly run your first OLS regression
  • Compare regression syntax across Python, Stata, and R
  • Understand core functions of statsmodels

Core Tool: statsmodels

The core library for regression analysis in Python is statsmodels, which provides regression output similar to Stata.

python
# Install statsmodels
!pip install statsmodels

Your First Regression Model

Python Code

python
import pandas as pd
import statsmodels.api as sm

# Simulated data: studying the impact of education years on wage
data = pd.DataFrame({
    'wage': [3000, 3500, 4000, 5000, 5500, 6000, 7000, 8000],
    'education': [12, 12, 14, 14, 16, 16, 18, 18],
    'experience': [0, 2, 1, 3, 2, 4, 3, 5]
})

# OLS regression: wage = β0 + β1*education + β2*experience + ε
X = data[['education', 'experience']]
X = sm.add_constant(X)  # Add constant term
y = data['wage']

model = sm.OLS(y, X).fit()
print(model.summary())

Output:

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.975
Method:                 Least Squares   F-statistic:                     134.8
Date:                ...              Prob (F-statistic):           7.09e-05
Time:                        ...      Log-Likelihood:                -41.234
No. Observations:                   8   AIC:                             88.47
Df Residuals:                       5   BIC:                             88.78
Df Model:                           2
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -5250.0000   1162.054     -4.518      0.006   -8234.761   -2265.239
education     625.0000     75.000      8.333      0.000     432.196     817.804
experience    375.0000     89.443      4.193      0.008     145.132     604.868
==============================================================================

Interpretation:

  • Each additional year of education increases wage by 625 dollars (p < 0.001, significant)
  • Each additional year of work experience increases wage by 375 dollars (p < 0.01, significant)
  • R² = 0.982, indicating very high model fit

Three-Language Comparison

Stata Code

stata
* Load data
use wage_data.dta, clear

* Run OLS regression
regress wage education experience

* View regression results

R Code

r
# Load data
data <- read.csv("wage_data.csv")

# Run OLS regression
model <- lm(wage ~ education + experience, data = data)

# View regression results
summary(model)

Python Code

python
# Load data
data = pd.read_csv("wage_data.csv")

# Run OLS regression
X = sm.add_constant(data[['education', 'experience']])
y = data['wage']
model = sm.OLS(y, X).fit()

# View regression results
print(model.summary())

Syntax Comparison Summary

FunctionStataRPython (statsmodels)
Regression Commandregress y x1 x2lm(y ~ x1 + x2)sm.OLS(y, X).fit()
Add ConstantAuto-addedAuto-addedManual sm.add_constant()
View ResultsAuto-displayedsummary(model)model.summary()
Get Coefficients_b[x1]coef(model)model.params
Get R²e(r2)summary(model)$r.squaredmodel.rsquared
Predictpredict yhatpredict(model)model.predict(X)

Key Considerations for Python Regression

1. Must Manually Add Constant Term

python
# ❌ Wrong: Forgot to add constant
X = data[['education', 'experience']]
model = sm.OLS(y, X).fit()  # Results will be biased!

# ✅ Correct: Add constant
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(y, X).fit()

2. Order of X and y

python
# Python/statsmodels: OLS(y, X)
model = sm.OLS(y, X).fit()

# R syntax: lm(y ~ X)
# Note: Python is (y, X), R is formula form

3. Must Call summary() to View Results

python
# ❌ Only displays model object
print(model)

# ✅ Displays complete regression results
print(model.summary())

Quick Practice

Run the following code to experience Python regression:

python
import pandas as pd
import statsmodels.api as sm
import numpy as np

# Generate simulated data
np.random.seed(42)
n = 100
data = pd.DataFrame({
    'income': np.random.normal(5000, 1500, n),
    'age': np.random.randint(22, 65, n),
    'education': np.random.randint(9, 22, n)
})

# Income = f(age, education)
X = sm.add_constant(data[['age', 'education']])
y = data['income']

model = sm.OLS(y, X).fit()
print(model.summary())

# Extract key results
print(f"\n📊 Key Metrics:")
print(f"R² = {model.rsquared:.3f}")
print(f"Education coefficient = {model.params['education']:.2f}")
print(f"Education p-value = {model.pvalues['education']:.4f}")

Next Steps

  • Article 02: Deep Dive into OLS Regression - Model Diagnostics & Interpretation
  • Article 03: Logit Regression - Binary Dependent Variable Models
  • Article 04: summary_col() - Elegantly Comparing Multiple Models

🎉 Congratulations! You've run your first Python regression model!

Released under the MIT License. Content © Author.