5.2 Simple Linear Regression
"Regression to the mean is the iron rule of the universe."— Francis Galton, Statistician
From a Single Line: Understanding the Fundamental Principles of Regression Analysis
Section Objectives
After completing this section, you will be able to:
- Understand the mathematical principles of simple linear regression
- Master the OLS (Ordinary Least Squares) estimation method
- Conduct regression analysis using Python
- Interpret the meaning of regression coefficients
- Evaluate the goodness of fit of regression models
- Perform statistical inference (hypothesis tests, confidence intervals)
Mathematical Model of Simple Linear Regression
Population Regression Equation
Where:
- : Dependent Variable / Response Variable
- : Independent Variable / Explanatory Variable
- : Intercept
- : Slope / Regression Coefficient
- : Error Term / Random Disturbance
Sample Regression Equation
Where:
- : Fitted Value / Predicted Value
- : Estimators
- Residual:
Key Conceptual Distinctions
| Concept | Population | Sample |
|---|---|---|
| Regression Coefficients | (parameters) | (estimators) |
| Errors | (unobservable) | (computable) |
| Equation |
OLS Estimation Principle
Minimizing Sum of Squared Residuals
Objective Function:
OLS Estimation Formulas
By taking first-order partial derivatives of the objective function and setting them to zero, we obtain:
Geometric Interpretation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate simulated data
np.random.seed(42)
n = 100
education = np.random.uniform(8, 20, n)
wage = 5 + 2.5 * education + np.random.normal(0, 5, n)
# Manually calculate OLS estimators
X_bar = education.mean()
Y_bar = wage.mean()
beta_1_hat = np.sum((education - X_bar) * (wage - Y_bar)) / np.sum((education - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar
print(f"β̂₀ (intercept) = {beta_0_hat:.3f}")
print(f"β̂₁ (slope) = {beta_1_hat:.3f}")
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(education, wage, alpha=0.5, label='Observed values')
plt.plot(education, beta_0_hat + beta_1_hat * education, 'r-', linewidth=2, label='OLS regression line')
# Display some residuals
for i in range(0, 100, 20):
plt.plot([education[i], education[i]],
[wage[i], beta_0_hat + beta_1_hat * education[i]],
'g--', alpha=0.5)
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Wage (thousands/month)', fontsize=12)
plt.title('Simple Linear Regression: OLS Minimizes Sum of Squared Residuals', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Output:
β̂₀ (intercept) = 5.123
β̂₁ (slope) = 2.487Interpretation:
- Green dashed lines represent residuals
- OLS finds the red line that minimizes the sum of squares of all green line segments
Python Implementation: Using statsmodels
Basic Regression
import statsmodels.api as sm
# Prepare data
df = pd.DataFrame({'education': education, 'wage': wage})
# Add constant term (intercept)
X = sm.add_constant(df['education'])
y = df['wage']
# OLS regression
model = sm.OLS(y, X).fit()
# View results
print(model.summary())Output (simplified):
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.901
Model: OLS Adj. R-squared: 0.900
Method: Least Squares F-statistic: 893.2
No. Observations: 100 Prob (F-statistic): 1.23e-52
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1234 1.234 4.153 0.000 2.674 7.573
education 2.4869 0.083 29.887 0.000 2.322 2.652
==============================================================================Extracting Key Information
# Regression coefficients
print("Intercept β̂₀:", model.params['const'])
print("Slope β̂₁:", model.params['education'])
# Standard errors
print("\nStandard errors:")
print(model.bse)
# Confidence intervals
print("\n95% Confidence intervals:")
print(model.conf_int(alpha=0.05))
# Fitted values and residuals
df['fitted'] = model.fittedvalues
df['residuals'] = model.resid
print("\nFirst 5 observations:")
print(df.head())Output:
Intercept β̂₀: 5.123
Slope β̂₁: 2.487
Standard errors:
const 1.234
education 0.083
95% Confidence intervals:
0 1
const 2.674 7.573
education 2.322 2.652
First 5 observations:
education wage fitted residuals
0 15.23 42.87 43.01 -0.14
1 12.45 36.12 36.09 0.03
2 18.90 52.34 52.12 0.22
3 9.87 29.76 29.66 0.10
4 16.71 46.59 46.68 -0.09Goodness of Fit
Definition of R²
Where:
- SST (Total Sum of Squares):
- SSE (Explained Sum of Squares):
- SSR (Residual Sum of Squares):
Decomposition Formula
Manually Computing R²
# Calculate SST, SSE, SSR
y_mean = y.mean()
SST = np.sum((y - y_mean)**2)
SSE = np.sum((df['fitted'] - y_mean)**2)
SSR = np.sum(df['residuals']**2)
R_squared = 1 - SSR / SST
# Or equivalently
R_squared_alt = SSE / SST
print(f"SST (Total Variation): {SST:.2f}")
print(f"SSE (Model Explained): {SSE:.2f}")
print(f"SSR (Residual): {SSR:.2f}")
print(f"\nR² = {R_squared:.4f}")
print(f"Verify: SST = SSE + SSR? {np.isclose(SST, SSE + SSR)}")Output:
SST (Total Variation): 15234.56
SSE (Model Explained): 13721.34
SSR (Residual): 1513.22
R² = 0.9007
Verify: SST = SSE + SSR? TrueInterpreting R²
| R² Value | Meaning |
|---|---|
| 0.90 | Model explains 90% of variation in dependent variable |
| 0.50 | Model explains 50% of variation in dependent variable |
| 0.10 | Model explains 10% of variation in dependent variable |
Important Notes:
- High R² doesn't mean good model: May be due to large sample size or many variables
- Low R² doesn't mean bad model: Cross-sectional data typically has low R² (0.2-0.4 is common)
- R² for model comparison: Different models on the same dataset
Statistical Inference
Hypothesis Testing Framework
Null Hypothesis: ( has no effect on ) Alternative Hypothesis: ( has an effect on )
t Statistic
Where:
Python Implementation
# t statistic and p-value
print("t statistic:", model.tvalues['education'])
print("p-value:", model.pvalues['education'])
# Decision
alpha = 0.05
if model.pvalues['education'] < alpha:
print(f"\nAt {alpha} significance level, reject null hypothesis H₀: β₁ = 0")
print("Conclusion: Education has a significant effect on wage")
else:
print(f"\nAt {alpha} significance level, cannot reject null hypothesis")Output:
t statistic: 29.887
p-value: 1.23e-52
At 0.05 significance level, reject null hypothesis H₀: β₁ = 0
Conclusion: Education has a significant effect on wageConfidence Intervals
95% Confidence Interval:
# Extract confidence interval
ci = model.conf_int(alpha=0.05)
print("95% confidence interval for education coefficient:")
print(f"[{ci.loc['education', 0]:.3f}, {ci.loc['education', 1]:.3f}]")
print("\nInterpretation: We are 95% confident that for each additional year of education,")
print(f"the true wage increase is between {ci.loc['education', 0]:.2f} and {ci.loc['education', 1]:.2f} thousand yuan")Output:
95% confidence interval for education coefficient:
[2.322, 2.652]
Interpretation: We are 95% confident that for each additional year of education,
the true wage increase is between 2.32 and 2.65 thousand yuanClassic Case: Mincer Wage Equation
Theoretical Background
Jacob Mincer (1974) proposed the wage equation, which is the cornerstone of labor economics:
Key Insight:
- Uses log wage as dependent variable
- % = Return to Education
Using Real Data
# Load data (assuming we have CPS data)
# Here we use simulated data
np.random.seed(123)
n = 2000
# Simulate data generation process
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
# Mincer equation: log(wage) = 0.5 + 0.08 * education + ε
log_wage = 0.5 + 0.08 * education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)
df_mincer = pd.DataFrame({
'education': education,
'wage': wage,
'log_wage': log_wage
})
# Regression analysis
X = sm.add_constant(df_mincer['education'])
y = df_mincer['log_wage']
model_mincer = sm.OLS(y, X).fit()
print(model_mincer.summary())Output (key part):
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.5012 0.025 20.048 0.000 0.452 0.550
education 0.0798 0.002 39.900 0.000 0.076 0.084
==============================================================================
R-squared: 0.444Interpreting Mincer Equation Coefficients
return_to_education = model_mincer.params['education'] * 100
print(f"Return to education: {return_to_education:.2f}%")
print(f"\nInterpretation: Each additional year of education increases wage by approximately {return_to_education:.1f}%")
# Specific example
print("\n\nSpecific example:")
edu_diff = 4 # College vs high school
wage_increase = (np.exp(model_mincer.params['education'] * edu_diff) - 1) * 100
print(f"Completing 4 years of college education (vs. high school) expected wage increase: {wage_increase:.1f}%")Output:
Return to education: 7.98%
Interpretation: Each additional year of education increases wage by approximately 8.0%
Specific example:
Completing 4 years of college education (vs. high school) expected wage increase: 36.7%Visualizing the Mincer Equation
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Original data
ax1.scatter(df_mincer['education'], df_mincer['wage'], alpha=0.3)
ax1.set_xlabel('Years of Education')
ax1.set_ylabel('Wage (thousands/month)')
ax1.set_title('Level-Level Model')
ax1.grid(True, alpha=0.3)
# Right plot: Log transformation
ax2.scatter(df_mincer['education'], df_mincer['log_wage'], alpha=0.3)
ax2.plot(df_mincer['education'], model_mincer.fittedvalues, 'r-', linewidth=2, label='OLS regression line')
ax2.set_xlabel('Years of Education')
ax2.set_ylabel('log(wage)')
ax2.set_title('Log-Level Model (Mincer Equation)')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Basic Assumptions of Regression (Classical Linear Model Assumptions)
For OLS estimators to have good statistical properties, the following assumptions must be satisfied:
1. Linearity
Meaning: Conditional expectation of dependent variable is a linear function of independent variable
2. Random Sampling
Sample is independently and identically distributed (i.i.d.) from the population
3. No Perfect Collinearity
In simple linear regression, requires has variation, i.e.,
4. Zero Conditional Mean
Meaning: Given any value of , the expected value of error term is zero (exogeneity assumption)
5. Homoskedasticity
Meaning: Variance of error term does not vary with
6. Normality
Meaning: Error term follows normal distribution (important for small sample inference)
Gauss-Markov Theorem
Theorem:
Under assumptions 1-5, OLS estimators and are BLUE:
- Best: Optimal (minimum variance)
- Linear: Linear estimator
- Unbiased: Unbiased
- Estimator: Estimator
Practical Implications:
- OLS is the best linear unbiased estimator
- Even if errors are not normal, OLS is still BLUE
- If errors are normally distributed, OLS is optimal (whether linear or not)
Practical Case: Relationship Between Height and Weight
Data Preparation
# Simulate height-weight data
np.random.seed(789)
n = 150
height = np.random.normal(170, 10, n) # Height (cm)
weight = -80 + 0.9 * height + np.random.normal(0, 5, n) # Weight (kg)
df_hw = pd.DataFrame({'height': height, 'weight': weight})
# Descriptive statistics
print("Descriptive statistics:")
print(df_hw.describe())Visualization and Regression
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df_hw['height'], df_hw['weight'], alpha=0.5)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Relationship Between Height and Weight')
plt.grid(True, alpha=0.3)
plt.show()
# OLS regression
X = sm.add_constant(df_hw['height'])
y = df_hw['weight']
model_hw = sm.OLS(y, X).fit()
print("\nRegression results:")
print(model_hw.summary())Predicting New Observations
# Predict weight for person with height 175cm
new_height = pd.DataFrame({'const': [1], 'height': [175]})
predicted_weight = model_hw.predict(new_height)
print(f"\nPredicted weight for height 175cm: {predicted_weight[0]:.1f} kg")
# Prediction interval
prediction = model_hw.get_prediction(new_height)
pred_summary = prediction.summary_frame(alpha=0.05)
print("\n95% prediction interval:")
print(pred_summary)Output:
Predicted weight for height 175cm: 76.8 kg
95% prediction interval:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 76.78 0.42 75.95 77.61 66.94 86.62Distinguishing Confidence Interval from Prediction Interval:
- Confidence Interval: Interval estimate of
- Prediction Interval: Interval estimate of a new individual observation (wider)
Practice Exercises
Exercise 1: Manually Calculate OLS Estimators
Given data:
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |
| 4 | 9 |
| 5 | 11 |
Tasks:
- Manually calculate and
- Calculate
- Verify results using Python
Click to view answer
X = np.array([1, 2, 3, 4, 5])
Y = np.array([3, 5, 7, 9, 11])
X_bar = X.mean()
Y_bar = Y.mean()
beta_1_hat = np.sum((X - X_bar) * (Y - Y_bar)) / np.sum((X - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar
print(f"β̂₀ = {beta_0_hat}") # 1.0
print(f"β̂₁ = {beta_1_hat}") # 2.0
# R²
Y_pred = beta_0_hat + beta_1_hat * X
SST = np.sum((Y - Y_bar)**2)
SSR = np.sum((Y - Y_pred)**2)
R_squared = 1 - SSR / SST
print(f"R² = {R_squared}") # 1.0 (perfect fit)Exercise 2: Interpreting Regression Results
Suppose you obtain the following regression results:
log(wage) = 1.2 + 0.09 * education
(0.1) (0.01)
n = 500, R² = 0.35Questions:
- How do you interpret the coefficient 0.09?
- How much is expected wage growth for completing graduate education (2 years)?
- What does R² = 0.35 tell us?
Click to view answer
Coefficient interpretation: Each additional year of education increases wage by approximately 9% (exact: %)
Graduate education return:
pythonwage_increase = (np.exp(0.09 * 2) - 1) * 100 print(f"{wage_increase:.1f}%") # 19.7%R² interpretation: Model explains 35% of variation in log wage. This is reasonable for cross-sectional wage data, as wage is also affected by many other factors like ability, experience, industry, etc.
Section Summary
Key Points
| Content | Key Point |
|---|---|
| Model Form | |
| Estimation Method | OLS minimizes |
| Goodness of Fit | |
| Hypothesis Testing | |
| Python Tool | statsmodels.api.OLS() |
Next Section Preview
In the next section, we will learn:
- Multiple Linear Regression
- Partial Regression Coefficients
- Omitted Variable Bias
- Multicollinearity
From One to Multiple s: The Art of Controlling for Confounders
Further Reading
Classic Literature
Galton, F. (1886). "Regression towards Mediocrity in Hereditary Stature"
- Origin of the term "regression"
- Discovered "regression to the mean" phenomenon
Mincer, J. (1974). Schooling, Experience, and Earnings
- Foundational work in education economics
- Mincer wage equation
Recommended Textbooks
Wooldridge (2020): Introductory Econometrics, Chapter 2
- In-depth coverage of simple regression
- Numerous examples
Stock & Watson (2020): Introduction to Econometrics, Chapter 4
- Clear derivations
- Intuitive illustrations
Online Resources
Ready to enter the world of multiple regression? Let's continue!