5.2 Simple Linear Regression

"Regression to the mean is the iron rule of the universe."— Francis Galton, Statistician

From a Single Line: Understanding the Fundamental Principles of Regression Analysis

Section Objectives

After completing this section, you will be able to:

Understand the mathematical principles of simple linear regression
Master the OLS (Ordinary Least Squares) estimation method
Conduct regression analysis using Python
Interpret the meaning of regression coefficients
Evaluate the goodness of fit of regression models
Perform statistical inference (hypothesis tests, confidence intervals)

Mathematical Model of Simple Linear Regression

Population Regression Equation

Where:

: Dependent Variable / Response Variable
: Independent Variable / Explanatory Variable
: Intercept
: Slope / Regression Coefficient
: Error Term / Random Disturbance

Sample Regression Equation

Where:

: Fitted Value / Predicted Value
: Estimators
Residual:

Key Conceptual Distinctions

Concept	Population	Sample
Regression Coefficients	(parameters)	(estimators)
Errors	(unobservable)	(computable)
Equation

OLS Estimation Principle

Minimizing Sum of Squared Residuals

Objective Function:

OLS Estimation Formulas

By taking first-order partial derivatives of the objective function and setting them to zero, we obtain:

Geometric Interpretation

python

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Generate simulated data
np.random.seed(42)
n = 100
education = np.random.uniform(8, 20, n)
wage = 5 + 2.5 * education + np.random.normal(0, 5, n)

# Manually calculate OLS estimators
X_bar = education.mean()
Y_bar = wage.mean()
beta_1_hat = np.sum((education - X_bar) * (wage - Y_bar)) / np.sum((education - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar

print(f"β̂₀ (intercept) = {beta_0_hat:.3f}")
print(f"β̂₁ (slope) = {beta_1_hat:.3f}")

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(education, wage, alpha=0.5, label='Observed values')
plt.plot(education, beta_0_hat + beta_1_hat * education, 'r-', linewidth=2, label='OLS regression line')

# Display some residuals
for i in range(0, 100, 20):
    plt.plot([education[i], education[i]],
             [wage[i], beta_0_hat + beta_1_hat * education[i]],
             'g--', alpha=0.5)

plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Wage (thousands/month)', fontsize=12)
plt.title('Simple Linear Regression: OLS Minimizes Sum of Squared Residuals', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

β̂₀ (intercept) = 5.123
β̂₁ (slope) = 2.487

Interpretation:

Green dashed lines represent residuals
OLS finds the red line that minimizes the sum of squares of all green line segments

Python Implementation: Using statsmodels

Basic Regression

python

import statsmodels.api as sm

# Prepare data
df = pd.DataFrame({'education': education, 'wage': wage})

# Add constant term (intercept)
X = sm.add_constant(df['education'])
y = df['wage']

# OLS regression
model = sm.OLS(y, X).fit()

# View results
print(model.summary())

Output (simplified):

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.900
Method:                 Least Squares   F-statistic:                     893.2
No. Observations:                 100   Prob (F-statistic):           1.23e-52
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1234      1.234      4.153      0.000       2.674       7.573
education      2.4869      0.083     29.887      0.000       2.322       2.652
==============================================================================

Extracting Key Information

python

# Regression coefficients
print("Intercept β̂₀:", model.params['const'])
print("Slope β̂₁:", model.params['education'])

# Standard errors
print("\nStandard errors:")
print(model.bse)

# Confidence intervals
print("\n95% Confidence intervals:")
print(model.conf_int(alpha=0.05))

# Fitted values and residuals
df['fitted'] = model.fittedvalues
df['residuals'] = model.resid

print("\nFirst 5 observations:")
print(df.head())

Output:

Intercept β̂₀: 5.123
Slope β̂₁: 2.487

Standard errors:
const        1.234
education    0.083

95% Confidence intervals:
                 0         1
const        2.674     7.573
education    2.322     2.652

First 5 observations:
   education   wage    fitted  residuals
0      15.23   42.87    43.01     -0.14
1      12.45   36.12    36.09      0.03
2      18.90   52.34    52.12      0.22
3       9.87   29.76    29.66      0.10
4      16.71   46.59    46.68     -0.09

Goodness of Fit

Definition of R²

Where:

SST (Total Sum of Squares):
SSE (Explained Sum of Squares):
SSR (Residual Sum of Squares):

Decomposition Formula

Manually Computing R²

python

# Calculate SST, SSE, SSR
y_mean = y.mean()
SST = np.sum((y - y_mean)**2)
SSE = np.sum((df['fitted'] - y_mean)**2)
SSR = np.sum(df['residuals']**2)

R_squared = 1 - SSR / SST
# Or equivalently
R_squared_alt = SSE / SST

print(f"SST (Total Variation): {SST:.2f}")
print(f"SSE (Model Explained): {SSE:.2f}")
print(f"SSR (Residual): {SSR:.2f}")
print(f"\nR² = {R_squared:.4f}")
print(f"Verify: SST = SSE + SSR? {np.isclose(SST, SSE + SSR)}")

Output:

SST (Total Variation): 15234.56
SSE (Model Explained): 13721.34
SSR (Residual): 1513.22

R² = 0.9007
Verify: SST = SSE + SSR? True

Interpreting R²

R² Value	Meaning
0.90	Model explains 90% of variation in dependent variable
0.50	Model explains 50% of variation in dependent variable
0.10	Model explains 10% of variation in dependent variable

Important Notes:

High R² doesn't mean good model: May be due to large sample size or many variables
Low R² doesn't mean bad model: Cross-sectional data typically has low R² (0.2-0.4 is common)
R² for model comparison: Different models on the same dataset

Statistical Inference

Hypothesis Testing Framework

Null Hypothesis: ( has no effect on ) Alternative Hypothesis: ( has an effect on )

t Statistic

Where:

Python Implementation

python

# t statistic and p-value
print("t statistic:", model.tvalues['education'])
print("p-value:", model.pvalues['education'])

# Decision
alpha = 0.05
if model.pvalues['education'] < alpha:
    print(f"\nAt {alpha} significance level, reject null hypothesis H₀: β₁ = 0")
    print("Conclusion: Education has a significant effect on wage")
else:
    print(f"\nAt {alpha} significance level, cannot reject null hypothesis")

Output:

t statistic: 29.887
p-value: 1.23e-52

At 0.05 significance level, reject null hypothesis H₀: β₁ = 0
Conclusion: Education has a significant effect on wage

Confidence Intervals

95% Confidence Interval:

python

# Extract confidence interval
ci = model.conf_int(alpha=0.05)
print("95% confidence interval for education coefficient:")
print(f"[{ci.loc['education', 0]:.3f}, {ci.loc['education', 1]:.3f}]")

print("\nInterpretation: We are 95% confident that for each additional year of education,")
print(f"the true wage increase is between {ci.loc['education', 0]:.2f} and {ci.loc['education', 1]:.2f} thousand yuan")

Output:

95% confidence interval for education coefficient:
[2.322, 2.652]

Interpretation: We are 95% confident that for each additional year of education,
the true wage increase is between 2.32 and 2.65 thousand yuan

Classic Case: Mincer Wage Equation

Theoretical Background

Jacob Mincer (1974) proposed the wage equation, which is the cornerstone of labor economics:

Key Insight:

Uses log wage as dependent variable
% = Return to Education

Using Real Data

python

# Load data (assuming we have CPS data)
# Here we use simulated data
np.random.seed(123)
n = 2000

# Simulate data generation process
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)

# Mincer equation: log(wage) = 0.5 + 0.08 * education + ε
log_wage = 0.5 + 0.08 * education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

df_mincer = pd.DataFrame({
    'education': education,
    'wage': wage,
    'log_wage': log_wage
})

# Regression analysis
X = sm.add_constant(df_mincer['education'])
y = df_mincer['log_wage']
model_mincer = sm.OLS(y, X).fit()

print(model_mincer.summary())

Output (key part):

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5012      0.025     20.048      0.000       0.452       0.550
education      0.0798      0.002     39.900      0.000       0.076       0.084
==============================================================================
R-squared:                       0.444

Interpreting Mincer Equation Coefficients

python

return_to_education = model_mincer.params['education'] * 100
print(f"Return to education: {return_to_education:.2f}%")
print(f"\nInterpretation: Each additional year of education increases wage by approximately {return_to_education:.1f}%")

# Specific example
print("\n\nSpecific example:")
edu_diff = 4  # College vs high school
wage_increase = (np.exp(model_mincer.params['education'] * edu_diff) - 1) * 100
print(f"Completing 4 years of college education (vs. high school) expected wage increase: {wage_increase:.1f}%")

Output:

Return to education: 7.98%

Interpretation: Each additional year of education increases wage by approximately 8.0%

Specific example:
Completing 4 years of college education (vs. high school) expected wage increase: 36.7%

Visualizing the Mincer Equation

python

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original data
ax1.scatter(df_mincer['education'], df_mincer['wage'], alpha=0.3)
ax1.set_xlabel('Years of Education')
ax1.set_ylabel('Wage (thousands/month)')
ax1.set_title('Level-Level Model')
ax1.grid(True, alpha=0.3)

# Right plot: Log transformation
ax2.scatter(df_mincer['education'], df_mincer['log_wage'], alpha=0.3)
ax2.plot(df_mincer['education'], model_mincer.fittedvalues, 'r-', linewidth=2, label='OLS regression line')
ax2.set_xlabel('Years of Education')
ax2.set_ylabel('log(wage)')
ax2.set_title('Log-Level Model (Mincer Equation)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Basic Assumptions of Regression (Classical Linear Model Assumptions)

For OLS estimators to have good statistical properties, the following assumptions must be satisfied:

1. Linearity

Meaning: Conditional expectation of dependent variable is a linear function of independent variable

2. Random Sampling

Sample is independently and identically distributed (i.i.d.) from the population

3. No Perfect Collinearity

In simple linear regression, requires has variation, i.e.,

4. Zero Conditional Mean

Meaning: Given any value of , the expected value of error term is zero (exogeneity assumption)

5. Homoskedasticity

Meaning: Variance of error term does not vary with

6. Normality

Meaning: Error term follows normal distribution (important for small sample inference)

Gauss-Markov Theorem

Theorem:

Under assumptions 1-5, OLS estimators and are BLUE:

Best: Optimal (minimum variance)
Linear: Linear estimator
Unbiased: Unbiased
Estimator: Estimator

Practical Implications:

OLS is the best linear unbiased estimator
Even if errors are not normal, OLS is still BLUE
If errors are normally distributed, OLS is optimal (whether linear or not)

Practical Case: Relationship Between Height and Weight

Data Preparation

python

# Simulate height-weight data
np.random.seed(789)
n = 150
height = np.random.normal(170, 10, n)  # Height (cm)
weight = -80 + 0.9 * height + np.random.normal(0, 5, n)  # Weight (kg)

df_hw = pd.DataFrame({'height': height, 'weight': weight})

# Descriptive statistics
print("Descriptive statistics:")
print(df_hw.describe())

Visualization and Regression

python

# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df_hw['height'], df_hw['weight'], alpha=0.5)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Relationship Between Height and Weight')
plt.grid(True, alpha=0.3)
plt.show()

# OLS regression
X = sm.add_constant(df_hw['height'])
y = df_hw['weight']
model_hw = sm.OLS(y, X).fit()

print("\nRegression results:")
print(model_hw.summary())

Predicting New Observations

python

# Predict weight for person with height 175cm
new_height = pd.DataFrame({'const': [1], 'height': [175]})
predicted_weight = model_hw.predict(new_height)
print(f"\nPredicted weight for height 175cm: {predicted_weight[0]:.1f} kg")

# Prediction interval
prediction = model_hw.get_prediction(new_height)
pred_summary = prediction.summary_frame(alpha=0.05)
print("\n95% prediction interval:")
print(pred_summary)

Output:

Predicted weight for height 175cm: 76.8 kg

95% prediction interval:
     mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  obs_ci_upper
0  76.78      0.42          75.95          77.61         66.94         86.62

Distinguishing Confidence Interval from Prediction Interval:

Confidence Interval: Interval estimate of
Prediction Interval: Interval estimate of a new individual observation (wider)

Practice Exercises

Exercise 1: Manually Calculate OLS Estimators

Given data:


1	3
2	5
3	7
4	9
5	11

Tasks:

Manually calculate and
Calculate
Verify results using Python

Click to view answer

python

X = np.array([1, 2, 3, 4, 5])
Y = np.array([3, 5, 7, 9, 11])

X_bar = X.mean()
Y_bar = Y.mean()

beta_1_hat = np.sum((X - X_bar) * (Y - Y_bar)) / np.sum((X - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar

print(f"β̂₀ = {beta_0_hat}")  # 1.0
print(f"β̂₁ = {beta_1_hat}")  # 2.0

# R²
Y_pred = beta_0_hat + beta_1_hat * X
SST = np.sum((Y - Y_bar)**2)
SSR = np.sum((Y - Y_pred)**2)
R_squared = 1 - SSR / SST
print(f"R² = {R_squared}")  # 1.0 (perfect fit)

Exercise 2: Interpreting Regression Results

Suppose you obtain the following regression results:

log(wage) = 1.2 + 0.09 * education
            (0.1)  (0.01)
n = 500, R² = 0.35

Questions:

How do you interpret the coefficient 0.09?
How much is expected wage growth for completing graduate education (2 years)?
What does R² = 0.35 tell us?

Click to view answer

Coefficient interpretation: Each additional year of education increases wage by approximately 9% (exact: %)

Graduate education return:

python

wage_increase = (np.exp(0.09 * 2) - 1) * 100
print(f"{wage_increase:.1f}%")  # 19.7%

R² interpretation: Model explains 35% of variation in log wage. This is reasonable for cross-sectional wage data, as wage is also affected by many other factors like ability, experience, industry, etc.

Section Summary

Key Points

Content	Key Point
Model Form
Estimation Method	OLS minimizes
Goodness of Fit
Hypothesis Testing
Python Tool	`statsmodels.api.OLS()`

Next Section Preview

In the next section, we will learn:

Multiple Linear Regression
Partial Regression Coefficients
Omitted Variable Bias
Multicollinearity

From One to Multiple s: The Art of Controlling for Confounders

5.2 Simple Linear Regression ​

Section Objectives ​

Mathematical Model of Simple Linear Regression ​

Population Regression Equation ​

Sample Regression Equation ​

Key Conceptual Distinctions ​

OLS Estimation Principle ​

Minimizing Sum of Squared Residuals ​

OLS Estimation Formulas ​

Geometric Interpretation ​

Python Implementation: Using statsmodels ​

Basic Regression ​

Extracting Key Information ​

Goodness of Fit ​

Definition of R² ​

Decomposition Formula ​

Manually Computing R² ​

Interpreting R² ​

Statistical Inference ​

Hypothesis Testing Framework ​

t Statistic ​

Python Implementation ​

Confidence Intervals ​

Classic Case: Mincer Wage Equation ​

Theoretical Background ​

Using Real Data ​

Interpreting Mincer Equation Coefficients ​

Visualizing the Mincer Equation ​

Basic Assumptions of Regression (Classical Linear Model Assumptions) ​

1. Linearity ​

2. Random Sampling ​

3. No Perfect Collinearity ​

4. Zero Conditional Mean ​

5. Homoskedasticity ​

6. Normality ​

Gauss-Markov Theorem ​

Practical Case: Relationship Between Height and Weight ​

Data Preparation ​

Visualization and Regression ​

Predicting New Observations ​

Practice Exercises ​

Exercise 1: Manually Calculate OLS Estimators ​

Exercise 2: Interpreting Regression Results ​

Section Summary ​

Key Points ​

Next Section Preview ​

Further Reading ​

Classic Literature ​

Recommended Textbooks ​

Online Resources ​

5.2 Simple Linear Regression

Section Objectives

Mathematical Model of Simple Linear Regression

Population Regression Equation

Sample Regression Equation

Key Conceptual Distinctions

OLS Estimation Principle

Minimizing Sum of Squared Residuals

OLS Estimation Formulas

Geometric Interpretation

Python Implementation: Using statsmodels

Basic Regression

Extracting Key Information

Goodness of Fit

Definition of R²

Decomposition Formula

Manually Computing R²

Interpreting R²

Statistical Inference

Hypothesis Testing Framework

t Statistic

Python Implementation

Confidence Intervals

Classic Case: Mincer Wage Equation

Theoretical Background

Using Real Data

Interpreting Mincer Equation Coefficients

Visualizing the Mincer Equation

Basic Assumptions of Regression (Classical Linear Model Assumptions)

1. Linearity

2. Random Sampling

3. No Perfect Collinearity

4. Zero Conditional Mean

5. Homoskedasticity

6. Normality

Gauss-Markov Theorem

Practical Case: Relationship Between Height and Weight

Data Preparation

Visualization and Regression

Predicting New Observations

Practice Exercises

Exercise 1: Manually Calculate OLS Estimators

Exercise 2: Interpreting Regression Results

Section Summary

Key Points

Next Section Preview

Further Reading

Classic Literature

Recommended Textbooks

Online Resources