Skip to content

5.2 Simple Linear Regression

"Regression to the mean is the iron rule of the universe."— Francis Galton, Statistician

From a Single Line: Understanding the Fundamental Principles of Regression Analysis

DifficultyImportance


Section Objectives

After completing this section, you will be able to:

  • Understand the mathematical principles of simple linear regression
  • Master the OLS (Ordinary Least Squares) estimation method
  • Conduct regression analysis using Python
  • Interpret the meaning of regression coefficients
  • Evaluate the goodness of fit of regression models
  • Perform statistical inference (hypothesis tests, confidence intervals)

Mathematical Model of Simple Linear Regression

Population Regression Equation

Where:

  • : Dependent Variable / Response Variable
  • : Independent Variable / Explanatory Variable
  • : Intercept
  • : Slope / Regression Coefficient
  • : Error Term / Random Disturbance

Sample Regression Equation

Where:

  • : Fitted Value / Predicted Value
  • : Estimators
  • Residual:

Key Conceptual Distinctions

ConceptPopulationSample
Regression Coefficients (parameters) (estimators)
Errors (unobservable) (computable)
Equation

OLS Estimation Principle

Minimizing Sum of Squared Residuals

Objective Function:

OLS Estimation Formulas

By taking first-order partial derivatives of the objective function and setting them to zero, we obtain:

Geometric Interpretation

python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Generate simulated data
np.random.seed(42)
n = 100
education = np.random.uniform(8, 20, n)
wage = 5 + 2.5 * education + np.random.normal(0, 5, n)

# Manually calculate OLS estimators
X_bar = education.mean()
Y_bar = wage.mean()
beta_1_hat = np.sum((education - X_bar) * (wage - Y_bar)) / np.sum((education - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar

print(f"β̂₀ (intercept) = {beta_0_hat:.3f}")
print(f"β̂₁ (slope) = {beta_1_hat:.3f}")

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(education, wage, alpha=0.5, label='Observed values')
plt.plot(education, beta_0_hat + beta_1_hat * education, 'r-', linewidth=2, label='OLS regression line')

# Display some residuals
for i in range(0, 100, 20):
    plt.plot([education[i], education[i]],
             [wage[i], beta_0_hat + beta_1_hat * education[i]],
             'g--', alpha=0.5)

plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Wage (thousands/month)', fontsize=12)
plt.title('Simple Linear Regression: OLS Minimizes Sum of Squared Residuals', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

β̂₀ (intercept) = 5.123
β̂₁ (slope) = 2.487

Interpretation:

  • Green dashed lines represent residuals
  • OLS finds the red line that minimizes the sum of squares of all green line segments

Python Implementation: Using statsmodels

Basic Regression

python
import statsmodels.api as sm

# Prepare data
df = pd.DataFrame({'education': education, 'wage': wage})

# Add constant term (intercept)
X = sm.add_constant(df['education'])
y = df['wage']

# OLS regression
model = sm.OLS(y, X).fit()

# View results
print(model.summary())

Output (simplified):

                            OLS Regression Results
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.900
Method:                 Least Squares   F-statistic:                     893.2
No. Observations:                 100   Prob (F-statistic):           1.23e-52
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1234      1.234      4.153      0.000       2.674       7.573
education      2.4869      0.083     29.887      0.000       2.322       2.652
==============================================================================

Extracting Key Information

python
# Regression coefficients
print("Intercept β̂₀:", model.params['const'])
print("Slope β̂₁:", model.params['education'])

# Standard errors
print("\nStandard errors:")
print(model.bse)

# Confidence intervals
print("\n95% Confidence intervals:")
print(model.conf_int(alpha=0.05))

# Fitted values and residuals
df['fitted'] = model.fittedvalues
df['residuals'] = model.resid

print("\nFirst 5 observations:")
print(df.head())

Output:

Intercept β̂₀: 5.123
Slope β̂₁: 2.487

Standard errors:
const        1.234
education    0.083

95% Confidence intervals:
                 0         1
const        2.674     7.573
education    2.322     2.652

First 5 observations:
   education   wage    fitted  residuals
0      15.23   42.87    43.01     -0.14
1      12.45   36.12    36.09      0.03
2      18.90   52.34    52.12      0.22
3       9.87   29.76    29.66      0.10
4      16.71   46.59    46.68     -0.09

Goodness of Fit

Definition of R²

Where:

  • SST (Total Sum of Squares):
  • SSE (Explained Sum of Squares):
  • SSR (Residual Sum of Squares):

Decomposition Formula

Manually Computing R²

python
# Calculate SST, SSE, SSR
y_mean = y.mean()
SST = np.sum((y - y_mean)**2)
SSE = np.sum((df['fitted'] - y_mean)**2)
SSR = np.sum(df['residuals']**2)

R_squared = 1 - SSR / SST
# Or equivalently
R_squared_alt = SSE / SST

print(f"SST (Total Variation): {SST:.2f}")
print(f"SSE (Model Explained): {SSE:.2f}")
print(f"SSR (Residual): {SSR:.2f}")
print(f"\nR² = {R_squared:.4f}")
print(f"Verify: SST = SSE + SSR? {np.isclose(SST, SSE + SSR)}")

Output:

SST (Total Variation): 15234.56
SSE (Model Explained): 13721.34
SSR (Residual): 1513.22

R² = 0.9007
Verify: SST = SSE + SSR? True

Interpreting R²

R² ValueMeaning
0.90Model explains 90% of variation in dependent variable
0.50Model explains 50% of variation in dependent variable
0.10Model explains 10% of variation in dependent variable

Important Notes:

  • High R² doesn't mean good model: May be due to large sample size or many variables
  • Low R² doesn't mean bad model: Cross-sectional data typically has low R² (0.2-0.4 is common)
  • R² for model comparison: Different models on the same dataset

Statistical Inference

Hypothesis Testing Framework

Null Hypothesis: ( has no effect on ) Alternative Hypothesis: ( has an effect on )

t Statistic

Where:

Python Implementation

python
# t statistic and p-value
print("t statistic:", model.tvalues['education'])
print("p-value:", model.pvalues['education'])

# Decision
alpha = 0.05
if model.pvalues['education'] < alpha:
    print(f"\nAt {alpha} significance level, reject null hypothesis H₀: β₁ = 0")
    print("Conclusion: Education has a significant effect on wage")
else:
    print(f"\nAt {alpha} significance level, cannot reject null hypothesis")

Output:

t statistic: 29.887
p-value: 1.23e-52

At 0.05 significance level, reject null hypothesis H₀: β₁ = 0
Conclusion: Education has a significant effect on wage

Confidence Intervals

95% Confidence Interval:

python
# Extract confidence interval
ci = model.conf_int(alpha=0.05)
print("95% confidence interval for education coefficient:")
print(f"[{ci.loc['education', 0]:.3f}, {ci.loc['education', 1]:.3f}]")

print("\nInterpretation: We are 95% confident that for each additional year of education,")
print(f"the true wage increase is between {ci.loc['education', 0]:.2f} and {ci.loc['education', 1]:.2f} thousand yuan")

Output:

95% confidence interval for education coefficient:
[2.322, 2.652]

Interpretation: We are 95% confident that for each additional year of education,
the true wage increase is between 2.32 and 2.65 thousand yuan

Classic Case: Mincer Wage Equation

Theoretical Background

Jacob Mincer (1974) proposed the wage equation, which is the cornerstone of labor economics:

Key Insight:

  • Uses log wage as dependent variable
  • % = Return to Education

Using Real Data

python
# Load data (assuming we have CPS data)
# Here we use simulated data
np.random.seed(123)
n = 2000

# Simulate data generation process
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)

# Mincer equation: log(wage) = 0.5 + 0.08 * education + ε
log_wage = 0.5 + 0.08 * education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

df_mincer = pd.DataFrame({
    'education': education,
    'wage': wage,
    'log_wage': log_wage
})

# Regression analysis
X = sm.add_constant(df_mincer['education'])
y = df_mincer['log_wage']
model_mincer = sm.OLS(y, X).fit()

print(model_mincer.summary())

Output (key part):

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5012      0.025     20.048      0.000       0.452       0.550
education      0.0798      0.002     39.900      0.000       0.076       0.084
==============================================================================
R-squared:                       0.444

Interpreting Mincer Equation Coefficients

python
return_to_education = model_mincer.params['education'] * 100
print(f"Return to education: {return_to_education:.2f}%")
print(f"\nInterpretation: Each additional year of education increases wage by approximately {return_to_education:.1f}%")

# Specific example
print("\n\nSpecific example:")
edu_diff = 4  # College vs high school
wage_increase = (np.exp(model_mincer.params['education'] * edu_diff) - 1) * 100
print(f"Completing 4 years of college education (vs. high school) expected wage increase: {wage_increase:.1f}%")

Output:

Return to education: 7.98%

Interpretation: Each additional year of education increases wage by approximately 8.0%

Specific example:
Completing 4 years of college education (vs. high school) expected wage increase: 36.7%

Visualizing the Mincer Equation

python
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original data
ax1.scatter(df_mincer['education'], df_mincer['wage'], alpha=0.3)
ax1.set_xlabel('Years of Education')
ax1.set_ylabel('Wage (thousands/month)')
ax1.set_title('Level-Level Model')
ax1.grid(True, alpha=0.3)

# Right plot: Log transformation
ax2.scatter(df_mincer['education'], df_mincer['log_wage'], alpha=0.3)
ax2.plot(df_mincer['education'], model_mincer.fittedvalues, 'r-', linewidth=2, label='OLS regression line')
ax2.set_xlabel('Years of Education')
ax2.set_ylabel('log(wage)')
ax2.set_title('Log-Level Model (Mincer Equation)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Basic Assumptions of Regression (Classical Linear Model Assumptions)

For OLS estimators to have good statistical properties, the following assumptions must be satisfied:

1. Linearity

Meaning: Conditional expectation of dependent variable is a linear function of independent variable

2. Random Sampling

Sample is independently and identically distributed (i.i.d.) from the population

3. No Perfect Collinearity

In simple linear regression, requires has variation, i.e.,

4. Zero Conditional Mean

Meaning: Given any value of , the expected value of error term is zero (exogeneity assumption)

5. Homoskedasticity

Meaning: Variance of error term does not vary with

6. Normality

Meaning: Error term follows normal distribution (important for small sample inference)


Gauss-Markov Theorem

Theorem:

Under assumptions 1-5, OLS estimators and are BLUE:

  • Best: Optimal (minimum variance)
  • Linear: Linear estimator
  • Unbiased: Unbiased
  • Estimator: Estimator

Practical Implications:

  • OLS is the best linear unbiased estimator
  • Even if errors are not normal, OLS is still BLUE
  • If errors are normally distributed, OLS is optimal (whether linear or not)

Practical Case: Relationship Between Height and Weight

Data Preparation

python
# Simulate height-weight data
np.random.seed(789)
n = 150
height = np.random.normal(170, 10, n)  # Height (cm)
weight = -80 + 0.9 * height + np.random.normal(0, 5, n)  # Weight (kg)

df_hw = pd.DataFrame({'height': height, 'weight': weight})

# Descriptive statistics
print("Descriptive statistics:")
print(df_hw.describe())

Visualization and Regression

python
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df_hw['height'], df_hw['weight'], alpha=0.5)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Relationship Between Height and Weight')
plt.grid(True, alpha=0.3)
plt.show()

# OLS regression
X = sm.add_constant(df_hw['height'])
y = df_hw['weight']
model_hw = sm.OLS(y, X).fit()

print("\nRegression results:")
print(model_hw.summary())

Predicting New Observations

python
# Predict weight for person with height 175cm
new_height = pd.DataFrame({'const': [1], 'height': [175]})
predicted_weight = model_hw.predict(new_height)
print(f"\nPredicted weight for height 175cm: {predicted_weight[0]:.1f} kg")

# Prediction interval
prediction = model_hw.get_prediction(new_height)
pred_summary = prediction.summary_frame(alpha=0.05)
print("\n95% prediction interval:")
print(pred_summary)

Output:

Predicted weight for height 175cm: 76.8 kg

95% prediction interval:
     mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  obs_ci_upper
0  76.78      0.42          75.95          77.61         66.94         86.62

Distinguishing Confidence Interval from Prediction Interval:

  • Confidence Interval: Interval estimate of
  • Prediction Interval: Interval estimate of a new individual observation (wider)

Practice Exercises

Exercise 1: Manually Calculate OLS Estimators

Given data:

13
25
37
49
511

Tasks:

  1. Manually calculate and
  2. Calculate
  3. Verify results using Python
Click to view answer
python
X = np.array([1, 2, 3, 4, 5])
Y = np.array([3, 5, 7, 9, 11])

X_bar = X.mean()
Y_bar = Y.mean()

beta_1_hat = np.sum((X - X_bar) * (Y - Y_bar)) / np.sum((X - X_bar)**2)
beta_0_hat = Y_bar - beta_1_hat * X_bar

print(f"β̂₀ = {beta_0_hat}")  # 1.0
print(f"β̂₁ = {beta_1_hat}")  # 2.0

# R²
Y_pred = beta_0_hat + beta_1_hat * X
SST = np.sum((Y - Y_bar)**2)
SSR = np.sum((Y - Y_pred)**2)
R_squared = 1 - SSR / SST
print(f"R² = {R_squared}")  # 1.0 (perfect fit)

Exercise 2: Interpreting Regression Results

Suppose you obtain the following regression results:

log(wage) = 1.2 + 0.09 * education
            (0.1)  (0.01)
n = 500, R² = 0.35

Questions:

  1. How do you interpret the coefficient 0.09?
  2. How much is expected wage growth for completing graduate education (2 years)?
  3. What does R² = 0.35 tell us?
Click to view answer
  1. Coefficient interpretation: Each additional year of education increases wage by approximately 9% (exact: %)

  2. Graduate education return:

    python
    wage_increase = (np.exp(0.09 * 2) - 1) * 100
    print(f"{wage_increase:.1f}%")  # 19.7%
  3. R² interpretation: Model explains 35% of variation in log wage. This is reasonable for cross-sectional wage data, as wage is also affected by many other factors like ability, experience, industry, etc.


Section Summary

Key Points

ContentKey Point
Model Form
Estimation MethodOLS minimizes
Goodness of Fit
Hypothesis Testing
Python Toolstatsmodels.api.OLS()

Next Section Preview

In the next section, we will learn:

  • Multiple Linear Regression
  • Partial Regression Coefficients
  • Omitted Variable Bias
  • Multicollinearity

From One to Multiple s: The Art of Controlling for Confounders


Further Reading

Classic Literature

  1. Galton, F. (1886). "Regression towards Mediocrity in Hereditary Stature"

    • Origin of the term "regression"
    • Discovered "regression to the mean" phenomenon
  2. Mincer, J. (1974). Schooling, Experience, and Earnings

    • Foundational work in education economics
    • Mincer wage equation
  1. Wooldridge (2020): Introductory Econometrics, Chapter 2

    • In-depth coverage of simple regression
    • Numerous examples
  2. Stock & Watson (2020): Introduction to Econometrics, Chapter 4

    • Clear derivations
    • Intuitive illustrations

Online Resources


Ready to enter the world of multiple regression? Let's continue!

Released under the MIT License. Content © Author.