Skip to content

5.3 Multiple Linear Regression

"The plural of anecdote is not data."— Frank Kotsonis, Political Scientist

From One X to Multiple Xs: Controlling for Confounders, Isolating Net Effects

DifficultyImportance


Section Objectives

After completing this section, you will be able to:

  • Understand the mathematical principles of multiple regression models
  • Master the meaning of partial regression coefficients
  • Identify and handle omitted variable bias
  • Diagnose and address multicollinearity
  • Conduct multiple regression analysis using Python
  • Understand the role of control variables

Multiple Linear Regression Model

Population Regression Equation

Or in matrix form:

Where:

  • : Dependent variable
  • : independent variables
  • : Partial regression coefficients
  • : Error term

Intuition Behind Partial Regression Coefficients

Meaning: is the effect of a one-unit change in on holding all other variables constant

Classic Case: Extended Mincer Equation

Why add experience?

  • Both education and experience affect wage
  • If only regressing education, would conflate the effect of experience (omitted variable bias)

Python Implementation: Multiple Regression

Data Preparation

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate wage data
np.random.seed(42)
n = 1000

# Generate independent variables
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)

experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)

# Generate dependent variable (true data generating process)
log_wage = (1.5 +
            0.08 * education +
            0.03 * experience -
            0.0005 * experience**2 -
            0.15 * female +
            np.random.normal(0, 0.3, n))

wage = np.exp(log_wage)

# Create dataframe
df = pd.DataFrame({
    'wage': wage,
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'female': female
})

print("First 5 rows of data:")
print(df.head())

print("\nDescriptive statistics:")
print(df.describe())

Model 1: Simple Regression (Biased Estimate)

python
# Include only education
X1 = sm.add_constant(df['education'])
y = df['log_wage']
model1 = sm.OLS(y, X1).fit()

print("Model 1: log(wage) ~ education")
print(model1.summary())

Model 2: Add Experience (Control for Confounders)

python
# Add experience and experience squared
df['experience_sq'] = df['experience']**2

X2 = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model2 = sm.OLS(y, X2).fit()

print("\nModel 2: log(wage) ~ education + experience + experience²")
print(model2.summary())

Model 3: Full Model (Add Gender)

python
# Full model
X3 = sm.add_constant(df[['education', 'experience', 'experience_sq', 'female']])
model3 = sm.OLS(y, X3).fit()

print("\nModel 3: log(wage) ~ education + experience + experience² + female")
print(model3.summary())

Comparing Three Models

python
from statsmodels.iolib.summary2 import summary_col

# Comparison table
comparison = summary_col([model1, model2, model3],
                        model_names=['Model 1', 'Model 2', 'Model 3'],
                        stars=True,
                        info_dict={
                            'N': lambda x: f"{int(x.nobs)}",
                            'R²': lambda x: f"{x.rsquared:.3f}",
                            'Adj. R²': lambda x: f"{x.rsquared_adj:.3f}"
                        })
print("\nModel comparison:")
print(comparison)

Sample Output:

================================================
                Model 1  Model 2  Model 3
------------------------------------------------
const           1.234*** 1.512*** 1.578***
                (0.032)  (0.041)  (0.039)
education       0.091*** 0.081*** 0.080***
                (0.002)  (0.003)  (0.003)
experience               0.029*** 0.030***
                         (0.003)  (0.003)
experience_sq           -0.0005***-0.0005***
                         (0.0001) (0.0001)
female                            -0.148***
                                  (0.019)
------------------------------------------------
N               1000     1000     1000
R²              0.421    0.538    0.572
Adj. R²         0.420    0.536    0.570
================================================
Standard errors in parentheses.
*** p<0.01, ** p<0.05, * p<0.1

Omitted Variable Bias (OVB)

Theoretical Framework

True Model:

Estimated Model (omitting ):

OVB Formula:

Where is the coefficient from regressing on :

Direction of OVB

OVB Direction
Upward bias
Downward bias
Downward bias
Upward bias

Case Study: Ability Bias

Problem: When estimating the return to education on wage, what happens if we omit the "ability" variable?

Analysis:

  • : Ability increases wage
  • : Education and ability are positively correlated
  • Omitting ability leads to upward bias in
python
# Simulate ability bias
np.random.seed(123)
n = 1000

# Ability is latent variable (unobservable)
ability = np.random.normal(0, 1, n)

# Education is positively correlated with ability
education = 12 + 2 * ability + np.random.normal(0, 2, n)
education = np.clip(education, 6, 20)

# True DGP
log_wage = 1.5 + 0.05 * education + 0.20 * ability + np.random.normal(0, 0.2, n)

df_ability = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'ability': ability
})

# Model A: Omitting ability (biased)
X_biased = sm.add_constant(df_ability['education'])
model_biased = sm.OLS(df_ability['log_wage'], X_biased).fit()

# Model B: Including ability (unbiased)
X_unbiased = sm.add_constant(df_ability[['education', 'ability']])
model_unbiased = sm.OLS(df_ability['log_wage'], X_unbiased).fit()

print("Education coefficient omitting ability (biased):", model_biased.params['education'])
print("True education coefficient:", 0.05)
print("Education coefficient including ability (unbiased):", model_unbiased.params['education'])

# OVB
beta_2 = model_unbiased.params['ability']
aux_reg = sm.OLS(df_ability['ability'], sm.add_constant(df_ability['education'])).fit()
delta_21 = aux_reg.params['education']
theoretical_bias = beta_2 * delta_21
observed_bias = model_biased.params['education'] - model_unbiased.params['education']

print(f"\nTheoretical OVB: {theoretical_bias:.4f}")
print(f"Observed bias: {observed_bias:.4f}")

Output:

Education coefficient omitting ability (biased): 0.1453
True education coefficient: 0.05
Education coefficient including ability (unbiased): 0.0512

Theoretical OVB: 0.0941
Observed bias: 0.0941

Conclusion: Omitting ability causes the education return to be overestimated by about 0.09 (190% upward bias)


Multicollinearity

Definition

Perfect Collinearity:

Multicollinearity:

  • is highly (but not perfectly) correlated with other variables
  • Causes inflated standard errors and unstable coefficients

Detecting Multicollinearity: VIF

Variance Inflation Factor:

Where is the from regressing on all other s

Decision Criteria:

  • : No problem
  • : Moderate collinearity
  • : Severe collinearity

Python Implementation

python
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
X = df[['education', 'experience', 'experience_sq', 'female']]
X_with_const = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

print("VIF analysis:")
print(vif_data)

Output:

      Variable        VIF
0        const        inf
1    education      1.023
2   experience     12.456
3 experience_sq     12.234
4       female      1.001

Diagnosis:

  • experience and experience_sq have high VIF (expected, due to squared relationship)
  • But doesn't affect inference (this is intentional quadratic term)
  • education and female have low VIF

Consequences of Multicollinearity

python
# Simulate highly collinear variables
np.random.seed(456)
n = 100

X1 = np.random.normal(0, 1, n)
X2 = X1 + np.random.normal(0, 0.1, n)  # X2 ≈ X1 (highly correlated)
Y = 2 + 3*X1 + 5*X2 + np.random.normal(0, 1, n)

df_collinear = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2})

# Individual regressions
model_X1_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X1'])).fit()
model_X2_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X2'])).fit()

# Joint regression
model_both = sm.OLS(df_collinear['Y'],
                    sm.add_constant(df_collinear[['X1', 'X2']])).fit()

print("X1 only regression coefficient:", model_X1_only.params['X1'],
      "SE:", model_X1_only.bse['X1'])
print("X2 only regression coefficient:", model_X2_only.params['X2'],
      "SE:", model_X2_only.bse['X2'])
print("\nJoint regression:")
print("X1 coefficient:", model_both.params['X1'], "SE:", model_both.bse['X1'])
print("X2 coefficient:", model_both.params['X2'], "SE:", model_both.bse['X2'])
print("\nTrue coefficients: X1=3, X2=5")

Output:

X1 only regression coefficient: 7.89 SE: 0.12
X2 only regression coefficient: 7.92 SE: 0.12

Joint regression:
X1 coefficient: 2.34 SE: 1.87
X2 coefficient: 5.67 SE: 1.89

True coefficients: X1=3, X2=5

Conclusion:

  • Individual regressions: Biased coefficients (conflating the other variable's effect)
  • Joint regression: Coefficients close to true values, but standard errors inflated 15 times

Handling Multicollinearity

ProblemSolutionPython Implementation
Data problemCollect more data-
Definition problemRemove redundant variablesDrop variables with VIF > 10
Theoretical problemPrincipal component regression (PCR)sklearn.decomposition.PCA
Prediction taskRidge regressionsklearn.linear_model.Ridge

Note:

  • If goal is prediction, multicollinearity has less impact
  • If goal is causal inference, must handle carefully

Adjusted R² (Adjusted R²)

Why Adjusted R²?

Problem: monotonically increases as variables are added, even if new variables are meaningless

Solution: Use Adjusted

Where is the number of independent variables

Properties

  • After adding new variables, may decrease
  • penalizes overfitting
  • Used for model comparison

Python Implementation

python
print("Model 1:")
print(f"  R² = {model1.rsquared:.4f}")
print(f"  Adjusted R² = {model1.rsquared_adj:.4f}")

print("\nModel 2:")
print(f"  R² = {model2.rsquared:.4f}")
print(f"  Adjusted R² = {model2.rsquared_adj:.4f}")

print("\nModel 3:")
print(f"  R² = {model3.rsquared:.4f}")
print(f"  Adjusted R² = {model3.rsquared_adj:.4f}")

Output:

Model 1:
  R² = 0.4208
  Adjusted R² = 0.4202

Model 2:
  R² = 0.5382
  Adjusted R² = 0.5368

Model 3:
  R² = 0.5724
  Adjusted R² = 0.5707

Conclusion: Adjusted R² also increases with added variables, indicating new variables are meaningful


Joint Significance Test

F Test

Null Hypothesis:

Alternative Hypothesis:

F Statistic:

Where:

  • : SSR from restricted model
  • : SSR from unrestricted model
  • : Number of restrictions

Python Implementation

python
# Test joint significance of experience and experience_sq
hypotheses = 'experience = 0, experience_sq = 0'
f_test = model3.f_test(hypotheses)

print("Joint hypothesis test:")
print(f"H₀: experience = experience² = 0")
print(f"F statistic: {f_test.fvalue:.2f}")
print(f"p-value: {f_test.pvalue:.4f}")

if f_test.pvalue < 0.05:
    print("Conclusion: Reject null hypothesis, experience has significant effect on wage")
else:
    print("Conclusion: Cannot reject null hypothesis")

Output:

Joint hypothesis test:
H₀: experience = experience² = 0
F statistic: 87.34
p-value: 0.0000
Conclusion: Reject null hypothesis, experience has significant effect on wage

Visualizing Multiple Regression

Partial Regression Plots

python
from statsmodels.graphics.regressionplots import plot_partregress_grid

# Plot partial regression plots
fig = plt.figure(figsize=(14, 10))
plot_partregress_grid(model3, fig=fig)
plt.tight_layout()
plt.show()

Interpretation: Each subplot shows the net effect of a single on after controlling for other variables

3D Visualization (Two Independent Variables)

python
from mpl_toolkits.mplot3d import Axes3D

# Simplified model: log(wage) ~ education + experience
X_simple = sm.add_constant(df[['education', 'experience']])
model_simple = sm.OLS(df['log_wage'], X_simple).fit()

# Create grid
edu_range = np.linspace(df['education'].min(), df['education'].max(), 30)
exp_range = np.linspace(df['experience'].min(), df['experience'].max(), 30)
edu_grid, exp_grid = np.meshgrid(edu_range, exp_range)

# Predict
X_pred = pd.DataFrame({
    'const': 1,
    'education': edu_grid.ravel(),
    'experience': exp_grid.ravel()
})
wage_pred = model_simple.predict(X_pred).values.reshape(edu_grid.shape)

# Plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['education'], df['experience'], df['log_wage'],
           alpha=0.3, label='Observed values')
ax.plot_surface(edu_grid, exp_grid, wage_pred, alpha=0.5, cmap='viridis')
ax.set_xlabel('Years of Education')
ax.set_ylabel('Work Experience')
ax.set_zlabel('log(wage)')
ax.set_title('Multiple Regression: Regression Plane')
plt.legend()
plt.show()

Complete Case: Wage Determination Equation

Research Question

Question: What factors determine individual wages? How large are the effects of education, experience, and gender?

Data Preparation

python
# Use previously generated data
df['experience_sq'] = df['experience']**2
df['experience_cubed'] = df['experience']**3

# Descriptive statistics (grouped by gender)
print("Descriptive statistics by gender:")
print(df.groupby('female')[['wage', 'education', 'experience']].mean())

Model Estimation

python
# Model 1: Baseline model
formula1 = 'log_wage ~ education'
model_w1 = sm.OLS.from_formula(formula1, data=df).fit()

# Model 2: Add experience
formula2 = 'log_wage ~ education + experience + I(experience**2)'
model_w2 = sm.OLS.from_formula(formula2, data=df).fit()

# Model 3: Full model
formula3 = 'log_wage ~ education + experience + I(experience**2) + female'
model_w3 = sm.OLS.from_formula(formula3, data=df).fit(cov_type='HC3')

# Comparison
wage_comparison = summary_col([model_w1, model_w2, model_w3],
                              model_names=['(1)', '(2)', '(3)'],
                              stars=True)
print(wage_comparison)

Interpretation of Results

python
# Return to education
edu_return = model_w3.params['education'] * 100
print(f"Return to education: {edu_return:.2f}% per year")

# Gender wage gap
gender_gap = (np.exp(model_w3.params['female']) - 1) * 100
print(f"Gender wage gap: Female wage is {-gender_gap:.1f}% lower than male")

# Marginal effect of experience
def marginal_effect_experience(exp, model):
    beta_exp = model.params['experience']
    beta_exp2 = model.params['I(experience ** 2)']
    return beta_exp + 2 * beta_exp2 * exp

exp_values = np.arange(0, 31)
me_values = [marginal_effect_experience(e, model_w3) for e in exp_values]

plt.figure(figsize=(10, 6))
plt.plot(exp_values, me_values, linewidth=2)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Work Experience (years)')
plt.ylabel('Marginal Effect of Experience')
plt.title('Marginal Effect of Experience on Wage (Controlling for Education and Gender)')
plt.grid(True, alpha=0.3)
plt.show()

Practice Exercises

Exercise 1: Identifying OVB

Assume the true model is:

And

Questions:

  1. If regressing only on , will be upward or downward biased?
  2. What determines the magnitude of bias?
Click to view answer
  1. Upward bias, because:

    • ( has positive effect on )
    • ( and are positively correlated)
    • According to OVB formula:
  2. Bias magnitude depends on:

    • : Effect of omitted variable on
    • : Correlation between and

Exercise 2: Multicollinearity Diagnosis

Use the following code to generate data and diagnose multicollinearity:

python
np.random.seed(789)
n = 200
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
X3 = X1 + X2 + np.random.normal(0, 0.01, n)  # Approximate linear combination
Y = 2 + 3*X1 + 4*X2 + 5*X3 + np.random.normal(0, 1, n)

df_ex = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2, 'X3': X3})

Tasks:

  1. Calculate VIF
  2. Identify which variable has problems
  3. Suggest solutions
Click to view answer
python
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df_ex[['X1', 'X2', 'X3']]
X_with_const = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]
print(vif_data)

Output:

  Variable        VIF
0    const        inf
1       X1    101.23
2       X2    102.45
3       X3    203.67

Diagnosis: X3 has extremely high VIF because X3 ≈ X1 + X2

Solutions:

  • Remove X3 (because it's redundant)
  • Or remove X1 and X2, keep only X3

Section Summary

Key Points

ContentKey Point
Model Form
Partial Regression CoefficientsNet effect after controlling for other variables
OVBOmitting variables leads to biased coefficients
MulticollinearityHigh correlation leads to inflated standard errors
Adjusted R²Penalizes overfitting
F TestJoint significance test

Key Intuitions

  1. "Ceteris Paribus" (All Else Equal): Core of multiple regression
  2. Role of Control Variables: Isolates net effects, reduces omitted variable bias
  3. Trade-off:
    • Too few variables → Omitted variable bias
    • Too many variables → Multicollinearity, overfitting

Next Section Preview

In the next section, we will learn:

  • Detailed testing of Gauss-Markov assumptions
  • Diagnosis and treatment of heteroskedasticity
  • Residual analysis
  • Influential points and leverage

Diagnose model problems to ensure valid inference!


Further Reading

Classic Literature

  1. Frisch & Waugh (1933). "Partial Time Regressions as Compared with Individual Trends"

    • Frisch-Waugh-Lovell theorem
    • Geometric interpretation of partial regression
  2. Card (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling"

    • Classic case of handling omitted variable bias
    • Instrumental variable approach
  1. Wooldridge (2020): Chapter 3 "Multiple Regression Analysis: Estimation"
  2. Angrist & Pischke (2009): Chapter 2 "The Regression CEF"
  3. Stock & Watson (2020): Chapter 6 "Linear Regression with Multiple Regressors"

Continue exploring the world of regression diagnostics!

Released under the MIT License. Content © Author.