5.3 Multiple Linear Regression

"The plural of anecdote is not data."— Frank Kotsonis, Political Scientist

From One X to Multiple Xs: Controlling for Confounders, Isolating Net Effects

Section Objectives

After completing this section, you will be able to:

Understand the mathematical principles of multiple regression models
Master the meaning of partial regression coefficients
Identify and handle omitted variable bias
Diagnose and address multicollinearity
Conduct multiple regression analysis using Python
Understand the role of control variables

Multiple Linear Regression Model

Population Regression Equation

Or in matrix form:

Where:

: Dependent variable
: independent variables
: Partial regression coefficients
: Error term

Intuition Behind Partial Regression Coefficients

Meaning: is the effect of a one-unit change in on holding all other variables constant

Classic Case: Extended Mincer Equation

Why add experience?

Both education and experience affect wage
If only regressing education, would conflate the effect of experience (omitted variable bias)

Python Implementation: Multiple Regression

Data Preparation

python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate wage data
np.random.seed(42)
n = 1000

# Generate independent variables
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)

experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)

# Generate dependent variable (true data generating process)
log_wage = (1.5 +
            0.08 * education +
            0.03 * experience -
            0.0005 * experience**2 -
            0.15 * female +
            np.random.normal(0, 0.3, n))

wage = np.exp(log_wage)

# Create dataframe
df = pd.DataFrame({
    'wage': wage,
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'female': female
})

print("First 5 rows of data:")
print(df.head())

print("\nDescriptive statistics:")
print(df.describe())

Model 1: Simple Regression (Biased Estimate)

python

# Include only education
X1 = sm.add_constant(df['education'])
y = df['log_wage']
model1 = sm.OLS(y, X1).fit()

print("Model 1: log(wage) ~ education")
print(model1.summary())

Model 2: Add Experience (Control for Confounders)

python

# Add experience and experience squared
df['experience_sq'] = df['experience']**2

X2 = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model2 = sm.OLS(y, X2).fit()

print("\nModel 2: log(wage) ~ education + experience + experience²")
print(model2.summary())

Model 3: Full Model (Add Gender)

python

# Full model
X3 = sm.add_constant(df[['education', 'experience', 'experience_sq', 'female']])
model3 = sm.OLS(y, X3).fit()

print("\nModel 3: log(wage) ~ education + experience + experience² + female")
print(model3.summary())

Comparing Three Models

python

from statsmodels.iolib.summary2 import summary_col

# Comparison table
comparison = summary_col([model1, model2, model3],
                        model_names=['Model 1', 'Model 2', 'Model 3'],
                        stars=True,
                        info_dict={
                            'N': lambda x: f"{int(x.nobs)}",
                            'R²': lambda x: f"{x.rsquared:.3f}",
                            'Adj. R²': lambda x: f"{x.rsquared_adj:.3f}"
                        })
print("\nModel comparison:")
print(comparison)

Sample Output:

================================================
                Model 1  Model 2  Model 3
------------------------------------------------
const           1.234*** 1.512*** 1.578***
                (0.032)  (0.041)  (0.039)
education       0.091*** 0.081*** 0.080***
                (0.002)  (0.003)  (0.003)
experience               0.029*** 0.030***
                         (0.003)  (0.003)
experience_sq           -0.0005***-0.0005***
                         (0.0001) (0.0001)
female                            -0.148***
                                  (0.019)
------------------------------------------------
N               1000     1000     1000
R²              0.421    0.538    0.572
Adj. R²         0.420    0.536    0.570
================================================
Standard errors in parentheses.
*** p<0.01, ** p<0.05, * p<0.1

Omitted Variable Bias (OVB)

Theoretical Framework

True Model:

Estimated Model (omitting ):

OVB Formula:

Where is the coefficient from regressing on :

Direction of OVB

		OVB Direction
		Upward bias
		Downward bias
		Downward bias
		Upward bias

Case Study: Ability Bias

Problem: When estimating the return to education on wage, what happens if we omit the "ability" variable?

Analysis:

: Ability increases wage
: Education and ability are positively correlated
Omitting ability leads to upward bias in

python

# Simulate ability bias
np.random.seed(123)
n = 1000

# Ability is latent variable (unobservable)
ability = np.random.normal(0, 1, n)

# Education is positively correlated with ability
education = 12 + 2 * ability + np.random.normal(0, 2, n)
education = np.clip(education, 6, 20)

# True DGP
log_wage = 1.5 + 0.05 * education + 0.20 * ability + np.random.normal(0, 0.2, n)

df_ability = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'ability': ability
})

# Model A: Omitting ability (biased)
X_biased = sm.add_constant(df_ability['education'])
model_biased = sm.OLS(df_ability['log_wage'], X_biased).fit()

# Model B: Including ability (unbiased)
X_unbiased = sm.add_constant(df_ability[['education', 'ability']])
model_unbiased = sm.OLS(df_ability['log_wage'], X_unbiased).fit()

print("Education coefficient omitting ability (biased):", model_biased.params['education'])
print("True education coefficient:", 0.05)
print("Education coefficient including ability (unbiased):", model_unbiased.params['education'])

# OVB
beta_2 = model_unbiased.params['ability']
aux_reg = sm.OLS(df_ability['ability'], sm.add_constant(df_ability['education'])).fit()
delta_21 = aux_reg.params['education']
theoretical_bias = beta_2 * delta_21
observed_bias = model_biased.params['education'] - model_unbiased.params['education']

print(f"\nTheoretical OVB: {theoretical_bias:.4f}")
print(f"Observed bias: {observed_bias:.4f}")

Output:

Education coefficient omitting ability (biased): 0.1453
True education coefficient: 0.05
Education coefficient including ability (unbiased): 0.0512

Theoretical OVB: 0.0941
Observed bias: 0.0941

Conclusion: Omitting ability causes the education return to be overestimated by about 0.09 (190% upward bias)

Multicollinearity

Definition

Perfect Collinearity:

Multicollinearity:

is highly (but not perfectly) correlated with other variables
Causes inflated standard errors and unstable coefficients

Detecting Multicollinearity: VIF

Variance Inflation Factor:

Where is the from regressing on all other s

Decision Criteria:

: No problem
: Moderate collinearity
: Severe collinearity

Python Implementation

python

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
X = df[['education', 'experience', 'experience_sq', 'female']]
X_with_const = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

print("VIF analysis:")
print(vif_data)

Output:

      Variable        VIF
0        const        inf
1    education      1.023
2   experience     12.456
3 experience_sq     12.234
4       female      1.001

Diagnosis:

experience and experience_sq have high VIF (expected, due to squared relationship)
But doesn't affect inference (this is intentional quadratic term)
education and female have low VIF

Consequences of Multicollinearity

python

# Simulate highly collinear variables
np.random.seed(456)
n = 100

X1 = np.random.normal(0, 1, n)
X2 = X1 + np.random.normal(0, 0.1, n)  # X2 ≈ X1 (highly correlated)
Y = 2 + 3*X1 + 5*X2 + np.random.normal(0, 1, n)

df_collinear = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2})

# Individual regressions
model_X1_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X1'])).fit()
model_X2_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X2'])).fit()

# Joint regression
model_both = sm.OLS(df_collinear['Y'],
                    sm.add_constant(df_collinear[['X1', 'X2']])).fit()

print("X1 only regression coefficient:", model_X1_only.params['X1'],
      "SE:", model_X1_only.bse['X1'])
print("X2 only regression coefficient:", model_X2_only.params['X2'],
      "SE:", model_X2_only.bse['X2'])
print("\nJoint regression:")
print("X1 coefficient:", model_both.params['X1'], "SE:", model_both.bse['X1'])
print("X2 coefficient:", model_both.params['X2'], "SE:", model_both.bse['X2'])
print("\nTrue coefficients: X1=3, X2=5")

Output:

X1 only regression coefficient: 7.89 SE: 0.12
X2 only regression coefficient: 7.92 SE: 0.12

Joint regression:
X1 coefficient: 2.34 SE: 1.87
X2 coefficient: 5.67 SE: 1.89

True coefficients: X1=3, X2=5

Conclusion:

Individual regressions: Biased coefficients (conflating the other variable's effect)
Joint regression: Coefficients close to true values, but standard errors inflated 15 times

Handling Multicollinearity

Problem	Solution	Python Implementation
Data problem	Collect more data	-
Definition problem	Remove redundant variables	Drop variables with VIF > 10
Theoretical problem	Principal component regression (PCR)	`sklearn.decomposition.PCA`
Prediction task	Ridge regression	`sklearn.linear_model.Ridge`

Note:

If goal is prediction, multicollinearity has less impact
If goal is causal inference, must handle carefully

Adjusted R² (Adjusted R²)

Why Adjusted R²?

Problem: monotonically increases as variables are added, even if new variables are meaningless

Solution: Use Adjusted

Where is the number of independent variables

Properties

After adding new variables, may decrease
penalizes overfitting
Used for model comparison

Python Implementation

python

print("Model 1:")
print(f"  R² = {model1.rsquared:.4f}")
print(f"  Adjusted R² = {model1.rsquared_adj:.4f}")

print("\nModel 2:")
print(f"  R² = {model2.rsquared:.4f}")
print(f"  Adjusted R² = {model2.rsquared_adj:.4f}")

print("\nModel 3:")
print(f"  R² = {model3.rsquared:.4f}")
print(f"  Adjusted R² = {model3.rsquared_adj:.4f}")

Output:

Model 1:
  R² = 0.4208
  Adjusted R² = 0.4202

Model 2:
  R² = 0.5382
  Adjusted R² = 0.5368

Model 3:
  R² = 0.5724
  Adjusted R² = 0.5707

Conclusion: Adjusted R² also increases with added variables, indicating new variables are meaningful

Joint Significance Test

F Test

Null Hypothesis:

Alternative Hypothesis:

F Statistic:

Where:

: SSR from restricted model
: SSR from unrestricted model
: Number of restrictions

Python Implementation

python

# Test joint significance of experience and experience_sq
hypotheses = 'experience = 0, experience_sq = 0'
f_test = model3.f_test(hypotheses)

print("Joint hypothesis test:")
print(f"H₀: experience = experience² = 0")
print(f"F statistic: {f_test.fvalue:.2f}")
print(f"p-value: {f_test.pvalue:.4f}")

if f_test.pvalue < 0.05:
    print("Conclusion: Reject null hypothesis, experience has significant effect on wage")
else:
    print("Conclusion: Cannot reject null hypothesis")

Output:

Joint hypothesis test:
H₀: experience = experience² = 0
F statistic: 87.34
p-value: 0.0000
Conclusion: Reject null hypothesis, experience has significant effect on wage

Visualizing Multiple Regression

Partial Regression Plots

python

from statsmodels.graphics.regressionplots import plot_partregress_grid

# Plot partial regression plots
fig = plt.figure(figsize=(14, 10))
plot_partregress_grid(model3, fig=fig)
plt.tight_layout()
plt.show()

Interpretation: Each subplot shows the net effect of a single on after controlling for other variables

3D Visualization (Two Independent Variables)

python

from mpl_toolkits.mplot3d import Axes3D

# Simplified model: log(wage) ~ education + experience
X_simple = sm.add_constant(df[['education', 'experience']])
model_simple = sm.OLS(df['log_wage'], X_simple).fit()

# Create grid
edu_range = np.linspace(df['education'].min(), df['education'].max(), 30)
exp_range = np.linspace(df['experience'].min(), df['experience'].max(), 30)
edu_grid, exp_grid = np.meshgrid(edu_range, exp_range)

# Predict
X_pred = pd.DataFrame({
    'const': 1,
    'education': edu_grid.ravel(),
    'experience': exp_grid.ravel()
})
wage_pred = model_simple.predict(X_pred).values.reshape(edu_grid.shape)

# Plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['education'], df['experience'], df['log_wage'],
           alpha=0.3, label='Observed values')
ax.plot_surface(edu_grid, exp_grid, wage_pred, alpha=0.5, cmap='viridis')
ax.set_xlabel('Years of Education')
ax.set_ylabel('Work Experience')
ax.set_zlabel('log(wage)')
ax.set_title('Multiple Regression: Regression Plane')
plt.legend()
plt.show()

Complete Case: Wage Determination Equation

Research Question

Question: What factors determine individual wages? How large are the effects of education, experience, and gender?

Data Preparation

python

# Use previously generated data
df['experience_sq'] = df['experience']**2
df['experience_cubed'] = df['experience']**3

# Descriptive statistics (grouped by gender)
print("Descriptive statistics by gender:")
print(df.groupby('female')[['wage', 'education', 'experience']].mean())

Model Estimation

python

# Model 1: Baseline model
formula1 = 'log_wage ~ education'
model_w1 = sm.OLS.from_formula(formula1, data=df).fit()

# Model 2: Add experience
formula2 = 'log_wage ~ education + experience + I(experience**2)'
model_w2 = sm.OLS.from_formula(formula2, data=df).fit()

# Model 3: Full model
formula3 = 'log_wage ~ education + experience + I(experience**2) + female'
model_w3 = sm.OLS.from_formula(formula3, data=df).fit(cov_type='HC3')

# Comparison
wage_comparison = summary_col([model_w1, model_w2, model_w3],
                              model_names=['(1)', '(2)', '(3)'],
                              stars=True)
print(wage_comparison)

Interpretation of Results

python

# Return to education
edu_return = model_w3.params['education'] * 100
print(f"Return to education: {edu_return:.2f}% per year")

# Gender wage gap
gender_gap = (np.exp(model_w3.params['female']) - 1) * 100
print(f"Gender wage gap: Female wage is {-gender_gap:.1f}% lower than male")

# Marginal effect of experience
def marginal_effect_experience(exp, model):
    beta_exp = model.params['experience']
    beta_exp2 = model.params['I(experience ** 2)']
    return beta_exp + 2 * beta_exp2 * exp

exp_values = np.arange(0, 31)
me_values = [marginal_effect_experience(e, model_w3) for e in exp_values]

plt.figure(figsize=(10, 6))
plt.plot(exp_values, me_values, linewidth=2)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Work Experience (years)')
plt.ylabel('Marginal Effect of Experience')
plt.title('Marginal Effect of Experience on Wage (Controlling for Education and Gender)')
plt.grid(True, alpha=0.3)
plt.show()

Practice Exercises

Exercise 1: Identifying OVB

Assume the true model is:

And

Questions:

If regressing only on , will be upward or downward biased?
What determines the magnitude of bias?

Click to view answer

Upward bias, because:
- ( has positive effect on )
- ( and are positively correlated)
- According to OVB formula:
Bias magnitude depends on:
- : Effect of omitted variable on
- : Correlation between and

Exercise 2: Multicollinearity Diagnosis

Use the following code to generate data and diagnose multicollinearity:

python

np.random.seed(789)
n = 200
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
X3 = X1 + X2 + np.random.normal(0, 0.01, n)  # Approximate linear combination
Y = 2 + 3*X1 + 4*X2 + 5*X3 + np.random.normal(0, 1, n)

df_ex = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2, 'X3': X3})

Tasks:

Calculate VIF
Identify which variable has problems
Suggest solutions

Click to view answer

python

from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df_ex[['X1', 'X2', 'X3']]
X_with_const = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]
print(vif_data)

Output:

  Variable        VIF
0    const        inf
1       X1    101.23
2       X2    102.45
3       X3    203.67

Diagnosis: X3 has extremely high VIF because X3 ≈ X1 + X2

Solutions:

Remove X3 (because it's redundant)
Or remove X1 and X2, keep only X3

Section Summary

Key Points

Content	Key Point
Model Form
Partial Regression Coefficients	Net effect after controlling for other variables
OVB	Omitting variables leads to biased coefficients
Multicollinearity	High correlation leads to inflated standard errors
Adjusted R²	Penalizes overfitting
F Test	Joint significance test

Key Intuitions

"Ceteris Paribus" (All Else Equal): Core of multiple regression
Role of Control Variables: Isolates net effects, reduces omitted variable bias
Trade-off:
- Too few variables → Omitted variable bias
- Too many variables → Multicollinearity, overfitting

Next Section Preview

In the next section, we will learn:

Detailed testing of Gauss-Markov assumptions
Diagnosis and treatment of heteroskedasticity
Residual analysis
Influential points and leverage

Diagnose model problems to ensure valid inference!

5.3 Multiple Linear Regression ​

Section Objectives ​

Multiple Linear Regression Model ​

Population Regression Equation ​

Intuition Behind Partial Regression Coefficients ​

Classic Case: Extended Mincer Equation ​

Python Implementation: Multiple Regression ​

Data Preparation ​

Model 1: Simple Regression (Biased Estimate) ​

Model 2: Add Experience (Control for Confounders) ​

Model 3: Full Model (Add Gender) ​

Comparing Three Models ​

Omitted Variable Bias (OVB) ​

Theoretical Framework ​

Direction of OVB ​

Case Study: Ability Bias ​

Multicollinearity ​

Definition ​

Detecting Multicollinearity: VIF ​

Python Implementation ​

Consequences of Multicollinearity ​

Handling Multicollinearity ​

Adjusted R² (Adjusted R²) ​

Why Adjusted R²? ​

Properties ​

Python Implementation ​

Joint Significance Test ​

F Test ​

Python Implementation ​

Visualizing Multiple Regression ​

Partial Regression Plots ​

3D Visualization (Two Independent Variables) ​

Complete Case: Wage Determination Equation ​

Research Question ​

Data Preparation ​

Model Estimation ​

Interpretation of Results ​

Practice Exercises ​

Exercise 1: Identifying OVB ​

Exercise 2: Multicollinearity Diagnosis ​

Section Summary ​

Key Points ​

Key Intuitions ​

Next Section Preview ​

Further Reading ​

Classic Literature ​

Recommended Textbooks ​

5.3 Multiple Linear Regression

Section Objectives

Multiple Linear Regression Model

Population Regression Equation

Intuition Behind Partial Regression Coefficients

Classic Case: Extended Mincer Equation

Python Implementation: Multiple Regression

Data Preparation

Model 1: Simple Regression (Biased Estimate)

Model 2: Add Experience (Control for Confounders)

Model 3: Full Model (Add Gender)

Comparing Three Models

Omitted Variable Bias (OVB)

Theoretical Framework

Direction of OVB

Case Study: Ability Bias

Multicollinearity

Definition

Detecting Multicollinearity: VIF

Python Implementation

Consequences of Multicollinearity

Handling Multicollinearity

Adjusted R² (Adjusted R²)

Why Adjusted R²?

Properties

Python Implementation

Joint Significance Test

F Test

Python Implementation

Visualizing Multiple Regression

Partial Regression Plots

3D Visualization (Two Independent Variables)

Complete Case: Wage Determination Equation

Research Question

Data Preparation

Model Estimation

Interpretation of Results

Practice Exercises

Exercise 1: Identifying OVB

Exercise 2: Multicollinearity Diagnosis

Section Summary

Key Points

Key Intuitions

Next Section Preview

Further Reading

Classic Literature

Recommended Textbooks