5.3 Multiple Linear Regression
"The plural of anecdote is not data."— Frank Kotsonis, Political Scientist
From One X to Multiple Xs: Controlling for Confounders, Isolating Net Effects
Section Objectives
After completing this section, you will be able to:
- Understand the mathematical principles of multiple regression models
- Master the meaning of partial regression coefficients
- Identify and handle omitted variable bias
- Diagnose and address multicollinearity
- Conduct multiple regression analysis using Python
- Understand the role of control variables
Multiple Linear Regression Model
Population Regression Equation
Or in matrix form:
Where:
- : Dependent variable
- : independent variables
- : Partial regression coefficients
- : Error term
Intuition Behind Partial Regression Coefficients
Meaning: is the effect of a one-unit change in on holding all other variables constant
Classic Case: Extended Mincer Equation
Why add experience?
- Both education and experience affect wage
- If only regressing education, would conflate the effect of experience (omitted variable bias)
Python Implementation: Multiple Regression
Data Preparation
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
# Simulate wage data
np.random.seed(42)
n = 1000
# Generate independent variables
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)
# Generate dependent variable (true data generating process)
log_wage = (1.5 +
0.08 * education +
0.03 * experience -
0.0005 * experience**2 -
0.15 * female +
np.random.normal(0, 0.3, n))
wage = np.exp(log_wage)
# Create dataframe
df = pd.DataFrame({
'wage': wage,
'log_wage': log_wage,
'education': education,
'experience': experience,
'female': female
})
print("First 5 rows of data:")
print(df.head())
print("\nDescriptive statistics:")
print(df.describe())Model 1: Simple Regression (Biased Estimate)
# Include only education
X1 = sm.add_constant(df['education'])
y = df['log_wage']
model1 = sm.OLS(y, X1).fit()
print("Model 1: log(wage) ~ education")
print(model1.summary())Model 2: Add Experience (Control for Confounders)
# Add experience and experience squared
df['experience_sq'] = df['experience']**2
X2 = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model2 = sm.OLS(y, X2).fit()
print("\nModel 2: log(wage) ~ education + experience + experience²")
print(model2.summary())Model 3: Full Model (Add Gender)
# Full model
X3 = sm.add_constant(df[['education', 'experience', 'experience_sq', 'female']])
model3 = sm.OLS(y, X3).fit()
print("\nModel 3: log(wage) ~ education + experience + experience² + female")
print(model3.summary())Comparing Three Models
from statsmodels.iolib.summary2 import summary_col
# Comparison table
comparison = summary_col([model1, model2, model3],
model_names=['Model 1', 'Model 2', 'Model 3'],
stars=True,
info_dict={
'N': lambda x: f"{int(x.nobs)}",
'R²': lambda x: f"{x.rsquared:.3f}",
'Adj. R²': lambda x: f"{x.rsquared_adj:.3f}"
})
print("\nModel comparison:")
print(comparison)Sample Output:
================================================
Model 1 Model 2 Model 3
------------------------------------------------
const 1.234*** 1.512*** 1.578***
(0.032) (0.041) (0.039)
education 0.091*** 0.081*** 0.080***
(0.002) (0.003) (0.003)
experience 0.029*** 0.030***
(0.003) (0.003)
experience_sq -0.0005***-0.0005***
(0.0001) (0.0001)
female -0.148***
(0.019)
------------------------------------------------
N 1000 1000 1000
R² 0.421 0.538 0.572
Adj. R² 0.420 0.536 0.570
================================================
Standard errors in parentheses.
*** p<0.01, ** p<0.05, * p<0.1Omitted Variable Bias (OVB)
Theoretical Framework
True Model:
Estimated Model (omitting ):
OVB Formula:
Where is the coefficient from regressing on :
Direction of OVB
| OVB Direction | ||
|---|---|---|
| Upward bias | ||
| Downward bias | ||
| Downward bias | ||
| Upward bias |
Case Study: Ability Bias
Problem: When estimating the return to education on wage, what happens if we omit the "ability" variable?
Analysis:
- : Ability increases wage
- : Education and ability are positively correlated
- Omitting ability leads to upward bias in
# Simulate ability bias
np.random.seed(123)
n = 1000
# Ability is latent variable (unobservable)
ability = np.random.normal(0, 1, n)
# Education is positively correlated with ability
education = 12 + 2 * ability + np.random.normal(0, 2, n)
education = np.clip(education, 6, 20)
# True DGP
log_wage = 1.5 + 0.05 * education + 0.20 * ability + np.random.normal(0, 0.2, n)
df_ability = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'ability': ability
})
# Model A: Omitting ability (biased)
X_biased = sm.add_constant(df_ability['education'])
model_biased = sm.OLS(df_ability['log_wage'], X_biased).fit()
# Model B: Including ability (unbiased)
X_unbiased = sm.add_constant(df_ability[['education', 'ability']])
model_unbiased = sm.OLS(df_ability['log_wage'], X_unbiased).fit()
print("Education coefficient omitting ability (biased):", model_biased.params['education'])
print("True education coefficient:", 0.05)
print("Education coefficient including ability (unbiased):", model_unbiased.params['education'])
# OVB
beta_2 = model_unbiased.params['ability']
aux_reg = sm.OLS(df_ability['ability'], sm.add_constant(df_ability['education'])).fit()
delta_21 = aux_reg.params['education']
theoretical_bias = beta_2 * delta_21
observed_bias = model_biased.params['education'] - model_unbiased.params['education']
print(f"\nTheoretical OVB: {theoretical_bias:.4f}")
print(f"Observed bias: {observed_bias:.4f}")Output:
Education coefficient omitting ability (biased): 0.1453
True education coefficient: 0.05
Education coefficient including ability (unbiased): 0.0512
Theoretical OVB: 0.0941
Observed bias: 0.0941Conclusion: Omitting ability causes the education return to be overestimated by about 0.09 (190% upward bias)
Multicollinearity
Definition
Perfect Collinearity:
Multicollinearity:
- is highly (but not perfectly) correlated with other variables
- Causes inflated standard errors and unstable coefficients
Detecting Multicollinearity: VIF
Variance Inflation Factor:
Where is the from regressing on all other s
Decision Criteria:
- : No problem
- : Moderate collinearity
- : Severe collinearity
Python Implementation
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF
X = df[['education', 'experience', 'experience_sq', 'female']]
X_with_const = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
print("VIF analysis:")
print(vif_data)Output:
Variable VIF
0 const inf
1 education 1.023
2 experience 12.456
3 experience_sq 12.234
4 female 1.001Diagnosis:
experienceandexperience_sqhave high VIF (expected, due to squared relationship)- But doesn't affect inference (this is intentional quadratic term)
educationandfemalehave low VIF
Consequences of Multicollinearity
# Simulate highly collinear variables
np.random.seed(456)
n = 100
X1 = np.random.normal(0, 1, n)
X2 = X1 + np.random.normal(0, 0.1, n) # X2 ≈ X1 (highly correlated)
Y = 2 + 3*X1 + 5*X2 + np.random.normal(0, 1, n)
df_collinear = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2})
# Individual regressions
model_X1_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X1'])).fit()
model_X2_only = sm.OLS(df_collinear['Y'], sm.add_constant(df_collinear['X2'])).fit()
# Joint regression
model_both = sm.OLS(df_collinear['Y'],
sm.add_constant(df_collinear[['X1', 'X2']])).fit()
print("X1 only regression coefficient:", model_X1_only.params['X1'],
"SE:", model_X1_only.bse['X1'])
print("X2 only regression coefficient:", model_X2_only.params['X2'],
"SE:", model_X2_only.bse['X2'])
print("\nJoint regression:")
print("X1 coefficient:", model_both.params['X1'], "SE:", model_both.bse['X1'])
print("X2 coefficient:", model_both.params['X2'], "SE:", model_both.bse['X2'])
print("\nTrue coefficients: X1=3, X2=5")Output:
X1 only regression coefficient: 7.89 SE: 0.12
X2 only regression coefficient: 7.92 SE: 0.12
Joint regression:
X1 coefficient: 2.34 SE: 1.87
X2 coefficient: 5.67 SE: 1.89
True coefficients: X1=3, X2=5Conclusion:
- Individual regressions: Biased coefficients (conflating the other variable's effect)
- Joint regression: Coefficients close to true values, but standard errors inflated 15 times
Handling Multicollinearity
| Problem | Solution | Python Implementation |
|---|---|---|
| Data problem | Collect more data | - |
| Definition problem | Remove redundant variables | Drop variables with VIF > 10 |
| Theoretical problem | Principal component regression (PCR) | sklearn.decomposition.PCA |
| Prediction task | Ridge regression | sklearn.linear_model.Ridge |
Note:
- If goal is prediction, multicollinearity has less impact
- If goal is causal inference, must handle carefully
Adjusted R² (Adjusted R²)
Why Adjusted R²?
Problem: monotonically increases as variables are added, even if new variables are meaningless
Solution: Use Adjusted
Where is the number of independent variables
Properties
- After adding new variables, may decrease
- penalizes overfitting
- Used for model comparison
Python Implementation
print("Model 1:")
print(f" R² = {model1.rsquared:.4f}")
print(f" Adjusted R² = {model1.rsquared_adj:.4f}")
print("\nModel 2:")
print(f" R² = {model2.rsquared:.4f}")
print(f" Adjusted R² = {model2.rsquared_adj:.4f}")
print("\nModel 3:")
print(f" R² = {model3.rsquared:.4f}")
print(f" Adjusted R² = {model3.rsquared_adj:.4f}")Output:
Model 1:
R² = 0.4208
Adjusted R² = 0.4202
Model 2:
R² = 0.5382
Adjusted R² = 0.5368
Model 3:
R² = 0.5724
Adjusted R² = 0.5707Conclusion: Adjusted R² also increases with added variables, indicating new variables are meaningful
Joint Significance Test
F Test
Null Hypothesis:
Alternative Hypothesis:
F Statistic:
Where:
- : SSR from restricted model
- : SSR from unrestricted model
- : Number of restrictions
Python Implementation
# Test joint significance of experience and experience_sq
hypotheses = 'experience = 0, experience_sq = 0'
f_test = model3.f_test(hypotheses)
print("Joint hypothesis test:")
print(f"H₀: experience = experience² = 0")
print(f"F statistic: {f_test.fvalue:.2f}")
print(f"p-value: {f_test.pvalue:.4f}")
if f_test.pvalue < 0.05:
print("Conclusion: Reject null hypothesis, experience has significant effect on wage")
else:
print("Conclusion: Cannot reject null hypothesis")Output:
Joint hypothesis test:
H₀: experience = experience² = 0
F statistic: 87.34
p-value: 0.0000
Conclusion: Reject null hypothesis, experience has significant effect on wageVisualizing Multiple Regression
Partial Regression Plots
from statsmodels.graphics.regressionplots import plot_partregress_grid
# Plot partial regression plots
fig = plt.figure(figsize=(14, 10))
plot_partregress_grid(model3, fig=fig)
plt.tight_layout()
plt.show()Interpretation: Each subplot shows the net effect of a single on after controlling for other variables
3D Visualization (Two Independent Variables)
from mpl_toolkits.mplot3d import Axes3D
# Simplified model: log(wage) ~ education + experience
X_simple = sm.add_constant(df[['education', 'experience']])
model_simple = sm.OLS(df['log_wage'], X_simple).fit()
# Create grid
edu_range = np.linspace(df['education'].min(), df['education'].max(), 30)
exp_range = np.linspace(df['experience'].min(), df['experience'].max(), 30)
edu_grid, exp_grid = np.meshgrid(edu_range, exp_range)
# Predict
X_pred = pd.DataFrame({
'const': 1,
'education': edu_grid.ravel(),
'experience': exp_grid.ravel()
})
wage_pred = model_simple.predict(X_pred).values.reshape(edu_grid.shape)
# Plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['education'], df['experience'], df['log_wage'],
alpha=0.3, label='Observed values')
ax.plot_surface(edu_grid, exp_grid, wage_pred, alpha=0.5, cmap='viridis')
ax.set_xlabel('Years of Education')
ax.set_ylabel('Work Experience')
ax.set_zlabel('log(wage)')
ax.set_title('Multiple Regression: Regression Plane')
plt.legend()
plt.show()Complete Case: Wage Determination Equation
Research Question
Question: What factors determine individual wages? How large are the effects of education, experience, and gender?
Data Preparation
# Use previously generated data
df['experience_sq'] = df['experience']**2
df['experience_cubed'] = df['experience']**3
# Descriptive statistics (grouped by gender)
print("Descriptive statistics by gender:")
print(df.groupby('female')[['wage', 'education', 'experience']].mean())Model Estimation
# Model 1: Baseline model
formula1 = 'log_wage ~ education'
model_w1 = sm.OLS.from_formula(formula1, data=df).fit()
# Model 2: Add experience
formula2 = 'log_wage ~ education + experience + I(experience**2)'
model_w2 = sm.OLS.from_formula(formula2, data=df).fit()
# Model 3: Full model
formula3 = 'log_wage ~ education + experience + I(experience**2) + female'
model_w3 = sm.OLS.from_formula(formula3, data=df).fit(cov_type='HC3')
# Comparison
wage_comparison = summary_col([model_w1, model_w2, model_w3],
model_names=['(1)', '(2)', '(3)'],
stars=True)
print(wage_comparison)Interpretation of Results
# Return to education
edu_return = model_w3.params['education'] * 100
print(f"Return to education: {edu_return:.2f}% per year")
# Gender wage gap
gender_gap = (np.exp(model_w3.params['female']) - 1) * 100
print(f"Gender wage gap: Female wage is {-gender_gap:.1f}% lower than male")
# Marginal effect of experience
def marginal_effect_experience(exp, model):
beta_exp = model.params['experience']
beta_exp2 = model.params['I(experience ** 2)']
return beta_exp + 2 * beta_exp2 * exp
exp_values = np.arange(0, 31)
me_values = [marginal_effect_experience(e, model_w3) for e in exp_values]
plt.figure(figsize=(10, 6))
plt.plot(exp_values, me_values, linewidth=2)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Work Experience (years)')
plt.ylabel('Marginal Effect of Experience')
plt.title('Marginal Effect of Experience on Wage (Controlling for Education and Gender)')
plt.grid(True, alpha=0.3)
plt.show()Practice Exercises
Exercise 1: Identifying OVB
Assume the true model is:
And
Questions:
- If regressing only on , will be upward or downward biased?
- What determines the magnitude of bias?
Click to view answer
Upward bias, because:
- ( has positive effect on )
- ( and are positively correlated)
- According to OVB formula:
Bias magnitude depends on:
- : Effect of omitted variable on
- : Correlation between and
Exercise 2: Multicollinearity Diagnosis
Use the following code to generate data and diagnose multicollinearity:
np.random.seed(789)
n = 200
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
X3 = X1 + X2 + np.random.normal(0, 0.01, n) # Approximate linear combination
Y = 2 + 3*X1 + 4*X2 + 5*X3 + np.random.normal(0, 1, n)
df_ex = pd.DataFrame({'Y': Y, 'X1': X1, 'X2': X2, 'X3': X3})Tasks:
- Calculate VIF
- Identify which variable has problems
- Suggest solutions
Click to view answer
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df_ex[['X1', 'X2', 'X3']]
X_with_const = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
print(vif_data)Output:
Variable VIF
0 const inf
1 X1 101.23
2 X2 102.45
3 X3 203.67Diagnosis: X3 has extremely high VIF because X3 ≈ X1 + X2
Solutions:
- Remove X3 (because it's redundant)
- Or remove X1 and X2, keep only X3
Section Summary
Key Points
| Content | Key Point |
|---|---|
| Model Form | |
| Partial Regression Coefficients | Net effect after controlling for other variables |
| OVB | Omitting variables leads to biased coefficients |
| Multicollinearity | High correlation leads to inflated standard errors |
| Adjusted R² | Penalizes overfitting |
| F Test | Joint significance test |
Key Intuitions
- "Ceteris Paribus" (All Else Equal): Core of multiple regression
- Role of Control Variables: Isolates net effects, reduces omitted variable bias
- Trade-off:
- Too few variables → Omitted variable bias
- Too many variables → Multicollinearity, overfitting
Next Section Preview
In the next section, we will learn:
- Detailed testing of Gauss-Markov assumptions
- Diagnosis and treatment of heteroskedasticity
- Residual analysis
- Influential points and leverage
Diagnose model problems to ensure valid inference!
Further Reading
Classic Literature
Frisch & Waugh (1933). "Partial Time Regressions as Compared with Individual Trends"
- Frisch-Waugh-Lovell theorem
- Geometric interpretation of partial regression
Card (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling"
- Classic case of handling omitted variable bias
- Instrumental variable approach
Recommended Textbooks
- Wooldridge (2020): Chapter 3 "Multiple Regression Analysis: Estimation"
- Angrist & Pischke (2009): Chapter 2 "The Regression CEF"
- Stock & Watson (2020): Chapter 6 "Linear Regression with Multiple Regressors"
Continue exploring the world of regression diagnostics!