5.4 Regression Diagnostics
"All models are wrong, some are useful, and a few are insightful."— George Box, Statistician
Testing Assumptions, Diagnosing Problems, Ensuring Valid Inference
Section Objectives
After completing this section, you will be able to:
- Understand the meaning of Gauss-Markov assumptions
- Test for heteroskedasticity
- Test for autocorrelation
- Perform residual analysis
- Identify influential points and outliers
- Use robust standard errors
Gauss-Markov Assumptions (Classical Linear Model Assumptions)
Complete List of Assumptions
| Assumption | Mathematical Expression | Meaning | Consequence of Violation |
|---|---|---|---|
| MLR.1 Linearity | Population model is linear | Biased coefficients | |
| MLR.2 Random Sampling | are i.i.d. | Sample is independently and identically distributed | Invalid inference |
| MLR.3 No Perfect Collinearity | matrix has full rank | No perfect linear relationship among independent variables | Cannot estimate |
| MLR.4 Zero Conditional Mean | $E[\varepsilon | X] = 0$ | Error term uncorrelated with (exogeneity) |
| MLR.5 Homoskedasticity | $Var(\varepsilon | X) = \sigma^2$ | Error variance is constant |
| MLR.6 Normality | Errors follow normal distribution | Affects small sample inference |
Gauss-Markov Theorem
Theorem: Under assumptions MLR.1-MLR.5, OLS estimators are BLUE (Best Linear Unbiased Estimator)
Practical Implications:
- Assumptions 1-4 ensure unbiasedness:
- Assumption 5 ensures correct standard errors
- Assumption 6 is not required (but useful for small sample inference)
Diagnostic Toolbox
1. Residual Analysis
Types of Residuals:
| Type | Formula | Use |
|---|---|---|
| Raw Residuals | Basic diagnosis | |
| Standardized Residuals | Identify large residuals | |
| Studentized Residuals | Identify outliers |
Where is the leverage value
Python Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Generate example data
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)
df = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'experience': experience,
'experience_sq': experience**2
})
# Regression
X = sm.add_constant(df[['education', 'experience', 'experience_sq']])
y = df['log_wage']
model = sm.OLS(y, X).fit()
# Get residuals and diagnostic statistics
influence = model.get_influence()
standardized_resid = influence.resid_studentized_internal
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]
df['resid'] = model.resid
df['fitted'] = model.fittedvalues
df['std_resid'] = standardized_resid
df['leverage'] = leverage
df['cooks_d'] = cooks_d
print("First 10 observations diagnostic statistics:")
print(df[['resid', 'std_resid', 'leverage', 'cooks_d']].head(10))2. Residual Visualization (Four-Panel Diagnostic Plot)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Residuals vs Fitted Values (test linearity and homoskedasticity)
axes[0, 0].scatter(df['fitted'], df['resid'], alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)
# Add LOWESS curve
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(df['resid'], df['fitted'], frac=0.3)
axes[0, 0].plot(lowess_result[:, 0], lowess_result[:, 1], 'b-', linewidth=2, label='LOWESS')
axes[0, 0].legend()
# Plot 2: Q-Q Plot (test normality)
stats.probplot(df['resid'], dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Scale-Location Plot (test homoskedasticity)
axes[1, 0].scatter(df['fitted'], np.sqrt(np.abs(df['std_resid'])), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')
axes[1, 0].grid(True, alpha=0.3)
# Add LOWESS curve
lowess_result2 = lowess(np.sqrt(np.abs(df['std_resid'])), df['fitted'], frac=0.3)
axes[1, 0].plot(lowess_result2[:, 0], lowess_result2[:, 1], 'r-', linewidth=2)
# Plot 4: Residuals vs Leverage (identify influential points)
axes[1, 1].scatter(df['leverage'], df['std_resid'], alpha=0.5)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
axes[1, 1].grid(True, alpha=0.3)
# Mark high influence points (Cook's D > 0.5)
high_influence = df['cooks_d'] > 0.5
if high_influence.any():
axes[1, 1].scatter(df.loc[high_influence, 'leverage'],
df.loc[high_influence, 'std_resid'],
color='red', s=100, label="High Influence Points")
axes[1, 1].legend()
plt.tight_layout()
plt.show()Interpretation Guide:
| Plot | Good Pattern | Problem Pattern |
|---|---|---|
| Residuals vs Fitted | Random scatter, horizontal band | Funnel shape (heteroskedasticity), curve (nonlinearity) |
| Q-Q Plot | Points on diagonal line | Systematic deviation (non-normality) |
| Scale-Location | Horizontal line | Upward or downward trend (heteroskedasticity) |
| Residuals vs Leverage | No extreme points | Points in upper-right/lower-right (high influence) |
Heteroskedasticity
Definition
Homoskedasticity: (constant)
Heteroskedasticity: (varies with )
Consequences
Good News: OLS estimators still unbiasedBad News: Standard errors biased, statistics and -values unreliable
Testing Methods
1. Breusch-Pagan Test
Null Hypothesis: (homoskedasticity)
from statsmodels.stats.diagnostic import het_breuschpagan
# BP test
bp_test = het_breuschpagan(model.resid, model.model.exog)
bp_stat, bp_pvalue, bp_f, bp_f_pvalue = bp_test
print("Breusch-Pagan Test:")
print(f" LM statistic = {bp_stat:.3f}")
print(f" p-value = {bp_pvalue:.4f}")
if bp_pvalue < 0.05:
print(" Conclusion: Reject homoskedasticity hypothesis, heteroskedasticity present")
else:
print(" Conclusion: Cannot reject homoskedasticity hypothesis")2. White Test
Advantage: Does not assume specific form of heteroskedasticity
from statsmodels.stats.diagnostic import het_white
# White test
white_test = het_white(model.resid, model.model.exog)
white_stat, white_pvalue, white_f, white_f_pvalue = white_test
print("\nWhite Test:")
print(f" LM statistic = {white_stat:.3f}")
print(f" p-value = {white_pvalue:.4f}")3. Graphical Diagnosis
# Plot squared residuals vs fitted values
plt.figure(figsize=(10, 6))
plt.scatter(df['fitted'], df['resid']**2, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Squared Residuals')
plt.title('Squared Residuals vs Fitted Values (Heteroskedasticity Diagnosis)')
plt.grid(True, alpha=0.3)
# Add trend line
z = np.polyfit(df['fitted'], df['resid']**2, 2)
p = np.poly1d(z)
x_line = np.linspace(df['fitted'].min(), df['fitted'].max(), 100)
plt.plot(x_line, p(x_line), "r-", linewidth=2, label='Quadratic fit')
plt.legend()
plt.show()Treatment Methods
Method 1: Robust Standard Errors (Heteroskedasticity-Robust Standard Errors)
# Use HC3 robust standard errors (recommended)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')
print("Ordinary OLS:")
print(model.summary().tables[1])
print("\nRobust Standard Errors (HC3):")
print(model_robust.summary().tables[1])
# Compare standard errors
comparison = pd.DataFrame({
'Variable': model.params.index,
'Coefficient': model.params.values,
'SE (OLS)': model.bse.values,
'SE (Robust)': model_robust.bse.values,
'Ratio': model_robust.bse.values / model.bse.values
})
print("\nStandard Error Comparison:")
print(comparison)Robust Standard Error Types:
| Type | Applicable Scenario | Python Parameter |
|---|---|---|
| HC0 | Basic version | cov_type='HC0' |
| HC1 | Small sample adjustment | cov_type='HC1' |
| HC2 | Leverage adjustment | cov_type='HC2' |
| HC3 | Recommended (most conservative) | cov_type='HC3' |
Method 2: Weighted Least Squares (WLS)
# If variance structure is known: Var(ε) ∝ X²
# Use WLS
weights = 1 / df['education']**2
model_wls = sm.WLS(y, X, weights=weights).fit()
print("WLS regression results:")
print(model_wls.summary())Method 3: Variable Transformation
# Take log of dependent variable (common for wage data)
# Can stabilize varianceAutocorrelation
Definition
No Autocorrelation Assumption: for
Autocorrelation: Correlation exists among error terms (common in time series data)
Durbin-Watson Test
from statsmodels.stats.stattools import durbin_watson
# DW statistic
dw_stat = durbin_watson(model.resid)
print(f"Durbin-Watson statistic: {dw_stat:.3f}")
# Decision criteria
if dw_stat < 1.5:
print("Possible positive autocorrelation")
elif dw_stat > 2.5:
print("Possible negative autocorrelation")
else:
print("No obvious autocorrelation")DW Statistic Interpretation:
- DW ≈ 2: No autocorrelation
- DW < 2: Positive autocorrelation
- DW > 2: Negative autocorrelation
- Range: [0, 4]
Treatment Methods
# Method 1: Newey-West standard errors (HAC)
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 4})
print("Newey-West standard errors:")
print(model_hac.bse)
# Method 2: Add lagged dependent variable (time series only)
# y_t = β₀ + β₁ x_t + ρ y_{t-1} + ε_tInfluence Analysis
Key Metrics
1. Leverage
Meaning: Distance of observation 's value from mean
Decision Criterion: indicates high leverage
# Calculate leverage threshold
k = X.shape[1] - 1 # Number of independent variables
n = len(df)
leverage_threshold = 2 * (k + 1) / n
print(f"Leverage threshold: {leverage_threshold:.4f}")
print(f"Number of high leverage points: {(df['leverage'] > leverage_threshold).sum()}")
# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['leverage'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=leverage_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {leverage_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel('Leverage')
plt.title('Leverage Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()2. Cook's Distance
Meaning: Degree of change in all fitted values when observation is deleted
Decision Criterion: indicates high influence point
# Cook's D threshold
cooks_threshold = 4 / n
print(f"Cook's D threshold: {cooks_threshold:.4f}")
high_influence = df['cooks_d'] > cooks_threshold
print(f"Number of high influence points: {high_influence.sum()}")
if high_influence.any():
print("\nHigh influence points:")
print(df.loc[high_influence, ['education', 'experience', 'log_wage', 'cooks_d']])
# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['cooks_d'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=cooks_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {cooks_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel("Cook's Distance")
plt.title('Influence Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()3. DFBETAS
Meaning: Change in each coefficient when observation is deleted
# DFBETAS
dfbetas = influence.dfbetas
dfbetas_df = pd.DataFrame(dfbetas, columns=X.columns)
print("DFBETAS descriptive statistics:")
print(dfbetas_df.describe())
# Visualize (for education coefficient)
plt.figure(figsize=(10, 6))
plt.stem(df.index, dfbetas_df['education'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=2/np.sqrt(n), color='r', linestyle='--', linewidth=2, label='Threshold')
plt.axhline(y=-2/np.sqrt(n), color='r', linestyle='--', linewidth=2)
plt.xlabel('Observation Number')
plt.ylabel('DFBETAS (education)')
plt.title('Influence Analysis of Education Coefficient')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Handling Influential Points
# Method 1: Re-estimate after removing high influence points
df_no_outliers = df[~high_influence].copy()
X_no = sm.add_constant(df_no_outliers[['education', 'experience', 'experience_sq']])
y_no = df_no_outliers['log_wage']
model_no_outliers = sm.OLS(y_no, X_no).fit()
print("Original model (including outliers):")
print(model.params)
print("\nAfter removing outliers:")
print(model_no_outliers.params)
# Method 2: Robust Regression
from statsmodels.robust.robust_linear_model import RLM
model_rlm = RLM(y, X, M=sm.robust.norms.HuberT()).fit()
print("\nRobust regression:")
print(model_rlm.params)Normality Tests
1. Shapiro-Wilk Test
from scipy.stats import shapiro
# Shapiro-Wilk test
stat, p_value = shapiro(model.resid)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Conclusion: Reject normality hypothesis")
else:
print("Conclusion: Cannot reject normality hypothesis")2. Jarque-Bera Test
from statsmodels.stats.stattools import jarque_bera
# Jarque-Bera test
jb_stat, jb_pvalue, jb_skew, jb_kurtosis = jarque_bera(model.resid)
print(f"\nJarque-Bera statistic: {jb_stat:.3f}")
print(f"p-value: {jb_pvalue:.4f}")
print(f"Skewness: {jb_skew:.3f}")
print(f"Kurtosis: {jb_kurtosis:.3f}")3. Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Histogram
axes[0].hist(model.resid, bins=30, density=True, alpha=0.7, edgecolor='black')
# Overlay normal distribution
mu, sigma = model.resid.mean(), model.resid.std()
x = np.linspace(model.resid.min(), model.resid.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Theoretical normal distribution')
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Density')
axes[0].set_title('Residual Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Q-Q plot
stats.probplot(model.resid, dist="norm", plot=axes[1])
axes[1].set_title('Normal Q-Q Plot')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Note:
- Normality assumption is not required (Gauss-Markov theorem doesn't need it)
- With large samples, CLT ensures asymptotic normality
- With small samples, non-normality affects tests and confidence intervals
Complete Diagnostic Workflow
Diagnostic Checklist
def comprehensive_diagnostics(model, X, y):
"""
Complete regression diagnostics
"""
print("="*70)
print("Regression Diagnostics Report")
print("="*70)
# 1. Basic information
print(f"\n1. Sample size: {len(y)}")
print(f" Number of independent variables: {X.shape[1] - 1}")
print(f" R²: {model.rsquared:.4f}")
print(f" Adjusted R²: {model.rsquared_adj:.4f}")
# 2. Heteroskedasticity test
print("\n2. Heteroskedasticity Test:")
bp_test = het_breuschpagan(model.resid, model.model.exog)
print(f" Breusch-Pagan: LM = {bp_test[0]:.3f}, p = {bp_test[1]:.4f}")
white_test = het_white(model.resid, model.model.exog)
print(f" White: LM = {white_test[0]:.3f}, p = {white_test[1]:.4f}")
if bp_test[1] < 0.05 or white_test[1] < 0.05:
print(" Recommendation: Use robust standard errors (HC3)")
else:
print(" No obvious heteroskedasticity")
# 3. Autocorrelation test
print("\n3. Autocorrelation Test:")
dw_stat = durbin_watson(model.resid)
print(f" Durbin-Watson: {dw_stat:.3f}")
if 1.5 < dw_stat < 2.5:
print(" No obvious autocorrelation")
else:
print(" Possible autocorrelation")
# 4. Normality test
print("\n4. Normality Test:")
jb_stat, jb_pvalue, _, _ = jarque_bera(model.resid)
print(f" Jarque-Bera: JB = {jb_stat:.3f}, p = {jb_pvalue:.4f}")
if jb_pvalue < 0.05:
print(" Residuals may not follow normal distribution (usually doesn't affect inference with large samples)")
else:
print(" Residuals approximately normally distributed")
# 5. Influence analysis
print("\n5. Influence Analysis:")
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
high_influence = cooks_d > 4/len(y)
print(f" Number of high influence points: {high_influence.sum()} ({high_influence.sum()/len(y)*100:.1f}%)")
if high_influence.sum() > 0:
print(" Recommendation: Check high influence points, consider robust regression")
else:
print(" No obvious high influence points")
# 6. VIF
print("\n6. Multicollinearity (VIF):")
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_data = vif_data[vif_data['Variable'] != 'const']
print(vif_data.to_string(index=False))
if (vif_data['VIF'] > 10).any():
print(" Some variables have VIF > 10, severe multicollinearity present")
else:
print(" VIF within acceptable range")
print("\n" + "="*70)
print("Diagnostics complete")
print("="*70)
# Run diagnostics
comprehensive_diagnostics(model, X, y)Section Summary
Diagnostic Workflow
Data preparation
↓
OLS regression
↓
Test assumptions ——→ Heteroskedasticity? ——→ Yes → Use robust SE
↓ ↓ No
↓ ↓
↓ Autocorrelation? ——→ Yes → Use HAC SE
↓ ↓ No
↓ ↓
↓ Non-normality? ——→ Yes → Large sample OK, small sample cautious
↓ ↓ No
↓ ↓
↓ Influential points? ——→ Yes → Robust regression or remove
↓ ↓ No
↓ ↓
↓ Pass all tests
↓
Final modelKey Points
| Problem | Test | Consequence | Solution |
|---|---|---|---|
| Heteroskedasticity | BP, White | Biased SE | HC3 robust SE |
| Autocorrelation | DW | Biased SE | HAC SE |
| Non-normality | JB, Shapiro | Small sample inference failure | Okay with large samples |
| Influential Points | Cook's D | Unstable coefficients | Robust regression |
Next Section Preview
In the next section, we will learn:
- Handling categorical variables (Dummy Variables)
- Dummy variable trap
- Interaction effects
- Modeling nonlinear relationships
Expanding the expressive power of regression models!
Further Reading
- Wooldridge (2020): Chapter 8 "Heteroskedasticity"
- White, H. (1980). "A Heteroskedasticity-Consistent Covariance Matrix Estimator"
- Cook, R. D. (1977). "Detection of Influential Observation in Linear Regression"
- Belsley, Kuh, & Welsch (1980). Regression Diagnostics
Continue exploring categorical variables and interaction effects!