5.4 Regression Diagnostics

"All models are wrong, some are useful, and a few are insightful."— George Box, Statistician

Testing Assumptions, Diagnosing Problems, Ensuring Valid Inference

Section Objectives

After completing this section, you will be able to:

Understand the meaning of Gauss-Markov assumptions
Test for heteroskedasticity
Test for autocorrelation
Perform residual analysis
Identify influential points and outliers
Use robust standard errors

Gauss-Markov Assumptions (Classical Linear Model Assumptions)

Complete List of Assumptions

Assumption	Mathematical Expression	Meaning	Consequence of Violation
MLR.1 Linearity		Population model is linear	Biased coefficients
MLR.2 Random Sampling	are i.i.d.	Sample is independently and identically distributed	Invalid inference
MLR.3 No Perfect Collinearity	matrix has full rank	No perfect linear relationship among independent variables	Cannot estimate
MLR.4 Zero Conditional Mean	$E[\varepsilon	X] = 0$	Error term uncorrelated with (exogeneity)
MLR.5 Homoskedasticity	$Var(\varepsilon	X) = \sigma^2$	Error variance is constant
MLR.6 Normality		Errors follow normal distribution	Affects small sample inference

Gauss-Markov Theorem

Theorem: Under assumptions MLR.1-MLR.5, OLS estimators are BLUE (Best Linear Unbiased Estimator)

Practical Implications:

Assumptions 1-4 ensure unbiasedness:
Assumption 5 ensures correct standard errors
Assumption 6 is not required (but useful for small sample inference)

Diagnostic Toolbox

1. Residual Analysis

Types of Residuals:

Type	Formula	Use
Raw Residuals		Basic diagnosis
Standardized Residuals		Identify large residuals
Studentized Residuals		Identify outliers

Where is the leverage value

Python Implementation

python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Generate example data
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)

df = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'experience_sq': experience**2
})

# Regression
X = sm.add_constant(df[['education', 'experience', 'experience_sq']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

# Get residuals and diagnostic statistics
influence = model.get_influence()
standardized_resid = influence.resid_studentized_internal
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]

df['resid'] = model.resid
df['fitted'] = model.fittedvalues
df['std_resid'] = standardized_resid
df['leverage'] = leverage
df['cooks_d'] = cooks_d

print("First 10 observations diagnostic statistics:")
print(df[['resid', 'std_resid', 'leverage', 'cooks_d']].head(10))

2. Residual Visualization (Four-Panel Diagnostic Plot)

python

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Residuals vs Fitted Values (test linearity and homoskedasticity)
axes[0, 0].scatter(df['fitted'], df['resid'], alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)

# Add LOWESS curve
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(df['resid'], df['fitted'], frac=0.3)
axes[0, 0].plot(lowess_result[:, 0], lowess_result[:, 1], 'b-', linewidth=2, label='LOWESS')
axes[0, 0].legend()

# Plot 2: Q-Q Plot (test normality)
stats.probplot(df['resid'], dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Scale-Location Plot (test homoskedasticity)
axes[1, 0].scatter(df['fitted'], np.sqrt(np.abs(df['std_resid'])), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')
axes[1, 0].grid(True, alpha=0.3)

# Add LOWESS curve
lowess_result2 = lowess(np.sqrt(np.abs(df['std_resid'])), df['fitted'], frac=0.3)
axes[1, 0].plot(lowess_result2[:, 0], lowess_result2[:, 1], 'r-', linewidth=2)

# Plot 4: Residuals vs Leverage (identify influential points)
axes[1, 1].scatter(df['leverage'], df['std_resid'], alpha=0.5)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
axes[1, 1].grid(True, alpha=0.3)

# Mark high influence points (Cook's D > 0.5)
high_influence = df['cooks_d'] > 0.5
if high_influence.any():
    axes[1, 1].scatter(df.loc[high_influence, 'leverage'],
                      df.loc[high_influence, 'std_resid'],
                      color='red', s=100, label="High Influence Points")
    axes[1, 1].legend()

plt.tight_layout()
plt.show()

Interpretation Guide:

Plot	Good Pattern	Problem Pattern
Residuals vs Fitted	Random scatter, horizontal band	Funnel shape (heteroskedasticity), curve (nonlinearity)
Q-Q Plot	Points on diagonal line	Systematic deviation (non-normality)
Scale-Location	Horizontal line	Upward or downward trend (heteroskedasticity)
Residuals vs Leverage	No extreme points	Points in upper-right/lower-right (high influence)

Heteroskedasticity

Definition

Homoskedasticity: (constant)

Heteroskedasticity: (varies with )

Consequences

Good News: OLS estimators still unbiasedBad News: Standard errors biased, statistics and -values unreliable

Testing Methods

1. Breusch-Pagan Test

Null Hypothesis: (homoskedasticity)

python

from statsmodels.stats.diagnostic import het_breuschpagan

# BP test
bp_test = het_breuschpagan(model.resid, model.model.exog)
bp_stat, bp_pvalue, bp_f, bp_f_pvalue = bp_test

print("Breusch-Pagan Test:")
print(f"  LM statistic = {bp_stat:.3f}")
print(f"  p-value = {bp_pvalue:.4f}")

if bp_pvalue < 0.05:
    print("  Conclusion: Reject homoskedasticity hypothesis, heteroskedasticity present")
else:
    print("  Conclusion: Cannot reject homoskedasticity hypothesis")

2. White Test

Advantage: Does not assume specific form of heteroskedasticity

python

from statsmodels.stats.diagnostic import het_white

# White test
white_test = het_white(model.resid, model.model.exog)
white_stat, white_pvalue, white_f, white_f_pvalue = white_test

print("\nWhite Test:")
print(f"  LM statistic = {white_stat:.3f}")
print(f"  p-value = {white_pvalue:.4f}")

3. Graphical Diagnosis

python

# Plot squared residuals vs fitted values
plt.figure(figsize=(10, 6))
plt.scatter(df['fitted'], df['resid']**2, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Squared Residuals')
plt.title('Squared Residuals vs Fitted Values (Heteroskedasticity Diagnosis)')
plt.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(df['fitted'], df['resid']**2, 2)
p = np.poly1d(z)
x_line = np.linspace(df['fitted'].min(), df['fitted'].max(), 100)
plt.plot(x_line, p(x_line), "r-", linewidth=2, label='Quadratic fit')
plt.legend()
plt.show()

Treatment Methods

Method 1: Robust Standard Errors (Heteroskedasticity-Robust Standard Errors)

python

# Use HC3 robust standard errors (recommended)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')

print("Ordinary OLS:")
print(model.summary().tables[1])

print("\nRobust Standard Errors (HC3):")
print(model_robust.summary().tables[1])

# Compare standard errors
comparison = pd.DataFrame({
    'Variable': model.params.index,
    'Coefficient': model.params.values,
    'SE (OLS)': model.bse.values,
    'SE (Robust)': model_robust.bse.values,
    'Ratio': model_robust.bse.values / model.bse.values
})
print("\nStandard Error Comparison:")
print(comparison)

Robust Standard Error Types:

Type	Applicable Scenario	Python Parameter
HC0	Basic version	`cov_type='HC0'`
HC1	Small sample adjustment	`cov_type='HC1'`
HC2	Leverage adjustment	`cov_type='HC2'`
HC3	Recommended (most conservative)	`cov_type='HC3'`

Method 2: Weighted Least Squares (WLS)

python

# If variance structure is known: Var(ε) ∝ X²
# Use WLS
weights = 1 / df['education']**2
model_wls = sm.WLS(y, X, weights=weights).fit()

print("WLS regression results:")
print(model_wls.summary())

Method 3: Variable Transformation

python

# Take log of dependent variable (common for wage data)
# Can stabilize variance

Autocorrelation

Definition

No Autocorrelation Assumption: for

Autocorrelation: Correlation exists among error terms (common in time series data)

Durbin-Watson Test

python

from statsmodels.stats.stattools import durbin_watson

# DW statistic
dw_stat = durbin_watson(model.resid)
print(f"Durbin-Watson statistic: {dw_stat:.3f}")

# Decision criteria
if dw_stat < 1.5:
    print("Possible positive autocorrelation")
elif dw_stat > 2.5:
    print("Possible negative autocorrelation")
else:
    print("No obvious autocorrelation")

DW Statistic Interpretation:

DW ≈ 2: No autocorrelation
DW < 2: Positive autocorrelation
DW > 2: Negative autocorrelation
Range: [0, 4]

Treatment Methods

python

# Method 1: Newey-West standard errors (HAC)
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 4})
print("Newey-West standard errors:")
print(model_hac.bse)

# Method 2: Add lagged dependent variable (time series only)
# y_t = β₀ + β₁ x_t + ρ y_{t-1} + ε_t

Influence Analysis

Key Metrics

1. Leverage

Meaning: Distance of observation 's value from mean

Decision Criterion: indicates high leverage

python

# Calculate leverage threshold
k = X.shape[1] - 1  # Number of independent variables
n = len(df)
leverage_threshold = 2 * (k + 1) / n

print(f"Leverage threshold: {leverage_threshold:.4f}")
print(f"Number of high leverage points: {(df['leverage'] > leverage_threshold).sum()}")

# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['leverage'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=leverage_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {leverage_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel('Leverage')
plt.title('Leverage Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Cook's Distance

Meaning: Degree of change in all fitted values when observation is deleted

Decision Criterion: indicates high influence point

python

# Cook's D threshold
cooks_threshold = 4 / n

print(f"Cook's D threshold: {cooks_threshold:.4f}")
high_influence = df['cooks_d'] > cooks_threshold
print(f"Number of high influence points: {high_influence.sum()}")

if high_influence.any():
    print("\nHigh influence points:")
    print(df.loc[high_influence, ['education', 'experience', 'log_wage', 'cooks_d']])

# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['cooks_d'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=cooks_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {cooks_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel("Cook's Distance")
plt.title('Influence Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

3. DFBETAS

Meaning: Change in each coefficient when observation is deleted

python

# DFBETAS
dfbetas = influence.dfbetas
dfbetas_df = pd.DataFrame(dfbetas, columns=X.columns)

print("DFBETAS descriptive statistics:")
print(dfbetas_df.describe())

# Visualize (for education coefficient)
plt.figure(figsize=(10, 6))
plt.stem(df.index, dfbetas_df['education'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=2/np.sqrt(n), color='r', linestyle='--', linewidth=2, label='Threshold')
plt.axhline(y=-2/np.sqrt(n), color='r', linestyle='--', linewidth=2)
plt.xlabel('Observation Number')
plt.ylabel('DFBETAS (education)')
plt.title('Influence Analysis of Education Coefficient')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Handling Influential Points

python

# Method 1: Re-estimate after removing high influence points
df_no_outliers = df[~high_influence].copy()
X_no = sm.add_constant(df_no_outliers[['education', 'experience', 'experience_sq']])
y_no = df_no_outliers['log_wage']
model_no_outliers = sm.OLS(y_no, X_no).fit()

print("Original model (including outliers):")
print(model.params)

print("\nAfter removing outliers:")
print(model_no_outliers.params)

# Method 2: Robust Regression
from statsmodels.robust.robust_linear_model import RLM

model_rlm = RLM(y, X, M=sm.robust.norms.HuberT()).fit()
print("\nRobust regression:")
print(model_rlm.params)

Normality Tests

1. Shapiro-Wilk Test

python

from scipy.stats import shapiro

# Shapiro-Wilk test
stat, p_value = shapiro(model.resid)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Reject normality hypothesis")
else:
    print("Conclusion: Cannot reject normality hypothesis")

2. Jarque-Bera Test

python

from statsmodels.stats.stattools import jarque_bera

# Jarque-Bera test
jb_stat, jb_pvalue, jb_skew, jb_kurtosis = jarque_bera(model.resid)
print(f"\nJarque-Bera statistic: {jb_stat:.3f}")
print(f"p-value: {jb_pvalue:.4f}")
print(f"Skewness: {jb_skew:.3f}")
print(f"Kurtosis: {jb_kurtosis:.3f}")

3. Visualization

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(model.resid, bins=30, density=True, alpha=0.7, edgecolor='black')
# Overlay normal distribution
mu, sigma = model.resid.mean(), model.resid.std()
x = np.linspace(model.resid.min(), model.resid.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Theoretical normal distribution')
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Density')
axes[0].set_title('Residual Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(model.resid, dist="norm", plot=axes[1])
axes[1].set_title('Normal Q-Q Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Note:

Normality assumption is not required (Gauss-Markov theorem doesn't need it)
With large samples, CLT ensures asymptotic normality
With small samples, non-normality affects tests and confidence intervals

Complete Diagnostic Workflow

Diagnostic Checklist

python

def comprehensive_diagnostics(model, X, y):
    """
    Complete regression diagnostics
    """
    print("="*70)
    print("Regression Diagnostics Report")
    print("="*70)

    # 1. Basic information
    print(f"\n1. Sample size: {len(y)}")
    print(f"   Number of independent variables: {X.shape[1] - 1}")
    print(f"   R²: {model.rsquared:.4f}")
    print(f"   Adjusted R²: {model.rsquared_adj:.4f}")

    # 2. Heteroskedasticity test
    print("\n2. Heteroskedasticity Test:")
    bp_test = het_breuschpagan(model.resid, model.model.exog)
    print(f"   Breusch-Pagan: LM = {bp_test[0]:.3f}, p = {bp_test[1]:.4f}")

    white_test = het_white(model.resid, model.model.exog)
    print(f"   White: LM = {white_test[0]:.3f}, p = {white_test[1]:.4f}")

    if bp_test[1] < 0.05 or white_test[1] < 0.05:
        print("   Recommendation: Use robust standard errors (HC3)")
    else:
        print("   No obvious heteroskedasticity")

    # 3. Autocorrelation test
    print("\n3. Autocorrelation Test:")
    dw_stat = durbin_watson(model.resid)
    print(f"   Durbin-Watson: {dw_stat:.3f}")
    if 1.5 < dw_stat < 2.5:
        print("   No obvious autocorrelation")
    else:
        print("   Possible autocorrelation")

    # 4. Normality test
    print("\n4. Normality Test:")
    jb_stat, jb_pvalue, _, _ = jarque_bera(model.resid)
    print(f"   Jarque-Bera: JB = {jb_stat:.3f}, p = {jb_pvalue:.4f}")
    if jb_pvalue < 0.05:
        print("   Residuals may not follow normal distribution (usually doesn't affect inference with large samples)")
    else:
        print("   Residuals approximately normally distributed")

    # 5. Influence analysis
    print("\n5. Influence Analysis:")
    influence = model.get_influence()
    cooks_d = influence.cooks_distance[0]
    high_influence = cooks_d > 4/len(y)
    print(f"   Number of high influence points: {high_influence.sum()} ({high_influence.sum()/len(y)*100:.1f}%)")
    if high_influence.sum() > 0:
        print("   Recommendation: Check high influence points, consider robust regression")
    else:
        print("   No obvious high influence points")

    # 6. VIF
    print("\n6. Multicollinearity (VIF):")
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif_data = vif_data[vif_data['Variable'] != 'const']
    print(vif_data.to_string(index=False))
    if (vif_data['VIF'] > 10).any():
        print("   Some variables have VIF > 10, severe multicollinearity present")
    else:
        print("   VIF within acceptable range")

    print("\n" + "="*70)
    print("Diagnostics complete")
    print("="*70)

# Run diagnostics
comprehensive_diagnostics(model, X, y)

Section Summary

Diagnostic Workflow

Data preparation
   ↓
OLS regression
   ↓
Test assumptions ——→ Heteroskedasticity? ——→ Yes → Use robust SE
   ↓          ↓ No
   ↓          ↓
   ↓      Autocorrelation? ——→ Yes → Use HAC SE
   ↓          ↓ No
   ↓          ↓
   ↓      Non-normality? ——→ Yes → Large sample OK, small sample cautious
   ↓          ↓ No
   ↓          ↓
   ↓      Influential points? ——→ Yes → Robust regression or remove
   ↓          ↓ No
   ↓          ↓
   ↓        Pass all tests
   ↓
 Final model

Key Points

Problem	Test	Consequence	Solution
Heteroskedasticity	BP, White	Biased SE	HC3 robust SE
Autocorrelation	DW	Biased SE	HAC SE
Non-normality	JB, Shapiro	Small sample inference failure	Okay with large samples
Influential Points	Cook's D	Unstable coefficients	Robust regression

Next Section Preview

In the next section, we will learn:

Handling categorical variables (Dummy Variables)
Dummy variable trap
Interaction effects
Modeling nonlinear relationships

Expanding the expressive power of regression models!

5.4 Regression Diagnostics ​

Section Objectives ​

Gauss-Markov Assumptions (Classical Linear Model Assumptions) ​

Complete List of Assumptions ​

Gauss-Markov Theorem ​

Diagnostic Toolbox ​

1. Residual Analysis ​

Python Implementation ​

2. Residual Visualization (Four-Panel Diagnostic Plot) ​

Heteroskedasticity ​

Definition ​

Consequences ​

Testing Methods ​

1. Breusch-Pagan Test ​

2. White Test ​

3. Graphical Diagnosis ​

Treatment Methods ​

Method 1: Robust Standard Errors (Heteroskedasticity-Robust Standard Errors) ​

Method 2: Weighted Least Squares (WLS) ​

Method 3: Variable Transformation ​

Autocorrelation ​

Definition ​

Durbin-Watson Test ​

Treatment Methods ​

Influence Analysis ​

Key Metrics ​

1. Leverage ​

2. Cook's Distance ​

3. DFBETAS ​

Handling Influential Points ​

Normality Tests ​

1. Shapiro-Wilk Test ​

2. Jarque-Bera Test ​

3. Visualization ​

Complete Diagnostic Workflow ​

Diagnostic Checklist ​

Section Summary ​

Diagnostic Workflow ​

Key Points ​

Next Section Preview ​

Further Reading ​

5.4 Regression Diagnostics

Section Objectives

Gauss-Markov Assumptions (Classical Linear Model Assumptions)

Complete List of Assumptions

Gauss-Markov Theorem

Diagnostic Toolbox

1. Residual Analysis

Python Implementation

2. Residual Visualization (Four-Panel Diagnostic Plot)

Heteroskedasticity

Definition

Consequences

Testing Methods

1. Breusch-Pagan Test

2. White Test

3. Graphical Diagnosis

Treatment Methods

Method 1: Robust Standard Errors (Heteroskedasticity-Robust Standard Errors)

Method 2: Weighted Least Squares (WLS)

Method 3: Variable Transformation

Autocorrelation

Definition

Durbin-Watson Test

Treatment Methods

Influence Analysis

Key Metrics

1. Leverage

2. Cook's Distance

3. DFBETAS

Handling Influential Points

Normality Tests

1. Shapiro-Wilk Test

2. Jarque-Bera Test

3. Visualization

Complete Diagnostic Workflow

Diagnostic Checklist

Section Summary

Diagnostic Workflow

Key Points

Next Section Preview

Further Reading