Skip to content

5.4 Regression Diagnostics

"All models are wrong, some are useful, and a few are insightful."— George Box, Statistician

Testing Assumptions, Diagnosing Problems, Ensuring Valid Inference

DifficultyImportance


Section Objectives

After completing this section, you will be able to:

  • Understand the meaning of Gauss-Markov assumptions
  • Test for heteroskedasticity
  • Test for autocorrelation
  • Perform residual analysis
  • Identify influential points and outliers
  • Use robust standard errors

Gauss-Markov Assumptions (Classical Linear Model Assumptions)

Complete List of Assumptions

AssumptionMathematical ExpressionMeaningConsequence of Violation
MLR.1 LinearityPopulation model is linearBiased coefficients
MLR.2 Random Sampling are i.i.d.Sample is independently and identically distributedInvalid inference
MLR.3 No Perfect Collinearity matrix has full rankNo perfect linear relationship among independent variablesCannot estimate
MLR.4 Zero Conditional Mean$E[\varepsilonX] = 0$Error term uncorrelated with (exogeneity)
MLR.5 Homoskedasticity$Var(\varepsilonX) = \sigma^2$Error variance is constant
MLR.6 NormalityErrors follow normal distributionAffects small sample inference

Gauss-Markov Theorem

Theorem: Under assumptions MLR.1-MLR.5, OLS estimators are BLUE (Best Linear Unbiased Estimator)

Practical Implications:

  • Assumptions 1-4 ensure unbiasedness:
  • Assumption 5 ensures correct standard errors
  • Assumption 6 is not required (but useful for small sample inference)

Diagnostic Toolbox

1. Residual Analysis

Types of Residuals:

TypeFormulaUse
Raw ResidualsBasic diagnosis
Standardized ResidualsIdentify large residuals
Studentized ResidualsIdentify outliers

Where is the leverage value

Python Implementation

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Generate example data
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)

df = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'experience_sq': experience**2
})

# Regression
X = sm.add_constant(df[['education', 'experience', 'experience_sq']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

# Get residuals and diagnostic statistics
influence = model.get_influence()
standardized_resid = influence.resid_studentized_internal
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]

df['resid'] = model.resid
df['fitted'] = model.fittedvalues
df['std_resid'] = standardized_resid
df['leverage'] = leverage
df['cooks_d'] = cooks_d

print("First 10 observations diagnostic statistics:")
print(df[['resid', 'std_resid', 'leverage', 'cooks_d']].head(10))

2. Residual Visualization (Four-Panel Diagnostic Plot)

python
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Residuals vs Fitted Values (test linearity and homoskedasticity)
axes[0, 0].scatter(df['fitted'], df['resid'], alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')
axes[0, 0].grid(True, alpha=0.3)

# Add LOWESS curve
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(df['resid'], df['fitted'], frac=0.3)
axes[0, 0].plot(lowess_result[:, 0], lowess_result[:, 1], 'b-', linewidth=2, label='LOWESS')
axes[0, 0].legend()

# Plot 2: Q-Q Plot (test normality)
stats.probplot(df['resid'], dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Scale-Location Plot (test homoskedasticity)
axes[1, 0].scatter(df['fitted'], np.sqrt(np.abs(df['std_resid'])), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')
axes[1, 0].grid(True, alpha=0.3)

# Add LOWESS curve
lowess_result2 = lowess(np.sqrt(np.abs(df['std_resid'])), df['fitted'], frac=0.3)
axes[1, 0].plot(lowess_result2[:, 0], lowess_result2[:, 1], 'r-', linewidth=2)

# Plot 4: Residuals vs Leverage (identify influential points)
axes[1, 1].scatter(df['leverage'], df['std_resid'], alpha=0.5)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
axes[1, 1].grid(True, alpha=0.3)

# Mark high influence points (Cook's D > 0.5)
high_influence = df['cooks_d'] > 0.5
if high_influence.any():
    axes[1, 1].scatter(df.loc[high_influence, 'leverage'],
                      df.loc[high_influence, 'std_resid'],
                      color='red', s=100, label="High Influence Points")
    axes[1, 1].legend()

plt.tight_layout()
plt.show()

Interpretation Guide:

PlotGood PatternProblem Pattern
Residuals vs FittedRandom scatter, horizontal bandFunnel shape (heteroskedasticity), curve (nonlinearity)
Q-Q PlotPoints on diagonal lineSystematic deviation (non-normality)
Scale-LocationHorizontal lineUpward or downward trend (heteroskedasticity)
Residuals vs LeverageNo extreme pointsPoints in upper-right/lower-right (high influence)

Heteroskedasticity

Definition

Homoskedasticity: (constant)

Heteroskedasticity: (varies with )

Consequences

Good News: OLS estimators still unbiasedBad News: Standard errors biased, statistics and -values unreliable

Testing Methods

1. Breusch-Pagan Test

Null Hypothesis: (homoskedasticity)

python
from statsmodels.stats.diagnostic import het_breuschpagan

# BP test
bp_test = het_breuschpagan(model.resid, model.model.exog)
bp_stat, bp_pvalue, bp_f, bp_f_pvalue = bp_test

print("Breusch-Pagan Test:")
print(f"  LM statistic = {bp_stat:.3f}")
print(f"  p-value = {bp_pvalue:.4f}")

if bp_pvalue < 0.05:
    print("  Conclusion: Reject homoskedasticity hypothesis, heteroskedasticity present")
else:
    print("  Conclusion: Cannot reject homoskedasticity hypothesis")

2. White Test

Advantage: Does not assume specific form of heteroskedasticity

python
from statsmodels.stats.diagnostic import het_white

# White test
white_test = het_white(model.resid, model.model.exog)
white_stat, white_pvalue, white_f, white_f_pvalue = white_test

print("\nWhite Test:")
print(f"  LM statistic = {white_stat:.3f}")
print(f"  p-value = {white_pvalue:.4f}")

3. Graphical Diagnosis

python
# Plot squared residuals vs fitted values
plt.figure(figsize=(10, 6))
plt.scatter(df['fitted'], df['resid']**2, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Squared Residuals')
plt.title('Squared Residuals vs Fitted Values (Heteroskedasticity Diagnosis)')
plt.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(df['fitted'], df['resid']**2, 2)
p = np.poly1d(z)
x_line = np.linspace(df['fitted'].min(), df['fitted'].max(), 100)
plt.plot(x_line, p(x_line), "r-", linewidth=2, label='Quadratic fit')
plt.legend()
plt.show()

Treatment Methods

Method 1: Robust Standard Errors (Heteroskedasticity-Robust Standard Errors)

python
# Use HC3 robust standard errors (recommended)
model_robust = sm.OLS(y, X).fit(cov_type='HC3')

print("Ordinary OLS:")
print(model.summary().tables[1])

print("\nRobust Standard Errors (HC3):")
print(model_robust.summary().tables[1])

# Compare standard errors
comparison = pd.DataFrame({
    'Variable': model.params.index,
    'Coefficient': model.params.values,
    'SE (OLS)': model.bse.values,
    'SE (Robust)': model_robust.bse.values,
    'Ratio': model_robust.bse.values / model.bse.values
})
print("\nStandard Error Comparison:")
print(comparison)

Robust Standard Error Types:

TypeApplicable ScenarioPython Parameter
HC0Basic versioncov_type='HC0'
HC1Small sample adjustmentcov_type='HC1'
HC2Leverage adjustmentcov_type='HC2'
HC3Recommended (most conservative)cov_type='HC3'

Method 2: Weighted Least Squares (WLS)

python
# If variance structure is known: Var(ε) ∝ X²
# Use WLS
weights = 1 / df['education']**2
model_wls = sm.WLS(y, X, weights=weights).fit()

print("WLS regression results:")
print(model_wls.summary())

Method 3: Variable Transformation

python
# Take log of dependent variable (common for wage data)
# Can stabilize variance

Autocorrelation

Definition

No Autocorrelation Assumption: for

Autocorrelation: Correlation exists among error terms (common in time series data)

Durbin-Watson Test

python
from statsmodels.stats.stattools import durbin_watson

# DW statistic
dw_stat = durbin_watson(model.resid)
print(f"Durbin-Watson statistic: {dw_stat:.3f}")

# Decision criteria
if dw_stat < 1.5:
    print("Possible positive autocorrelation")
elif dw_stat > 2.5:
    print("Possible negative autocorrelation")
else:
    print("No obvious autocorrelation")

DW Statistic Interpretation:

  • DW ≈ 2: No autocorrelation
  • DW < 2: Positive autocorrelation
  • DW > 2: Negative autocorrelation
  • Range: [0, 4]

Treatment Methods

python
# Method 1: Newey-West standard errors (HAC)
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 4})
print("Newey-West standard errors:")
print(model_hac.bse)

# Method 2: Add lagged dependent variable (time series only)
# y_t = β₀ + β₁ x_t + ρ y_{t-1} + ε_t

Influence Analysis

Key Metrics

1. Leverage

Meaning: Distance of observation 's value from mean

Decision Criterion: indicates high leverage

python
# Calculate leverage threshold
k = X.shape[1] - 1  # Number of independent variables
n = len(df)
leverage_threshold = 2 * (k + 1) / n

print(f"Leverage threshold: {leverage_threshold:.4f}")
print(f"Number of high leverage points: {(df['leverage'] > leverage_threshold).sum()}")

# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['leverage'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=leverage_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {leverage_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel('Leverage')
plt.title('Leverage Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Cook's Distance

Meaning: Degree of change in all fitted values when observation is deleted

Decision Criterion: indicates high influence point

python
# Cook's D threshold
cooks_threshold = 4 / n

print(f"Cook's D threshold: {cooks_threshold:.4f}")
high_influence = df['cooks_d'] > cooks_threshold
print(f"Number of high influence points: {high_influence.sum()}")

if high_influence.any():
    print("\nHigh influence points:")
    print(df.loc[high_influence, ['education', 'experience', 'log_wage', 'cooks_d']])

# Visualize
plt.figure(figsize=(10, 6))
plt.stem(df.index, df['cooks_d'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=cooks_threshold, color='r', linestyle='--', linewidth=2, label=f'Threshold = {cooks_threshold:.3f}')
plt.xlabel('Observation Number')
plt.ylabel("Cook's Distance")
plt.title('Influence Diagnostics')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

3. DFBETAS

Meaning: Change in each coefficient when observation is deleted

python
# DFBETAS
dfbetas = influence.dfbetas
dfbetas_df = pd.DataFrame(dfbetas, columns=X.columns)

print("DFBETAS descriptive statistics:")
print(dfbetas_df.describe())

# Visualize (for education coefficient)
plt.figure(figsize=(10, 6))
plt.stem(df.index, dfbetas_df['education'], linefmt='b-', markerfmt='bo', basefmt=' ')
plt.axhline(y=2/np.sqrt(n), color='r', linestyle='--', linewidth=2, label='Threshold')
plt.axhline(y=-2/np.sqrt(n), color='r', linestyle='--', linewidth=2)
plt.xlabel('Observation Number')
plt.ylabel('DFBETAS (education)')
plt.title('Influence Analysis of Education Coefficient')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Handling Influential Points

python
# Method 1: Re-estimate after removing high influence points
df_no_outliers = df[~high_influence].copy()
X_no = sm.add_constant(df_no_outliers[['education', 'experience', 'experience_sq']])
y_no = df_no_outliers['log_wage']
model_no_outliers = sm.OLS(y_no, X_no).fit()

print("Original model (including outliers):")
print(model.params)

print("\nAfter removing outliers:")
print(model_no_outliers.params)

# Method 2: Robust Regression
from statsmodels.robust.robust_linear_model import RLM

model_rlm = RLM(y, X, M=sm.robust.norms.HuberT()).fit()
print("\nRobust regression:")
print(model_rlm.params)

Normality Tests

1. Shapiro-Wilk Test

python
from scipy.stats import shapiro

# Shapiro-Wilk test
stat, p_value = shapiro(model.resid)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Reject normality hypothesis")
else:
    print("Conclusion: Cannot reject normality hypothesis")

2. Jarque-Bera Test

python
from statsmodels.stats.stattools import jarque_bera

# Jarque-Bera test
jb_stat, jb_pvalue, jb_skew, jb_kurtosis = jarque_bera(model.resid)
print(f"\nJarque-Bera statistic: {jb_stat:.3f}")
print(f"p-value: {jb_pvalue:.4f}")
print(f"Skewness: {jb_skew:.3f}")
print(f"Kurtosis: {jb_kurtosis:.3f}")

3. Visualization

python
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(model.resid, bins=30, density=True, alpha=0.7, edgecolor='black')
# Overlay normal distribution
mu, sigma = model.resid.mean(), model.resid.std()
x = np.linspace(model.resid.min(), model.resid.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Theoretical normal distribution')
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Density')
axes[0].set_title('Residual Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(model.resid, dist="norm", plot=axes[1])
axes[1].set_title('Normal Q-Q Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Note:

  • Normality assumption is not required (Gauss-Markov theorem doesn't need it)
  • With large samples, CLT ensures asymptotic normality
  • With small samples, non-normality affects tests and confidence intervals

Complete Diagnostic Workflow

Diagnostic Checklist

python
def comprehensive_diagnostics(model, X, y):
    """
    Complete regression diagnostics
    """
    print("="*70)
    print("Regression Diagnostics Report")
    print("="*70)

    # 1. Basic information
    print(f"\n1. Sample size: {len(y)}")
    print(f"   Number of independent variables: {X.shape[1] - 1}")
    print(f"   R²: {model.rsquared:.4f}")
    print(f"   Adjusted R²: {model.rsquared_adj:.4f}")

    # 2. Heteroskedasticity test
    print("\n2. Heteroskedasticity Test:")
    bp_test = het_breuschpagan(model.resid, model.model.exog)
    print(f"   Breusch-Pagan: LM = {bp_test[0]:.3f}, p = {bp_test[1]:.4f}")

    white_test = het_white(model.resid, model.model.exog)
    print(f"   White: LM = {white_test[0]:.3f}, p = {white_test[1]:.4f}")

    if bp_test[1] < 0.05 or white_test[1] < 0.05:
        print("   Recommendation: Use robust standard errors (HC3)")
    else:
        print("   No obvious heteroskedasticity")

    # 3. Autocorrelation test
    print("\n3. Autocorrelation Test:")
    dw_stat = durbin_watson(model.resid)
    print(f"   Durbin-Watson: {dw_stat:.3f}")
    if 1.5 < dw_stat < 2.5:
        print("   No obvious autocorrelation")
    else:
        print("   Possible autocorrelation")

    # 4. Normality test
    print("\n4. Normality Test:")
    jb_stat, jb_pvalue, _, _ = jarque_bera(model.resid)
    print(f"   Jarque-Bera: JB = {jb_stat:.3f}, p = {jb_pvalue:.4f}")
    if jb_pvalue < 0.05:
        print("   Residuals may not follow normal distribution (usually doesn't affect inference with large samples)")
    else:
        print("   Residuals approximately normally distributed")

    # 5. Influence analysis
    print("\n5. Influence Analysis:")
    influence = model.get_influence()
    cooks_d = influence.cooks_distance[0]
    high_influence = cooks_d > 4/len(y)
    print(f"   Number of high influence points: {high_influence.sum()} ({high_influence.sum()/len(y)*100:.1f}%)")
    if high_influence.sum() > 0:
        print("   Recommendation: Check high influence points, consider robust regression")
    else:
        print("   No obvious high influence points")

    # 6. VIF
    print("\n6. Multicollinearity (VIF):")
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif_data = vif_data[vif_data['Variable'] != 'const']
    print(vif_data.to_string(index=False))
    if (vif_data['VIF'] > 10).any():
        print("   Some variables have VIF > 10, severe multicollinearity present")
    else:
        print("   VIF within acceptable range")

    print("\n" + "="*70)
    print("Diagnostics complete")
    print("="*70)

# Run diagnostics
comprehensive_diagnostics(model, X, y)

Section Summary

Diagnostic Workflow

Data preparation

OLS regression

Test assumptions ——→ Heteroskedasticity? ——→ Yes → Use robust SE
   ↓          ↓ No
   ↓          ↓
   ↓      Autocorrelation? ——→ Yes → Use HAC SE
   ↓          ↓ No
   ↓          ↓
   ↓      Non-normality? ——→ Yes → Large sample OK, small sample cautious
   ↓          ↓ No
   ↓          ↓
   ↓      Influential points? ——→ Yes → Robust regression or remove
   ↓          ↓ No
   ↓          ↓
   ↓        Pass all tests

 Final model

Key Points

ProblemTestConsequenceSolution
HeteroskedasticityBP, WhiteBiased SEHC3 robust SE
AutocorrelationDWBiased SEHAC SE
Non-normalityJB, ShapiroSmall sample inference failureOkay with large samples
Influential PointsCook's DUnstable coefficientsRobust regression

Next Section Preview

In the next section, we will learn:

  • Handling categorical variables (Dummy Variables)
  • Dummy variable trap
  • Interaction effects
  • Modeling nonlinear relationships

Expanding the expressive power of regression models!


Further Reading

  1. Wooldridge (2020): Chapter 8 "Heteroskedasticity"
  2. White, H. (1980). "A Heteroskedasticity-Consistent Covariance Matrix Estimator"
  3. Cook, R. D. (1977). "Detection of Influential Observation in Linear Regression"
  4. Belsley, Kuh, & Welsch (1980). Regression Diagnostics

Continue exploring categorical variables and interaction effects!

Released under the MIT License. Content © Author.