Skip to content

5.5 Categorical Variables and Interaction Effects

"Interaction effects tell us how the world really works."— Andrew Gelman, Statistician & Political Scientist

Expanding Regression Models: Handling Qualitative Variables, Capturing Interactive Relationships

DifficultyImportance


Section Objectives

After completing this section, you will be able to:

  • Understand the principles of dummy variables
  • Avoid the dummy variable trap
  • Handle multi-category variables
  • Model and interpret interaction effects
  • Visualize interaction relationships
  • Conduct group-wise regression analysis

Dummy Variables

Why Do We Need Dummy Variables?

Problem: How do we include categorical variables (such as gender, region, education level) in regression?

Solution: Convert categorical variables into dummy variables (0/1 binary variables)

Binary Categorical Variables

Case: Gender Wage Gap

Where:

  • if is female
  • if is male (reference group)

Interpretation:

  • Male:
  • Female:
  • : Gender wage gap (female relative to male)

Python Implementation

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate simulated data
np.random.seed(42)
n = 500

education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
female = np.random.binomial(1, 0.5, n)

# True DGP: Female wage is 15% lower than male
log_wage = 1.5 + 0.08*education - 0.15*female + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

df = pd.DataFrame({
    'wage': wage,
    'log_wage': log_wage,
    'education': education,
    'female': female
})

# Regression
X = sm.add_constant(df[['education', 'female']])
y = df['log_wage']
model = sm.OLS(y, X).fit(cov_type='HC3')

print(model.summary())

Output (key part):

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5012      0.052     28.870      0.000       1.399       1.603
education      0.0798      0.004     19.950      0.000       0.072       0.088
female        -0.1485      0.027     -5.500      0.000      -0.202      -0.095
==============================================================================

Interpretation:

python
# Coefficient interpretation
gender_gap = (np.exp(model.params['female']) - 1) * 100
print(f"Gender wage gap: Female wage is {-gender_gap:.1f}% lower than male")
print(f"Specifically, after controlling for education, female wage is {np.exp(model.params['female'])*100:.1f}% of male")

Output:

Gender wage gap: Female wage is 13.8% lower than male
Specifically, after controlling for education, female wage is 86.2% of male

Visualization

python
# Plot regression lines for different genders
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original wage
for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
    mask = df['female'] == gender
    axes[0].scatter(df.loc[mask, 'education'], df.loc[mask, 'wage'],
                   alpha=0.3, label=label, color=color)

axes[0].set_xlabel('Years of Education')
axes[0].set_ylabel('Wage (thousands/month)')
axes[0].set_title('Level-Level Model')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right plot: Log wage
for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
    mask = df['female'] == gender
    axes[1].scatter(df.loc[mask, 'education'], df.loc[mask, 'log_wage'],
                   alpha=0.3, label=label, color=color)

    # Plot regression line
    edu_range = np.linspace(df['education'].min(), df['education'].max(), 100)
    log_wage_pred = (model.params['const'] +
                     model.params['education'] * edu_range +
                     model.params['female'] * gender)
    axes[1].plot(edu_range, log_wage_pred, color=color, linewidth=2)

axes[1].set_xlabel('Years of Education')
axes[1].set_ylabel('log(wage)')
axes[1].set_title('Log-Level Model (Parallel Regression Lines)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Observation: Two regression lines are parallel (same slope), only intercepts differ


Dummy Variable Trap

Perfect Collinearity Problem

Wrong Approach: Create dummy variables for every category

python
# Wrong example
df['male'] = 1 - df['female']
X_wrong = sm.add_constant(df[['education', 'female', 'male']])

try:
    model_wrong = sm.OLS(y, X_wrong).fit()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Singular matrix

Reason:

Perfect collinearity exists!

Correct Approach: Drop One Reference Category

Principle:

  • categories → create dummy variables
  • Dropped category becomes reference group (Baseline / Reference Group)
  • All coefficients interpreted as "difference from reference group"

Multi-Category Variables

Case: Regional Wage Differences

Assume 4 regions: East, Central, West, Northeast

python
# Generate data
np.random.seed(123)
n = 800

education = np.random.normal(13, 3, n)
region = np.random.choice(['East', 'Central', 'West', 'Northeast'], n)

# Different wage levels by region
region_effect = {
    'East': 0.20,   # Base group
    'Central': 0.10,
    'West': 0.05,
    'Northeast': 0.00
}

log_wage = 1.5 + 0.08*education + np.array([region_effect[r] for r in region]) + np.random.normal(0, 0.3, n)

df_region = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'region': region
})

print("Average wage by region:")
print(df_region.groupby('region')['log_wage'].mean().sort_values(ascending=False))

Method 1: Using pandas.get_dummies()

python
# Create dummy variables (automatically drop one reference group)
region_dummies = pd.get_dummies(df_region['region'], prefix='region', drop_first=True)
print("Dummy variables:")
print(region_dummies.head())

# Merge into dataframe
df_region_model = pd.concat([df_region[['log_wage', 'education']], region_dummies], axis=1)

# Regression
X = sm.add_constant(df_region_model.drop('log_wage', axis=1))
y = df_region_model['log_wage']
model_region = sm.OLS(y, X).fit()

print("\nRegression results:")
print(model_region.summary())

Output:

==============================================================================
                    coef    std err          t      P>|t|      [0.025    0.975]
------------------------------------------------------------------------------
const              1.678      0.053     31.660      0.000       1.574     1.782
education          0.080      0.004     20.000      0.000       0.072     0.088
region_Central    -0.098      0.030     -3.267      0.001      -0.157    -0.039
region_Northeast  -0.195      0.030     -6.500      0.000      -0.254    -0.136
region_West       -0.145      0.030     -4.833      0.000      -0.204    -0.086
==============================================================================

Interpretation:

  • Reference group: East (wealthiest region)
  • Central wage is % lower than East
  • West wage is % lower than East
  • Northeast wage is % lower than East
python
import statsmodels.formula.api as smf

# Use formula interface (automatically handles dummy variables)
model_formula = smf.ols('log_wage ~ education + C(region)', data=df_region).fit()
print(model_formula.summary())

Advantages:

  • Automatically creates dummy variables
  • Automatically selects reference group (first alphabetically)
  • More concise code

Changing Reference Group

python
# Use Treatment coding, specify reference group
from patsy import Treatment

model_ref_east = smf.ols(
    'log_wage ~ education + C(region, Treatment(reference="East"))',
    data=df_region
).fit()

print("Reference group = East:")
print(model_ref_east.params)

# Compare: Reference group = Northeast (poorest)
model_ref_northeast = smf.ols(
    'log_wage ~ education + C(region, Treatment(reference="Northeast"))',
    data=df_region
).fit()

print("\nReference group = Northeast:")
print(model_ref_northeast.params)

Interaction Effects

What Are Interaction Effects?

Definition: The effect of one variable depends on the value of another variable

Mathematical Expression:

Marginal Effect:

The effect of varies with !

Case 1: Gender Differences in Education Returns

Research Question: Does the return to education vary by gender?

Interpretation:

  • Male education return:
  • Female education return:
  • : Gender difference (female vs male)
python
# Generate data (education return: male 8%, female 6%)
np.random.seed(456)
n = 600

education = np.random.normal(13, 3, n)
female = np.random.binomial(1, 0.5, n)

# Interaction effect: Female education return is lower
log_wage = (1.5 +
            0.08 * education +
            0.10 * female -
            0.02 * education * female +
            np.random.normal(0, 0.3, n))

df_interact = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'female': female
})

# Create interaction term
df_interact['edu_x_female'] = df_interact['education'] * df_interact['female']

# Regression
X = sm.add_constant(df_interact[['education', 'female', 'edu_x_female']])
y = df_interact['log_wage']
model_interact = sm.OLS(y, X).fit(cov_type='HC3')

print(model_interact.summary())

Output:

==============================================================================
                    coef    std err          t      P>|t|      [0.025    0.975]
------------------------------------------------------------------------------
const              1.498      0.078     19.205      0.000       1.345     1.651
education          0.080      0.006     13.333      0.000       0.068     0.092
female             0.112      0.110      1.018      0.309      -0.104     0.328
edu_x_female      -0.020      0.008     -2.500      0.013      -0.036    -0.004
==============================================================================

Interpretation:

python
# Marginal effects
beta_1 = model_interact.params['education']
beta_3 = model_interact.params['edu_x_female']

print(f"Male education return: {beta_1*100:.2f}% per year")
print(f"Female education return: {(beta_1 + beta_3)*100:.2f}% per year")
print(f"Gender difference: {beta_3*100:.2f} percentage points")

# Test interaction term significance
p_value = model_interact.pvalues['edu_x_female']
print(f"\nInteraction term p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Conclusion: Significant gender difference in education returns")

Visualizing Interaction Effects

python
# Plot regression lines for different genders (non-parallel)
plt.figure(figsize=(10, 6))

for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
    mask = df_interact['female'] == gender
    plt.scatter(df_interact.loc[mask, 'education'],
               df_interact.loc[mask, 'log_wage'],
               alpha=0.3, label=label, color=color)

    # Regression line
    edu_range = np.linspace(df_interact['education'].min(),
                           df_interact['education'].max(), 100)
    log_wage_pred = (model_interact.params['const'] +
                     model_interact.params['education'] * edu_range +
                     model_interact.params['female'] * gender +
                     model_interact.params['edu_x_female'] * edu_range * gender)
    plt.plot(edu_range, log_wage_pred, color=color, linewidth=2,
            label=f'{label} regression line')

plt.xlabel('Years of Education')
plt.ylabel('log(wage)')
plt.title('Gender Differences in Education-Wage Relationship (Non-Parallel Regression Lines)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Key Observation: Regression lines are not parallel (different slopes)

Case 2: Interaction Between Experience and Education

Research Question: Does the value of work experience depend on education level?

python
# Generate data
np.random.seed(789)
n = 500

education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)

# Interaction effect: Higher education, greater value of experience
log_wage = (1.0 +
            0.06 * education +
            0.01 * experience +
            0.002 * education * experience +
            np.random.normal(0, 0.3, n))

df_exp_edu = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience
})

# Use formula interface (automatically creates interaction term)
model_exp_edu = smf.ols('log_wage ~ education * experience', data=df_exp_edu).fit()
print(model_exp_edu.summary())

Visualization:

python
# Plot experience-wage curves for different education levels
fig = plt.figure(figsize=(10, 6))

edu_levels = [10, 13, 16]  # High school, college, graduate
colors = ['red', 'blue', 'green']

for edu, color, label in zip(edu_levels, colors, ['High School', 'College', 'Graduate']):
    exp_range = np.linspace(0, 30, 100)
    log_wage_pred = model_exp_edu.predict(pd.DataFrame({
        'education': [edu] * 100,
        'experience': exp_range
    }))
    plt.plot(exp_range, log_wage_pred, color=color, linewidth=2, label=label)

plt.xlabel('Work Experience (years)')
plt.ylabel('log(wage)')
plt.title('Effect of Experience on Wage: Moderating Role of Education')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Marginal Effect Analysis

python
# Calculate marginal effect of experience at different education levels
def marginal_effect_experience(edu, model):
    beta_exp = model.params['experience']
    beta_interact = model.params['education:experience']
    return beta_exp + beta_interact * edu

for edu, label in [(10, 'High school'), (13, 'College'), (16, 'Graduate')]:
    me = marginal_effect_experience(edu, model_exp_edu)
    print(f"{label} ({edu} years education): Marginal return to experience = {me*100:.2f}% per year")

Output:

High school (10 years education): Marginal return to experience = 3.00% per year
College (13 years education): Marginal return to experience = 3.60% per year
Graduate (16 years education): Marginal return to experience = 4.20% per year

Group-wise Regression vs Interaction Terms

Method Comparison

Method 1: Separate Regressions

python
# Separate regressions for males and females
model_male = smf.ols('log_wage ~ education + experience',
                      data=df_interact[df_interact['female'] == 0]).fit()
model_female = smf.ols('log_wage ~ education + experience',
                        data=df_interact[df_interact['female'] == 1]).fit()

print("Male regression:")
print(model_male.params)
print("\nFemale regression:")
print(model_female.params)

Method 2: Interaction Terms

python
# Full interaction model (allows all coefficients to differ)
model_full_interact = smf.ols('log_wage ~ education * female + experience * female',
                               data=df_interact).fit()
print("Full interaction model:")
print(model_full_interact.params)

Testing Coefficient Equality (Chow Test)

Null Hypothesis: Regression coefficients are equal for both groups

python
# F test
# SSR_pooled: SSR from pooled regression
# SSR_separate: Sum of SSRs from separate regressions
# k: Number of parameters per group
# n1, n2: Sample sizes for two groups

model_pooled = smf.ols('log_wage ~ education + experience', data=df_interact).fit()
SSR_pooled = model_pooled.ssr
SSR_separate = model_male.ssr + model_female.ssr

k = 3  # const + education + experience
n1 = (df_interact['female'] == 0).sum()
n2 = (df_interact['female'] == 1).sum()

F_stat = ((SSR_pooled - SSR_separate) / k) / (SSR_separate / (n1 + n2 - 2*k))

from scipy.stats import f
p_value = 1 - f.cdf(F_stat, k, n1 + n2 - 2*k)

print(f"\nChow Test:")
print(f"F statistic: {F_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Reject coefficient equality, should use separate regressions or interaction terms")
else:
    print("Conclusion: Cannot reject coefficient equality, can use pooled regression")

Practical Case: Complete Wage Determination Equation

python
# Comprehensive case: Including all types of variables
np.random.seed(2024)
n = 1000

# Generate variables
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)
region = np.random.choice(['East', 'Central', 'West'], n, p=[0.4, 0.3, 0.3])
married = np.random.binomial(1, 0.6, n)

# DGP
region_effect = {'East': 0.15, 'Central': 0.05, 'West': 0.00}
log_wage = (1.2 +
            0.07 * education +
            0.03 * experience -
            0.0005 * experience**2 -
            0.12 * female +
            0.08 * married -
            0.015 * education * female +  # Gender difference in education return
            np.array([region_effect[r] for r in region]) +
            np.random.normal(0, 0.3, n))

df_full = pd.DataFrame({
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'female': female,
    'region': region,
    'married': married
})

# Full model
formula = '''
log_wage ~ education + experience + I(experience**2) +
           female + C(region) + married +
           education:female
'''
model_full = smf.ols(formula, data=df_full).fit(cov_type='HC3')

print("Complete wage determination equation:")
print(model_full.summary())

Predicting Wages for Different Groups

python
# Prediction examples
scenarios = pd.DataFrame({
    'education': [12, 16, 16, 16],
    'experience': [5, 10, 10, 10],
    'female': [0, 0, 1, 1],
    'region': ['East', 'East', 'East', 'Central'],
    'married': [0, 1, 1, 1],
    'label': ['High school male, East, 5 years experience',
              'College male, East, married, 10 years experience',
              'College female, East, married, 10 years experience',
              'College female, Central, married, 10 years experience']
})

scenarios['log_wage_pred'] = model_full.predict(scenarios)
scenarios['wage_pred'] = np.exp(scenarios['log_wage_pred'])

print("\nPredicted wages for different groups:")
print(scenarios[['label', 'wage_pred']])

Section Summary

Key Points

ConceptKey Point
Dummy Variables categories → dummy variables
Reference GroupDropped category, all coefficients relative to it
Interaction EffectsEffect of one variable depends on another
Marginal Effects

Python Tools

TaskTool
Create Dummy Variablespd.get_dummies()
Formula Interfacesmf.ols('y ~ C(x)')
Interaction Termssmf.ols('y ~ x1 * x2')
Marginal EffectsManual calculation or statsmodels.graphics

Next Section Preview

In the next section, we will learn:

  • The art of coefficient interpretation (Level-Level, Log-Level, Log-Log)
  • Publication-grade regression tables
  • Standards for results reporting
  • Visualizing regression results

From Model to Paper: Professional Presentation!


Further Reading

  1. Wooldridge (2020): Chapter 7 "Multiple Regression Analysis with Qualitative Information"
  2. Aiken & West (1991). Multiple Regression: Testing and Interpreting Interactions
  3. Brambor, Clark, & Golder (2006). "Understanding Interaction Models"

Ready to write professional regression reports?

Released under the MIT License. Content © Author.