5.5 Categorical Variables and Interaction Effects
"Interaction effects tell us how the world really works."— Andrew Gelman, Statistician & Political Scientist
Expanding Regression Models: Handling Qualitative Variables, Capturing Interactive Relationships
Section Objectives
After completing this section, you will be able to:
- Understand the principles of dummy variables
- Avoid the dummy variable trap
- Handle multi-category variables
- Model and interpret interaction effects
- Visualize interaction relationships
- Conduct group-wise regression analysis
Dummy Variables
Why Do We Need Dummy Variables?
Problem: How do we include categorical variables (such as gender, region, education level) in regression?
Solution: Convert categorical variables into dummy variables (0/1 binary variables)
Binary Categorical Variables
Case: Gender Wage Gap
Where:
- if is female
- if is male (reference group)
Interpretation:
- Male:
- Female:
- : Gender wage gap (female relative to male)
Python Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Generate simulated data
np.random.seed(42)
n = 500
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
female = np.random.binomial(1, 0.5, n)
# True DGP: Female wage is 15% lower than male
log_wage = 1.5 + 0.08*education - 0.15*female + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)
df = pd.DataFrame({
'wage': wage,
'log_wage': log_wage,
'education': education,
'female': female
})
# Regression
X = sm.add_constant(df[['education', 'female']])
y = df['log_wage']
model = sm.OLS(y, X).fit(cov_type='HC3')
print(model.summary())Output (key part):
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.5012 0.052 28.870 0.000 1.399 1.603
education 0.0798 0.004 19.950 0.000 0.072 0.088
female -0.1485 0.027 -5.500 0.000 -0.202 -0.095
==============================================================================Interpretation:
# Coefficient interpretation
gender_gap = (np.exp(model.params['female']) - 1) * 100
print(f"Gender wage gap: Female wage is {-gender_gap:.1f}% lower than male")
print(f"Specifically, after controlling for education, female wage is {np.exp(model.params['female'])*100:.1f}% of male")Output:
Gender wage gap: Female wage is 13.8% lower than male
Specifically, after controlling for education, female wage is 86.2% of maleVisualization
# Plot regression lines for different genders
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Original wage
for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
mask = df['female'] == gender
axes[0].scatter(df.loc[mask, 'education'], df.loc[mask, 'wage'],
alpha=0.3, label=label, color=color)
axes[0].set_xlabel('Years of Education')
axes[0].set_ylabel('Wage (thousands/month)')
axes[0].set_title('Level-Level Model')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Right plot: Log wage
for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
mask = df['female'] == gender
axes[1].scatter(df.loc[mask, 'education'], df.loc[mask, 'log_wage'],
alpha=0.3, label=label, color=color)
# Plot regression line
edu_range = np.linspace(df['education'].min(), df['education'].max(), 100)
log_wage_pred = (model.params['const'] +
model.params['education'] * edu_range +
model.params['female'] * gender)
axes[1].plot(edu_range, log_wage_pred, color=color, linewidth=2)
axes[1].set_xlabel('Years of Education')
axes[1].set_ylabel('log(wage)')
axes[1].set_title('Log-Level Model (Parallel Regression Lines)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Key Observation: Two regression lines are parallel (same slope), only intercepts differ
Dummy Variable Trap
Perfect Collinearity Problem
Wrong Approach: Create dummy variables for every category
# Wrong example
df['male'] = 1 - df['female']
X_wrong = sm.add_constant(df[['education', 'female', 'male']])
try:
model_wrong = sm.OLS(y, X_wrong).fit()
except Exception as e:
print(f"Error: {e}")Output:
Error: Singular matrixReason:
Perfect collinearity exists!
Correct Approach: Drop One Reference Category
Principle:
- categories → create dummy variables
- Dropped category becomes reference group (Baseline / Reference Group)
- All coefficients interpreted as "difference from reference group"
Multi-Category Variables
Case: Regional Wage Differences
Assume 4 regions: East, Central, West, Northeast
# Generate data
np.random.seed(123)
n = 800
education = np.random.normal(13, 3, n)
region = np.random.choice(['East', 'Central', 'West', 'Northeast'], n)
# Different wage levels by region
region_effect = {
'East': 0.20, # Base group
'Central': 0.10,
'West': 0.05,
'Northeast': 0.00
}
log_wage = 1.5 + 0.08*education + np.array([region_effect[r] for r in region]) + np.random.normal(0, 0.3, n)
df_region = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'region': region
})
print("Average wage by region:")
print(df_region.groupby('region')['log_wage'].mean().sort_values(ascending=False))Method 1: Using pandas.get_dummies()
# Create dummy variables (automatically drop one reference group)
region_dummies = pd.get_dummies(df_region['region'], prefix='region', drop_first=True)
print("Dummy variables:")
print(region_dummies.head())
# Merge into dataframe
df_region_model = pd.concat([df_region[['log_wage', 'education']], region_dummies], axis=1)
# Regression
X = sm.add_constant(df_region_model.drop('log_wage', axis=1))
y = df_region_model['log_wage']
model_region = sm.OLS(y, X).fit()
print("\nRegression results:")
print(model_region.summary())Output:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.678 0.053 31.660 0.000 1.574 1.782
education 0.080 0.004 20.000 0.000 0.072 0.088
region_Central -0.098 0.030 -3.267 0.001 -0.157 -0.039
region_Northeast -0.195 0.030 -6.500 0.000 -0.254 -0.136
region_West -0.145 0.030 -4.833 0.000 -0.204 -0.086
==============================================================================Interpretation:
- Reference group: East (wealthiest region)
- Central wage is % lower than East
- West wage is % lower than East
- Northeast wage is % lower than East
Method 2: Using patsy Formula (Recommended)
import statsmodels.formula.api as smf
# Use formula interface (automatically handles dummy variables)
model_formula = smf.ols('log_wage ~ education + C(region)', data=df_region).fit()
print(model_formula.summary())Advantages:
- Automatically creates dummy variables
- Automatically selects reference group (first alphabetically)
- More concise code
Changing Reference Group
# Use Treatment coding, specify reference group
from patsy import Treatment
model_ref_east = smf.ols(
'log_wage ~ education + C(region, Treatment(reference="East"))',
data=df_region
).fit()
print("Reference group = East:")
print(model_ref_east.params)
# Compare: Reference group = Northeast (poorest)
model_ref_northeast = smf.ols(
'log_wage ~ education + C(region, Treatment(reference="Northeast"))',
data=df_region
).fit()
print("\nReference group = Northeast:")
print(model_ref_northeast.params)Interaction Effects
What Are Interaction Effects?
Definition: The effect of one variable depends on the value of another variable
Mathematical Expression:
Marginal Effect:
The effect of varies with !
Case 1: Gender Differences in Education Returns
Research Question: Does the return to education vary by gender?
Interpretation:
- Male education return:
- Female education return:
- : Gender difference (female vs male)
# Generate data (education return: male 8%, female 6%)
np.random.seed(456)
n = 600
education = np.random.normal(13, 3, n)
female = np.random.binomial(1, 0.5, n)
# Interaction effect: Female education return is lower
log_wage = (1.5 +
0.08 * education +
0.10 * female -
0.02 * education * female +
np.random.normal(0, 0.3, n))
df_interact = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'female': female
})
# Create interaction term
df_interact['edu_x_female'] = df_interact['education'] * df_interact['female']
# Regression
X = sm.add_constant(df_interact[['education', 'female', 'edu_x_female']])
y = df_interact['log_wage']
model_interact = sm.OLS(y, X).fit(cov_type='HC3')
print(model_interact.summary())Output:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.498 0.078 19.205 0.000 1.345 1.651
education 0.080 0.006 13.333 0.000 0.068 0.092
female 0.112 0.110 1.018 0.309 -0.104 0.328
edu_x_female -0.020 0.008 -2.500 0.013 -0.036 -0.004
==============================================================================Interpretation:
# Marginal effects
beta_1 = model_interact.params['education']
beta_3 = model_interact.params['edu_x_female']
print(f"Male education return: {beta_1*100:.2f}% per year")
print(f"Female education return: {(beta_1 + beta_3)*100:.2f}% per year")
print(f"Gender difference: {beta_3*100:.2f} percentage points")
# Test interaction term significance
p_value = model_interact.pvalues['edu_x_female']
print(f"\nInteraction term p-value: {p_value:.4f}")
if p_value < 0.05:
print("Conclusion: Significant gender difference in education returns")Visualizing Interaction Effects
# Plot regression lines for different genders (non-parallel)
plt.figure(figsize=(10, 6))
for gender, label, color in [(0, 'Male', 'blue'), (1, 'Female', 'red')]:
mask = df_interact['female'] == gender
plt.scatter(df_interact.loc[mask, 'education'],
df_interact.loc[mask, 'log_wage'],
alpha=0.3, label=label, color=color)
# Regression line
edu_range = np.linspace(df_interact['education'].min(),
df_interact['education'].max(), 100)
log_wage_pred = (model_interact.params['const'] +
model_interact.params['education'] * edu_range +
model_interact.params['female'] * gender +
model_interact.params['edu_x_female'] * edu_range * gender)
plt.plot(edu_range, log_wage_pred, color=color, linewidth=2,
label=f'{label} regression line')
plt.xlabel('Years of Education')
plt.ylabel('log(wage)')
plt.title('Gender Differences in Education-Wage Relationship (Non-Parallel Regression Lines)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Key Observation: Regression lines are not parallel (different slopes)
Case 2: Interaction Between Experience and Education
Research Question: Does the value of work experience depend on education level?
# Generate data
np.random.seed(789)
n = 500
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
# Interaction effect: Higher education, greater value of experience
log_wage = (1.0 +
0.06 * education +
0.01 * experience +
0.002 * education * experience +
np.random.normal(0, 0.3, n))
df_exp_edu = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'experience': experience
})
# Use formula interface (automatically creates interaction term)
model_exp_edu = smf.ols('log_wage ~ education * experience', data=df_exp_edu).fit()
print(model_exp_edu.summary())Visualization:
# Plot experience-wage curves for different education levels
fig = plt.figure(figsize=(10, 6))
edu_levels = [10, 13, 16] # High school, college, graduate
colors = ['red', 'blue', 'green']
for edu, color, label in zip(edu_levels, colors, ['High School', 'College', 'Graduate']):
exp_range = np.linspace(0, 30, 100)
log_wage_pred = model_exp_edu.predict(pd.DataFrame({
'education': [edu] * 100,
'experience': exp_range
}))
plt.plot(exp_range, log_wage_pred, color=color, linewidth=2, label=label)
plt.xlabel('Work Experience (years)')
plt.ylabel('log(wage)')
plt.title('Effect of Experience on Wage: Moderating Role of Education')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Marginal Effect Analysis
# Calculate marginal effect of experience at different education levels
def marginal_effect_experience(edu, model):
beta_exp = model.params['experience']
beta_interact = model.params['education:experience']
return beta_exp + beta_interact * edu
for edu, label in [(10, 'High school'), (13, 'College'), (16, 'Graduate')]:
me = marginal_effect_experience(edu, model_exp_edu)
print(f"{label} ({edu} years education): Marginal return to experience = {me*100:.2f}% per year")Output:
High school (10 years education): Marginal return to experience = 3.00% per year
College (13 years education): Marginal return to experience = 3.60% per year
Graduate (16 years education): Marginal return to experience = 4.20% per yearGroup-wise Regression vs Interaction Terms
Method Comparison
Method 1: Separate Regressions
# Separate regressions for males and females
model_male = smf.ols('log_wage ~ education + experience',
data=df_interact[df_interact['female'] == 0]).fit()
model_female = smf.ols('log_wage ~ education + experience',
data=df_interact[df_interact['female'] == 1]).fit()
print("Male regression:")
print(model_male.params)
print("\nFemale regression:")
print(model_female.params)Method 2: Interaction Terms
# Full interaction model (allows all coefficients to differ)
model_full_interact = smf.ols('log_wage ~ education * female + experience * female',
data=df_interact).fit()
print("Full interaction model:")
print(model_full_interact.params)Testing Coefficient Equality (Chow Test)
Null Hypothesis: Regression coefficients are equal for both groups
# F test
# SSR_pooled: SSR from pooled regression
# SSR_separate: Sum of SSRs from separate regressions
# k: Number of parameters per group
# n1, n2: Sample sizes for two groups
model_pooled = smf.ols('log_wage ~ education + experience', data=df_interact).fit()
SSR_pooled = model_pooled.ssr
SSR_separate = model_male.ssr + model_female.ssr
k = 3 # const + education + experience
n1 = (df_interact['female'] == 0).sum()
n2 = (df_interact['female'] == 1).sum()
F_stat = ((SSR_pooled - SSR_separate) / k) / (SSR_separate / (n1 + n2 - 2*k))
from scipy.stats import f
p_value = 1 - f.cdf(F_stat, k, n1 + n2 - 2*k)
print(f"\nChow Test:")
print(f"F statistic: {F_stat:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Conclusion: Reject coefficient equality, should use separate regressions or interaction terms")
else:
print("Conclusion: Cannot reject coefficient equality, can use pooled regression")Practical Case: Complete Wage Determination Equation
# Comprehensive case: Including all types of variables
np.random.seed(2024)
n = 1000
# Generate variables
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
female = np.random.binomial(1, 0.5, n)
region = np.random.choice(['East', 'Central', 'West'], n, p=[0.4, 0.3, 0.3])
married = np.random.binomial(1, 0.6, n)
# DGP
region_effect = {'East': 0.15, 'Central': 0.05, 'West': 0.00}
log_wage = (1.2 +
0.07 * education +
0.03 * experience -
0.0005 * experience**2 -
0.12 * female +
0.08 * married -
0.015 * education * female + # Gender difference in education return
np.array([region_effect[r] for r in region]) +
np.random.normal(0, 0.3, n))
df_full = pd.DataFrame({
'log_wage': log_wage,
'education': education,
'experience': experience,
'female': female,
'region': region,
'married': married
})
# Full model
formula = '''
log_wage ~ education + experience + I(experience**2) +
female + C(region) + married +
education:female
'''
model_full = smf.ols(formula, data=df_full).fit(cov_type='HC3')
print("Complete wage determination equation:")
print(model_full.summary())Predicting Wages for Different Groups
# Prediction examples
scenarios = pd.DataFrame({
'education': [12, 16, 16, 16],
'experience': [5, 10, 10, 10],
'female': [0, 0, 1, 1],
'region': ['East', 'East', 'East', 'Central'],
'married': [0, 1, 1, 1],
'label': ['High school male, East, 5 years experience',
'College male, East, married, 10 years experience',
'College female, East, married, 10 years experience',
'College female, Central, married, 10 years experience']
})
scenarios['log_wage_pred'] = model_full.predict(scenarios)
scenarios['wage_pred'] = np.exp(scenarios['log_wage_pred'])
print("\nPredicted wages for different groups:")
print(scenarios[['label', 'wage_pred']])Section Summary
Key Points
| Concept | Key Point |
|---|---|
| Dummy Variables | categories → dummy variables |
| Reference Group | Dropped category, all coefficients relative to it |
| Interaction Effects | Effect of one variable depends on another |
| Marginal Effects |
Python Tools
| Task | Tool |
|---|---|
| Create Dummy Variables | pd.get_dummies() |
| Formula Interface | smf.ols('y ~ C(x)') |
| Interaction Terms | smf.ols('y ~ x1 * x2') |
| Marginal Effects | Manual calculation or statsmodels.graphics |
Next Section Preview
In the next section, we will learn:
- The art of coefficient interpretation (Level-Level, Log-Level, Log-Log)
- Publication-grade regression tables
- Standards for results reporting
- Visualizing regression results
From Model to Paper: Professional Presentation!
Further Reading
- Wooldridge (2020): Chapter 7 "Multiple Regression Analysis with Qualitative Information"
- Aiken & West (1991). Multiple Regression: Testing and Interpreting Interactions
- Brambor, Clark, & Golder (2006). "Understanding Interaction Models"
Ready to write professional regression reports?