3.4 Variable Construction
"Feature engineering is the key to machine learning success.""特征工程是机器学习成功的关键。"— Pedro Domingos, Professor at University of Washington (华盛顿大学教授)
From raw variables to analysis-ready data
📋 Learning Objectives
- Master the creation and theory of dummy variables
- Deeply understand interaction terms and marginal effects
- Learn lag operators in dynamic panel data
- Master complete variable construction for the Mincer wage equation
- Understand DID variables and treatment effect decomposition
- Build a production-level VariableConstructor class
📐 Theoretical Foundation
The Role of Variable Construction in Causal Inference
Angrist & Pischke (2009) emphasize:
"Good econometrics is about good research design, and good research design requires careful variable construction."
Three Key Functions:
Confounding Control
- : Time fixed effects (dummy variables)
- : Individual fixed effects (group mean centering)
Capturing Heterogeneous Effects
- Interaction term captures how the treatment effect varies with
Constructing Instrumental Variables
🎯 Dummy Variables
Theory: Numerical Encoding of Categorical Variables
Mathematical Representation:
Let categorical variable have categories . Create dummy variables:
Regression Model:
Coefficient Interpretation:
- : Expected value for the reference group ():
- : Difference between category and the reference group:
The Dummy Variable Trap (Multicollinearity)
Perfect Collinearity:
If all dummy variables are included:
This creates perfect collinearity with the intercept term (which equals 1 for all observations), making the design matrix non-invertible:
Solution: Drop one category as the reference group
Method 1: pd.get_dummies()
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'education': ['High School', 'Bachelor', 'Master', 'High School', 'PhD']
})
# Create dummy variables
dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, dummies], axis=1)
print(df)
# id education edu_Master edu_PhD edu_Bachelor
# 0 1 High School 0 0 0
# 1 2 Bachelor 0 0 1
# 2 3 Master 1 0 0
# 3 4 High School 0 0 0
# 4 5 PhD 0 1 0Parameter Explanation:
prefix='edu': Variable name prefixdrop_first=True: Drop the first category (alphabetically) as reference group- Reference group: High School (all dummy variables equal 0)
Method 2: Manual Construction (More Flexible)
# Create single dummy variable
df['college'] = (df['education'] == 'Bachelor').astype(int)
df['female'] = (df['gender'] == 'Female').astype(int)
df['high_income'] = (df['income'] > df['income'].median()).astype(int)
# Multiple conditions (cross-classification)
df['young_female'] = ((df['age'] < 30) & (df['gender'] == 'Female')).astype(int)
df['urban_college'] = ((df['urban'] == 1) & (df['education'] == 'Bachelor')).astype(int)
# Specify reference group
def create_education_dummies(education_series, ref='High School'):
"""Create education dummy variables with specified reference group"""
categories = education_series.unique()
dummies = {}
for cat in categories:
if cat != ref:
dummies[f'edu_{cat}'] = (education_series == cat).astype(int)
return pd.DataFrame(dummies)
edu_dummies = create_education_dummies(df['education'], ref='High School')
df = pd.concat([df, edu_dummies], axis=1)Regression and Interpretation
import statsmodels.api as sm
# Simulate wage data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'education': np.random.choice(['High School', 'Bachelor', 'Master'], n),
'experience': np.random.uniform(0, 30, n),
'age': np.random.uniform(22, 65, n)
})
# Generate wages (true model)
df['wage'] = (30000 +
5000 * (df['education'] == 'Bachelor') +
10000 * (df['education'] == 'Master') +
1000 * df['experience'] +
np.random.normal(0, 5000, n))
# Create dummy variables
edu_dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, edu_dummies], axis=1)
# OLS regression
X = sm.add_constant(df[['experience', 'edu_Master', 'edu_Bachelor']])
y = df['wage']
model = sm.OLS(y, X).fit(cov_type='HC3')
print(model.summary().tables[1])Coefficient Interpretation:
- Intercept (30127): Expected wage for high school graduates with zero experience
- Bachelor coefficient (5084): Wage premium for bachelor's degree relative to high school
- Master coefficient (10156): Wage premium for master's degree relative to high school
Significance Testing:
Hypothesis test (no bachelor's premium)
If , reject ; the bachelor's premium is significant.
🔄 Interaction Terms
Theory: Heterogeneity in Marginal Effects
Research Question: Does the effect of education on income vary by gender?
Model Without Interaction (parallel slopes):
Marginal Effect:
Problem: Assumes the effect of on does not vary with
Model With Interaction (non-parallel slopes):
Marginal Effect:
Interpretation: The effect of on depends on the value of
Mathematical Derivation: Angrist & Pischke (2009) Framework
Basic Model (education × gender interaction):
Group-Specific Regressions:
Males ():
Education return:
Females ():
Education return:
Interaction Coefficient Interpretation:
If , the education return for females is lower than for males (evidence of gender discrimination).
Creating Interaction Terms
# 1. Continuous × Dummy
df['educ_x_female'] = df['education'] * df['female']
# 2. Continuous × Continuous
df['educ_x_experience'] = df['education'] * df['experience']
# 3. Dummy × Dummy
df['female_x_married'] = df['female'] * df['married']
# 4. Three-way interaction
df['educ_x_exp_x_female'] = df['education'] * df['experience'] * df['female']Centering: Reducing Multicollinearity
Problem: and are highly correlated
When both and are positive, the correlation approaches 1.
Solution: Centering
Centered Model:
Advantages:
- interpretation: Effect of on when is at its mean
- interpretation: Expected value of when both and are at their means
# Centering
df['educ_centered'] = df['education'] - df['education'].mean()
df['exp_centered'] = df['experience'] - df['experience'].mean()
# Create centered interaction
df['educ_x_exp_centered'] = df['educ_centered'] * df['exp_centered']
# Regression
X = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
y = df['log_wage']
model = sm.OLS(y, X).fit()Statistical Testing of Interactions
1. Wald Test
Test (no interaction effect)
# Wald test
from scipy import stats
beta3 = model.params['educ_x_exp_centered']
se_beta3 = model.bse['educ_x_exp_centered']
wald_stat = (beta3 / se_beta3) ** 2
p_value = 1 - stats.chi2.cdf(wald_stat, df=1)
print(f"Wald statistic: {wald_stat:.4f}, p-value: {p_value:.4f}")2. F Test (Multiple Interactions)
Test
Where:
- : Residual sum of squares from restricted model (without interactions)
- : Residual sum of squares from unrestricted model (with interactions)
- : Number of interaction terms
# F test
from statsmodels.iolib.table import SimpleTable
# Restricted model
X_restricted = sm.add_constant(df[['educ_centered', 'exp_centered']])
model_r = sm.OLS(y, X_restricted).fit()
# Unrestricted model
X_unrestricted = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
model_u = sm.OLS(y, X_unrestricted).fit()
# F statistic
rss_r = model_r.ssr
rss_u = model_u.ssr
q = 1
n = len(df)
k = X_unrestricted.shape[1] - 1
f_stat = ((rss_r - rss_u) / q) / (rss_u / (n - k - 1))
p_value = 1 - stats.f.cdf(f_stat, q, n - k - 1)
print(f"F statistic: {f_stat:.4f}, p-value: {p_value:.4f}")Visualizing Interaction Effects
import matplotlib.pyplot as plt
# Simulate data
np.random.seed(42)
educ = np.linspace(8, 20, 100)
# Predicted wages for males and females
wage_male = 8 + 0.08 * educ
wage_female = 8.2 + 0.06 * educ
# Plot
plt.figure(figsize=(10, 6))
plt.plot(educ, wage_male, label='Male', linewidth=2, color='blue')
plt.plot(educ, wage_female, label='Female', linewidth=2, color='red')
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Log Wage', fontsize=12)
plt.title('Gender Differences in Education-Income Relationship (Interaction Effect)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()⏱️ Lag Operators
Theory: Dynamic Panel Data Models
Lag Operator Definition:
Properties:
- Linearity:
- Multiplication:
- Polynomial:
Difference Operator:
Applications:
- Unit root testing: (ADF test)
- Error correction models:
Dynamic Panel Models
Arellano-Bond Model:
Problem: is correlated with (endogeneity)
Solution: First-differencing + GMM
Use as instrumental variables.
Creating Lags
# Example: panel data
df = pd.DataFrame({
'person_id': [1, 1, 1, 1, 2, 2, 2, 2],
'year': [2017, 2018, 2019, 2020, 2017, 2018, 2019, 2020],
'income': [50000, 52000, 55000, 58000, 60000, 62000, 65000, 68000]
})
# Sort by individual and time
df = df.sort_values(['person_id', 'year'])
# Create 1-period lag
df['income_lag1'] = df.groupby('person_id')['income'].shift(1)
# Create 2-period lag
df['income_lag2'] = df.groupby('person_id')['income'].shift(2)
# Create difference
df['income_diff'] = df.groupby('person_id')['income'].diff()
# Create log difference (growth rate)
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()
print(df)
# person_id year income income_lag1 income_lag2 income_diff log_income income_growth
# 0 1 2017 50000 NaN NaN NaN 10.819778 NaN
# 1 1 2018 52000 50000.0 NaN 2000.0 10.859073 0.039295
# 2 1 2019 55000 52000.0 50000.0 3000.0 10.914785 0.055713
# 3 1 2020 58000 55000.0 52000.0 3000.0 10.968557 0.053771
# 4 2 2017 60000 NaN NaN NaN 11.002100 NaN
# 5 2 2018 62000 60000.0 NaN 2000.0 11.034699 0.032599
# 6 2 2019 65000 62000.0 60000.0 3000.0 11.082414 0.047715
# 7 2 2020 68000 65000.0 62000.0 3000.0 11.127384 0.044970Lead Variables
Application: Event studies
Test for anticipation effects before policy implementation
Where indicates periods before () or after () the event.
# Create 1-period lead (future value)
df['income_lead1'] = df.groupby('person_id')['income'].shift(-1)
df['income_lead2'] = df.groupby('person_id')['income'].shift(-2)
# Event study: pre- and post-policy effects
df['treat_post'] = (df['year'] >= 2019) & (df['treated'] == 1)
df['treat_pre1'] = (df['year'] == 2018) & (df['treated'] == 1)
df['treat_pre2'] = (df['year'] == 2017) & (df['treated'] == 1)📊 Group Statistics Variables
Theory: Controlling for Group Fixed Effects
Within-Group Variation:
Where represents group fixed effects.
Group Mean Centering (Within Transformation):
Centered Model:
The term is eliminated, avoiding omitted variable bias.
Method 1: groupby + transform
# Calculate average income by region
df['region_mean_income'] = df.groupby('region')['income'].transform('mean')
# Calculate individual deviation from regional mean (centering)
df['income_deviation'] = df['income'] - df['region_mean_income']
# Other statistics
df['region_median_income'] = df.groupby('region')['income'].transform('median')
df['region_std_income'] = df.groupby('region')['income'].transform('std')
df['region_count'] = df.groupby('region')['income'].transform('count')
df['region_min_income'] = df.groupby('region')['income'].transform('min')
df['region_max_income'] = df.groupby('region')['income'].transform('max')
# Percentiles
df['region_p25_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.25))
df['region_p75_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.75))Method 2: Ranking and Percentiles
# Individual's income rank within region (percentile)
df['income_pct_rank'] = df.groupby('region')['income'].transform(
lambda x: x.rank(pct=True)
)
# Interpretation: 0.75 means the individual's income is higher than 75% within the region
# Quartile groups
df['income_quartile'] = df.groupby('region')['income'].transform(
lambda x: pd.qcut(x, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
)Method 3: Rolling Window Statistics
# Time series: average income over the past 3 years
df = df.sort_values(['person_id', 'year'])
df['income_rolling_mean_3y'] = df.groupby('person_id')['income'].transform(
lambda x: x.rolling(window=3, min_periods=1).mean()
)
# Standard deviation over the past 3 years (income volatility)
df['income_rolling_std_3y'] = df.groupby('person_id')['income'].transform(
lambda x: x.rolling(window=3, min_periods=1).std()
)
# Growth rate over the past 3 years
df['income_growth_3y'] = df.groupby('person_id')['income'].pct_change(periods=3)⚙️ Production-Level VariableConstructor Class
import pandas as pd
import numpy as np
from typing import List, Union, Tuple, Optional
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
class VariableConstructor:
"""
Production-level variable constructor - Nobel Prize Standards
Functions:
-----
1. Dummy variable creation
2. Interaction term construction
3. Lag operators
4. Group statistics variables
5. Mincer equation variable set
6. DID variable construction
7. Treatment effect decomposition
Author: StatsPai Team
Reference: Angrist & Pischke (2009), Wooldridge (2010)
"""
def __init__(self, df: pd.DataFrame, id_col: Optional[str] = None,
time_col: Optional[str] = None):
"""
Initialize variable constructor
Parameters:
------
df : pd.DataFrame
Raw data
id_col : str, optional
Individual identifier column for panel data
time_col : str, optional
Time identifier column for panel data
"""
self.df = df.copy()
self.id_col = id_col
self.time_col = time_col
self.constructed_vars = []
# Validate panel data structure
if id_col and time_col:
self._validate_panel_structure()
def _validate_panel_structure(self):
"""Validate panel data structure"""
if self.id_col not in self.df.columns:
raise ValueError(f"id_col '{self.id_col}' not in data")
if self.time_col not in self.df.columns:
raise ValueError(f"time_col '{self.time_col}' not in data")
# Check if balanced panel
counts = self.df.groupby(self.id_col)[self.time_col].count()
if counts.nunique() == 1:
print(f"✓ Balanced panel: {counts.iloc[0]} periods per individual")
else:
print(f"⚠️ Unbalanced panel: period count ranges {counts.min()} - {counts.max()}")
def create_dummies(self, var: str, prefix: Optional[str] = None,
drop_first: bool = True, ref_category: Optional[str] = None) -> pd.DataFrame:
"""
Create dummy variables
Parameters:
------
var : str
Categorical variable name
prefix : str, optional
Variable name prefix, defaults to var
drop_first : bool, default=True
Whether to drop the first category (avoid multicollinearity)
ref_category : str, optional
Specify reference group (takes precedence over drop_first)
Returns:
------
pd.DataFrame
Data with dummy variables added
Example:
------
>>> constructor.create_dummies('education', prefix='edu', ref_category='High School')
"""
if var not in self.df.columns:
raise ValueError(f"Variable '{var}' not in data")
prefix = prefix or var
if ref_category:
# Specify reference group
categories = self.df[var].unique()
for cat in categories:
if cat != ref_category:
dummy_name = f'{prefix}_{cat}'
self.df[dummy_name] = (self.df[var] == cat).astype(int)
self.constructed_vars.append(dummy_name)
print(f"✓ Created dummy variables: reference group = {ref_category}")
else:
# Use pd.get_dummies
dummies = pd.get_dummies(self.df[var], prefix=prefix, drop_first=drop_first)
for col in dummies.columns:
self.df[col] = dummies[col]
self.constructed_vars.append(col)
print(f"✓ Created dummy variables: {len(dummies.columns)} total")
return self.df
def create_interactions(self, var1: str, var2: str, center: bool = False,
name: Optional[str] = None) -> pd.DataFrame:
"""
Create interaction terms
Theory:
-----
For model: y = β₀ + β₁x₁ + β₂x₂ + β₃(x₁×x₂) + ε
Marginal effect:
∂y/∂x₁ = β₁ + β₃x₂
Parameters:
------
var1 : str
First variable
var2 : str
Second variable
center : bool, default=False
Whether to center (reduce multicollinearity)
name : str, optional
Interaction term name, defaults to 'var1_x_var2'
Returns:
------
pd.DataFrame
Data with interaction term added
Example:
------
>>> constructor.create_interactions('education', 'female', center=True)
"""
if var1 not in self.df.columns or var2 not in self.df.columns:
raise ValueError(f"Variable '{var1}' or '{var2}' not in data")
name = name or f'{var1}_x_{var2}'
if center:
# Center
var1_centered = self.df[var1] - self.df[var1].mean()
var2_centered = self.df[var2] - self.df[var2].mean()
self.df[name] = var1_centered * var2_centered
print(f"✓ Created centered interaction: {name}")
else:
self.df[name] = self.df[var1] * self.df[var2]
print(f"✓ Created interaction: {name}")
self.constructed_vars.append(name)
return self.df
def create_lags(self, var: str, lags: Union[int, List[int]] = 1,
fill_method: str = 'drop') -> pd.DataFrame:
"""
Create lag variables (panel data)
Requirements: Must set id_col and time_col
Math:
-----
L(Y_it, k) = Y_{i,t-k}
Parameters:
------
var : str
Variable name
lags : int or list of int
Lag periods, e.g., 1 or [1, 2, 3]
fill_method : {'drop', 'zero', 'ffill'}
Missing value handling method
Returns:
------
pd.DataFrame
Data with lag variables added
Example:
------
>>> constructor.create_lags('income', lags=[1, 2])
"""
if not self.id_col or not self.time_col:
raise ValueError("Creating lags requires id_col and time_col to be set")
if var not in self.df.columns:
raise ValueError(f"Variable '{var}' not in data")
# Ensure sorted
self.df = self.df.sort_values([self.id_col, self.time_col])
# Convert to list
if isinstance(lags, int):
lags = [lags]
for lag in lags:
lag_name = f'{var}_lag{lag}'
self.df[lag_name] = self.df.groupby(self.id_col)[var].shift(lag)
# Handle missing values
if fill_method == 'drop':
pass # Keep NaN, remove in subsequent analysis
elif fill_method == 'zero':
self.df[lag_name].fillna(0, inplace=True)
elif fill_method == 'ffill':
self.df[lag_name] = self.df.groupby(self.id_col)[lag_name].fillna(method='ffill')
self.constructed_vars.append(lag_name)
print(f"✓ Created lag variable: {lag_name}")
return self.df
def create_leads(self, var: str, leads: Union[int, List[int]] = 1) -> pd.DataFrame:
"""
Create lead variables (panel data)
Application: Event studies
Parameters:
------
var : str
Variable name
leads : int or list of int
Lead periods
Returns:
------
pd.DataFrame
Data with lead variables added
"""
if not self.id_col or not self.time_col:
raise ValueError("Creating leads requires id_col and time_col to be set")
if var not in self.df.columns:
raise ValueError(f"Variable '{var}' not in data")
self.df = self.df.sort_values([self.id_col, self.time_col])
if isinstance(leads, int):
leads = [leads]
for lead in leads:
lead_name = f'{var}_lead{lead}'
self.df[lead_name] = self.df.groupby(self.id_col)[var].shift(-lead)
self.constructed_vars.append(lead_name)
print(f"✓ Created lead variable: {lead_name}")
return self.df
def create_group_stats(self, var: str, groupby: Union[str, List[str]],
stats: List[str] = ['mean', 'std']) -> pd.DataFrame:
"""
Create group statistics variables
Examples:
-----
- Average wage by province
- Individual wage deviation from provincial mean
Parameters:
------
var : str
Variable name
groupby : str or list of str
Grouping variable(s)
stats : list of str
Statistics, options: 'mean', 'std', 'min', 'max', 'median', 'count'
Returns:
------
pd.DataFrame
Data with group statistics variables added
Example:
------
>>> constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
"""
if var not in self.df.columns:
raise ValueError(f"Variable '{var}' not in data")
if isinstance(groupby, str):
groupby = [groupby]
for gb in groupby:
if gb not in self.df.columns:
raise ValueError(f"Grouping variable '{gb}' not in data")
group_name = '_'.join(groupby)
for stat in stats:
stat_name = f'{group_name}_{stat}_{var}'
if stat == 'mean':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('mean')
elif stat == 'std':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('std')
elif stat == 'min':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('min')
elif stat == 'max':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('max')
elif stat == 'median':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('median')
elif stat == 'count':
self.df[stat_name] = self.df.groupby(groupby)[var].transform('count')
else:
raise ValueError(f"Unknown statistic: {stat}")
self.constructed_vars.append(stat_name)
print(f"✓ Created group statistic: {stat_name}")
# Create deviation variable
if 'mean' in stats:
deviation_name = f'{var}_dev_{group_name}'
mean_name = f'{group_name}_mean_{var}'
self.df[deviation_name] = self.df[var] - self.df[mean_name]
self.constructed_vars.append(deviation_name)
print(f"✓ Created deviation variable: {deviation_name}")
return self.df
def mincer_equation_vars(self, wage_col: str, education_col: str,
age_col: str, start_age: int = 6) -> pd.DataFrame:
"""
Construct complete Mincer wage equation variables
Model:
-----
ln(wage) = β₀ + β₁·education + β₂·experience + β₃·experience² + ε
Required variables:
--------
- education: years of education
- age: age
- experience = age - education - start_age
- experience_sq = experience²
- log_wage = ln(wage)
Parameters:
------
wage_col : str
Wage variable name
education_col : str
Education years variable name
age_col : str
Age variable name
start_age : int, default=6
Age when schooling starts
Returns:
------
pd.DataFrame
Data with Mincer variables added
Example:
------
>>> constructor.mincer_equation_vars('wage', 'education', 'age')
"""
required_cols = [wage_col, education_col, age_col]
for col in required_cols:
if col not in self.df.columns:
raise ValueError(f"Variable '{col}' not in data")
# Create experience
self.df['experience'] = self.df[age_col] - self.df[education_col] - start_age
self.df['experience'] = self.df['experience'].clip(lower=0) # Ensure non-negative
# Create experience²
self.df['experience_sq'] = self.df['experience'] ** 2
# Create log(wage)
self.df['log_wage'] = np.log(self.df[wage_col])
mincer_vars = ['experience', 'experience_sq', 'log_wage']
self.constructed_vars.extend(mincer_vars)
print("✓ Created Mincer equation variables:")
print(f" - experience = {age_col} - {education_col} - {start_age}")
print(f" - experience_sq = experience²")
print(f" - log_wage = ln({wage_col})")
return self.df
def did_variables(self, treat_col: str, post_period: Union[int, str],
time_col: Optional[str] = None) -> pd.DataFrame:
"""
Construct DID variables
Variables:
-----
- treat: treatment group indicator (already exists)
- post: post-policy indicator
- treat_post: interaction term (DID estimator)
Parameters:
------
treat_col : str
Treatment group variable name
post_period : int or str
Policy implementation time point
time_col : str, optional
Time variable name, defaults to self.time_col
Returns:
------
pd.DataFrame
Data with DID variables added
Example:
------
>>> constructor.did_variables('treated', post_period=2010)
"""
if treat_col not in self.df.columns:
raise ValueError(f"Treatment variable '{treat_col}' not in data")
time_col = time_col or self.time_col
if not time_col:
raise ValueError("Must specify time_col")
if time_col not in self.df.columns:
raise ValueError(f"Time variable '{time_col}' not in data")
# Create post variable
self.df['post'] = (self.df[time_col] >= post_period).astype(int)
# Create DID interaction
self.df['treat_post'] = self.df[treat_col] * self.df['post']
did_vars = ['post', 'treat_post']
self.constructed_vars.extend(did_vars)
print("✓ Created DID variables:")
print(f" - post = ({time_col} >= {post_period})")
print(f" - treat_post = {treat_col} × post")
print(f"\nDID regression equation:")
print(f" Y = β₀ + β₁·{treat_col} + β₂·post + β₃·treat_post + ε")
print(f" β₃ = DID estimator (treatment effect)")
return self.df
def decompose_treatment_effect(self, outcome: str, treatment: str,
covariates: Optional[List[str]] = None) -> dict:
"""
Treatment effect decomposition
Theory:
-----
ATE = E[Y(1) - Y(0)]
= [E[Y|D=1] - E[Y|D=0]]
- {E[Y(0)|D=1] - E[Y(0)|D=0]}
= Observed Difference - Selection Bias
Parameters:
------
outcome : str
Outcome variable
treatment : str
Treatment variable (0/1)
covariates : list of str, optional
Control variables
Returns:
------
dict
Dictionary containing decomposition results
Example:
------
>>> results = constructor.decompose_treatment_effect('wage', 'college_degree')
"""
if outcome not in self.df.columns:
raise ValueError(f"Outcome variable '{outcome}' not in data")
if treatment not in self.df.columns:
raise ValueError(f"Treatment variable '{treatment}' not in data")
# Observed difference
treated = self.df[self.df[treatment] == 1][outcome]
control = self.df[self.df[treatment] == 0][outcome]
observed_diff = treated.mean() - control.mean()
# Regression adjustment (control for covariates)
if covariates:
X = sm.add_constant(self.df[[treatment] + covariates])
else:
X = sm.add_constant(self.df[[treatment]])
y = self.df[outcome]
model = sm.OLS(y, X).fit(cov_type='HC3')
ate = model.params[treatment]
se = model.bse[treatment]
# Selection bias
selection_bias = observed_diff - ate
results = {
'observed_difference': observed_diff,
'ate': ate,
'se': se,
'selection_bias': selection_bias,
't_stat': ate / se,
'p_value': model.pvalues[treatment],
'ci_lower': ate - 1.96 * se,
'ci_upper': ate + 1.96 * se
}
print("✓ Treatment effect decomposition:")
print(f" Observed difference: {observed_diff:.4f}")
print(f" ATE: {ate:.4f} (SE: {se:.4f})")
print(f" Selection bias: {selection_bias:.4f}")
print(f" 95% CI: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
print(f" t-stat: {results['t_stat']:.4f}, p-value: {results['p_value']:.4f}")
return results
def get_constructed_vars(self) -> List[str]:
"""Return all constructed variable names"""
return self.constructed_vars
def summary(self):
"""Print variable construction summary"""
print("\n" + "="*60)
print("📊 Variable Construction Summary")
print("="*60)
print(f"Original variables: {len(self.df.columns) - len(self.constructed_vars)}")
print(f"Constructed variables: {len(self.constructed_vars)}")
print(f"Total variables: {len(self.df.columns)}")
if self.id_col and self.time_col:
n_id = self.df[self.id_col].nunique()
n_time = self.df[self.time_col].nunique()
print(f"\nPanel structure: {n_id} individuals × {n_time} periods = {len(self.df)} observations")
print(f"\nConstructed variables:")
for var in self.constructed_vars:
print(f" - {var}")
print("="*60)💼 Complete Case Study: NLSY97 Mincer Wage Equation
Case Background
Data: National Longitudinal Survey of Youth 1997 (NLSY97)
Research Questions:
- What is the return to education on wages?
- Does the return to education vary by gender?
- Does work experience have an inverted U-shaped effect on wages?
- How do regional and industry differences affect wages?
Data Simulation (Substitute for Real NLSY97)
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed
np.random.seed(2024)
# Simulate 5000 individuals
n = 5000
# Generate basic features
df = pd.DataFrame({
'person_id': range(1, n+1),
'age': np.random.uniform(25, 55, n),
'education': np.random.choice([12, 14, 16, 18, 20], n, p=[0.3, 0.2, 0.3, 0.15, 0.05]),
'female': np.random.binomial(1, 0.5, n),
'race': np.random.choice(['White', 'Black', 'Hispanic', 'Other'], n, p=[0.6, 0.15, 0.2, 0.05]),
'urban': np.random.binomial(1, 0.7, n),
'married': np.random.binomial(1, 0.55, n),
'region': np.random.choice(['Northeast', 'South', 'Midwest', 'West'], n, p=[0.18, 0.37, 0.21, 0.24]),
'industry': np.random.choice(['Manufacturing', 'Service', 'Tech', 'Finance', 'Other'], n,
p=[0.15, 0.35, 0.2, 0.15, 0.15])
})
# Calculate work experience
df['experience'] = df['age'] - df['education'] - 6
df['experience'] = df['experience'].clip(lower=0)
# Generate wages (based on real economic relationships)
log_wage = (
9.5 + # Base log wage
0.08 * df['education'] + # Education return (8%)
0.05 * df['experience'] + # Experience return
-0.0008 * df['experience']**2 + # Diminishing marginal return to experience
-0.15 * df['female'] + # Gender wage gap
-0.02 * df['education'] * df['female'] + # Gender difference in education return
0.10 * (df['race'] == 'White') + # Racial differences
-0.08 * (df['race'] == 'Black') +
-0.05 * (df['race'] == 'Hispanic') +
0.05 * df['urban'] + # Urban premium
0.08 * df['married'] + # Marriage premium
0.12 * (df['region'] == 'West') + # Regional differences
0.08 * (df['region'] == 'Northeast') +
-0.05 * (df['region'] == 'South') +
0.15 * (df['industry'] == 'Tech') + # Industry differences
0.12 * (df['industry'] == 'Finance') +
-0.08 * (df['industry'] == 'Service') +
np.random.normal(0, 0.3, n) # Error term
)
df['wage'] = np.exp(log_wage)
print("✓ Data generation complete")
print(f"Sample size: {len(df)}")
print(f"\nDescriptive statistics:")
print(df[['age', 'education', 'experience', 'wage']].describe())Using VariableConstructor to Construct Variables
# Initialize constructor
constructor = VariableConstructor(df)
# 1. Create basic Mincer equation variables
constructor.mincer_equation_vars('wage', 'education', 'age', start_age=6)
# 2. Create dummy variables
constructor.create_dummies('race', prefix='race', ref_category='White')
constructor.create_dummies('region', prefix='region', ref_category='Northeast')
constructor.create_dummies('industry', prefix='industry', ref_category='Other')
# 3. Create interaction terms
constructor.create_interactions('education', 'female', center=True, name='educ_x_female')
constructor.create_interactions('experience', 'female', center=False, name='exp_x_female')
constructor.create_interactions('education', 'experience', center=True, name='educ_x_exp')
# 4. Create group statistics variables
constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
constructor.create_group_stats('wage', 'industry', stats=['mean', 'median'])
# 5. View constructed variables
print("\nConstructed variables:")
for var in constructor.get_constructed_vars():
print(f" - {var}")
# Get constructed data
df_constructed = constructor.dfEstimating the Mincer Wage Equation
# Model 1: Basic Mincer equation
X1 = sm.add_constant(df_constructed[['education', 'experience', 'experience_sq']])
y = df_constructed['log_wage']
model1 = sm.OLS(y, X1).fit(cov_type='HC3')
print("\n" + "="*80)
print("Model 1: Basic Mincer Wage Equation")
print("="*80)
print(model1.summary().tables[1])
# Model 2: Add demographic features
X2 = sm.add_constant(df_constructed[[
'education', 'experience', 'experience_sq',
'female', 'race_Black', 'race_Hispanic', 'race_Other',
'urban', 'married'
]])
model2 = sm.OLS(y, X2).fit(cov_type='HC3')
print("\n" + "="*80)
print("Model 2: With Demographic Features")
print("="*80)
print(model2.summary().tables[1])
# Model 3: Add interactions
X3 = sm.add_constant(df_constructed[[
'education', 'experience', 'experience_sq',
'female', 'race_Black', 'race_Hispanic', 'race_Other',
'urban', 'married',
'educ_x_female', 'exp_x_female'
]])
model3 = sm.OLS(y, X3).fit(cov_type='HC3')
print("\n" + "="*80)
print("Model 3: With Interactions (Gender Heterogeneity)")
print("="*80)
print(model3.summary().tables[1])
# Model 4: Full model (add region and industry)
X4 = sm.add_constant(df_constructed[[
'education', 'experience', 'experience_sq',
'female', 'race_Black', 'race_Hispanic', 'race_Other',
'urban', 'married',
'educ_x_female', 'exp_x_female',
'region_South', 'region_Midwest', 'region_West',
'industry_Manufacturing', 'industry_Service', 'industry_Tech', 'industry_Finance'
]])
model4 = sm.OLS(y, X4).fit(cov_type='HC3')
print("\n" + "="*80)
print("Model 4: Full Model (Region + Industry)")
print("="*80)
print(model4.summary().tables[1])Results Interpretation
print("\n" + "="*80)
print("📊 Results Interpretation")
print("="*80)
# Education return
beta_educ = model4.params['education']
print(f"\n1. Return to Education")
print(f" - Each additional year of education increases wages by {beta_educ*100:.2f}%")
print(f" - 95% confidence interval: [{(beta_educ - 1.96*model4.bse['education'])*100:.2f}%, "
f"{(beta_educ + 1.96*model4.bse['education'])*100:.2f}%]")
# Gender wage gap
beta_female = model4.params['female']
print(f"\n2. Gender Wage Gap")
print(f" - Female wages are {abs(beta_female)*100:.2f}% lower than male wages")
print(f" - p-value: {model4.pvalues['female']:.4f}")
# Gender difference in education return
beta_educ_female = model4.params['educ_x_female']
print(f"\n3. Gender Difference in Education Return")
print(f" - Male education return: {beta_educ*100:.2f}%")
print(f" - Female education return: {(beta_educ + beta_educ_female)*100:.2f}%")
print(f" - Difference: {beta_educ_female*100:.2f}% (females lower)")
# Inverted U-shaped effect of experience
beta_exp = model4.params['experience']
beta_exp_sq = model4.params['experience_sq']
optimal_exp = -beta_exp / (2 * beta_exp_sq)
print(f"\n4. Inverted U-shaped Effect of Experience")
print(f" - Experience coefficient: {beta_exp:.6f}")
print(f" - Experience squared coefficient: {beta_exp_sq:.6f}")
print(f" - Optimal experience years: {optimal_exp:.1f} years")
# Regional differences
print(f"\n5. Regional Wage Differences (relative to Northeast)")
for region in ['South', 'Midwest', 'West']:
var = f'region_{region}'
if var in model4.params.index:
print(f" - {region}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")
# Industry differences
print(f"\n6. Industry Wage Differences (relative to Other)")
for industry in ['Manufacturing', 'Service', 'Tech', 'Finance']:
var = f'industry_{industry}'
if var in model4.params.index:
print(f" - {industry}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")
# Model fit
print(f"\n7. Model Fit")
print(f" - R²: {model4.rsquared:.4f}")
print(f" - Adjusted R²: {model4.rsquared_adj:.4f}")
print(f" - F-statistic: {model4.fvalue:.2f} (p-value: {model4.f_pvalue:.4e})")Visualization Analysis
# 1. Education-wage relationship (by gender)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for gender, label in [(0, 'Male'), (1, 'Female')]:
subset = df_constructed[df_constructed['female'] == gender]
avg_wage = subset.groupby('education')['wage'].mean()
plt.plot(avg_wage.index, avg_wage.values, marker='o', label=label, linewidth=2)
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Average Wage ($)', fontsize=12)
plt.title('Education-Wage Relationship (by Gender)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
# 2. Experience-wage relationship (inverted U)
plt.subplot(1, 2, 2)
exp_range = np.linspace(0, 40, 100)
log_wage_pred = (model4.params['const'] +
model4.params['education'] * df_constructed['education'].mean() +
model4.params['experience'] * exp_range +
model4.params['experience_sq'] * exp_range**2)
wage_pred = np.exp(log_wage_pred)
plt.plot(exp_range, wage_pred, linewidth=2, color='green')
plt.axvline(optimal_exp, color='red', linestyle='--', label=f'Optimal experience = {optimal_exp:.1f} years')
plt.xlabel('Work Experience (years)', fontsize=12)
plt.ylabel('Predicted Wage ($)', fontsize=12)
plt.title('Experience-Wage Relationship (Inverted U)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# 3. Regional and industry wage distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Region
df_constructed.boxplot(column='wage', by='region', ax=axes[0])
axes[0].set_title('Wage Distribution by Region', fontsize=14)
axes[0].set_xlabel('Region', fontsize=12)
axes[0].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[0])
plt.xticks(rotation=45)
# Industry
df_constructed.boxplot(column='wage', by='industry', ax=axes[1])
axes[1].set_title('Wage Distribution by Industry', fontsize=14)
axes[1].set_xlabel('Industry', fontsize=12)
axes[1].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[1])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# 4. Residual diagnostics
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted values
axes[0, 0].scatter(model4.fittedvalues, model4.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values', fontsize=12)
axes[0, 0].set_ylabel('Residuals', fontsize=12)
axes[0, 0].set_title('Residuals vs Fitted Values', fontsize=14)
axes[0, 0].grid(alpha=0.3)
# QQ plot
from scipy import stats as sp_stats
sp_stats.probplot(model4.resid, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('QQ Plot (Normality Test)', fontsize=14)
axes[0, 1].grid(alpha=0.3)
# Residual histogram
axes[1, 0].hist(model4.resid, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)
axes[1, 0].set_title('Residual Distribution', fontsize=14)
axes[1, 0].grid(alpha=0.3)
# Standardized residuals
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(model4)
standardized_resid = influence.resid_studentized_internal
axes[1, 1].scatter(range(len(standardized_resid)), standardized_resid, alpha=0.5)
axes[1, 1].axhline(y=3, color='r', linestyle='--', label='±3σ')
axes[1, 1].axhline(y=-3, color='r', linestyle='--')
axes[1, 1].set_xlabel('Observation Index', fontsize=12)
axes[1, 1].set_ylabel('Standardized Residuals', fontsize=12)
axes[1, 1].set_title('Standardized Residuals (Outlier Detection)', fontsize=14)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
plt.show()Heterogeneity Analysis: Regional Differences in Education Return
# Estimate education return by region
print("\n" + "="*80)
print("📊 Heterogeneity Analysis: Regional Differences in Education Return")
print("="*80)
results_by_region = {}
for region in df_constructed['region'].unique():
subset = df_constructed[df_constructed['region'] == region]
X_subset = sm.add_constant(subset[['education', 'experience', 'experience_sq', 'female']])
y_subset = subset['log_wage']
model_subset = sm.OLS(y_subset, X_subset).fit(cov_type='HC3')
educ_return = model_subset.params['education']
educ_se = model_subset.bse['education']
results_by_region[region] = {
'return': educ_return,
'se': educ_se,
'ci_lower': educ_return - 1.96 * educ_se,
'ci_upper': educ_return + 1.96 * educ_se
}
print(f"\n{region}:")
print(f" Education return: {educ_return*100:.2f}% (SE: {educ_se*100:.2f}%)")
print(f" 95% CI: [{results_by_region[region]['ci_lower']*100:.2f}%, "
f"{results_by_region[region]['ci_upper']*100:.2f}%]")
# Visualize regional differences
fig, ax = plt.subplots(figsize=(10, 6))
regions = list(results_by_region.keys())
returns = [results_by_region[r]['return'] * 100 for r in regions]
errors = [1.96 * results_by_region[r]['se'] * 100 for r in regions]
ax.barh(regions, returns, xerr=errors, capsize=5, color='skyblue', edgecolor='black')
ax.set_xlabel('Education Return (%)', fontsize=12)
ax.set_ylabel('Region', fontsize=12)
ax.set_title('Regional Differences in Education Return (95% CI)', fontsize=14)
ax.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()🔧 Other Common Variable Constructions
1. Polynomial Terms (Non-linear Relationships)
# Study non-linear relationship: income = f(experience, experience²)
df['experience_sq'] = df['experience'] ** 2
df['experience_cube'] = df['experience'] ** 3
# Identify optimal point
# ∂wage/∂experience = β₁ + 2β₂·experience = 0
# experience* = -β₁ / (2β₂)2. Ratio Variables
# Calculate ratios
df['income_per_capita'] = df['household_income'] / df['household_size']
df['debt_to_income'] = df['debt'] / df['income']
df['saving_rate'] = df['saving'] / df['income']
# Log ratios
df['log_debt_to_income'] = np.log(df['debt'] / df['income'])3. Log Difference (Growth Rates)
# Log difference ≈ growth rate
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()
# Interpretation: income_growth = 0.05 → income grows by approximately 5%
# Verification:
# ln(Y_t) - ln(Y_{t-1}) = ln(Y_t / Y_{t-1}) ≈ (Y_t - Y_{t-1}) / Y_{t-1}4. Standardized Variables (Z-score)
from scipy import stats
# Z-score standardization (population)
df['income_std'] = stats.zscore(df['income'])
# Within-group standardization
df['income_std_within'] = df.groupby('region')['income'].transform(
lambda x: stats.zscore(x)
)
# Interpretation: income_std = 2 → individual's income is 2 standard deviations above the mean📝 Practice Exercises
⭐⭐ Exercise 1: Create Education Level Dummy Variables
Problem: Given an education years variable, create education level dummy variables (high school, bachelor's, graduate), with high school as the reference group.
# Data
df = pd.DataFrame({
'id': range(1, 11),
'education_years': [12, 16, 18, 12, 14, 16, 20, 12, 16, 14]
})
# Task: Create dummy variables
# High school: 12 years
# Bachelor's: 14-16 years
# Graduate: 18-20 yearsHint: Use pd.cut() to group, then pd.get_dummies()
⭐⭐⭐ Exercise 2: Construct Interactions and Calculate Marginal Effects
Problem: Estimate the model
- Create interaction term
- Run regression
- Calculate education return for males and females
- Test whether the difference in returns is significant (Wald test)
⭐⭐⭐⭐ Exercise 3: Dynamic Panel Model
Problem: Use the Arellano-Bond method to estimate a dynamic panel model
- Create lag
- First-difference to eliminate
- Use as instrumental variable
- GMM estimation
⭐⭐⭐⭐⭐ Exercise 4: Heterogeneous Treatment Effect Decomposition
Problem: Analyze the treatment effect of a college degree on wages, and decompose it into:
- ATE (Average Treatment Effect): Average treatment effect
- ATT (Average Treatment on the Treated): Average treatment effect on the treated
- ATE vs ATT difference: Evidence of selection bias
Use Propensity Score Matching for sensitivity analysis.
📚 Summary
Core Functions Summary
| Task | Method | Mathematical Representation | Application |
|---|---|---|---|
| Dummy Variables | pd.get_dummies() | Numerical encoding of categorical variables | |
| Interactions | df['A'] * df['B'] | Effect heterogeneity | |
| Lags | df.shift() | Dynamic models | |
| Group Stats | groupby().transform() | Control for group effects | |
| Polynomials | df['x'] ** 2 | Non-linear relationships |
Best Practices
- ✅ Understand reference groups: Clearly specify reference group in dummy variable regressions
- ✅ Avoid multicollinearity: Use
drop_first=Trueor specify reference group - ✅ Center interactions: Improve interpretability, reduce collinearity
- ✅ Sort panel data: Sort by
idandtimebefore creating lags - ✅ Naming conventions: e.g.,
income_lag1,edu_x_female,region_mean_wage - ✅ Documentation: Record construction logic and theoretical basis for each variable
- ✅ Validation: Check distribution and missing values of constructed variables
Theoretical Key Points
- Dummy variable trap: causes perfect collinearity
- Interaction interpretation: represents how the marginal effect of on varies with
- Lag operator properties: ,
- Within transformation: eliminates group fixed effects
- Mincer equation:
🎯 Next Steps
Next section: Data Transformation (standardization, log transformation, Box-Cox transformation, binning)
📖 References
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
- Chapter 3: Making Regression Make Sense
- Interactions and heterogeneous effects
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.
- Chapter 10: Basic Linear Unobserved Effects Panel Data Models
- Dynamic panel models
Mincer, J. (1974). Schooling, Experience, and Earnings. NBER.
- Original reference for the Mincer wage equation
Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. The Review of Economic Studies, 58(2), 277-297.
- Dynamic panel GMM estimation
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press.
- Chapter 21: Panel Data Models
- Chapter 25: Treatment Evaluation
Version: StatsPai Module 3.4 - Nobel Prize Standards Last Updated: 2024-01-15 Author: StatsPai Team License: CC BY-NC-SA 4.0