Skip to content

3.4 Variable Construction

"Feature engineering is the key to machine learning success.""特征工程是机器学习成功的关键。"— Pedro Domingos, Professor at University of Washington (华盛顿大学教授)

From raw variables to analysis-ready data


📋 Learning Objectives

  • Master the creation and theory of dummy variables
  • Deeply understand interaction terms and marginal effects
  • Learn lag operators in dynamic panel data
  • Master complete variable construction for the Mincer wage equation
  • Understand DID variables and treatment effect decomposition
  • Build a production-level VariableConstructor class

📐 Theoretical Foundation

The Role of Variable Construction in Causal Inference

Angrist & Pischke (2009) emphasize:

"Good econometrics is about good research design, and good research design requires careful variable construction."

Three Key Functions:

  1. Confounding Control

    • : Time fixed effects (dummy variables)
    • : Individual fixed effects (group mean centering)
  2. Capturing Heterogeneous Effects

    • Interaction term captures how the treatment effect varies with
  3. Constructing Instrumental Variables


🎯 Dummy Variables

Theory: Numerical Encoding of Categorical Variables

Mathematical Representation:

Let categorical variable have categories . Create dummy variables:

Regression Model:

Coefficient Interpretation:

  • : Expected value for the reference group ():
  • : Difference between category and the reference group:

The Dummy Variable Trap (Multicollinearity)

Perfect Collinearity:

If all dummy variables are included:

This creates perfect collinearity with the intercept term (which equals 1 for all observations), making the design matrix non-invertible:

Solution: Drop one category as the reference group

Method 1: pd.get_dummies()

python
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'education': ['High School', 'Bachelor', 'Master', 'High School', 'PhD']
})

# Create dummy variables
dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, dummies], axis=1)

print(df)
#    id    education  edu_Master  edu_PhD  edu_Bachelor
# 0   1  High School           0        0             0
# 1   2     Bachelor           0        0             1
# 2   3       Master           1        0             0
# 3   4  High School           0        0             0
# 4   5          PhD           0        1             0

Parameter Explanation:

  • prefix='edu': Variable name prefix
  • drop_first=True: Drop the first category (alphabetically) as reference group
  • Reference group: High School (all dummy variables equal 0)

Method 2: Manual Construction (More Flexible)

python
# Create single dummy variable
df['college'] = (df['education'] == 'Bachelor').astype(int)
df['female'] = (df['gender'] == 'Female').astype(int)
df['high_income'] = (df['income'] > df['income'].median()).astype(int)

# Multiple conditions (cross-classification)
df['young_female'] = ((df['age'] < 30) & (df['gender'] == 'Female')).astype(int)
df['urban_college'] = ((df['urban'] == 1) & (df['education'] == 'Bachelor')).astype(int)

# Specify reference group
def create_education_dummies(education_series, ref='High School'):
    """Create education dummy variables with specified reference group"""
    categories = education_series.unique()
    dummies = {}
    for cat in categories:
        if cat != ref:
            dummies[f'edu_{cat}'] = (education_series == cat).astype(int)
    return pd.DataFrame(dummies)

edu_dummies = create_education_dummies(df['education'], ref='High School')
df = pd.concat([df, edu_dummies], axis=1)

Regression and Interpretation

python
import statsmodels.api as sm

# Simulate wage data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'education': np.random.choice(['High School', 'Bachelor', 'Master'], n),
    'experience': np.random.uniform(0, 30, n),
    'age': np.random.uniform(22, 65, n)
})

# Generate wages (true model)
df['wage'] = (30000 +
              5000 * (df['education'] == 'Bachelor') +
              10000 * (df['education'] == 'Master') +
              1000 * df['experience'] +
              np.random.normal(0, 5000, n))

# Create dummy variables
edu_dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, edu_dummies], axis=1)

# OLS regression
X = sm.add_constant(df[['experience', 'edu_Master', 'edu_Bachelor']])
y = df['wage']
model = sm.OLS(y, X).fit(cov_type='HC3')

print(model.summary().tables[1])

Coefficient Interpretation:

  • Intercept (30127): Expected wage for high school graduates with zero experience
  • Bachelor coefficient (5084): Wage premium for bachelor's degree relative to high school
  • Master coefficient (10156): Wage premium for master's degree relative to high school

Significance Testing:

Hypothesis test (no bachelor's premium)

If , reject ; the bachelor's premium is significant.


🔄 Interaction Terms

Theory: Heterogeneity in Marginal Effects

Research Question: Does the effect of education on income vary by gender?

Model Without Interaction (parallel slopes):

Marginal Effect:

Problem: Assumes the effect of on does not vary with

Model With Interaction (non-parallel slopes):

Marginal Effect:

Interpretation: The effect of on depends on the value of

Mathematical Derivation: Angrist & Pischke (2009) Framework

Basic Model (education × gender interaction):

Group-Specific Regressions:

  • Males ():

    Education return:

  • Females ():

    Education return:

Interaction Coefficient Interpretation:

If , the education return for females is lower than for males (evidence of gender discrimination).

Creating Interaction Terms

python
# 1. Continuous × Dummy
df['educ_x_female'] = df['education'] * df['female']

# 2. Continuous × Continuous
df['educ_x_experience'] = df['education'] * df['experience']

# 3. Dummy × Dummy
df['female_x_married'] = df['female'] * df['married']

# 4. Three-way interaction
df['educ_x_exp_x_female'] = df['education'] * df['experience'] * df['female']

Centering: Reducing Multicollinearity

Problem: and are highly correlated

When both and are positive, the correlation approaches 1.

Solution: Centering

Centered Model:

Advantages:

  • interpretation: Effect of on when is at its mean
  • interpretation: Expected value of when both and are at their means
python
# Centering
df['educ_centered'] = df['education'] - df['education'].mean()
df['exp_centered'] = df['experience'] - df['experience'].mean()

# Create centered interaction
df['educ_x_exp_centered'] = df['educ_centered'] * df['exp_centered']

# Regression
X = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

Statistical Testing of Interactions

1. Wald Test

Test (no interaction effect)

python
# Wald test
from scipy import stats

beta3 = model.params['educ_x_exp_centered']
se_beta3 = model.bse['educ_x_exp_centered']
wald_stat = (beta3 / se_beta3) ** 2
p_value = 1 - stats.chi2.cdf(wald_stat, df=1)

print(f"Wald statistic: {wald_stat:.4f}, p-value: {p_value:.4f}")

2. F Test (Multiple Interactions)

Test

Where:

  • : Residual sum of squares from restricted model (without interactions)
  • : Residual sum of squares from unrestricted model (with interactions)
  • : Number of interaction terms
python
# F test
from statsmodels.iolib.table import SimpleTable

# Restricted model
X_restricted = sm.add_constant(df[['educ_centered', 'exp_centered']])
model_r = sm.OLS(y, X_restricted).fit()

# Unrestricted model
X_unrestricted = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
model_u = sm.OLS(y, X_unrestricted).fit()

# F statistic
rss_r = model_r.ssr
rss_u = model_u.ssr
q = 1
n = len(df)
k = X_unrestricted.shape[1] - 1

f_stat = ((rss_r - rss_u) / q) / (rss_u / (n - k - 1))
p_value = 1 - stats.f.cdf(f_stat, q, n - k - 1)

print(f"F statistic: {f_stat:.4f}, p-value: {p_value:.4f}")

Visualizing Interaction Effects

python
import matplotlib.pyplot as plt

# Simulate data
np.random.seed(42)
educ = np.linspace(8, 20, 100)

# Predicted wages for males and females
wage_male = 8 + 0.08 * educ
wage_female = 8.2 + 0.06 * educ

# Plot
plt.figure(figsize=(10, 6))
plt.plot(educ, wage_male, label='Male', linewidth=2, color='blue')
plt.plot(educ, wage_female, label='Female', linewidth=2, color='red')
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Log Wage', fontsize=12)
plt.title('Gender Differences in Education-Income Relationship (Interaction Effect)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

⏱️ Lag Operators

Theory: Dynamic Panel Data Models

Lag Operator Definition:

Properties:

  1. Linearity:
  2. Multiplication:
  3. Polynomial:

Difference Operator:

Applications:

  • Unit root testing: (ADF test)
  • Error correction models:

Dynamic Panel Models

Arellano-Bond Model:

Problem: is correlated with (endogeneity)

Solution: First-differencing + GMM

Use as instrumental variables.

Creating Lags

python
# Example: panel data
df = pd.DataFrame({
    'person_id': [1, 1, 1, 1, 2, 2, 2, 2],
    'year': [2017, 2018, 2019, 2020, 2017, 2018, 2019, 2020],
    'income': [50000, 52000, 55000, 58000, 60000, 62000, 65000, 68000]
})

# Sort by individual and time
df = df.sort_values(['person_id', 'year'])

# Create 1-period lag
df['income_lag1'] = df.groupby('person_id')['income'].shift(1)

# Create 2-period lag
df['income_lag2'] = df.groupby('person_id')['income'].shift(2)

# Create difference
df['income_diff'] = df.groupby('person_id')['income'].diff()

# Create log difference (growth rate)
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()

print(df)
#    person_id  year  income  income_lag1  income_lag2  income_diff  log_income  income_growth
# 0          1  2017   50000          NaN          NaN          NaN   10.819778            NaN
# 1          1  2018   52000      50000.0          NaN       2000.0   10.859073       0.039295
# 2          1  2019   55000      52000.0      50000.0       3000.0   10.914785       0.055713
# 3          1  2020   58000      55000.0      52000.0       3000.0   10.968557       0.053771
# 4          2  2017   60000          NaN          NaN          NaN   11.002100            NaN
# 5          2  2018   62000      60000.0          NaN       2000.0   11.034699       0.032599
# 6          2  2019   65000      62000.0      60000.0       3000.0   11.082414       0.047715
# 7          2  2020   68000      65000.0      62000.0       3000.0   11.127384       0.044970

Lead Variables

Application: Event studies

Test for anticipation effects before policy implementation

Where indicates periods before () or after () the event.

python
# Create 1-period lead (future value)
df['income_lead1'] = df.groupby('person_id')['income'].shift(-1)
df['income_lead2'] = df.groupby('person_id')['income'].shift(-2)

# Event study: pre- and post-policy effects
df['treat_post'] = (df['year'] >= 2019) & (df['treated'] == 1)
df['treat_pre1'] = (df['year'] == 2018) & (df['treated'] == 1)
df['treat_pre2'] = (df['year'] == 2017) & (df['treated'] == 1)

📊 Group Statistics Variables

Theory: Controlling for Group Fixed Effects

Within-Group Variation:

Where represents group fixed effects.

Group Mean Centering (Within Transformation):

Centered Model:

The term is eliminated, avoiding omitted variable bias.

Method 1: groupby + transform

python
# Calculate average income by region
df['region_mean_income'] = df.groupby('region')['income'].transform('mean')

# Calculate individual deviation from regional mean (centering)
df['income_deviation'] = df['income'] - df['region_mean_income']

# Other statistics
df['region_median_income'] = df.groupby('region')['income'].transform('median')
df['region_std_income'] = df.groupby('region')['income'].transform('std')
df['region_count'] = df.groupby('region')['income'].transform('count')
df['region_min_income'] = df.groupby('region')['income'].transform('min')
df['region_max_income'] = df.groupby('region')['income'].transform('max')

# Percentiles
df['region_p25_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.25))
df['region_p75_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.75))

Method 2: Ranking and Percentiles

python
# Individual's income rank within region (percentile)
df['income_pct_rank'] = df.groupby('region')['income'].transform(
    lambda x: x.rank(pct=True)
)

# Interpretation: 0.75 means the individual's income is higher than 75% within the region

# Quartile groups
df['income_quartile'] = df.groupby('region')['income'].transform(
    lambda x: pd.qcut(x, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
)

Method 3: Rolling Window Statistics

python
# Time series: average income over the past 3 years
df = df.sort_values(['person_id', 'year'])
df['income_rolling_mean_3y'] = df.groupby('person_id')['income'].transform(
    lambda x: x.rolling(window=3, min_periods=1).mean()
)

# Standard deviation over the past 3 years (income volatility)
df['income_rolling_std_3y'] = df.groupby('person_id')['income'].transform(
    lambda x: x.rolling(window=3, min_periods=1).std()
)

# Growth rate over the past 3 years
df['income_growth_3y'] = df.groupby('person_id')['income'].pct_change(periods=3)

⚙️ Production-Level VariableConstructor Class

python
import pandas as pd
import numpy as np
from typing import List, Union, Tuple, Optional
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

class VariableConstructor:
    """
    Production-level variable constructor - Nobel Prize Standards

    Functions:
    -----
    1. Dummy variable creation
    2. Interaction term construction
    3. Lag operators
    4. Group statistics variables
    5. Mincer equation variable set
    6. DID variable construction
    7. Treatment effect decomposition

    Author: StatsPai Team
    Reference: Angrist & Pischke (2009), Wooldridge (2010)
    """

    def __init__(self, df: pd.DataFrame, id_col: Optional[str] = None,
                 time_col: Optional[str] = None):
        """
        Initialize variable constructor

        Parameters:
        ------
        df : pd.DataFrame
            Raw data
        id_col : str, optional
            Individual identifier column for panel data
        time_col : str, optional
            Time identifier column for panel data
        """
        self.df = df.copy()
        self.id_col = id_col
        self.time_col = time_col
        self.constructed_vars = []

        # Validate panel data structure
        if id_col and time_col:
            self._validate_panel_structure()

    def _validate_panel_structure(self):
        """Validate panel data structure"""
        if self.id_col not in self.df.columns:
            raise ValueError(f"id_col '{self.id_col}' not in data")
        if self.time_col not in self.df.columns:
            raise ValueError(f"time_col '{self.time_col}' not in data")

        # Check if balanced panel
        counts = self.df.groupby(self.id_col)[self.time_col].count()
        if counts.nunique() == 1:
            print(f"✓ Balanced panel: {counts.iloc[0]} periods per individual")
        else:
            print(f"⚠️  Unbalanced panel: period count ranges {counts.min()} - {counts.max()}")

    def create_dummies(self, var: str, prefix: Optional[str] = None,
                      drop_first: bool = True, ref_category: Optional[str] = None) -> pd.DataFrame:
        """
        Create dummy variables

        Parameters:
        ------
        var : str
            Categorical variable name
        prefix : str, optional
            Variable name prefix, defaults to var
        drop_first : bool, default=True
            Whether to drop the first category (avoid multicollinearity)
        ref_category : str, optional
            Specify reference group (takes precedence over drop_first)

        Returns:
        ------
        pd.DataFrame
            Data with dummy variables added

        Example:
        ------
        >>> constructor.create_dummies('education', prefix='edu', ref_category='High School')
        """
        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        prefix = prefix or var

        if ref_category:
            # Specify reference group
            categories = self.df[var].unique()
            for cat in categories:
                if cat != ref_category:
                    dummy_name = f'{prefix}_{cat}'
                    self.df[dummy_name] = (self.df[var] == cat).astype(int)
                    self.constructed_vars.append(dummy_name)
            print(f"✓ Created dummy variables: reference group = {ref_category}")
        else:
            # Use pd.get_dummies
            dummies = pd.get_dummies(self.df[var], prefix=prefix, drop_first=drop_first)
            for col in dummies.columns:
                self.df[col] = dummies[col]
                self.constructed_vars.append(col)
            print(f"✓ Created dummy variables: {len(dummies.columns)} total")

        return self.df

    def create_interactions(self, var1: str, var2: str, center: bool = False,
                          name: Optional[str] = None) -> pd.DataFrame:
        """
        Create interaction terms

        Theory:
        -----
        For model: y = β₀ + β₁x₁ + β₂x₂ + β₃(x₁×x₂) + ε

        Marginal effect:
        ∂y/∂x₁ = β₁ + β₃x₂

        Parameters:
        ------
        var1 : str
            First variable
        var2 : str
            Second variable
        center : bool, default=False
            Whether to center (reduce multicollinearity)
        name : str, optional
            Interaction term name, defaults to 'var1_x_var2'

        Returns:
        ------
        pd.DataFrame
            Data with interaction term added

        Example:
        ------
        >>> constructor.create_interactions('education', 'female', center=True)
        """
        if var1 not in self.df.columns or var2 not in self.df.columns:
            raise ValueError(f"Variable '{var1}' or '{var2}' not in data")

        name = name or f'{var1}_x_{var2}'

        if center:
            # Center
            var1_centered = self.df[var1] - self.df[var1].mean()
            var2_centered = self.df[var2] - self.df[var2].mean()
            self.df[name] = var1_centered * var2_centered
            print(f"✓ Created centered interaction: {name}")
        else:
            self.df[name] = self.df[var1] * self.df[var2]
            print(f"✓ Created interaction: {name}")

        self.constructed_vars.append(name)
        return self.df

    def create_lags(self, var: str, lags: Union[int, List[int]] = 1,
                   fill_method: str = 'drop') -> pd.DataFrame:
        """
        Create lag variables (panel data)

        Requirements: Must set id_col and time_col

        Math:
        -----
        L(Y_it, k) = Y_{i,t-k}

        Parameters:
        ------
        var : str
            Variable name
        lags : int or list of int
            Lag periods, e.g., 1 or [1, 2, 3]
        fill_method : {'drop', 'zero', 'ffill'}
            Missing value handling method

        Returns:
        ------
        pd.DataFrame
            Data with lag variables added

        Example:
        ------
        >>> constructor.create_lags('income', lags=[1, 2])
        """
        if not self.id_col or not self.time_col:
            raise ValueError("Creating lags requires id_col and time_col to be set")

        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        # Ensure sorted
        self.df = self.df.sort_values([self.id_col, self.time_col])

        # Convert to list
        if isinstance(lags, int):
            lags = [lags]

        for lag in lags:
            lag_name = f'{var}_lag{lag}'
            self.df[lag_name] = self.df.groupby(self.id_col)[var].shift(lag)

            # Handle missing values
            if fill_method == 'drop':
                pass  # Keep NaN, remove in subsequent analysis
            elif fill_method == 'zero':
                self.df[lag_name].fillna(0, inplace=True)
            elif fill_method == 'ffill':
                self.df[lag_name] = self.df.groupby(self.id_col)[lag_name].fillna(method='ffill')

            self.constructed_vars.append(lag_name)
            print(f"✓ Created lag variable: {lag_name}")

        return self.df

    def create_leads(self, var: str, leads: Union[int, List[int]] = 1) -> pd.DataFrame:
        """
        Create lead variables (panel data)

        Application: Event studies

        Parameters:
        ------
        var : str
            Variable name
        leads : int or list of int
            Lead periods

        Returns:
        ------
        pd.DataFrame
            Data with lead variables added
        """
        if not self.id_col or not self.time_col:
            raise ValueError("Creating leads requires id_col and time_col to be set")

        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        self.df = self.df.sort_values([self.id_col, self.time_col])

        if isinstance(leads, int):
            leads = [leads]

        for lead in leads:
            lead_name = f'{var}_lead{lead}'
            self.df[lead_name] = self.df.groupby(self.id_col)[var].shift(-lead)
            self.constructed_vars.append(lead_name)
            print(f"✓ Created lead variable: {lead_name}")

        return self.df

    def create_group_stats(self, var: str, groupby: Union[str, List[str]],
                          stats: List[str] = ['mean', 'std']) -> pd.DataFrame:
        """
        Create group statistics variables

        Examples:
        -----
        - Average wage by province
        - Individual wage deviation from provincial mean

        Parameters:
        ------
        var : str
            Variable name
        groupby : str or list of str
            Grouping variable(s)
        stats : list of str
            Statistics, options: 'mean', 'std', 'min', 'max', 'median', 'count'

        Returns:
        ------
        pd.DataFrame
            Data with group statistics variables added

        Example:
        ------
        >>> constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
        """
        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        if isinstance(groupby, str):
            groupby = [groupby]

        for gb in groupby:
            if gb not in self.df.columns:
                raise ValueError(f"Grouping variable '{gb}' not in data")

        group_name = '_'.join(groupby)

        for stat in stats:
            stat_name = f'{group_name}_{stat}_{var}'

            if stat == 'mean':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('mean')
            elif stat == 'std':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('std')
            elif stat == 'min':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('min')
            elif stat == 'max':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('max')
            elif stat == 'median':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('median')
            elif stat == 'count':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('count')
            else:
                raise ValueError(f"Unknown statistic: {stat}")

            self.constructed_vars.append(stat_name)
            print(f"✓ Created group statistic: {stat_name}")

        # Create deviation variable
        if 'mean' in stats:
            deviation_name = f'{var}_dev_{group_name}'
            mean_name = f'{group_name}_mean_{var}'
            self.df[deviation_name] = self.df[var] - self.df[mean_name]
            self.constructed_vars.append(deviation_name)
            print(f"✓ Created deviation variable: {deviation_name}")

        return self.df

    def mincer_equation_vars(self, wage_col: str, education_col: str,
                            age_col: str, start_age: int = 6) -> pd.DataFrame:
        """
        Construct complete Mincer wage equation variables

        Model:
        -----
        ln(wage) = β₀ + β₁·education + β₂·experience + β₃·experience² + ε

        Required variables:
        --------
        - education: years of education
        - age: age
        - experience = age - education - start_age
        - experience_sq = experience²
        - log_wage = ln(wage)

        Parameters:
        ------
        wage_col : str
            Wage variable name
        education_col : str
            Education years variable name
        age_col : str
            Age variable name
        start_age : int, default=6
            Age when schooling starts

        Returns:
        ------
        pd.DataFrame
            Data with Mincer variables added

        Example:
        ------
        >>> constructor.mincer_equation_vars('wage', 'education', 'age')
        """
        required_cols = [wage_col, education_col, age_col]
        for col in required_cols:
            if col not in self.df.columns:
                raise ValueError(f"Variable '{col}' not in data")

        # Create experience
        self.df['experience'] = self.df[age_col] - self.df[education_col] - start_age
        self.df['experience'] = self.df['experience'].clip(lower=0)  # Ensure non-negative

        # Create experience²
        self.df['experience_sq'] = self.df['experience'] ** 2

        # Create log(wage)
        self.df['log_wage'] = np.log(self.df[wage_col])

        mincer_vars = ['experience', 'experience_sq', 'log_wage']
        self.constructed_vars.extend(mincer_vars)

        print("✓ Created Mincer equation variables:")
        print(f"   - experience = {age_col} - {education_col} - {start_age}")
        print(f"   - experience_sq = experience²")
        print(f"   - log_wage = ln({wage_col})")

        return self.df

    def did_variables(self, treat_col: str, post_period: Union[int, str],
                     time_col: Optional[str] = None) -> pd.DataFrame:
        """
        Construct DID variables

        Variables:
        -----
        - treat: treatment group indicator (already exists)
        - post: post-policy indicator
        - treat_post: interaction term (DID estimator)

        Parameters:
        ------
        treat_col : str
            Treatment group variable name
        post_period : int or str
            Policy implementation time point
        time_col : str, optional
            Time variable name, defaults to self.time_col

        Returns:
        ------
        pd.DataFrame
            Data with DID variables added

        Example:
        ------
        >>> constructor.did_variables('treated', post_period=2010)
        """
        if treat_col not in self.df.columns:
            raise ValueError(f"Treatment variable '{treat_col}' not in data")

        time_col = time_col or self.time_col
        if not time_col:
            raise ValueError("Must specify time_col")

        if time_col not in self.df.columns:
            raise ValueError(f"Time variable '{time_col}' not in data")

        # Create post variable
        self.df['post'] = (self.df[time_col] >= post_period).astype(int)

        # Create DID interaction
        self.df['treat_post'] = self.df[treat_col] * self.df['post']

        did_vars = ['post', 'treat_post']
        self.constructed_vars.extend(did_vars)

        print("✓ Created DID variables:")
        print(f"   - post = ({time_col} >= {post_period})")
        print(f"   - treat_post = {treat_col} × post")
        print(f"\nDID regression equation:")
        print(f"   Y = β₀ + β₁·{treat_col} + β₂·post + β₃·treat_post + ε")
        print(f"   β₃ = DID estimator (treatment effect)")

        return self.df

    def decompose_treatment_effect(self, outcome: str, treatment: str,
                                   covariates: Optional[List[str]] = None) -> dict:
        """
        Treatment effect decomposition

        Theory:
        -----
        ATE = E[Y(1) - Y(0)]
            = [E[Y|D=1] - E[Y|D=0]]
              - {E[Y(0)|D=1] - E[Y(0)|D=0]}
            = Observed Difference - Selection Bias

        Parameters:
        ------
        outcome : str
            Outcome variable
        treatment : str
            Treatment variable (0/1)
        covariates : list of str, optional
            Control variables

        Returns:
        ------
        dict
            Dictionary containing decomposition results

        Example:
        ------
        >>> results = constructor.decompose_treatment_effect('wage', 'college_degree')
        """
        if outcome not in self.df.columns:
            raise ValueError(f"Outcome variable '{outcome}' not in data")
        if treatment not in self.df.columns:
            raise ValueError(f"Treatment variable '{treatment}' not in data")

        # Observed difference
        treated = self.df[self.df[treatment] == 1][outcome]
        control = self.df[self.df[treatment] == 0][outcome]
        observed_diff = treated.mean() - control.mean()

        # Regression adjustment (control for covariates)
        if covariates:
            X = sm.add_constant(self.df[[treatment] + covariates])
        else:
            X = sm.add_constant(self.df[[treatment]])

        y = self.df[outcome]
        model = sm.OLS(y, X).fit(cov_type='HC3')

        ate = model.params[treatment]
        se = model.bse[treatment]

        # Selection bias
        selection_bias = observed_diff - ate

        results = {
            'observed_difference': observed_diff,
            'ate': ate,
            'se': se,
            'selection_bias': selection_bias,
            't_stat': ate / se,
            'p_value': model.pvalues[treatment],
            'ci_lower': ate - 1.96 * se,
            'ci_upper': ate + 1.96 * se
        }

        print("✓ Treatment effect decomposition:")
        print(f"   Observed difference: {observed_diff:.4f}")
        print(f"   ATE: {ate:.4f} (SE: {se:.4f})")
        print(f"   Selection bias: {selection_bias:.4f}")
        print(f"   95% CI: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
        print(f"   t-stat: {results['t_stat']:.4f}, p-value: {results['p_value']:.4f}")

        return results

    def get_constructed_vars(self) -> List[str]:
        """Return all constructed variable names"""
        return self.constructed_vars

    def summary(self):
        """Print variable construction summary"""
        print("\n" + "="*60)
        print("📊 Variable Construction Summary")
        print("="*60)
        print(f"Original variables: {len(self.df.columns) - len(self.constructed_vars)}")
        print(f"Constructed variables: {len(self.constructed_vars)}")
        print(f"Total variables: {len(self.df.columns)}")

        if self.id_col and self.time_col:
            n_id = self.df[self.id_col].nunique()
            n_time = self.df[self.time_col].nunique()
            print(f"\nPanel structure: {n_id} individuals × {n_time} periods = {len(self.df)} observations")

        print(f"\nConstructed variables:")
        for var in self.constructed_vars:
            print(f"  - {var}")

        print("="*60)

💼 Complete Case Study: NLSY97 Mincer Wage Equation

Case Background

Data: National Longitudinal Survey of Youth 1997 (NLSY97)

Research Questions:

  1. What is the return to education on wages?
  2. Does the return to education vary by gender?
  3. Does work experience have an inverted U-shaped effect on wages?
  4. How do regional and industry differences affect wages?

Data Simulation (Substitute for Real NLSY97)

python
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed
np.random.seed(2024)

# Simulate 5000 individuals
n = 5000

# Generate basic features
df = pd.DataFrame({
    'person_id': range(1, n+1),
    'age': np.random.uniform(25, 55, n),
    'education': np.random.choice([12, 14, 16, 18, 20], n, p=[0.3, 0.2, 0.3, 0.15, 0.05]),
    'female': np.random.binomial(1, 0.5, n),
    'race': np.random.choice(['White', 'Black', 'Hispanic', 'Other'], n, p=[0.6, 0.15, 0.2, 0.05]),
    'urban': np.random.binomial(1, 0.7, n),
    'married': np.random.binomial(1, 0.55, n),
    'region': np.random.choice(['Northeast', 'South', 'Midwest', 'West'], n, p=[0.18, 0.37, 0.21, 0.24]),
    'industry': np.random.choice(['Manufacturing', 'Service', 'Tech', 'Finance', 'Other'], n,
                                 p=[0.15, 0.35, 0.2, 0.15, 0.15])
})

# Calculate work experience
df['experience'] = df['age'] - df['education'] - 6
df['experience'] = df['experience'].clip(lower=0)

# Generate wages (based on real economic relationships)
log_wage = (
    9.5 +                                          # Base log wage
    0.08 * df['education'] +                       # Education return (8%)
    0.05 * df['experience'] +                      # Experience return
    -0.0008 * df['experience']**2 +                # Diminishing marginal return to experience
    -0.15 * df['female'] +                         # Gender wage gap
    -0.02 * df['education'] * df['female'] +       # Gender difference in education return
    0.10 * (df['race'] == 'White') +               # Racial differences
    -0.08 * (df['race'] == 'Black') +
    -0.05 * (df['race'] == 'Hispanic') +
    0.05 * df['urban'] +                           # Urban premium
    0.08 * df['married'] +                         # Marriage premium
    0.12 * (df['region'] == 'West') +              # Regional differences
    0.08 * (df['region'] == 'Northeast') +
    -0.05 * (df['region'] == 'South') +
    0.15 * (df['industry'] == 'Tech') +            # Industry differences
    0.12 * (df['industry'] == 'Finance') +
    -0.08 * (df['industry'] == 'Service') +
    np.random.normal(0, 0.3, n)                    # Error term
)

df['wage'] = np.exp(log_wage)

print("✓ Data generation complete")
print(f"Sample size: {len(df)}")
print(f"\nDescriptive statistics:")
print(df[['age', 'education', 'experience', 'wage']].describe())

Using VariableConstructor to Construct Variables

python
# Initialize constructor
constructor = VariableConstructor(df)

# 1. Create basic Mincer equation variables
constructor.mincer_equation_vars('wage', 'education', 'age', start_age=6)

# 2. Create dummy variables
constructor.create_dummies('race', prefix='race', ref_category='White')
constructor.create_dummies('region', prefix='region', ref_category='Northeast')
constructor.create_dummies('industry', prefix='industry', ref_category='Other')

# 3. Create interaction terms
constructor.create_interactions('education', 'female', center=True, name='educ_x_female')
constructor.create_interactions('experience', 'female', center=False, name='exp_x_female')
constructor.create_interactions('education', 'experience', center=True, name='educ_x_exp')

# 4. Create group statistics variables
constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
constructor.create_group_stats('wage', 'industry', stats=['mean', 'median'])

# 5. View constructed variables
print("\nConstructed variables:")
for var in constructor.get_constructed_vars():
    print(f"  - {var}")

# Get constructed data
df_constructed = constructor.df

Estimating the Mincer Wage Equation

python
# Model 1: Basic Mincer equation
X1 = sm.add_constant(df_constructed[['education', 'experience', 'experience_sq']])
y = df_constructed['log_wage']
model1 = sm.OLS(y, X1).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 1: Basic Mincer Wage Equation")
print("="*80)
print(model1.summary().tables[1])

# Model 2: Add demographic features
X2 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married'
]])
model2 = sm.OLS(y, X2).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 2: With Demographic Features")
print("="*80)
print(model2.summary().tables[1])

# Model 3: Add interactions
X3 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married',
    'educ_x_female', 'exp_x_female'
]])
model3 = sm.OLS(y, X3).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 3: With Interactions (Gender Heterogeneity)")
print("="*80)
print(model3.summary().tables[1])

# Model 4: Full model (add region and industry)
X4 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married',
    'educ_x_female', 'exp_x_female',
    'region_South', 'region_Midwest', 'region_West',
    'industry_Manufacturing', 'industry_Service', 'industry_Tech', 'industry_Finance'
]])
model4 = sm.OLS(y, X4).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 4: Full Model (Region + Industry)")
print("="*80)
print(model4.summary().tables[1])

Results Interpretation

python
print("\n" + "="*80)
print("📊 Results Interpretation")
print("="*80)

# Education return
beta_educ = model4.params['education']
print(f"\n1. Return to Education")
print(f"   - Each additional year of education increases wages by {beta_educ*100:.2f}%")
print(f"   - 95% confidence interval: [{(beta_educ - 1.96*model4.bse['education'])*100:.2f}%, "
      f"{(beta_educ + 1.96*model4.bse['education'])*100:.2f}%]")

# Gender wage gap
beta_female = model4.params['female']
print(f"\n2. Gender Wage Gap")
print(f"   - Female wages are {abs(beta_female)*100:.2f}% lower than male wages")
print(f"   - p-value: {model4.pvalues['female']:.4f}")

# Gender difference in education return
beta_educ_female = model4.params['educ_x_female']
print(f"\n3. Gender Difference in Education Return")
print(f"   - Male education return: {beta_educ*100:.2f}%")
print(f"   - Female education return: {(beta_educ + beta_educ_female)*100:.2f}%")
print(f"   - Difference: {beta_educ_female*100:.2f}% (females lower)")

# Inverted U-shaped effect of experience
beta_exp = model4.params['experience']
beta_exp_sq = model4.params['experience_sq']
optimal_exp = -beta_exp / (2 * beta_exp_sq)
print(f"\n4. Inverted U-shaped Effect of Experience")
print(f"   - Experience coefficient: {beta_exp:.6f}")
print(f"   - Experience squared coefficient: {beta_exp_sq:.6f}")
print(f"   - Optimal experience years: {optimal_exp:.1f} years")

# Regional differences
print(f"\n5. Regional Wage Differences (relative to Northeast)")
for region in ['South', 'Midwest', 'West']:
    var = f'region_{region}'
    if var in model4.params.index:
        print(f"   - {region}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")

# Industry differences
print(f"\n6. Industry Wage Differences (relative to Other)")
for industry in ['Manufacturing', 'Service', 'Tech', 'Finance']:
    var = f'industry_{industry}'
    if var in model4.params.index:
        print(f"   - {industry}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")

# Model fit
print(f"\n7. Model Fit")
print(f"   - R²: {model4.rsquared:.4f}")
print(f"   - Adjusted R²: {model4.rsquared_adj:.4f}")
print(f"   - F-statistic: {model4.fvalue:.2f} (p-value: {model4.f_pvalue:.4e})")

Visualization Analysis

python
# 1. Education-wage relationship (by gender)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for gender, label in [(0, 'Male'), (1, 'Female')]:
    subset = df_constructed[df_constructed['female'] == gender]
    avg_wage = subset.groupby('education')['wage'].mean()
    plt.plot(avg_wage.index, avg_wage.values, marker='o', label=label, linewidth=2)
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Average Wage ($)', fontsize=12)
plt.title('Education-Wage Relationship (by Gender)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)

# 2. Experience-wage relationship (inverted U)
plt.subplot(1, 2, 2)
exp_range = np.linspace(0, 40, 100)
log_wage_pred = (model4.params['const'] +
                 model4.params['education'] * df_constructed['education'].mean() +
                 model4.params['experience'] * exp_range +
                 model4.params['experience_sq'] * exp_range**2)
wage_pred = np.exp(log_wage_pred)
plt.plot(exp_range, wage_pred, linewidth=2, color='green')
plt.axvline(optimal_exp, color='red', linestyle='--', label=f'Optimal experience = {optimal_exp:.1f} years')
plt.xlabel('Work Experience (years)', fontsize=12)
plt.ylabel('Predicted Wage ($)', fontsize=12)
plt.title('Experience-Wage Relationship (Inverted U)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# 3. Regional and industry wage distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Region
df_constructed.boxplot(column='wage', by='region', ax=axes[0])
axes[0].set_title('Wage Distribution by Region', fontsize=14)
axes[0].set_xlabel('Region', fontsize=12)
axes[0].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[0])
plt.xticks(rotation=45)

# Industry
df_constructed.boxplot(column='wage', by='industry', ax=axes[1])
axes[1].set_title('Wage Distribution by Industry', fontsize=14)
axes[1].set_xlabel('Industry', fontsize=12)
axes[1].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[1])
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# 4. Residual diagnostics
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs fitted values
axes[0, 0].scatter(model4.fittedvalues, model4.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values', fontsize=12)
axes[0, 0].set_ylabel('Residuals', fontsize=12)
axes[0, 0].set_title('Residuals vs Fitted Values', fontsize=14)
axes[0, 0].grid(alpha=0.3)

# QQ plot
from scipy import stats as sp_stats
sp_stats.probplot(model4.resid, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('QQ Plot (Normality Test)', fontsize=14)
axes[0, 1].grid(alpha=0.3)

# Residual histogram
axes[1, 0].hist(model4.resid, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)
axes[1, 0].set_title('Residual Distribution', fontsize=14)
axes[1, 0].grid(alpha=0.3)

# Standardized residuals
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(model4)
standardized_resid = influence.resid_studentized_internal
axes[1, 1].scatter(range(len(standardized_resid)), standardized_resid, alpha=0.5)
axes[1, 1].axhline(y=3, color='r', linestyle='--', label='±3σ')
axes[1, 1].axhline(y=-3, color='r', linestyle='--')
axes[1, 1].set_xlabel('Observation Index', fontsize=12)
axes[1, 1].set_ylabel('Standardized Residuals', fontsize=12)
axes[1, 1].set_title('Standardized Residuals (Outlier Detection)', fontsize=14)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Heterogeneity Analysis: Regional Differences in Education Return

python
# Estimate education return by region
print("\n" + "="*80)
print("📊 Heterogeneity Analysis: Regional Differences in Education Return")
print("="*80)

results_by_region = {}

for region in df_constructed['region'].unique():
    subset = df_constructed[df_constructed['region'] == region]
    X_subset = sm.add_constant(subset[['education', 'experience', 'experience_sq', 'female']])
    y_subset = subset['log_wage']
    model_subset = sm.OLS(y_subset, X_subset).fit(cov_type='HC3')

    educ_return = model_subset.params['education']
    educ_se = model_subset.bse['education']

    results_by_region[region] = {
        'return': educ_return,
        'se': educ_se,
        'ci_lower': educ_return - 1.96 * educ_se,
        'ci_upper': educ_return + 1.96 * educ_se
    }

    print(f"\n{region}:")
    print(f"  Education return: {educ_return*100:.2f}% (SE: {educ_se*100:.2f}%)")
    print(f"  95% CI: [{results_by_region[region]['ci_lower']*100:.2f}%, "
          f"{results_by_region[region]['ci_upper']*100:.2f}%]")

# Visualize regional differences
fig, ax = plt.subplots(figsize=(10, 6))
regions = list(results_by_region.keys())
returns = [results_by_region[r]['return'] * 100 for r in regions]
errors = [1.96 * results_by_region[r]['se'] * 100 for r in regions]

ax.barh(regions, returns, xerr=errors, capsize=5, color='skyblue', edgecolor='black')
ax.set_xlabel('Education Return (%)', fontsize=12)
ax.set_ylabel('Region', fontsize=12)
ax.set_title('Regional Differences in Education Return (95% CI)', fontsize=14)
ax.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

🔧 Other Common Variable Constructions

1. Polynomial Terms (Non-linear Relationships)

python
# Study non-linear relationship: income = f(experience, experience²)
df['experience_sq'] = df['experience'] ** 2
df['experience_cube'] = df['experience'] ** 3

# Identify optimal point
# ∂wage/∂experience = β₁ + 2β₂·experience = 0
# experience* = -β₁ / (2β₂)

2. Ratio Variables

python
# Calculate ratios
df['income_per_capita'] = df['household_income'] / df['household_size']
df['debt_to_income'] = df['debt'] / df['income']
df['saving_rate'] = df['saving'] / df['income']

# Log ratios
df['log_debt_to_income'] = np.log(df['debt'] / df['income'])

3. Log Difference (Growth Rates)

python
# Log difference ≈ growth rate
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()

# Interpretation: income_growth = 0.05 → income grows by approximately 5%

# Verification:
# ln(Y_t) - ln(Y_{t-1}) = ln(Y_t / Y_{t-1}) ≈ (Y_t - Y_{t-1}) / Y_{t-1}

4. Standardized Variables (Z-score)

python
from scipy import stats

# Z-score standardization (population)
df['income_std'] = stats.zscore(df['income'])

# Within-group standardization
df['income_std_within'] = df.groupby('region')['income'].transform(
    lambda x: stats.zscore(x)
)

# Interpretation: income_std = 2 → individual's income is 2 standard deviations above the mean

📝 Practice Exercises

⭐⭐ Exercise 1: Create Education Level Dummy Variables

Problem: Given an education years variable, create education level dummy variables (high school, bachelor's, graduate), with high school as the reference group.

python
# Data
df = pd.DataFrame({
    'id': range(1, 11),
    'education_years': [12, 16, 18, 12, 14, 16, 20, 12, 16, 14]
})

# Task: Create dummy variables
# High school: 12 years
# Bachelor's: 14-16 years
# Graduate: 18-20 years

Hint: Use pd.cut() to group, then pd.get_dummies()

⭐⭐⭐ Exercise 2: Construct Interactions and Calculate Marginal Effects

Problem: Estimate the model

  1. Create interaction term
  2. Run regression
  3. Calculate education return for males and females
  4. Test whether the difference in returns is significant (Wald test)

⭐⭐⭐⭐ Exercise 3: Dynamic Panel Model

Problem: Use the Arellano-Bond method to estimate a dynamic panel model

  1. Create lag
  2. First-difference to eliminate
  3. Use as instrumental variable
  4. GMM estimation

⭐⭐⭐⭐⭐ Exercise 4: Heterogeneous Treatment Effect Decomposition

Problem: Analyze the treatment effect of a college degree on wages, and decompose it into:

  1. ATE (Average Treatment Effect): Average treatment effect
  2. ATT (Average Treatment on the Treated): Average treatment effect on the treated
  3. ATE vs ATT difference: Evidence of selection bias

Use Propensity Score Matching for sensitivity analysis.


📚 Summary

Core Functions Summary

TaskMethodMathematical RepresentationApplication
Dummy Variablespd.get_dummies()Numerical encoding of categorical variables
Interactionsdf['A'] * df['B']Effect heterogeneity
Lagsdf.shift()Dynamic models
Group Statsgroupby().transform()Control for group effects
Polynomialsdf['x'] ** 2Non-linear relationships

Best Practices

  1. Understand reference groups: Clearly specify reference group in dummy variable regressions
  2. Avoid multicollinearity: Use drop_first=True or specify reference group
  3. Center interactions: Improve interpretability, reduce collinearity
  4. Sort panel data: Sort by id and time before creating lags
  5. Naming conventions: e.g., income_lag1, edu_x_female, region_mean_wage
  6. Documentation: Record construction logic and theoretical basis for each variable
  7. Validation: Check distribution and missing values of constructed variables

Theoretical Key Points

  1. Dummy variable trap: causes perfect collinearity
  2. Interaction interpretation: represents how the marginal effect of on varies with
  3. Lag operator properties: ,
  4. Within transformation: eliminates group fixed effects
  5. Mincer equation:

🎯 Next Steps

Next section: Data Transformation (standardization, log transformation, Box-Cox transformation, binning)


📖 References

  1. Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.

    • Chapter 3: Making Regression Make Sense
    • Interactions and heterogeneous effects
  2. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.

    • Chapter 10: Basic Linear Unobserved Effects Panel Data Models
    • Dynamic panel models
  3. Mincer, J. (1974). Schooling, Experience, and Earnings. NBER.

    • Original reference for the Mincer wage equation
  4. Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. The Review of Economic Studies, 58(2), 277-297.

    • Dynamic panel GMM estimation
  5. Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press.

    • Chapter 21: Panel Data Models
    • Chapter 25: Treatment Evaluation

Version: StatsPai Module 3.4 - Nobel Prize Standards Last Updated: 2024-01-15 Author: StatsPai Team License: CC BY-NC-SA 4.0

Released under the MIT License. Content © Author.