3.4 Variable Construction

"Feature engineering is the key to machine learning success.""特征工程是机器学习成功的关键。"— Pedro Domingos, Professor at University of Washington (华盛顿大学教授)

From raw variables to analysis-ready data

📋 Learning Objectives

Master the creation and theory of dummy variables
Deeply understand interaction terms and marginal effects
Learn lag operators in dynamic panel data
Master complete variable construction for the Mincer wage equation
Understand DID variables and treatment effect decomposition
Build a production-level VariableConstructor class

📐 Theoretical Foundation

The Role of Variable Construction in Causal Inference

Angrist & Pischke (2009) emphasize:

"Good econometrics is about good research design, and good research design requires careful variable construction."

Three Key Functions:

Confounding Control
- : Time fixed effects (dummy variables)
- : Individual fixed effects (group mean centering)
Capturing Heterogeneous Effects
- Interaction term captures how the treatment effect varies with
Constructing Instrumental Variables

🎯 Dummy Variables

Theory: Numerical Encoding of Categorical Variables

Mathematical Representation:

Let categorical variable have categories . Create dummy variables:

Regression Model:

Coefficient Interpretation:

: Expected value for the reference group ():
: Difference between category and the reference group:

The Dummy Variable Trap (Multicollinearity)

Perfect Collinearity:

If all dummy variables are included:

This creates perfect collinearity with the intercept term (which equals 1 for all observations), making the design matrix non-invertible:

Solution: Drop one category as the reference group

Method 1: pd.get_dummies()

python

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'education': ['High School', 'Bachelor', 'Master', 'High School', 'PhD']
})

# Create dummy variables
dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, dummies], axis=1)

print(df)
#    id    education  edu_Master  edu_PhD  edu_Bachelor
# 0   1  High School           0        0             0
# 1   2     Bachelor           0        0             1
# 2   3       Master           1        0             0
# 3   4  High School           0        0             0
# 4   5          PhD           0        1             0

Parameter Explanation:

prefix='edu': Variable name prefix
drop_first=True: Drop the first category (alphabetically) as reference group
Reference group: High School (all dummy variables equal 0)

Method 2: Manual Construction (More Flexible)

python

# Create single dummy variable
df['college'] = (df['education'] == 'Bachelor').astype(int)
df['female'] = (df['gender'] == 'Female').astype(int)
df['high_income'] = (df['income'] > df['income'].median()).astype(int)

# Multiple conditions (cross-classification)
df['young_female'] = ((df['age'] < 30) & (df['gender'] == 'Female')).astype(int)
df['urban_college'] = ((df['urban'] == 1) & (df['education'] == 'Bachelor')).astype(int)

# Specify reference group
def create_education_dummies(education_series, ref='High School'):
    """Create education dummy variables with specified reference group"""
    categories = education_series.unique()
    dummies = {}
    for cat in categories:
        if cat != ref:
            dummies[f'edu_{cat}'] = (education_series == cat).astype(int)
    return pd.DataFrame(dummies)

edu_dummies = create_education_dummies(df['education'], ref='High School')
df = pd.concat([df, edu_dummies], axis=1)

Regression and Interpretation

python

import statsmodels.api as sm

# Simulate wage data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'education': np.random.choice(['High School', 'Bachelor', 'Master'], n),
    'experience': np.random.uniform(0, 30, n),
    'age': np.random.uniform(22, 65, n)
})

# Generate wages (true model)
df['wage'] = (30000 +
              5000 * (df['education'] == 'Bachelor') +
              10000 * (df['education'] == 'Master') +
              1000 * df['experience'] +
              np.random.normal(0, 5000, n))

# Create dummy variables
edu_dummies = pd.get_dummies(df['education'], prefix='edu', drop_first=True)
df = pd.concat([df, edu_dummies], axis=1)

# OLS regression
X = sm.add_constant(df[['experience', 'edu_Master', 'edu_Bachelor']])
y = df['wage']
model = sm.OLS(y, X).fit(cov_type='HC3')

print(model.summary().tables[1])

Coefficient Interpretation:

Intercept (30127): Expected wage for high school graduates with zero experience
Bachelor coefficient (5084): Wage premium for bachelor's degree relative to high school
Master coefficient (10156): Wage premium for master's degree relative to high school

Significance Testing:

Hypothesis test (no bachelor's premium)

If , reject ; the bachelor's premium is significant.

🔄 Interaction Terms

Theory: Heterogeneity in Marginal Effects

Research Question: Does the effect of education on income vary by gender?

Model Without Interaction (parallel slopes):

Marginal Effect:

Problem: Assumes the effect of on does not vary with

Model With Interaction (non-parallel slopes):

Marginal Effect:

Interpretation: The effect of on depends on the value of

Mathematical Derivation: Angrist & Pischke (2009) Framework

Basic Model (education × gender interaction):

Group-Specific Regressions:

Males ():
Education return:
Females ():
Education return:

Interaction Coefficient Interpretation:

If , the education return for females is lower than for males (evidence of gender discrimination).

Creating Interaction Terms

python

# 1. Continuous × Dummy
df['educ_x_female'] = df['education'] * df['female']

# 2. Continuous × Continuous
df['educ_x_experience'] = df['education'] * df['experience']

# 3. Dummy × Dummy
df['female_x_married'] = df['female'] * df['married']

# 4. Three-way interaction
df['educ_x_exp_x_female'] = df['education'] * df['experience'] * df['female']

Centering: Reducing Multicollinearity

Problem: and are highly correlated

When both and are positive, the correlation approaches 1.

Solution: Centering

Centered Model:

Advantages:

interpretation: Effect of on when is at its mean
interpretation: Expected value of when both and are at their means

python

# Centering
df['educ_centered'] = df['education'] - df['education'].mean()
df['exp_centered'] = df['experience'] - df['experience'].mean()

# Create centered interaction
df['educ_x_exp_centered'] = df['educ_centered'] * df['exp_centered']

# Regression
X = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

Statistical Testing of Interactions

1. Wald Test

Test (no interaction effect)

python

# Wald test
from scipy import stats

beta3 = model.params['educ_x_exp_centered']
se_beta3 = model.bse['educ_x_exp_centered']
wald_stat = (beta3 / se_beta3) ** 2
p_value = 1 - stats.chi2.cdf(wald_stat, df=1)

print(f"Wald statistic: {wald_stat:.4f}, p-value: {p_value:.4f}")

2. F Test (Multiple Interactions)

Test

Where:

: Residual sum of squares from restricted model (without interactions)
: Residual sum of squares from unrestricted model (with interactions)
: Number of interaction terms

python

# F test
from statsmodels.iolib.table import SimpleTable

# Restricted model
X_restricted = sm.add_constant(df[['educ_centered', 'exp_centered']])
model_r = sm.OLS(y, X_restricted).fit()

# Unrestricted model
X_unrestricted = sm.add_constant(df[['educ_centered', 'exp_centered', 'educ_x_exp_centered']])
model_u = sm.OLS(y, X_unrestricted).fit()

# F statistic
rss_r = model_r.ssr
rss_u = model_u.ssr
q = 1
n = len(df)
k = X_unrestricted.shape[1] - 1

f_stat = ((rss_r - rss_u) / q) / (rss_u / (n - k - 1))
p_value = 1 - stats.f.cdf(f_stat, q, n - k - 1)

print(f"F statistic: {f_stat:.4f}, p-value: {p_value:.4f}")

Visualizing Interaction Effects

python

import matplotlib.pyplot as plt

# Simulate data
np.random.seed(42)
educ = np.linspace(8, 20, 100)

# Predicted wages for males and females
wage_male = 8 + 0.08 * educ
wage_female = 8.2 + 0.06 * educ

# Plot
plt.figure(figsize=(10, 6))
plt.plot(educ, wage_male, label='Male', linewidth=2, color='blue')
plt.plot(educ, wage_female, label='Female', linewidth=2, color='red')
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Log Wage', fontsize=12)
plt.title('Gender Differences in Education-Income Relationship (Interaction Effect)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

⏱️ Lag Operators

Theory: Dynamic Panel Data Models

Lag Operator Definition:

Properties:

Linearity:
Multiplication:
Polynomial:

Difference Operator:

Applications:

Unit root testing: (ADF test)
Error correction models:

Dynamic Panel Models

Arellano-Bond Model:

Problem: is correlated with (endogeneity)

Solution: First-differencing + GMM

Use as instrumental variables.

Creating Lags

python

# Example: panel data
df = pd.DataFrame({
    'person_id': [1, 1, 1, 1, 2, 2, 2, 2],
    'year': [2017, 2018, 2019, 2020, 2017, 2018, 2019, 2020],
    'income': [50000, 52000, 55000, 58000, 60000, 62000, 65000, 68000]
})

# Sort by individual and time
df = df.sort_values(['person_id', 'year'])

# Create 1-period lag
df['income_lag1'] = df.groupby('person_id')['income'].shift(1)

# Create 2-period lag
df['income_lag2'] = df.groupby('person_id')['income'].shift(2)

# Create difference
df['income_diff'] = df.groupby('person_id')['income'].diff()

# Create log difference (growth rate)
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()

print(df)
#    person_id  year  income  income_lag1  income_lag2  income_diff  log_income  income_growth
# 0          1  2017   50000          NaN          NaN          NaN   10.819778            NaN
# 1          1  2018   52000      50000.0          NaN       2000.0   10.859073       0.039295
# 2          1  2019   55000      52000.0      50000.0       3000.0   10.914785       0.055713
# 3          1  2020   58000      55000.0      52000.0       3000.0   10.968557       0.053771
# 4          2  2017   60000          NaN          NaN          NaN   11.002100            NaN
# 5          2  2018   62000      60000.0          NaN       2000.0   11.034699       0.032599
# 6          2  2019   65000      62000.0      60000.0       3000.0   11.082414       0.047715
# 7          2  2020   68000      65000.0      62000.0       3000.0   11.127384       0.044970

Lead Variables

Application: Event studies

Test for anticipation effects before policy implementation

Where indicates periods before () or after () the event.

python

# Create 1-period lead (future value)
df['income_lead1'] = df.groupby('person_id')['income'].shift(-1)
df['income_lead2'] = df.groupby('person_id')['income'].shift(-2)

# Event study: pre- and post-policy effects
df['treat_post'] = (df['year'] >= 2019) & (df['treated'] == 1)
df['treat_pre1'] = (df['year'] == 2018) & (df['treated'] == 1)
df['treat_pre2'] = (df['year'] == 2017) & (df['treated'] == 1)

📊 Group Statistics Variables

Theory: Controlling for Group Fixed Effects

Within-Group Variation:

Where represents group fixed effects.

Group Mean Centering (Within Transformation):

Centered Model:

The term is eliminated, avoiding omitted variable bias.

Method 1: groupby + transform

python

# Calculate average income by region
df['region_mean_income'] = df.groupby('region')['income'].transform('mean')

# Calculate individual deviation from regional mean (centering)
df['income_deviation'] = df['income'] - df['region_mean_income']

# Other statistics
df['region_median_income'] = df.groupby('region')['income'].transform('median')
df['region_std_income'] = df.groupby('region')['income'].transform('std')
df['region_count'] = df.groupby('region')['income'].transform('count')
df['region_min_income'] = df.groupby('region')['income'].transform('min')
df['region_max_income'] = df.groupby('region')['income'].transform('max')

# Percentiles
df['region_p25_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.25))
df['region_p75_income'] = df.groupby('region')['income'].transform(lambda x: x.quantile(0.75))

Method 2: Ranking and Percentiles

python

# Individual's income rank within region (percentile)
df['income_pct_rank'] = df.groupby('region')['income'].transform(
    lambda x: x.rank(pct=True)
)

# Interpretation: 0.75 means the individual's income is higher than 75% within the region

# Quartile groups
df['income_quartile'] = df.groupby('region')['income'].transform(
    lambda x: pd.qcut(x, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
)

Method 3: Rolling Window Statistics

python

# Time series: average income over the past 3 years
df = df.sort_values(['person_id', 'year'])
df['income_rolling_mean_3y'] = df.groupby('person_id')['income'].transform(
    lambda x: x.rolling(window=3, min_periods=1).mean()
)

# Standard deviation over the past 3 years (income volatility)
df['income_rolling_std_3y'] = df.groupby('person_id')['income'].transform(
    lambda x: x.rolling(window=3, min_periods=1).std()
)

# Growth rate over the past 3 years
df['income_growth_3y'] = df.groupby('person_id')['income'].pct_change(periods=3)

⚙️ Production-Level VariableConstructor Class

python

import pandas as pd
import numpy as np
from typing import List, Union, Tuple, Optional
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

class VariableConstructor:
    """
    Production-level variable constructor - Nobel Prize Standards

    Functions:
    -----
    1. Dummy variable creation
    2. Interaction term construction
    3. Lag operators
    4. Group statistics variables
    5. Mincer equation variable set
    6. DID variable construction
    7. Treatment effect decomposition

    Author: StatsPai Team
    Reference: Angrist & Pischke (2009), Wooldridge (2010)
    """

    def __init__(self, df: pd.DataFrame, id_col: Optional[str] = None,
                 time_col: Optional[str] = None):
        """
        Initialize variable constructor

        Parameters:
        ------
        df : pd.DataFrame
            Raw data
        id_col : str, optional
            Individual identifier column for panel data
        time_col : str, optional
            Time identifier column for panel data
        """
        self.df = df.copy()
        self.id_col = id_col
        self.time_col = time_col
        self.constructed_vars = []

        # Validate panel data structure
        if id_col and time_col:
            self._validate_panel_structure()

    def _validate_panel_structure(self):
        """Validate panel data structure"""
        if self.id_col not in self.df.columns:
            raise ValueError(f"id_col '{self.id_col}' not in data")
        if self.time_col not in self.df.columns:
            raise ValueError(f"time_col '{self.time_col}' not in data")

        # Check if balanced panel
        counts = self.df.groupby(self.id_col)[self.time_col].count()
        if counts.nunique() == 1:
            print(f"✓ Balanced panel: {counts.iloc[0]} periods per individual")
        else:
            print(f"⚠️  Unbalanced panel: period count ranges {counts.min()} - {counts.max()}")

    def create_dummies(self, var: str, prefix: Optional[str] = None,
                      drop_first: bool = True, ref_category: Optional[str] = None) -> pd.DataFrame:
        """
        Create dummy variables

        Parameters:
        ------
        var : str
            Categorical variable name
        prefix : str, optional
            Variable name prefix, defaults to var
        drop_first : bool, default=True
            Whether to drop the first category (avoid multicollinearity)
        ref_category : str, optional
            Specify reference group (takes precedence over drop_first)

        Returns:
        ------
        pd.DataFrame
            Data with dummy variables added

        Example:
        ------
        >>> constructor.create_dummies('education', prefix='edu', ref_category='High School')
        """
        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        prefix = prefix or var

        if ref_category:
            # Specify reference group
            categories = self.df[var].unique()
            for cat in categories:
                if cat != ref_category:
                    dummy_name = f'{prefix}_{cat}'
                    self.df[dummy_name] = (self.df[var] == cat).astype(int)
                    self.constructed_vars.append(dummy_name)
            print(f"✓ Created dummy variables: reference group = {ref_category}")
        else:
            # Use pd.get_dummies
            dummies = pd.get_dummies(self.df[var], prefix=prefix, drop_first=drop_first)
            for col in dummies.columns:
                self.df[col] = dummies[col]
                self.constructed_vars.append(col)
            print(f"✓ Created dummy variables: {len(dummies.columns)} total")

        return self.df

    def create_interactions(self, var1: str, var2: str, center: bool = False,
                          name: Optional[str] = None) -> pd.DataFrame:
        """
        Create interaction terms

        Theory:
        -----
        For model: y = β₀ + β₁x₁ + β₂x₂ + β₃(x₁×x₂) + ε

        Marginal effect:
        ∂y/∂x₁ = β₁ + β₃x₂

        Parameters:
        ------
        var1 : str
            First variable
        var2 : str
            Second variable
        center : bool, default=False
            Whether to center (reduce multicollinearity)
        name : str, optional
            Interaction term name, defaults to 'var1_x_var2'

        Returns:
        ------
        pd.DataFrame
            Data with interaction term added

        Example:
        ------
        >>> constructor.create_interactions('education', 'female', center=True)
        """
        if var1 not in self.df.columns or var2 not in self.df.columns:
            raise ValueError(f"Variable '{var1}' or '{var2}' not in data")

        name = name or f'{var1}_x_{var2}'

        if center:
            # Center
            var1_centered = self.df[var1] - self.df[var1].mean()
            var2_centered = self.df[var2] - self.df[var2].mean()
            self.df[name] = var1_centered * var2_centered
            print(f"✓ Created centered interaction: {name}")
        else:
            self.df[name] = self.df[var1] * self.df[var2]
            print(f"✓ Created interaction: {name}")

        self.constructed_vars.append(name)
        return self.df

    def create_lags(self, var: str, lags: Union[int, List[int]] = 1,
                   fill_method: str = 'drop') -> pd.DataFrame:
        """
        Create lag variables (panel data)

        Requirements: Must set id_col and time_col

        Math:
        -----
        L(Y_it, k) = Y_{i,t-k}

        Parameters:
        ------
        var : str
            Variable name
        lags : int or list of int
            Lag periods, e.g., 1 or [1, 2, 3]
        fill_method : {'drop', 'zero', 'ffill'}
            Missing value handling method

        Returns:
        ------
        pd.DataFrame
            Data with lag variables added

        Example:
        ------
        >>> constructor.create_lags('income', lags=[1, 2])
        """
        if not self.id_col or not self.time_col:
            raise ValueError("Creating lags requires id_col and time_col to be set")

        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        # Ensure sorted
        self.df = self.df.sort_values([self.id_col, self.time_col])

        # Convert to list
        if isinstance(lags, int):
            lags = [lags]

        for lag in lags:
            lag_name = f'{var}_lag{lag}'
            self.df[lag_name] = self.df.groupby(self.id_col)[var].shift(lag)

            # Handle missing values
            if fill_method == 'drop':
                pass  # Keep NaN, remove in subsequent analysis
            elif fill_method == 'zero':
                self.df[lag_name].fillna(0, inplace=True)
            elif fill_method == 'ffill':
                self.df[lag_name] = self.df.groupby(self.id_col)[lag_name].fillna(method='ffill')

            self.constructed_vars.append(lag_name)
            print(f"✓ Created lag variable: {lag_name}")

        return self.df

    def create_leads(self, var: str, leads: Union[int, List[int]] = 1) -> pd.DataFrame:
        """
        Create lead variables (panel data)

        Application: Event studies

        Parameters:
        ------
        var : str
            Variable name
        leads : int or list of int
            Lead periods

        Returns:
        ------
        pd.DataFrame
            Data with lead variables added
        """
        if not self.id_col or not self.time_col:
            raise ValueError("Creating leads requires id_col and time_col to be set")

        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        self.df = self.df.sort_values([self.id_col, self.time_col])

        if isinstance(leads, int):
            leads = [leads]

        for lead in leads:
            lead_name = f'{var}_lead{lead}'
            self.df[lead_name] = self.df.groupby(self.id_col)[var].shift(-lead)
            self.constructed_vars.append(lead_name)
            print(f"✓ Created lead variable: {lead_name}")

        return self.df

    def create_group_stats(self, var: str, groupby: Union[str, List[str]],
                          stats: List[str] = ['mean', 'std']) -> pd.DataFrame:
        """
        Create group statistics variables

        Examples:
        -----
        - Average wage by province
        - Individual wage deviation from provincial mean

        Parameters:
        ------
        var : str
            Variable name
        groupby : str or list of str
            Grouping variable(s)
        stats : list of str
            Statistics, options: 'mean', 'std', 'min', 'max', 'median', 'count'

        Returns:
        ------
        pd.DataFrame
            Data with group statistics variables added

        Example:
        ------
        >>> constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
        """
        if var not in self.df.columns:
            raise ValueError(f"Variable '{var}' not in data")

        if isinstance(groupby, str):
            groupby = [groupby]

        for gb in groupby:
            if gb not in self.df.columns:
                raise ValueError(f"Grouping variable '{gb}' not in data")

        group_name = '_'.join(groupby)

        for stat in stats:
            stat_name = f'{group_name}_{stat}_{var}'

            if stat == 'mean':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('mean')
            elif stat == 'std':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('std')
            elif stat == 'min':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('min')
            elif stat == 'max':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('max')
            elif stat == 'median':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('median')
            elif stat == 'count':
                self.df[stat_name] = self.df.groupby(groupby)[var].transform('count')
            else:
                raise ValueError(f"Unknown statistic: {stat}")

            self.constructed_vars.append(stat_name)
            print(f"✓ Created group statistic: {stat_name}")

        # Create deviation variable
        if 'mean' in stats:
            deviation_name = f'{var}_dev_{group_name}'
            mean_name = f'{group_name}_mean_{var}'
            self.df[deviation_name] = self.df[var] - self.df[mean_name]
            self.constructed_vars.append(deviation_name)
            print(f"✓ Created deviation variable: {deviation_name}")

        return self.df

    def mincer_equation_vars(self, wage_col: str, education_col: str,
                            age_col: str, start_age: int = 6) -> pd.DataFrame:
        """
        Construct complete Mincer wage equation variables

        Model:
        -----
        ln(wage) = β₀ + β₁·education + β₂·experience + β₃·experience² + ε

        Required variables:
        --------
        - education: years of education
        - age: age
        - experience = age - education - start_age
        - experience_sq = experience²
        - log_wage = ln(wage)

        Parameters:
        ------
        wage_col : str
            Wage variable name
        education_col : str
            Education years variable name
        age_col : str
            Age variable name
        start_age : int, default=6
            Age when schooling starts

        Returns:
        ------
        pd.DataFrame
            Data with Mincer variables added

        Example:
        ------
        >>> constructor.mincer_equation_vars('wage', 'education', 'age')
        """
        required_cols = [wage_col, education_col, age_col]
        for col in required_cols:
            if col not in self.df.columns:
                raise ValueError(f"Variable '{col}' not in data")

        # Create experience
        self.df['experience'] = self.df[age_col] - self.df[education_col] - start_age
        self.df['experience'] = self.df['experience'].clip(lower=0)  # Ensure non-negative

        # Create experience²
        self.df['experience_sq'] = self.df['experience'] ** 2

        # Create log(wage)
        self.df['log_wage'] = np.log(self.df[wage_col])

        mincer_vars = ['experience', 'experience_sq', 'log_wage']
        self.constructed_vars.extend(mincer_vars)

        print("✓ Created Mincer equation variables:")
        print(f"   - experience = {age_col} - {education_col} - {start_age}")
        print(f"   - experience_sq = experience²")
        print(f"   - log_wage = ln({wage_col})")

        return self.df

    def did_variables(self, treat_col: str, post_period: Union[int, str],
                     time_col: Optional[str] = None) -> pd.DataFrame:
        """
        Construct DID variables

        Variables:
        -----
        - treat: treatment group indicator (already exists)
        - post: post-policy indicator
        - treat_post: interaction term (DID estimator)

        Parameters:
        ------
        treat_col : str
            Treatment group variable name
        post_period : int or str
            Policy implementation time point
        time_col : str, optional
            Time variable name, defaults to self.time_col

        Returns:
        ------
        pd.DataFrame
            Data with DID variables added

        Example:
        ------
        >>> constructor.did_variables('treated', post_period=2010)
        """
        if treat_col not in self.df.columns:
            raise ValueError(f"Treatment variable '{treat_col}' not in data")

        time_col = time_col or self.time_col
        if not time_col:
            raise ValueError("Must specify time_col")

        if time_col not in self.df.columns:
            raise ValueError(f"Time variable '{time_col}' not in data")

        # Create post variable
        self.df['post'] = (self.df[time_col] >= post_period).astype(int)

        # Create DID interaction
        self.df['treat_post'] = self.df[treat_col] * self.df['post']

        did_vars = ['post', 'treat_post']
        self.constructed_vars.extend(did_vars)

        print("✓ Created DID variables:")
        print(f"   - post = ({time_col} >= {post_period})")
        print(f"   - treat_post = {treat_col} × post")
        print(f"\nDID regression equation:")
        print(f"   Y = β₀ + β₁·{treat_col} + β₂·post + β₃·treat_post + ε")
        print(f"   β₃ = DID estimator (treatment effect)")

        return self.df

    def decompose_treatment_effect(self, outcome: str, treatment: str,
                                   covariates: Optional[List[str]] = None) -> dict:
        """
        Treatment effect decomposition

        Theory:
        -----
        ATE = E[Y(1) - Y(0)]
            = [E[Y|D=1] - E[Y|D=0]]
              - {E[Y(0)|D=1] - E[Y(0)|D=0]}
            = Observed Difference - Selection Bias

        Parameters:
        ------
        outcome : str
            Outcome variable
        treatment : str
            Treatment variable (0/1)
        covariates : list of str, optional
            Control variables

        Returns:
        ------
        dict
            Dictionary containing decomposition results

        Example:
        ------
        >>> results = constructor.decompose_treatment_effect('wage', 'college_degree')
        """
        if outcome not in self.df.columns:
            raise ValueError(f"Outcome variable '{outcome}' not in data")
        if treatment not in self.df.columns:
            raise ValueError(f"Treatment variable '{treatment}' not in data")

        # Observed difference
        treated = self.df[self.df[treatment] == 1][outcome]
        control = self.df[self.df[treatment] == 0][outcome]
        observed_diff = treated.mean() - control.mean()

        # Regression adjustment (control for covariates)
        if covariates:
            X = sm.add_constant(self.df[[treatment] + covariates])
        else:
            X = sm.add_constant(self.df[[treatment]])

        y = self.df[outcome]
        model = sm.OLS(y, X).fit(cov_type='HC3')

        ate = model.params[treatment]
        se = model.bse[treatment]

        # Selection bias
        selection_bias = observed_diff - ate

        results = {
            'observed_difference': observed_diff,
            'ate': ate,
            'se': se,
            'selection_bias': selection_bias,
            't_stat': ate / se,
            'p_value': model.pvalues[treatment],
            'ci_lower': ate - 1.96 * se,
            'ci_upper': ate + 1.96 * se
        }

        print("✓ Treatment effect decomposition:")
        print(f"   Observed difference: {observed_diff:.4f}")
        print(f"   ATE: {ate:.4f} (SE: {se:.4f})")
        print(f"   Selection bias: {selection_bias:.4f}")
        print(f"   95% CI: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
        print(f"   t-stat: {results['t_stat']:.4f}, p-value: {results['p_value']:.4f}")

        return results

    def get_constructed_vars(self) -> List[str]:
        """Return all constructed variable names"""
        return self.constructed_vars

    def summary(self):
        """Print variable construction summary"""
        print("\n" + "="*60)
        print("📊 Variable Construction Summary")
        print("="*60)
        print(f"Original variables: {len(self.df.columns) - len(self.constructed_vars)}")
        print(f"Constructed variables: {len(self.constructed_vars)}")
        print(f"Total variables: {len(self.df.columns)}")

        if self.id_col and self.time_col:
            n_id = self.df[self.id_col].nunique()
            n_time = self.df[self.time_col].nunique()
            print(f"\nPanel structure: {n_id} individuals × {n_time} periods = {len(self.df)} observations")

        print(f"\nConstructed variables:")
        for var in self.constructed_vars:
            print(f"  - {var}")

        print("="*60)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544

💼 Complete Case Study: NLSY97 Mincer Wage Equation

Case Background

Data: National Longitudinal Survey of Youth 1997 (NLSY97)

Research Questions:

What is the return to education on wages?
Does the return to education vary by gender?
Does work experience have an inverted U-shaped effect on wages?
How do regional and industry differences affect wages?

Data Simulation (Substitute for Real NLSY97)

python

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed
np.random.seed(2024)

# Simulate 5000 individuals
n = 5000

# Generate basic features
df = pd.DataFrame({
    'person_id': range(1, n+1),
    'age': np.random.uniform(25, 55, n),
    'education': np.random.choice([12, 14, 16, 18, 20], n, p=[0.3, 0.2, 0.3, 0.15, 0.05]),
    'female': np.random.binomial(1, 0.5, n),
    'race': np.random.choice(['White', 'Black', 'Hispanic', 'Other'], n, p=[0.6, 0.15, 0.2, 0.05]),
    'urban': np.random.binomial(1, 0.7, n),
    'married': np.random.binomial(1, 0.55, n),
    'region': np.random.choice(['Northeast', 'South', 'Midwest', 'West'], n, p=[0.18, 0.37, 0.21, 0.24]),
    'industry': np.random.choice(['Manufacturing', 'Service', 'Tech', 'Finance', 'Other'], n,
                                 p=[0.15, 0.35, 0.2, 0.15, 0.15])
})

# Calculate work experience
df['experience'] = df['age'] - df['education'] - 6
df['experience'] = df['experience'].clip(lower=0)

# Generate wages (based on real economic relationships)
log_wage = (
    9.5 +                                          # Base log wage
    0.08 * df['education'] +                       # Education return (8%)
    0.05 * df['experience'] +                      # Experience return
    -0.0008 * df['experience']**2 +                # Diminishing marginal return to experience
    -0.15 * df['female'] +                         # Gender wage gap
    -0.02 * df['education'] * df['female'] +       # Gender difference in education return
    0.10 * (df['race'] == 'White') +               # Racial differences
    -0.08 * (df['race'] == 'Black') +
    -0.05 * (df['race'] == 'Hispanic') +
    0.05 * df['urban'] +                           # Urban premium
    0.08 * df['married'] +                         # Marriage premium
    0.12 * (df['region'] == 'West') +              # Regional differences
    0.08 * (df['region'] == 'Northeast') +
    -0.05 * (df['region'] == 'South') +
    0.15 * (df['industry'] == 'Tech') +            # Industry differences
    0.12 * (df['industry'] == 'Finance') +
    -0.08 * (df['industry'] == 'Service') +
    np.random.normal(0, 0.3, n)                    # Error term
)

df['wage'] = np.exp(log_wage)

print("✓ Data generation complete")
print(f"Sample size: {len(df)}")
print(f"\nDescriptive statistics:")
print(df[['age', 'education', 'experience', 'wage']].describe())

Using VariableConstructor to Construct Variables

python

# Initialize constructor
constructor = VariableConstructor(df)

# 1. Create basic Mincer equation variables
constructor.mincer_equation_vars('wage', 'education', 'age', start_age=6)

# 2. Create dummy variables
constructor.create_dummies('race', prefix='race', ref_category='White')
constructor.create_dummies('region', prefix='region', ref_category='Northeast')
constructor.create_dummies('industry', prefix='industry', ref_category='Other')

# 3. Create interaction terms
constructor.create_interactions('education', 'female', center=True, name='educ_x_female')
constructor.create_interactions('experience', 'female', center=False, name='exp_x_female')
constructor.create_interactions('education', 'experience', center=True, name='educ_x_exp')

# 4. Create group statistics variables
constructor.create_group_stats('wage', 'region', stats=['mean', 'std'])
constructor.create_group_stats('wage', 'industry', stats=['mean', 'median'])

# 5. View constructed variables
print("\nConstructed variables:")
for var in constructor.get_constructed_vars():
    print(f"  - {var}")

# Get constructed data
df_constructed = constructor.df

Estimating the Mincer Wage Equation

python

# Model 1: Basic Mincer equation
X1 = sm.add_constant(df_constructed[['education', 'experience', 'experience_sq']])
y = df_constructed['log_wage']
model1 = sm.OLS(y, X1).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 1: Basic Mincer Wage Equation")
print("="*80)
print(model1.summary().tables[1])

# Model 2: Add demographic features
X2 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married'
]])
model2 = sm.OLS(y, X2).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 2: With Demographic Features")
print("="*80)
print(model2.summary().tables[1])

# Model 3: Add interactions
X3 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married',
    'educ_x_female', 'exp_x_female'
]])
model3 = sm.OLS(y, X3).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 3: With Interactions (Gender Heterogeneity)")
print("="*80)
print(model3.summary().tables[1])

# Model 4: Full model (add region and industry)
X4 = sm.add_constant(df_constructed[[
    'education', 'experience', 'experience_sq',
    'female', 'race_Black', 'race_Hispanic', 'race_Other',
    'urban', 'married',
    'educ_x_female', 'exp_x_female',
    'region_South', 'region_Midwest', 'region_West',
    'industry_Manufacturing', 'industry_Service', 'industry_Tech', 'industry_Finance'
]])
model4 = sm.OLS(y, X4).fit(cov_type='HC3')

print("\n" + "="*80)
print("Model 4: Full Model (Region + Industry)")
print("="*80)
print(model4.summary().tables[1])

Results Interpretation

python

print("\n" + "="*80)
print("📊 Results Interpretation")
print("="*80)

# Education return
beta_educ = model4.params['education']
print(f"\n1. Return to Education")
print(f"   - Each additional year of education increases wages by {beta_educ*100:.2f}%")
print(f"   - 95% confidence interval: [{(beta_educ - 1.96*model4.bse['education'])*100:.2f}%, "
      f"{(beta_educ + 1.96*model4.bse['education'])*100:.2f}%]")

# Gender wage gap
beta_female = model4.params['female']
print(f"\n2. Gender Wage Gap")
print(f"   - Female wages are {abs(beta_female)*100:.2f}% lower than male wages")
print(f"   - p-value: {model4.pvalues['female']:.4f}")

# Gender difference in education return
beta_educ_female = model4.params['educ_x_female']
print(f"\n3. Gender Difference in Education Return")
print(f"   - Male education return: {beta_educ*100:.2f}%")
print(f"   - Female education return: {(beta_educ + beta_educ_female)*100:.2f}%")
print(f"   - Difference: {beta_educ_female*100:.2f}% (females lower)")

# Inverted U-shaped effect of experience
beta_exp = model4.params['experience']
beta_exp_sq = model4.params['experience_sq']
optimal_exp = -beta_exp / (2 * beta_exp_sq)
print(f"\n4. Inverted U-shaped Effect of Experience")
print(f"   - Experience coefficient: {beta_exp:.6f}")
print(f"   - Experience squared coefficient: {beta_exp_sq:.6f}")
print(f"   - Optimal experience years: {optimal_exp:.1f} years")

# Regional differences
print(f"\n5. Regional Wage Differences (relative to Northeast)")
for region in ['South', 'Midwest', 'West']:
    var = f'region_{region}'
    if var in model4.params.index:
        print(f"   - {region}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")

# Industry differences
print(f"\n6. Industry Wage Differences (relative to Other)")
for industry in ['Manufacturing', 'Service', 'Tech', 'Finance']:
    var = f'industry_{industry}'
    if var in model4.params.index:
        print(f"   - {industry}: {model4.params[var]*100:.2f}% ({model4.pvalues[var]:.4f})")

# Model fit
print(f"\n7. Model Fit")
print(f"   - R²: {model4.rsquared:.4f}")
print(f"   - Adjusted R²: {model4.rsquared_adj:.4f}")
print(f"   - F-statistic: {model4.fvalue:.2f} (p-value: {model4.f_pvalue:.4e})")

Visualization Analysis

python

# 1. Education-wage relationship (by gender)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for gender, label in [(0, 'Male'), (1, 'Female')]:
    subset = df_constructed[df_constructed['female'] == gender]
    avg_wage = subset.groupby('education')['wage'].mean()
    plt.plot(avg_wage.index, avg_wage.values, marker='o', label=label, linewidth=2)
plt.xlabel('Years of Education', fontsize=12)
plt.ylabel('Average Wage ($)', fontsize=12)
plt.title('Education-Wage Relationship (by Gender)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)

# 2. Experience-wage relationship (inverted U)
plt.subplot(1, 2, 2)
exp_range = np.linspace(0, 40, 100)
log_wage_pred = (model4.params['const'] +
                 model4.params['education'] * df_constructed['education'].mean() +
                 model4.params['experience'] * exp_range +
                 model4.params['experience_sq'] * exp_range**2)
wage_pred = np.exp(log_wage_pred)
plt.plot(exp_range, wage_pred, linewidth=2, color='green')
plt.axvline(optimal_exp, color='red', linestyle='--', label=f'Optimal experience = {optimal_exp:.1f} years')
plt.xlabel('Work Experience (years)', fontsize=12)
plt.ylabel('Predicted Wage ($)', fontsize=12)
plt.title('Experience-Wage Relationship (Inverted U)', fontsize=14)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# 3. Regional and industry wage distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Region
df_constructed.boxplot(column='wage', by='region', ax=axes[0])
axes[0].set_title('Wage Distribution by Region', fontsize=14)
axes[0].set_xlabel('Region', fontsize=12)
axes[0].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[0])
plt.xticks(rotation=45)

# Industry
df_constructed.boxplot(column='wage', by='industry', ax=axes[1])
axes[1].set_title('Wage Distribution by Industry', fontsize=14)
axes[1].set_xlabel('Industry', fontsize=12)
axes[1].set_ylabel('Wage ($)', fontsize=12)
plt.sca(axes[1])
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# 4. Residual diagnostics
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs fitted values
axes[0, 0].scatter(model4.fittedvalues, model4.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values', fontsize=12)
axes[0, 0].set_ylabel('Residuals', fontsize=12)
axes[0, 0].set_title('Residuals vs Fitted Values', fontsize=14)
axes[0, 0].grid(alpha=0.3)

# QQ plot
from scipy import stats as sp_stats
sp_stats.probplot(model4.resid, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('QQ Plot (Normality Test)', fontsize=14)
axes[0, 1].grid(alpha=0.3)

# Residual histogram
axes[1, 0].hist(model4.resid, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)
axes[1, 0].set_title('Residual Distribution', fontsize=14)
axes[1, 0].grid(alpha=0.3)

# Standardized residuals
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(model4)
standardized_resid = influence.resid_studentized_internal
axes[1, 1].scatter(range(len(standardized_resid)), standardized_resid, alpha=0.5)
axes[1, 1].axhline(y=3, color='r', linestyle='--', label='±3σ')
axes[1, 1].axhline(y=-3, color='r', linestyle='--')
axes[1, 1].set_xlabel('Observation Index', fontsize=12)
axes[1, 1].set_ylabel('Standardized Residuals', fontsize=12)
axes[1, 1].set_title('Standardized Residuals (Outlier Detection)', fontsize=14)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Heterogeneity Analysis: Regional Differences in Education Return

python

# Estimate education return by region
print("\n" + "="*80)
print("📊 Heterogeneity Analysis: Regional Differences in Education Return")
print("="*80)

results_by_region = {}

for region in df_constructed['region'].unique():
    subset = df_constructed[df_constructed['region'] == region]
    X_subset = sm.add_constant(subset[['education', 'experience', 'experience_sq', 'female']])
    y_subset = subset['log_wage']
    model_subset = sm.OLS(y_subset, X_subset).fit(cov_type='HC3')

    educ_return = model_subset.params['education']
    educ_se = model_subset.bse['education']

    results_by_region[region] = {
        'return': educ_return,
        'se': educ_se,
        'ci_lower': educ_return - 1.96 * educ_se,
        'ci_upper': educ_return + 1.96 * educ_se
    }

    print(f"\n{region}:")
    print(f"  Education return: {educ_return*100:.2f}% (SE: {educ_se*100:.2f}%)")
    print(f"  95% CI: [{results_by_region[region]['ci_lower']*100:.2f}%, "
          f"{results_by_region[region]['ci_upper']*100:.2f}%]")

# Visualize regional differences
fig, ax = plt.subplots(figsize=(10, 6))
regions = list(results_by_region.keys())
returns = [results_by_region[r]['return'] * 100 for r in regions]
errors = [1.96 * results_by_region[r]['se'] * 100 for r in regions]

ax.barh(regions, returns, xerr=errors, capsize=5, color='skyblue', edgecolor='black')
ax.set_xlabel('Education Return (%)', fontsize=12)
ax.set_ylabel('Region', fontsize=12)
ax.set_title('Regional Differences in Education Return (95% CI)', fontsize=14)
ax.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

🔧 Other Common Variable Constructions

1. Polynomial Terms (Non-linear Relationships)

python

# Study non-linear relationship: income = f(experience, experience²)
df['experience_sq'] = df['experience'] ** 2
df['experience_cube'] = df['experience'] ** 3

# Identify optimal point
# ∂wage/∂experience = β₁ + 2β₂·experience = 0
# experience* = -β₁ / (2β₂)

2. Ratio Variables

python

# Calculate ratios
df['income_per_capita'] = df['household_income'] / df['household_size']
df['debt_to_income'] = df['debt'] / df['income']
df['saving_rate'] = df['saving'] / df['income']

# Log ratios
df['log_debt_to_income'] = np.log(df['debt'] / df['income'])

3. Log Difference (Growth Rates)

python

# Log difference ≈ growth rate
df['log_income'] = np.log(df['income'])
df['income_growth'] = df.groupby('person_id')['log_income'].diff()

# Interpretation: income_growth = 0.05 → income grows by approximately 5%

# Verification:
# ln(Y_t) - ln(Y_{t-1}) = ln(Y_t / Y_{t-1}) ≈ (Y_t - Y_{t-1}) / Y_{t-1}

4. Standardized Variables (Z-score)

python

from scipy import stats

# Z-score standardization (population)
df['income_std'] = stats.zscore(df['income'])

# Within-group standardization
df['income_std_within'] = df.groupby('region')['income'].transform(
    lambda x: stats.zscore(x)
)

# Interpretation: income_std = 2 → individual's income is 2 standard deviations above the mean

📝 Practice Exercises

⭐⭐ Exercise 1: Create Education Level Dummy Variables

Problem: Given an education years variable, create education level dummy variables (high school, bachelor's, graduate), with high school as the reference group.

python

# Data
df = pd.DataFrame({
    'id': range(1, 11),
    'education_years': [12, 16, 18, 12, 14, 16, 20, 12, 16, 14]
})

# Task: Create dummy variables
# High school: 12 years
# Bachelor's: 14-16 years
# Graduate: 18-20 years

Hint: Use pd.cut() to group, then pd.get_dummies()

⭐⭐⭐ Exercise 2: Construct Interactions and Calculate Marginal Effects

Problem: Estimate the model

Create interaction term
Run regression
Calculate education return for males and females
Test whether the difference in returns is significant (Wald test)

⭐⭐⭐⭐ Exercise 3: Dynamic Panel Model

Problem: Use the Arellano-Bond method to estimate a dynamic panel model

Create lag
First-difference to eliminate
Use as instrumental variable
GMM estimation

⭐⭐⭐⭐⭐ Exercise 4: Heterogeneous Treatment Effect Decomposition

Problem: Analyze the treatment effect of a college degree on wages, and decompose it into:

ATE (Average Treatment Effect): Average treatment effect
ATT (Average Treatment on the Treated): Average treatment effect on the treated
ATE vs ATT difference: Evidence of selection bias

Use Propensity Score Matching for sensitivity analysis.

📚 Summary

Core Functions Summary

Task	Method	Application
Dummy Variables	`pd.get_dummies()`	Numerical encoding of categorical variables
Interactions	`df['A'] * df['B']`	Effect heterogeneity
Lags	`df.shift()`	Dynamic models
Group Stats	`groupby().transform()`	Control for group effects
Polynomials	`df['x'] ** 2`	Non-linear relationships

Best Practices

✅ Understand reference groups: Clearly specify reference group in dummy variable regressions
✅ Avoid multicollinearity: Use drop_first=True or specify reference group
✅ Center interactions: Improve interpretability, reduce collinearity
✅ Sort panel data: Sort by id and time before creating lags
✅ Naming conventions: e.g., income_lag1, edu_x_female, region_mean_wage
✅ Documentation: Record construction logic and theoretical basis for each variable
✅ Validation: Check distribution and missing values of constructed variables

Theoretical Key Points

Dummy variable trap: causes perfect collinearity
Interaction interpretation: represents how the marginal effect of on varies with
Lag operator properties: ,
Within transformation: eliminates group fixed effects
Mincer equation:

🎯 Next Steps

Next section: Data Transformation (standardization, log transformation, Box-Cox transformation, binning)

📖 References

Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
- Chapter 3: Making Regression Make Sense
- Interactions and heterogeneous effects
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.
- Chapter 10: Basic Linear Unobserved Effects Panel Data Models
- Dynamic panel models
Mincer, J. (1974). Schooling, Experience, and Earnings. NBER.
- Original reference for the Mincer wage equation
Arellano, M., & Bond, S. (1991). Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. The Review of Economic Studies, 58(2), 277-297.
- Dynamic panel GMM estimation
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press.
- Chapter 21: Panel Data Models
- Chapter 25: Treatment Evaluation

Version: StatsPai Module 3.4 - Nobel Prize Standards Last Updated: 2024-01-15 Author: StatsPai Team License: CC BY-NC-SA 4.0

3.4 Variable Construction ​

📋 Learning Objectives ​

📐 Theoretical Foundation ​

The Role of Variable Construction in Causal Inference ​

🎯 Dummy Variables ​

Theory: Numerical Encoding of Categorical Variables ​

The Dummy Variable Trap (Multicollinearity) ​

Method 1: pd.get_dummies() ​

Method 2: Manual Construction (More Flexible) ​

Regression and Interpretation ​

🔄 Interaction Terms ​

Theory: Heterogeneity in Marginal Effects ​

Mathematical Derivation: Angrist & Pischke (2009) Framework ​

Creating Interaction Terms ​

Centering: Reducing Multicollinearity ​

Statistical Testing of Interactions ​

Visualizing Interaction Effects ​

⏱️ Lag Operators ​

Theory: Dynamic Panel Data Models ​

Dynamic Panel Models ​

Creating Lags ​

Lead Variables ​

📊 Group Statistics Variables ​

Theory: Controlling for Group Fixed Effects ​

Method 1: groupby + transform ​

Method 2: Ranking and Percentiles ​

Method 3: Rolling Window Statistics ​

⚙️ Production-Level VariableConstructor Class ​

💼 Complete Case Study: NLSY97 Mincer Wage Equation ​

Case Background ​

Data Simulation (Substitute for Real NLSY97) ​

Using VariableConstructor to Construct Variables ​

Estimating the Mincer Wage Equation ​

Results Interpretation ​

Visualization Analysis ​

Heterogeneity Analysis: Regional Differences in Education Return ​

🔧 Other Common Variable Constructions ​

1. Polynomial Terms (Non-linear Relationships) ​

2. Ratio Variables ​

3. Log Difference (Growth Rates) ​

4. Standardized Variables (Z-score) ​

📝 Practice Exercises ​

⭐⭐ Exercise 1: Create Education Level Dummy Variables ​

⭐⭐⭐ Exercise 2: Construct Interactions and Calculate Marginal Effects ​

⭐⭐⭐⭐ Exercise 3: Dynamic Panel Model ​

⭐⭐⭐⭐⭐ Exercise 4: Heterogeneous Treatment Effect Decomposition ​

📚 Summary ​

Core Functions Summary ​

Best Practices ​

Theoretical Key Points ​

🎯 Next Steps ​

📖 References ​

3.4 Variable Construction

📋 Learning Objectives

📐 Theoretical Foundation

The Role of Variable Construction in Causal Inference

🎯 Dummy Variables

Theory: Numerical Encoding of Categorical Variables

The Dummy Variable Trap (Multicollinearity)

Method 1: pd.get_dummies()

Method 2: Manual Construction (More Flexible)

Regression and Interpretation

🔄 Interaction Terms

Theory: Heterogeneity in Marginal Effects

Mathematical Derivation: Angrist & Pischke (2009) Framework

Creating Interaction Terms

Centering: Reducing Multicollinearity

Statistical Testing of Interactions

Visualizing Interaction Effects

⏱️ Lag Operators

Theory: Dynamic Panel Data Models

Dynamic Panel Models

Creating Lags

Lead Variables

📊 Group Statistics Variables

Theory: Controlling for Group Fixed Effects

Method 1: groupby + transform

Method 2: Ranking and Percentiles

Method 3: Rolling Window Statistics

⚙️ Production-Level VariableConstructor Class

💼 Complete Case Study: NLSY97 Mincer Wage Equation

Case Background

Data Simulation (Substitute for Real NLSY97)

Using VariableConstructor to Construct Variables

Estimating the Mincer Wage Equation

Results Interpretation

Visualization Analysis

Heterogeneity Analysis: Regional Differences in Education Return

🔧 Other Common Variable Constructions

1. Polynomial Terms (Non-linear Relationships)

2. Ratio Variables

3. Log Difference (Growth Rates)

4. Standardized Variables (Z-score)

📝 Practice Exercises

⭐⭐ Exercise 1: Create Education Level Dummy Variables

⭐⭐⭐ Exercise 2: Construct Interactions and Calculate Marginal Effects

⭐⭐⭐⭐ Exercise 3: Dynamic Panel Model

⭐⭐⭐⭐⭐ Exercise 4: Heterogeneous Treatment Effect Decomposition

📚 Summary

Core Functions Summary

Best Practices

Theoretical Key Points

🎯 Next Steps

📖 References