Skip to content

3.7 Complete Case Study: Wage Determination Research

"In God we trust, all others must bring data.""除了上帝,其他人都必须用数据说话。"— W. Edwards Deming, Statistician & Quality Control Expert (统计学家、质量管理专家)

Complete workflow from raw data to regression analysis


🎯 Case Objectives

Research Question: What is the causal effect of education on wages?

Data Source: Simulated NLSY (National Longitudinal Survey of Youth) data

Complete Workflow:

  1. Data import and inspection
  2. Data cleaning
  3. Variable construction
  4. Data transformation
  5. Descriptive statistics
  6. Regression analysis
  7. Results visualization and reporting

📥 Step 1: Data Import and Inspection

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# Settings
sns.set_style('whitegrid')
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
np.random.seed(42)

# 1. Read data
print("=" * 70)
print("【Step 1】Data Import")
print("=" * 70)

df_raw = pd.read_csv('wage_data_raw.csv', encoding='utf-8')
print(f"✓ Successfully read {len(df_raw)} rows, {df_raw.shape[1]} columns")

# 2. Initial inspection
print("\nFirst 5 rows:")
print(df_raw.head())

print("\nData types:")
print(df_raw.dtypes)

print("\nBasic statistics:")
print(df_raw.describe())

# 3. Missing values
print("\nMissing values:")
missing = df_raw.isnull().sum()
print(missing[missing > 0])

# 4. Duplicates
print(f"\nDuplicate rows: {df_raw.duplicated().sum()}")

🧹 Step 2: Data Cleaning

python
print("\n" + "=" * 70)
print("【Step 2】Data Cleaning")
print("=" * 70)

df = df_raw.copy()  # Preserve original data

# 1. Remove complete duplicates
before = len(df)
df = df.drop_duplicates()
print(f"✓ Removed duplicate rows: {before - len(df)} rows")

# 2. Handle missing values
# Remove rows with missing key variables (wage or education)
before = len(df)
df = df.dropna(subset=['wage', 'education'])
print(f"✓ Removed rows with key variable missing: {before - len(df)} rows")

# Fill other variables with median/mode
df['age'].fillna(df['age'].median(), inplace=True)
df['experience'].fillna(df['experience'].median(), inplace=True)

# 3. Handle outliers
# Remove obviously incorrect values
before = len(df)
df = df[
    (df['age'] >= 18) & (df['age'] <= 65) &
    (df['wage'] > 0) & (df['wage'] < 500000) &
    (df['education'] >= 0) & (df['education'] <= 30) &
    (df['experience'] >= 0) & (df['experience'] <= 50)
]
print(f"✓ Removed outliers: {before - len(df)} rows")

# 4. Winsorize income (handle extreme values)
from scipy.stats.mstats import winsorize
df['wage'] = winsorize(df['wage'], limits=[0.01, 0.01])
print("✓ Winsorized wage (1%-99%)")

# 5. Standardize string formats
df['gender'] = df['gender'].str.lower().str.strip()
df['race'] = df['race'].str.lower().str.strip()
df['region'] = df['region'].str.lower().str.strip()

print(f"\nAfter cleaning: {len(df)} rows, {df.shape[1]} columns")

🔧 Step 3: Variable Construction

python
print("\n" + "=" * 70)
print("【Step 3】Variable Construction")
print("=" * 70)

# 1. Continuous variable transformations
df['log_wage'] = np.log(df['wage'])
df['experience'] = df['age'] - df['education'] - 6  # Recalculate experience
df['experience_sq'] = df['experience'] ** 2  # Quadratic term

print("✓ Created log_wage, experience, experience_sq")

# 2. Dummy variables
df['female'] = (df['gender'] == 'female').astype(int)
df['married'] = (df['marital_status'] == 'married').astype(int)
df['urban'] = (df['area_type'] == 'urban').astype(int)

# Race dummy variables
race_dummies = pd.get_dummies(df['race'], prefix='race', drop_first=True)
df = pd.concat([df, race_dummies], axis=1)

# Region dummy variables
region_dummies = pd.get_dummies(df['region'], prefix='region', drop_first=True)
df = pd.concat([df, region_dummies], axis=1)

print("✓ Created dummy variables: female, married, urban, race_*, region_*")

# 3. Interaction terms
df['edu_x_experience'] = df['education'] * df['experience']
df['female_x_married'] = df['female'] * df['married']
df['education_x_urban'] = df['education'] * df['urban']

print("✓ Created interactions: edu_x_experience, female_x_married, education_x_urban")

# 4. Group statistics variables
df['region_mean_wage'] = df.groupby('region')['wage'].transform('mean')
df['wage_deviation'] = df['wage'] - df['region_mean_wage']

print("✓ Created group variables: region_mean_wage, wage_deviation")

print(f"\nAfter variable construction: {df.shape[1]} columns")

📊 Step 4: Descriptive Statistics

python
print("\n" + "=" * 70)
print("【Step 4】Descriptive Statistics")
print("=" * 70)

# 1. Core variable statistics
print("\nCore variables:")
print(df[['wage', 'log_wage', 'education', 'experience', 'age']].describe())

# 2. Categorical variable frequencies
print("\nGender distribution:")
print(df['gender'].value_counts())

print("\nRace distribution:")
print(df['race'].value_counts())

print("\nRegion distribution:")
print(df['region'].value_counts())

# 3. Group statistics
print("\nAverage wage by gender:")
print(df.groupby('gender')['wage'].agg(['mean', 'median', 'std']))

print("\nAverage wage by education level:")
df['education_group'] = pd.cut(df['education'],
                               bins=[0, 12, 16, 30],
                               labels=['Below HS', 'HS-College', 'Above College'])
print(df.groupby('education_group')['wage'].mean())

# 4. Correlation matrix
print("\nCorrelation matrix:")
corr_vars = ['log_wage', 'education', 'experience', 'age', 'female']
print(df[corr_vars].corr().round(3))

📈 Step 5: Visualization

python
print("\n" + "=" * 70)
print("【Step 5】Data Visualization")
print("=" * 70)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Wage Determination Research: Exploratory Data Analysis', fontsize=16, fontweight='bold')

# Subplot 1: Wage distribution
axes[0, 0].hist(df['wage'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Wage ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('(1) Wage Distribution (Right-Skewed)')

# Subplot 2: log(wage) distribution
axes[0, 1].hist(df['log_wage'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_xlabel('log(Wage)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('(2) log(Wage) Distribution (Approximately Normal)')

# Subplot 3: Education vs wage (scatter plot)
axes[0, 2].scatter(df['education'], df['log_wage'], alpha=0.3, s=10)
z = np.polyfit(df['education'], df['log_wage'], 1)
p = np.poly1d(z)
axes[0, 2].plot(df['education'], p(df['education']), 'r--', linewidth=2)
axes[0, 2].set_xlabel('Years of Education')
axes[0, 2].set_ylabel('log(Wage)')
axes[0, 2].set_title('(3) Education vs log(Wage)')

# Subplot 4: Wage by gender (box plot)
df.boxplot(column='log_wage', by='gender', ax=axes[1, 0])
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('log(Wage)')
axes[1, 0].set_title('(4) Gender Wage Gap')
plt.sca(axes[1, 0])
plt.xticks([1, 2], ['Female', 'Male'])

# Subplot 5: Education-experience-wage (3D scatter)
from mpl_toolkits.mplot3d import Axes3D
ax_3d = fig.add_subplot(2, 3, 5, projection='3d')
scatter = ax_3d.scatter(df['education'], df['experience'], df['log_wage'],
                        c=df['log_wage'], cmap='viridis', alpha=0.5, s=10)
ax_3d.set_xlabel('Education')
ax_3d.set_ylabel('Experience')
ax_3d.set_zlabel('log(Wage)')
ax_3d.set_title('(5) Education-Experience-Wage Relationship')

# Subplot 6: Wage distribution by education groups
axes[1, 2].violinplot([df[df['education_group'] == g]['log_wage'].dropna()
                        for g in ['Below HS', 'HS-College', 'Above College']],
                       positions=[1, 2, 3])
axes[1, 2].set_xticks([1, 2, 3])
axes[1, 2].set_xticklabels(['Below HS', 'HS-College', 'Above College'], rotation=15)
axes[1, 2].set_ylabel('log(Wage)')
axes[1, 2].set_title('(6) Wage Distribution by Education Group')

plt.tight_layout()
plt.savefig('wage_eda.png', dpi=300, bbox_inches='tight')
print("✓ Visualization saved to wage_eda.png")

📐 Step 6: Regression Analysis

python
print("\n" + "=" * 70)
print("【Step 6】Regression Analysis")
print("=" * 70)

# Model 1: Simple regression (basic Mincer equation)
print("\n【Model 1】Simple Mincer Equation")
X1 = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model1 = sm.OLS(df['log_wage'], X1).fit(cov_type='HC3')
print(model1.summary())

print(f"\nEducation return: {model1.params['education']:.4f}")
print(f"Interpretation: Each additional year of education increases wages by {model1.params['education']*100:.2f}%")

# Model 2: Add gender and marital status
print("\n【Model 2】With Demographic Variables")
X2 = sm.add_constant(df[['education', 'experience', 'experience_sq',
                         'female', 'married']])
model2 = sm.OLS(df['log_wage'], X2).fit(cov_type='HC3')

# Model 3: Add race and region dummy variables
print("\n【Model 3】With Race and Region Controls")
control_vars = ['education', 'experience', 'experience_sq',
                'female', 'married', 'urban']

# Add race dummy variables
race_cols = [col for col in df.columns if col.startswith('race_')]
control_vars.extend(race_cols)

# Add region dummy variables
region_cols = [col for col in df.columns if col.startswith('region_')]
control_vars.extend(region_cols)

X3 = sm.add_constant(df[control_vars])
model3 = sm.OLS(df['log_wage'], X3).fit(cov_type='HC3')

# Model 4: Add interactions (education × experience)
print("\n【Model 4】With Interactions")
X4 = sm.add_constant(df[control_vars + ['edu_x_experience']])
model4 = sm.OLS(df['log_wage'], X4).fit(cov_type='HC3')

# Regression results comparison table
from statsmodels.iolib.summary2 import summary_col

print("\n" + "=" * 70)
print("【Regression Results Comparison】")
print("=" * 70)

results_table = summary_col(
    [model1, model2, model3, model4],
    model_names=['(1) Basic', '(2) +Demo', '(3) +Race/Region', '(4) +Interaction'],
    stars=True,
    info_dict={'N': lambda x: f"{int(x.nobs):,}",
               'R²': lambda x: f"{x.rsquared:.3f}"}
)

print(results_table)

# Save regression results
with open('regression_results.txt', 'w', encoding='utf-8') as f:
    f.write(results_table.as_text())
print("\n✓ Regression results saved to regression_results.txt")

📊 Step 7: Results Interpretation and Reporting

python
print("\n" + "=" * 70)
print("【Step 7】Results Interpretation")
print("=" * 70)

# Extract key coefficients
edu_coef = model3.params['education']
edu_se = model3.bse['education']
edu_pval = model3.pvalues['education']

female_coef = model3.params['female']
experience_coef = model3.params['experience']

print("\n【Key Findings】")
print(f"1. Return to Education: {edu_coef:.4f} ({edu_se:.4f})")
print(f"   Interpretation: Each additional year of education increases wages by {edu_coef*100:.2f}%")
print(f"   Significance: p = {edu_pval:.4f} ***")

print(f"\n2. Gender Wage Gap: {female_coef:.4f}")
print(f"   Interpretation: Female wages are {abs(female_coef)*100:.2f}% lower than male wages")

print(f"\n3. Return to Experience: {experience_coef:.4f}")
print(f"   Interpretation: Each additional year of experience increases wages by {experience_coef*100:.2f}%")

# Calculate optimal experience years (quadratic term vertex)
exp_sq_coef = model3.params['experience_sq']
optimal_exp = -experience_coef / (2 * exp_sq_coef)
print(f"\n4. Optimal Experience Years: {optimal_exp:.1f} years")

# Generate predictions
print("\n【Prediction Example】")
# Predict: college graduate, 5 years experience, male, white, east
pred_data = pd.DataFrame({
    'education': [16],
    'experience': [5],
    'experience_sq': [25],
    'female': [0],
    'married': [1],
    'urban': [1]
})

# Add race and region dummy variables (all 0, i.e., reference group)
for col in race_cols + region_cols:
    pred_data[col] = 0

pred_data_const = sm.add_constant(pred_data)
log_wage_pred = model3.predict(pred_data_const)[0]
wage_pred = np.exp(log_wage_pred)

print(f"Predicted wage: ${wage_pred:,.2f}")

# Visualize predictions
fig, ax = plt.subplots(figsize=(10, 6))

# Wage-experience curves for different education levels
exp_range = np.arange(0, 40, 1)
for edu_level in [12, 16, 20]:
    wages = []
    for exp in exp_range:
        pred_temp = pred_data.copy()
        pred_temp['education'] = edu_level
        pred_temp['experience'] = exp
        pred_temp['experience_sq'] = exp ** 2
        pred_temp_const = sm.add_constant(pred_temp)
        log_w = model3.predict(pred_temp_const)[0]
        wages.append(np.exp(log_w))

    ax.plot(exp_range, wages, label=f'Education={edu_level} years', linewidth=2)

ax.set_xlabel('Work Experience (years)', fontsize=12)
ax.set_ylabel('Predicted Wage ($)', fontsize=12)
ax.set_title('Wage-Experience Curves (by Education Level)', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('wage_prediction.png', dpi=300, bbox_inches='tight')
print("\n✓ Prediction visualization saved to wage_prediction.png")

📋 Complete Summary

python
print("\n" + "=" * 70)
print("【Complete Analysis Summary】")
print("=" * 70)

summary_report = f"""
## Wage Determination Research: Complete Analysis Report

### 1. Data Overview
- Original sample: {len(df_raw)} rows
- Cleaned sample: {len(df)} rows
- Number of variables: {df.shape[1]}

### 2. Key Findings

#### (1) Return to Education: {edu_coef*100:.2f}%
- Each additional year of education increases wages by {edu_coef*100:.2f}%
- Statistical significance: p < 0.001
- 95% confidence interval: [{(edu_coef - 1.96*edu_se)*100:.2f}%, {(edu_coef + 1.96*edu_se)*100:.2f}%]

#### (2) Gender Wage Gap: {abs(female_coef)*100:.2f}%
- Female wages are {abs(female_coef)*100:.2f}% lower than male wages (after controlling for other factors)
- This gap cannot be explained by education, experience, and other factors

#### (3) Return to Experience: {experience_coef*100:.2f}%
- Each additional year of experience increases wages by {experience_coef*100:.2f}%
- Exhibits diminishing marginal returns (negative squared term)
- Optimal experience years: {optimal_exp:.1f} years

### 3. Model Fit
- R²: {model3.rsquared:.3f}
- Adjusted R²: {model3.rsquared_adj:.3f}
- F-statistic: {model3.fvalue:.2f} (p < 0.001)

### 4. Robustness Checks
✓ Used heteroskedasticity-robust standard errors (HC3)
✓ Winsorized extreme values
✓ Controlled for race and region fixed effects

### 5. Policy Implications
- Significant and robust return to education investment
- Gender wage gap requires policy intervention
- Experience accumulation is important, but has an optimal point

---
**Analysis Completion Time**: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}
**Analysis Tools**: Python + Pandas + Statsmodels
"""

print(summary_report)

# Save report
with open('wage_analysis_report.md', 'w', encoding='utf-8') as f:
    f.write(summary_report)

print("\n✓ Complete report saved to wage_analysis_report.md")
print("=" * 70)
print("Analysis Complete!")
print("=" * 70)

📚 Summary

Complete Workflow Review

StepTaskCore Functions
1Data importpd.read_csv()
2Data cleaningdropna(), winsorize()
3Variable constructionget_dummies(), transform()
4Descriptive statisticsdescribe(), groupby()
5Visualizationmatplotlib, seaborn
6Regression analysissm.OLS()
7Results interpretationCoefficient interpretation, prediction

Key Takeaways

  1. Preserve original data: Use df.copy()
  2. Validate each step: Check data shape and statistics
  3. Reproducibility: Set random seed, save code
  4. Robustness: Heteroskedasticity-robust standard errors, Winsorize
  5. Visualization: Charts are more intuitive than numbers

📝 Practice

Using your own dataset, complete the full workflow from data import to regression analysis.


Module 3 Complete!


Next Steps: Module 4 - Hypothesis Testing and Statistical Inference


Reference Resources:

  • Angrist & Pischke (2009): Mostly Harmless Econometrics
  • Wooldridge (2020): Introductory Econometrics (7th Ed)

Released under the MIT License. Content © Author.