3.7 Complete Case Study: Wage Determination Research
"In God we trust, all others must bring data.""除了上帝,其他人都必须用数据说话。"— W. Edwards Deming, Statistician & Quality Control Expert (统计学家、质量管理专家)
Complete workflow from raw data to regression analysis
🎯 Case Objectives
Research Question: What is the causal effect of education on wages?
Data Source: Simulated NLSY (National Longitudinal Survey of Youth) data
Complete Workflow:
- Data import and inspection
- Data cleaning
- Variable construction
- Data transformation
- Descriptive statistics
- Regression analysis
- Results visualization and reporting
📥 Step 1: Data Import and Inspection
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
# Settings
sns.set_style('whitegrid')
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
np.random.seed(42)
# 1. Read data
print("=" * 70)
print("【Step 1】Data Import")
print("=" * 70)
df_raw = pd.read_csv('wage_data_raw.csv', encoding='utf-8')
print(f"✓ Successfully read {len(df_raw)} rows, {df_raw.shape[1]} columns")
# 2. Initial inspection
print("\nFirst 5 rows:")
print(df_raw.head())
print("\nData types:")
print(df_raw.dtypes)
print("\nBasic statistics:")
print(df_raw.describe())
# 3. Missing values
print("\nMissing values:")
missing = df_raw.isnull().sum()
print(missing[missing > 0])
# 4. Duplicates
print(f"\nDuplicate rows: {df_raw.duplicated().sum()}")🧹 Step 2: Data Cleaning
python
print("\n" + "=" * 70)
print("【Step 2】Data Cleaning")
print("=" * 70)
df = df_raw.copy() # Preserve original data
# 1. Remove complete duplicates
before = len(df)
df = df.drop_duplicates()
print(f"✓ Removed duplicate rows: {before - len(df)} rows")
# 2. Handle missing values
# Remove rows with missing key variables (wage or education)
before = len(df)
df = df.dropna(subset=['wage', 'education'])
print(f"✓ Removed rows with key variable missing: {before - len(df)} rows")
# Fill other variables with median/mode
df['age'].fillna(df['age'].median(), inplace=True)
df['experience'].fillna(df['experience'].median(), inplace=True)
# 3. Handle outliers
# Remove obviously incorrect values
before = len(df)
df = df[
(df['age'] >= 18) & (df['age'] <= 65) &
(df['wage'] > 0) & (df['wage'] < 500000) &
(df['education'] >= 0) & (df['education'] <= 30) &
(df['experience'] >= 0) & (df['experience'] <= 50)
]
print(f"✓ Removed outliers: {before - len(df)} rows")
# 4. Winsorize income (handle extreme values)
from scipy.stats.mstats import winsorize
df['wage'] = winsorize(df['wage'], limits=[0.01, 0.01])
print("✓ Winsorized wage (1%-99%)")
# 5. Standardize string formats
df['gender'] = df['gender'].str.lower().str.strip()
df['race'] = df['race'].str.lower().str.strip()
df['region'] = df['region'].str.lower().str.strip()
print(f"\nAfter cleaning: {len(df)} rows, {df.shape[1]} columns")🔧 Step 3: Variable Construction
python
print("\n" + "=" * 70)
print("【Step 3】Variable Construction")
print("=" * 70)
# 1. Continuous variable transformations
df['log_wage'] = np.log(df['wage'])
df['experience'] = df['age'] - df['education'] - 6 # Recalculate experience
df['experience_sq'] = df['experience'] ** 2 # Quadratic term
print("✓ Created log_wage, experience, experience_sq")
# 2. Dummy variables
df['female'] = (df['gender'] == 'female').astype(int)
df['married'] = (df['marital_status'] == 'married').astype(int)
df['urban'] = (df['area_type'] == 'urban').astype(int)
# Race dummy variables
race_dummies = pd.get_dummies(df['race'], prefix='race', drop_first=True)
df = pd.concat([df, race_dummies], axis=1)
# Region dummy variables
region_dummies = pd.get_dummies(df['region'], prefix='region', drop_first=True)
df = pd.concat([df, region_dummies], axis=1)
print("✓ Created dummy variables: female, married, urban, race_*, region_*")
# 3. Interaction terms
df['edu_x_experience'] = df['education'] * df['experience']
df['female_x_married'] = df['female'] * df['married']
df['education_x_urban'] = df['education'] * df['urban']
print("✓ Created interactions: edu_x_experience, female_x_married, education_x_urban")
# 4. Group statistics variables
df['region_mean_wage'] = df.groupby('region')['wage'].transform('mean')
df['wage_deviation'] = df['wage'] - df['region_mean_wage']
print("✓ Created group variables: region_mean_wage, wage_deviation")
print(f"\nAfter variable construction: {df.shape[1]} columns")📊 Step 4: Descriptive Statistics
python
print("\n" + "=" * 70)
print("【Step 4】Descriptive Statistics")
print("=" * 70)
# 1. Core variable statistics
print("\nCore variables:")
print(df[['wage', 'log_wage', 'education', 'experience', 'age']].describe())
# 2. Categorical variable frequencies
print("\nGender distribution:")
print(df['gender'].value_counts())
print("\nRace distribution:")
print(df['race'].value_counts())
print("\nRegion distribution:")
print(df['region'].value_counts())
# 3. Group statistics
print("\nAverage wage by gender:")
print(df.groupby('gender')['wage'].agg(['mean', 'median', 'std']))
print("\nAverage wage by education level:")
df['education_group'] = pd.cut(df['education'],
bins=[0, 12, 16, 30],
labels=['Below HS', 'HS-College', 'Above College'])
print(df.groupby('education_group')['wage'].mean())
# 4. Correlation matrix
print("\nCorrelation matrix:")
corr_vars = ['log_wage', 'education', 'experience', 'age', 'female']
print(df[corr_vars].corr().round(3))📈 Step 5: Visualization
python
print("\n" + "=" * 70)
print("【Step 5】Data Visualization")
print("=" * 70)
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Wage Determination Research: Exploratory Data Analysis', fontsize=16, fontweight='bold')
# Subplot 1: Wage distribution
axes[0, 0].hist(df['wage'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Wage ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('(1) Wage Distribution (Right-Skewed)')
# Subplot 2: log(wage) distribution
axes[0, 1].hist(df['log_wage'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_xlabel('log(Wage)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('(2) log(Wage) Distribution (Approximately Normal)')
# Subplot 3: Education vs wage (scatter plot)
axes[0, 2].scatter(df['education'], df['log_wage'], alpha=0.3, s=10)
z = np.polyfit(df['education'], df['log_wage'], 1)
p = np.poly1d(z)
axes[0, 2].plot(df['education'], p(df['education']), 'r--', linewidth=2)
axes[0, 2].set_xlabel('Years of Education')
axes[0, 2].set_ylabel('log(Wage)')
axes[0, 2].set_title('(3) Education vs log(Wage)')
# Subplot 4: Wage by gender (box plot)
df.boxplot(column='log_wage', by='gender', ax=axes[1, 0])
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('log(Wage)')
axes[1, 0].set_title('(4) Gender Wage Gap')
plt.sca(axes[1, 0])
plt.xticks([1, 2], ['Female', 'Male'])
# Subplot 5: Education-experience-wage (3D scatter)
from mpl_toolkits.mplot3d import Axes3D
ax_3d = fig.add_subplot(2, 3, 5, projection='3d')
scatter = ax_3d.scatter(df['education'], df['experience'], df['log_wage'],
c=df['log_wage'], cmap='viridis', alpha=0.5, s=10)
ax_3d.set_xlabel('Education')
ax_3d.set_ylabel('Experience')
ax_3d.set_zlabel('log(Wage)')
ax_3d.set_title('(5) Education-Experience-Wage Relationship')
# Subplot 6: Wage distribution by education groups
axes[1, 2].violinplot([df[df['education_group'] == g]['log_wage'].dropna()
for g in ['Below HS', 'HS-College', 'Above College']],
positions=[1, 2, 3])
axes[1, 2].set_xticks([1, 2, 3])
axes[1, 2].set_xticklabels(['Below HS', 'HS-College', 'Above College'], rotation=15)
axes[1, 2].set_ylabel('log(Wage)')
axes[1, 2].set_title('(6) Wage Distribution by Education Group')
plt.tight_layout()
plt.savefig('wage_eda.png', dpi=300, bbox_inches='tight')
print("✓ Visualization saved to wage_eda.png")📐 Step 6: Regression Analysis
python
print("\n" + "=" * 70)
print("【Step 6】Regression Analysis")
print("=" * 70)
# Model 1: Simple regression (basic Mincer equation)
print("\n【Model 1】Simple Mincer Equation")
X1 = sm.add_constant(df[['education', 'experience', 'experience_sq']])
model1 = sm.OLS(df['log_wage'], X1).fit(cov_type='HC3')
print(model1.summary())
print(f"\nEducation return: {model1.params['education']:.4f}")
print(f"Interpretation: Each additional year of education increases wages by {model1.params['education']*100:.2f}%")
# Model 2: Add gender and marital status
print("\n【Model 2】With Demographic Variables")
X2 = sm.add_constant(df[['education', 'experience', 'experience_sq',
'female', 'married']])
model2 = sm.OLS(df['log_wage'], X2).fit(cov_type='HC3')
# Model 3: Add race and region dummy variables
print("\n【Model 3】With Race and Region Controls")
control_vars = ['education', 'experience', 'experience_sq',
'female', 'married', 'urban']
# Add race dummy variables
race_cols = [col for col in df.columns if col.startswith('race_')]
control_vars.extend(race_cols)
# Add region dummy variables
region_cols = [col for col in df.columns if col.startswith('region_')]
control_vars.extend(region_cols)
X3 = sm.add_constant(df[control_vars])
model3 = sm.OLS(df['log_wage'], X3).fit(cov_type='HC3')
# Model 4: Add interactions (education × experience)
print("\n【Model 4】With Interactions")
X4 = sm.add_constant(df[control_vars + ['edu_x_experience']])
model4 = sm.OLS(df['log_wage'], X4).fit(cov_type='HC3')
# Regression results comparison table
from statsmodels.iolib.summary2 import summary_col
print("\n" + "=" * 70)
print("【Regression Results Comparison】")
print("=" * 70)
results_table = summary_col(
[model1, model2, model3, model4],
model_names=['(1) Basic', '(2) +Demo', '(3) +Race/Region', '(4) +Interaction'],
stars=True,
info_dict={'N': lambda x: f"{int(x.nobs):,}",
'R²': lambda x: f"{x.rsquared:.3f}"}
)
print(results_table)
# Save regression results
with open('regression_results.txt', 'w', encoding='utf-8') as f:
f.write(results_table.as_text())
print("\n✓ Regression results saved to regression_results.txt")📊 Step 7: Results Interpretation and Reporting
python
print("\n" + "=" * 70)
print("【Step 7】Results Interpretation")
print("=" * 70)
# Extract key coefficients
edu_coef = model3.params['education']
edu_se = model3.bse['education']
edu_pval = model3.pvalues['education']
female_coef = model3.params['female']
experience_coef = model3.params['experience']
print("\n【Key Findings】")
print(f"1. Return to Education: {edu_coef:.4f} ({edu_se:.4f})")
print(f" Interpretation: Each additional year of education increases wages by {edu_coef*100:.2f}%")
print(f" Significance: p = {edu_pval:.4f} ***")
print(f"\n2. Gender Wage Gap: {female_coef:.4f}")
print(f" Interpretation: Female wages are {abs(female_coef)*100:.2f}% lower than male wages")
print(f"\n3. Return to Experience: {experience_coef:.4f}")
print(f" Interpretation: Each additional year of experience increases wages by {experience_coef*100:.2f}%")
# Calculate optimal experience years (quadratic term vertex)
exp_sq_coef = model3.params['experience_sq']
optimal_exp = -experience_coef / (2 * exp_sq_coef)
print(f"\n4. Optimal Experience Years: {optimal_exp:.1f} years")
# Generate predictions
print("\n【Prediction Example】")
# Predict: college graduate, 5 years experience, male, white, east
pred_data = pd.DataFrame({
'education': [16],
'experience': [5],
'experience_sq': [25],
'female': [0],
'married': [1],
'urban': [1]
})
# Add race and region dummy variables (all 0, i.e., reference group)
for col in race_cols + region_cols:
pred_data[col] = 0
pred_data_const = sm.add_constant(pred_data)
log_wage_pred = model3.predict(pred_data_const)[0]
wage_pred = np.exp(log_wage_pred)
print(f"Predicted wage: ${wage_pred:,.2f}")
# Visualize predictions
fig, ax = plt.subplots(figsize=(10, 6))
# Wage-experience curves for different education levels
exp_range = np.arange(0, 40, 1)
for edu_level in [12, 16, 20]:
wages = []
for exp in exp_range:
pred_temp = pred_data.copy()
pred_temp['education'] = edu_level
pred_temp['experience'] = exp
pred_temp['experience_sq'] = exp ** 2
pred_temp_const = sm.add_constant(pred_temp)
log_w = model3.predict(pred_temp_const)[0]
wages.append(np.exp(log_w))
ax.plot(exp_range, wages, label=f'Education={edu_level} years', linewidth=2)
ax.set_xlabel('Work Experience (years)', fontsize=12)
ax.set_ylabel('Predicted Wage ($)', fontsize=12)
ax.set_title('Wage-Experience Curves (by Education Level)', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('wage_prediction.png', dpi=300, bbox_inches='tight')
print("\n✓ Prediction visualization saved to wage_prediction.png")📋 Complete Summary
python
print("\n" + "=" * 70)
print("【Complete Analysis Summary】")
print("=" * 70)
summary_report = f"""
## Wage Determination Research: Complete Analysis Report
### 1. Data Overview
- Original sample: {len(df_raw)} rows
- Cleaned sample: {len(df)} rows
- Number of variables: {df.shape[1]}
### 2. Key Findings
#### (1) Return to Education: {edu_coef*100:.2f}%
- Each additional year of education increases wages by {edu_coef*100:.2f}%
- Statistical significance: p < 0.001
- 95% confidence interval: [{(edu_coef - 1.96*edu_se)*100:.2f}%, {(edu_coef + 1.96*edu_se)*100:.2f}%]
#### (2) Gender Wage Gap: {abs(female_coef)*100:.2f}%
- Female wages are {abs(female_coef)*100:.2f}% lower than male wages (after controlling for other factors)
- This gap cannot be explained by education, experience, and other factors
#### (3) Return to Experience: {experience_coef*100:.2f}%
- Each additional year of experience increases wages by {experience_coef*100:.2f}%
- Exhibits diminishing marginal returns (negative squared term)
- Optimal experience years: {optimal_exp:.1f} years
### 3. Model Fit
- R²: {model3.rsquared:.3f}
- Adjusted R²: {model3.rsquared_adj:.3f}
- F-statistic: {model3.fvalue:.2f} (p < 0.001)
### 4. Robustness Checks
✓ Used heteroskedasticity-robust standard errors (HC3)
✓ Winsorized extreme values
✓ Controlled for race and region fixed effects
### 5. Policy Implications
- Significant and robust return to education investment
- Gender wage gap requires policy intervention
- Experience accumulation is important, but has an optimal point
---
**Analysis Completion Time**: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}
**Analysis Tools**: Python + Pandas + Statsmodels
"""
print(summary_report)
# Save report
with open('wage_analysis_report.md', 'w', encoding='utf-8') as f:
f.write(summary_report)
print("\n✓ Complete report saved to wage_analysis_report.md")
print("=" * 70)
print("Analysis Complete!")
print("=" * 70)📚 Summary
Complete Workflow Review
| Step | Task | Core Functions |
|---|---|---|
| 1 | Data import | pd.read_csv() |
| 2 | Data cleaning | dropna(), winsorize() |
| 3 | Variable construction | get_dummies(), transform() |
| 4 | Descriptive statistics | describe(), groupby() |
| 5 | Visualization | matplotlib, seaborn |
| 6 | Regression analysis | sm.OLS() |
| 7 | Results interpretation | Coefficient interpretation, prediction |
Key Takeaways
- Preserve original data: Use
df.copy() - Validate each step: Check data shape and statistics
- Reproducibility: Set random seed, save code
- Robustness: Heteroskedasticity-robust standard errors, Winsorize
- Visualization: Charts are more intuitive than numbers
📝 Practice
Using your own dataset, complete the full workflow from data import to regression analysis.
Module 3 Complete!
Next Steps: Module 4 - Hypothesis Testing and Statistical Inference
Reference Resources:
- Angrist & Pischke (2009): Mostly Harmless Econometrics
- Wooldridge (2020): Introductory Econometrics (7th Ed)