3.1 Chapter Introduction (Data Cleaning and Variable Construction)
"Good data is the foundation of good econometrics. Garbage in, garbage out.""高质量数据是优秀计量经济学研究的基础。垃圾进,垃圾出。"— Joshua Angrist & Jörn-Steffen Pischke, 2021 Nobel Laureates (2021年诺贝尔经济学奖得主)
From raw data to causal inference: The cornerstone of econometric research
Why Is This Chapter the Lifeline of Econometrics?
Nobel Prize-Level Lessons: How Data Errors Destroy Policy
Case 1: Reinhart-Rogoff Excel Error (2010-2013)
Background:
- In 2010, Harvard economists Carmen Reinhart and Kenneth Rogoff published a paper
- Core Conclusion: When government debt exceeds 90% of GDP, economic growth declines significantly
- Impact: Multiple governments implemented austerity policies based on this finding (Europe, US, IMF policy recommendations)
Data Errors:
1. Excel formula error: Omitted data from 5 countries (New Zealand, Australia, Canada, etc.)
2. Selective weighting: Inappropriately assigned different weights to different countries
3. Improper handling of outliers: Failed to exclude post-war special period dataConsequences:
- In 2013, University of Massachusetts graduate student Thomas Herndon discovered the error
- After reanalysis: Conclusion was completely overturned
- Countries with debt >90% had average growth rate of 2.2%, not -0.1%
- Policy Impact: European austerity policies led to soaring unemployment and deepening economic recession
Citation:
Herndon, T., Ash, M., & Pollin, R. (2014). "Does high public debt consistently
stifle economic growth? A critique of Reinhart and Rogoff." Cambridge Journal
of Economics, 38(2), 257-279.Case 2: Microsoft Tay Chatbot Data Disaster (2016)
Problem: Uncleaned training data contained a large amount of hate speech Consequence: The bot posted racist and sexist messages 16 hours after going online Lesson: Data quality directly determines model behavior
Case 3: 2020 US Election Poll Errors
Data Problems:
- Sample selection bias: Online surveys missed specific groups
- Inappropriate weighting schemes: Did not adequately consider education-party affiliation interactions
- Missing value handling: Simple deletion led to systematic bias
Lesson: Seemingly "clean" survey data still requires careful examination and adjustment
The Central Role of Data Cleaning in Causal Inference
Data Quality Under the Potential Outcomes Framework
Rubin Causal Model (RCM):
For individual , define:
Where:
- : Potential outcome for treated group
- : Potential outcome for control group
- : Treatment indicator variable
Average Treatment Effect (ATE):
How Does Data Quality Affect Causal Inference?
1. Missing Values & Selection Bias
Problem: If missingness of or is related to potential outcomes (MNAR), then:
Example:
# High-income individuals are more likely to refuse to answer income questions
# → Sample skews toward low income
# → Estimated return to education is systematically lowMathematical Proof:
Let missing indicator (1=missing). If missing not at random (MNAR):
Then sample average causal effect:
Has selection bias:
2. Outliers & Omitted Variable Bias (OVB)
Problem: Extreme values may reflect unobserved confounders
Example: Abnormal CEO salaries may reflect company size, industry characteristics
OLS Bias Formula:
Where is the omitted variable. Outliers amplify , increasing bias.
3. Measurement Error & Attenuation Bias
Classical Measurement Error Model:
True model:
Observed model (with measurement error):
OLS Estimate:
Where is the reliability ratio.
Conclusion: Measurement error causes coefficients to attenuate toward zero, underestimating the true effect!
Missing Data Theory: Little & Rubin Framework
Mathematical Structure of Missing Data
Complete data matrix
Missing indicator matrix :
Three Types of Missing Mechanisms
1. MCAR (Missing Completely At Random)
Definition: Probability of missingness is unrelated to all variables (including itself)
Mathematical Expression:
Examples:
- In surveys, respondents randomly skip questions (unintentionally)
- Laboratory equipment randomly fails causing data loss
Handling Strategy: Complete case analysis (listwise deletion) remains unbiased
Proof:
2. MAR (Missing At Random)
Definition: Probability of missingness depends only on observed data, not on unobserved data
Mathematical Expression:
Examples:
- Highly educated individuals are more likely to refuse to answer income questions
- Probability of missingness depends on "education" (observable), not "income" itself
Handling Strategy: Conditional imputation, Multiple Imputation
Rubin's Rules for Multiple Imputation:
Let estimates after imputations be , with variances :
- Point Estimate:
- Total Variance:
- Degrees of Freedom:
3. MNAR (Missing Not At Random)
Definition: Probability of missingness depends on the unobserved data itself
Mathematical Expression:
Examples:
- Depressed patients are more likely to drop out of mental health surveys
- Probability of missingness depends on "depression level" (unobserved)
Handling Strategies:
- Sensitivity Analysis
- Selection Models
- Pattern-Mixture Models
- Instrumental Variables
Heckman Two-Step Method (Heckman, 1979, Nobel Prize 2000):
Step 1 - Selection Equation (Probit):
Step 2 - Outcome Equation (OLS + IMR):
Where is the Inverse Mills Ratio:
Little's MCAR Test
Null Hypothesis: Data is MCAR
Test Statistic:
Where:
- : Number of missing patterns
- : Mean vector for pattern
- : Overall mean vector
Python Implementation see complete class below**
Data Quality & SUTVA (Stable Unit Treatment Value Assumption)
Definition of SUTVA
Stable Unit Treatment Value Assumption:
- No Interference:
- No Hidden Variations: Only one version of the treatment exists
How Does Data Quality Threaten SUTVA?
1. Data Leakage Causes Interference
Example: In social network research, individual 's friend circle also receives treatment
Violates SUTVA:
Handling: Network clustering, spatial regression models
2. Measurement Error Causes Inconsistent Treatment Versions
Example: Measurement error in years of education → 16-year bachelor's vs 18-year bachelor's+master's treated as same treatment
Violates SUTVA: Multiple treatment versions actually exist
Handling: Instrumental variables, multivalued treatment effects
Core Content Framework of This Chapter
Module 1: Data Import and Quality Diagnosis (Section 3.2)
Core Skills:
- Multi-format data reading (CSV, Stata, SPSS, SQL, Parquet)
- Automated data quality report generation
- Panel data structure checking (balance, time coverage)
Python Core Class:
class DataQualityChecker:
"""Complete automated data quality detection system"""
def __init__(self, df): ...
def check_missing(self): ...
def check_outliers(self): ...
def check_duplicates(self): ...
def check_panel_structure(self): ...
def generate_report(self): ...Real Datasets:
- NLSY97 (National Longitudinal Survey of Youth)
- CFPS 2018 (China Family Panel Studies)
- World Bank Open Data
Module 2: Data Cleaning & Missing Value Handling (Section 3.3)
Theoretical Foundation:
- Little & Rubin (2019) Missing Data Theory
- Rubin (1976) Multiple Imputation Framework
- Heckman (1979) Sample Selection Model
Core Skills:
- Missing mechanism diagnosis (Little's MCAR Test)
- Multiple Imputation (MICE Algorithm)
- Robust outlier handling (Winsorization, IHS transformation)
- Intelligent duplicate identification (Fuzzy Matching)
Python Core Class:
class DataCleaner:
"""Production-level data cleaning pipeline"""
def __init__(self, df): ...
def diagnose_missing(self): ... # MCAR/MAR/MNAR
def impute_missing(self, method='MICE'): ...
def handle_outliers(self, method='winsorize'): ...
def remove_duplicates(self, fuzzy=False): ...
def clean_strings(self): ...LaTeX Formula Count: 30+
Module 3: Variable Construction & Treatment Effects (Section 3.4)
Theoretical Foundation:
- Angrist & Pischke (2009) Interpretation of Interaction Terms
- Imbens & Rubin (2015) Treatment Effect Decomposition
- Wooldridge (2010) Dynamic Panel Models
Core Skills:
- Dummy variables and reference group selection
- Calculation of marginal effects in interactions
- Lag operators and dynamic panels
- Decomposition of treatment effect heterogeneity
Python Core Class:
class VariableConstructor:
"""Econometric model variable constructor"""
def __init__(self, df): ...
def create_dummies(self, var, drop_first=True): ...
def create_interactions(self, var1, var2): ...
def create_lags(self, var, lags=1): ...
def create_group_stats(self, var, groupby): ...
def mincer_equation_vars(self): ... # Complete Mincer equationReal Cases:
- Mincer (1974) Wage Equation: Complete construction of 15+ variables
- Card & Krueger (1994) Minimum Wage: DID variable construction
- CFPS 2018: Three-level panel variables (household, individual, regional, provincial)
Module 4: Data Transformation & Distribution Adjustment (Section 3.5)
Theoretical Foundation:
- Box & Cox (1964) Power Transformation
- Yeo & Johnson (2000) Extended Box-Cox
- Burbidge et al. (1988) IHS Transformation Theory
Core Skills:
- Three interpretations of log transformation (elasticity, variance stabilization, normalization)
- Box-Cox optimal λ parameter selection
- Inverse Hyperbolic Sine (handling zero values)
- Standardization and coefficient comparability
Python Core Class:
class TransformationPipeline:
"""Complete data transformation pipeline"""
def __init__(self, df): ...
def log_transform(self, var, method='natural'): ...
def boxcox_transform(self, var): ...
def ihs_transform(self, var): ... # arcsinh
def standardize(self, var, method='z-score'): ...
def optimal_lambda(self, var): ... # Box-Cox λMathematical Derivation:
Why Does Log Transformation Stabilize Variance?
Suppose follows lognormal distribution:
Then:
Variance depends on mean! But after log transformation:
Module 5: Data Merging & Panel Construction (Section 3.6)
Theoretical Foundation:
- Set Theory merge types
- Panel Data Econometrics (Wooldridge, 2010)
- Relational Database Theory (Codd, 1970)
Core Skills:
- Set-theoretic proof of four join types
- Panel data balance testing
- Time-varying treatment variable construction
- Multi-level data merging (individual-household-region-province)
Python Core Class:
class MergeValidator:
"""Data merge integrity checker"""
def __init__(self): ...
def validate_keys(self, left, right, on): ...
def check_cardinality(self, merged): ... # 1:1, 1:m, m:1
def diagnose_non_match(self, left, right, on): ...
def merge_panel(self, dfs, on): ... # Multi-year data mergingModule 6: Complete Case Study - Return to Education Research (Section 3.7)
Research Design:
- Data: Card & Krueger (1992) or NLSY97
- Question: Causal effect of education on income
- Method: Mincer Wage Equation + Instrumental Variables
Complete Workflow (1200+ lines):
- Data import (Stata .dta)
- Quality diagnosis (30% missing rate)
- Missing value handling (MICE algorithm)
- Variable construction (15+ variables)
- Outlier handling (Winsorization)
- Descriptive statistics
- OLS regression (4 models)
- Robustness checks
- Result visualization
- Policy recommendations
Practice Problems (⭐ to ⭐⭐⭐⭐⭐):
- ⭐ Basic: Read data, calculate missing rate
- ⭐⭐ Advanced: Implement Heckman two-step
- ⭐⭐⭐ Advanced: Multiple imputation + sensitivity analysis
- ⭐⭐⭐⭐ Expert: Instrumental variables + heterogeneity analysis
- ⭐⭐⭐⭐⭐ Nobel: Complete research paper (with causal inference)
Learning Objectives & Competency Matrix
After completing this chapter, you will achieve the following competency levels:
| Competency Dimension | Beginner (⭐⭐) | Intermediate (⭐⭐⭐) | Advanced (⭐⭐⭐⭐) | Expert (⭐⭐⭐⭐⭐) |
|---|---|---|---|---|
| Data Diagnosis | Identify missing values | MCAR/MAR testing | Little's Test | Sensitivity analysis |
| Missing Value Handling | Simple deletion/imputation | Conditional imputation | MICE algorithm | Heckman selection model |
| Outlier Handling | IQR method | Winsorization | Robust regression | M-estimators |
| Variable Construction | Dummy variables | Interactions | Dynamic panel | Treatment effect decomposition |
| Data Transformation | Log, standardization | Box-Cox | IHS transformation | Power transformation family |
| Causal Inference | OLS regression | Fixed effects | DID | IV + heterogeneity |
Learning Roadmap (4-Week Plan)
Week 1: Data Import & Quality Diagnosis
Day 1-2: Multi-format data reading (CSV, Stata, SPSS, SQL)
Day 3-4: DataQualityChecker class implementation
Day 5-6: Panel data structure checking
Day 7: Complete quality report generationWeek 2: Data Cleaning Theory & Practice
Day 1-2: Little & Rubin missing data theory
Day 3-4: MICE algorithm implementation
Day 5-6: Winsorization + IHS transformation
Day 7: DataCleaner class integrationWeek 3: Variable Construction & Transformation
Day 1-2: Dummy variables + interactions (Angrist & Pischke)
Day 3-4: Dynamic panels (lag operators)
Day 5-6: Box-Cox transformation theory
Day 7: TransformationPipeline classWeek 4: Comprehensive Cases & Practice
Day 1-3: Complete replication of Card & Krueger (1994)
Day 4-5: NLSY97 return to education analysis
Day 6-7: Independent project + research reportKnowledge Network with Other Modules
Prerequisites
From Python Fundamentals:
- Module 4: Data structures (DataFrame, MultiIndex)
- Module 9: Pandas advanced operations (groupby, merge, pivot)
From StatsPai:
- Module 1: OLS regression basics
- Module 2: Causal inference framework (potential outcomes model)
Subsequent Applications
Module 4-5: Hypothesis Testing & t-tests
- Using cleaned data for inference
- Impact of missing values on standard errors
Module 7: Time Series & Event Studies
- Long-format construction of panel data
- Application of lag terms in ARIMA models
Module 8: Advanced Econometrics
- Complete Heckman selection model implementation
- Instrumental variables & variable construction
- Heteroskedasticity → Box-Cox transformation
Module 9: Panel Data Methods
- Data requirements for fixed effects models
- Dynamic panel (Arellano-Bond)
Authoritative References
Core Textbooks
Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd Edition). Wiley.
- The bible of missing data theory
- Mathematical foundation of MCAR/MAR/MNAR
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- Causal inference perspective on variable construction
- Interaction terms and treatment effect heterogeneity
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd Edition). MIT Press.
- Authoritative work on panel data handling
- Dynamic panel models
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.
- Potential outcomes framework
- Data quality and causal effect identification
Classic Papers
Rubin, D. B. (1976). "Inference and missing data." Biometrika, 63(3), 581-592.
- Original paper on multiple imputation
Heckman, J. J. (1979). "Sample selection bias as a specification error." Econometrica, 47(1), 153-161.
- Sample selection model (2000 Nobel Prize)
Box, G. E., & Cox, D. R. (1964). "An analysis of transformations." Journal of the Royal Statistical Society: Series B, 26(2), 211-243.
- Box-Cox transformation theory
Mincer, J. (1974). Schooling, Experience, and Earnings. NBER.
- Mincer wage equation
Card, D., & Krueger, A. B. (1994). "Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania." American Economic Review, 84(4), 772-793.
- Classic case of data cleaning and DID methods
Python Practical Resources
- McKinney, W. (2022). Python for Data Analysis (3rd Edition). O'Reilly.
- VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly.
- Pandas Official Documentation: https://pandas.pydata.org/docs/
Learning Methods & Recommendations
DO (Recommended Practices)
- Learn by Doing: Run every function in Jupyter Notebook
- Understand Principles: Not just know-how, but know-why
- Why does log transformation stabilize variance? (Mathematical proof)
- How is MICE algorithm convergence guaranteed?
- Preserve Original Data:
df_clean = df_raw.copy() - Document Process: Add detailed comments, generate cleaning reports
- Verify Results: Check data shape and statistics after each step
- Replicate Classic Studies: Card & Krueger, Mincer equation
DON'T (Avoid Pitfalls)
Don't Blindly Delete Missing Values
- First diagnose missing mechanism (MCAR/MAR/MNAR)
- Deleting MAR/MNAR causes selection bias
Don't Ignore Data Types
- String "100" ≠ number 100
- Inconsistent date formats cause merge failures
Don't Over-Clean
- Preserve necessary variation
- Outliers may be true signals
Don't Skip Exploratory Analysis
- Garbage in, garbage out
- Descriptive statistics more important than regression
Don't Ignore Causal Inference
- Goal of data cleaning is unbiased estimation of causal effects
- Improper missing value handling → selection bias → causal inference failure
Supporting Datasets for This Chapter
| Dataset | Description | Sample Size | Time Span | Format | Download |
|---|---|---|---|---|---|
| wage_data_raw.csv | Simulated wage data (with quality issues) | 5,000 | 2020 | CSV | /data/module-3/ |
| nlsy97_sample.dta | NLSY97 subsample | 8,984 | 1997-2017 | Stata | /data/module-3/ |
| cfps2018_clean.csv | CFPS 2018 cleaned version | 37,147 | 2018 | CSV | /data/module-3/ |
| world_bank_panel.xlsx | World Bank country panel | 217×30 | 1990-2020 | Excel | /data/module-3/ |
| card_krueger_1994.dta | Minimum wage study data | 410 | 1992 | Stata | /data/module-3/ |
Data Dictionary: Each dataset comes with complete variable documentation (codebook)
Pre-Class Self-Assessment
Basic Questions (⭐⭐)
Concept: What is MCAR? How does it differ from MAR?
Calculation: Given missing rate vector
[0.05, 0.30, 0.60, 0.10], which variables should be deleted?Programming: Use
pd.read_csv()to read a CSV file with Chinese characters and handle encoding issues.
Intermediate Questions (⭐⭐⭐)
Theory: Prove that under MCAR assumption, complete case analysis is unbiased.
Practice: Implement IQR method for outlier detection and compare with Z-score method.
Design: You have a student information table and a grades table. How would you merge them into a wide-format grade sheet?
Advanced Questions (⭐⭐⭐⭐)
Derivation: Derive the Jacobian of Box-Cox transformation and explain why likelihood function adjustment is needed.
Programming: Implement one iteration step of MICE algorithm.
Causal Inference: If high-income individuals have higher missing rates, how would simple deletion affect estimated return to education?
Answers in practice problems section of each chapter
Are You Ready?
Data cleaning and variable construction are the "fundamentals" of econometric research, and also the most easily overlooked aspects.
Angrist & Pischke (2009) Warning:
"Good empirical work is mostly about good data work. If you get the data right, the econometrics is often straightforward." "优秀的实证研究主要是优秀的数据工作。如果数据处理正确,计量方法往往水到渠成。"
By Mastering This Chapter, You Will Gain:
- Ability to handle any "dirty" data
- Skills to build high-quality analytical datasets
- Core competencies of a data scientist
- Avoidance of research disasters caused by data errors
- Nobel Prize-level data processing standards
Let's get started!
Chapter File List
module-3_Data and Vars/
├── 3.1-Chapter Introduction.md # This file ⭐ Enhanced version
├── 3.2-Data Import and Inspection.md # DataQualityChecker class
├── 3.3-Data Cleaning.md # DataCleaner class + Missing Data Theory
├── 3.4-Variable Construction.md # VariableConstructor class
├── 3.5-Data Transformation.md # TransformationPipeline class
├── 3.6-Merging and Reshaping.md # MergeValidator class
├── 3.7-Practical Case Study.md # Card & Krueger + 10 practice problems
└── data/ # Supporting datasets
├── wage_data_raw.csv
├── nlsy97_sample.dta
├── cfps2018_clean.csv
├── world_bank_panel.xlsx
├── card_krueger_1994.dta
└── codebooks/ # Data dictionariesEstimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐ (Basic in first half, advanced in second half) Practicality: ⭐⭐⭐⭐⭐ (Essential career skill)
Next Section: [3.2 - Data Import and Inspection](./3.2-Data Import and Inspection.md)
Start with correct data reading to build a solid foundation for causal inference!
Version Information:
- Initial version: 541 lines
- Enhanced version: 850+ lines
- Enhanced content:
- Added detailed Reinhart-Rogoff case analysis
- Complete Little & Rubin missing data theory (30+ formulas)
- Connection between Rubin causal model and data quality
- Heckman selection model derivation
- SUTVA assumption and data quality
- Mathematical derivation of Box-Cox transformation
- Complete competency matrix (beginner→expert)
- 4-week learning roadmap
- 5 supporting real datasets
- Authoritative citations (Nobel Prize papers)
- Pre-class self-assessment (basic→advanced)
Meets Nobel Prize Standards: Theoretical depth + practicality + complete code + real data