3.3 Data Cleaning
"Garbage in, garbage out.""垃圾进,垃圾出。"— George Fuechsel, IBM Programmer (IBM程序员)
Garbage In, Garbage Out
Learning Objectives
- Scientifically handle missing values (understand MCAR/MAR/MNAR)
- Master MICE multiple imputation theory and practice
- Identify and handle outliers
- Remove duplicate values
- Unify data types and formats
- Establish reproducible cleaning workflows
- Understand Heckman sample selection model
Why is Data Cleaning So Important?
Real Case: Excel Error Leading to Policy Failure
2010, Reinhart-Rogoff Debt and Growth Study
- Harvard economists published paper: Government debt > 90% GDP → Economic growth significantly declines
- Multiple governments implemented austerity policies based on this
- 2013 Discovery: Excel formula error (omitted data from 5 countries)
- Consequence: Conclusion overturned, policy recommendations invalidated
Lesson: Data cleaning errors lead to wrong conclusions and policies!
Statistical Theory of Missing Data
Mathematical Definition of Missing Mechanisms
Let be the complete data matrix, , and be the missing indicator matrix:
The missing mechanism is characterized by the conditional distribution , where is the missing mechanism parameter.
1. MCAR (Missing Completely At Random)
Missingness is independent of all variables (observed and unobserved).
Example: In surveys, respondents randomly interrupted (phone disconnected) leading to missingness.
2. MAR (Missing At Random)
Given observed data, missingness is independent of unobserved data.
Example: High-income individuals are more likely to refuse answering income questions, but given observed variables like education and occupation, missingness is independent of true income.
3. MNAR (Missing Not At Random)
Missingness is related to unobserved data, even given observed data.
Example: Drug users are more likely to refuse answering about drug history, missingness is related to the answer itself.
Little's MCAR Test
Statistical test for the MCAR hypothesis:
Where:
- is the total number of missing patterns
- is the sample size for the -th missing pattern
- is the mean vector for the -th missing pattern
- is the overall mean vector
- is the pooled covariance matrix
Under the MCAR hypothesis, approximately follows a chi-square distribution.
Missing Value Handling
Three Types of Missing Values
| Type | English | Definition | Handling Strategy |
|---|---|---|---|
| Missing Completely At Random | MCAR | Missingness unrelated to any variable | Delete or simple imputation |
| Missing At Random | MAR | Missingness related to observed variables | Conditional imputation or interpolation |
| Missing Not At Random | MNAR | Missingness related to unobserved variables | ️ Needs modeling or sensitivity analysis |
Detecting Missing Values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Basic detection
df.isnull().sum() # Missing count per column
df.isnull().mean() * 100 # Missing rate
df.isnull().any(axis=1).sum() # Rows with missing values
# 2. Visualize missing patterns
import missingno as msno
# Missing value matrix plot
msno.matrix(df, figsize=(10, 6))
plt.title('Missing Value Patterns')
# Missing value correlation heatmap
msno.heatmap(df, figsize=(10, 6))
# Missing value dendrogram (shows which variables are missing together)
msno.dendrogram(df)Little's MCAR Test Implementation
from scipy import stats
from scipy.linalg import inv
def littles_mcar_test(data):
"""
Little's MCAR Test
H0: Data is MCAR
H1: Data is not MCAR
Parameters:
-----------
data : pd.DataFrame
DataFrame containing missing values
Returns:
--------
chi2_stat : float
Chi-square statistic
p_value : float
p-value
df : int
Degrees of freedom
"""
# Remove columns that are all missing
data = data.dropna(axis=1, how='all')
# Identify missing patterns
missing_patterns = data.isnull().astype(int)
unique_patterns = missing_patterns.drop_duplicates()
# Keep only numeric columns
numeric_data = data.select_dtypes(include=[np.number])
# Calculate overall mean and covariance (complete cases)
complete_cases = numeric_data.dropna()
if len(complete_cases) < 2:
return np.nan, np.nan, np.nan
grand_mean = complete_cases.mean().values
grand_cov = complete_cases.cov().values
# Calculate chi-square statistic
chi2_stat = 0
total_df = 0
for idx, pattern in unique_patterns.iterrows():
# Get data for this pattern
mask = (missing_patterns == pattern).all(axis=1)
pattern_data = numeric_data[mask]
if len(pattern_data) == 0:
continue
# Observed variables
observed_vars = ~pattern[numeric_data.columns].values
if observed_vars.sum() == 0:
continue
# Mean for this pattern
pattern_mean = pattern_data.iloc[:, observed_vars].mean().values
# Covariance matrix for this pattern
sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]
if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
continue
# Overall mean (observed variables only)
sub_mean = grand_mean[observed_vars]
# Calculate Mahalanobis distance
diff = pattern_mean - sub_mean
n_j = len(pattern_data)
try:
chi2_j = n_j * diff @ inv(sub_cov) @ diff
chi2_stat += chi2_j
total_df += len(pattern_mean)
except np.linalg.LinAlgError:
continue
# Degrees of freedom
n_patterns = len(unique_patterns)
n_vars = observed_vars.sum()
df = (n_patterns - 1) * n_vars
# p-value
p_value = 1 - stats.chi2.cdf(chi2_stat, df)
return chi2_stat, p_value, df
# Usage example
chi2, p_value, df = littles_mcar_test(df)
print(f"Little's MCAR Test:")
print(f" Chi-square = {chi2:.4f}")
print(f" df = {df}")
print(f" p-value = {p_value:.4f}")
if p_value > 0.05:
print(" Conclusion: Cannot reject MCAR hypothesis (data may be MCAR)")
else:
print(" Conclusion: Reject MCAR hypothesis (data is not MCAR)")Missing Value Handling Strategies
Strategy 1: Deletion
Listwise Deletion:
# Delete rows with any missing column
df_clean = df.dropna()
# Delete rows with missing in specific columns
df_clean = df.dropna(subset=['income', 'education'])
# Delete rows that are all missing
df_clean = df.dropna(how='all')
# Delete rows with missing exceeding threshold (keep only if at least 5 non-missing values)
df_clean = df.dropna(thresh=5)Delete columns with high missing rates:
# Delete columns with missing rate > 50%
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > 0.5].index
df_clean = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {list(cols_to_drop)}")When to use deletion:
- Missing rate < 5%
- MCAR (Missing Completely At Random)
- Large enough sample size
- High missing rate causes information loss
- MAR/MNAR causes selection bias
Strategy 2: Simple Imputation
# 1. Mean imputation
df['age'].fillna(df['age'].mean(), inplace=True)
# 2. Median imputation (robust to outliers)
df['income'].fillna(df['income'].median(), inplace=True)
# 3. Mode imputation (categorical variables)
df['region'].fillna(df['region'].mode()[0], inplace=True)
# 4. Constant imputation
df['score'].fillna(0, inplace=True)
# 5. Forward fill (time series)
df['sales'].fillna(method='ffill', inplace=True)
# 6. Backward fill
df['sales'].fillna(method='bfill', inplace=True)
# 7. Linear interpolation (time series)
df['temperature'] = df['temperature'].interpolate(method='linear')Batch imputation:
# Median for numeric columns, mode for categorical columns
from sklearn.impute import SimpleImputer
# Numeric columns
num_cols = df.select_dtypes(include=[np.number]).columns
imputer_num = SimpleImputer(strategy='median')
df[num_cols] = imputer_num.fit_transform(df[num_cols])
# Categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
imputer_cat = SimpleImputer(strategy='most_frequent')
df[cat_cols] = imputer_cat.fit_transform(df[cat_cols])Strategy 3: Advanced Imputation
Conditional imputation (group-based imputation):
# Use regional average income to fill missing values
df['income'] = df.groupby('region')['income'].transform(
lambda x: x.fillna(x.median())
)
# Use industry-education combination average wage to fill
df['wage'] = df.groupby(['industry', 'education'])['wage'].transform(
lambda x: x.fillna(x.mean())
)Regression imputation (predict missing values using other variables):
from sklearn.linear_model import LinearRegression
# Example: Use age and education to predict missing income
# 1. Separate data with/without missing
df_with_income = df[df['income'].notnull()]
df_missing_income = df[df['income'].isnull()]
# 2. Train model
X = df_with_income[['age', 'education']]
y = df_with_income['income']
model = LinearRegression().fit(X, y)
# 3. Predict missing values
X_missing = df_missing_income[['age', 'education']]
predicted_income = model.predict(X_missing)
# 4. Fill
df.loc[df['income'].isnull(), 'income'] = predicted_incomeKNN imputation:
from sklearn.impute import KNNImputer
# Use K nearest neighbors' average value to fill
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer.fit_transform(df[['age', 'income', 'education']]),
columns=['age', 'income', 'education']
)MICE Multiple Imputation Theory
MICE Algorithm Principle
MICE (Multivariate Imputation by Chained Equations) is an iterative imputation algorithm.
Algorithm Steps
For variables :
Initialization: Fill all missing values using simple methods (e.g., mean)
Iteration Process: For the -th iteration, variable :
Where represents all variables except .
Convergence: Repeat step 2 until convergence (usually 5-10 iterations)
Generate multiple imputed datasets: Repeat the above process times (usually )
Rubin's Rules
For imputed datasets, how to combine estimators?
Let be the parameter estimate for the -th imputed dataset, and be its variance estimate.
Pooled estimator:
Total variance:
Where:
- Within-imputation variance (average variance):
- Between-imputation variance:
Degrees of freedom:
Confidence interval:
Fraction of Missing Information
measures the proportion of information loss due to missingness.
Complete MICE Implementation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import pandas as pd
import numpy as np
class MICEImputer:
"""
MICE (Multivariate Imputation by Chained Equations)
Complete implementation + Rubin's Rules
"""
def __init__(self, n_imputations=5, max_iter=10, random_state=42):
"""
Parameters:
------
n_imputations : int
Number of imputed datasets to generate
max_iter : int
Maximum iterations for each imputation
random_state : int
Random seed
"""
self.n_imputations = n_imputations
self.max_iter = max_iter
self.random_state = random_state
self.imputed_datasets = []
def fit_transform(self, X):
"""
Perform MICE imputation
Parameters:
------
X : pd.DataFrame
DataFrame containing missing values
Returns:
------
imputed_datasets : list
m imputed DataFrames
"""
self.imputed_datasets = []
for i in range(self.n_imputations):
# Create iterative imputer
imputer = IterativeImputer(
estimator=BayesianRidge(),
max_iter=self.max_iter,
random_state=self.random_state + i,
verbose=0
)
# Impute
X_imputed = imputer.fit_transform(X)
# Convert to DataFrame
df_imputed = pd.DataFrame(
X_imputed,
columns=X.columns,
index=X.index
)
self.imputed_datasets.append(df_imputed)
return self.imputed_datasets
def analyze(self, estimator_func):
"""
Analyze multiple imputed datasets and combine results using Rubin's Rules
Parameters:
------
estimator_func : function
Function that accepts DataFrame and returns (estimate, variance)
Returns:
------
pooled_estimate : float
Pooled estimator
pooled_variance : float
Total variance
pooled_se : float
Pooled standard error
ci_lower, ci_upper : float
95% confidence interval
"""
estimates = []
variances = []
# Estimate for each imputed dataset
for df in self.imputed_datasets:
est, var = estimator_func(df)
estimates.append(est)
variances.append(var)
# Rubin's Rules
m = self.n_imputations
# Pooled estimator
pooled_estimate = np.mean(estimates)
# Within-imputation variance
W = np.mean(variances)
# Between-imputation variance
B = np.var(estimates, ddof=1)
# Total variance
T = W + (1 + 1/m) * B
# Standard error
pooled_se = np.sqrt(T)
# Degrees of freedom
if B > 0:
nu = (m - 1) * (1 + W / ((1 + 1/m) * B))**2
else:
nu = np.inf
# 95% CI
from scipy import stats
t_crit = stats.t.ppf(0.975, nu) if nu < np.inf else 1.96
ci_lower = pooled_estimate - t_crit * pooled_se
ci_upper = pooled_estimate + t_crit * pooled_se
# Fraction of missing information
fmi = (B + B/m) / T if T > 0 else 0
return {
'estimate': pooled_estimate,
'se': pooled_se,
'ci_lower': ci_lower,
'ci_upper': ci_upper,
'W': W,
'B': B,
'T': T,
'nu': nu,
'fmi': fmi
}
# Usage example
# Create simulated data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'age': np.random.normal(40, 10, n),
'education': np.random.normal(12, 3, n),
'income': np.random.normal(50000, 20000, n),
'health': np.random.normal(70, 15, n)
})
# Introduce MAR missingness (high-income individuals more likely to have missing income)
df.loc[df['income'] > 60000, 'income'] = np.where(
np.random.random(sum(df['income'] > 60000)) < 0.3,
np.nan,
df.loc[df['income'] > 60000, 'income']
)
# Introduce other missingness
for col in ['age', 'education', 'health']:
mask = np.random.random(n) < 0.1
df.loc[mask, col] = np.nan
print(f"Missing rates:\n{df.isnull().mean() * 100}")
# MICE imputation
mice = MICEImputer(n_imputations=5, max_iter=10)
imputed_datasets = mice.fit_transform(df)
# Define estimation function (e.g., estimate income mean)
def estimate_mean_income(df):
"""Estimate income mean and its variance"""
mean = df['income'].mean()
var = df['income'].var() / len(df) # Variance of the mean
return mean, var
# Combine results using Rubin's Rules
results = mice.analyze(estimate_mean_income)
print("\n=== MICE + Rubin's Rules Results ===")
print(f"Pooled estimate: {results['estimate']:.2f}")
print(f"Standard error: {results['se']:.2f}")
print(f"95% CI: [{results['ci_lower']:.2f}, {results['ci_upper']:.2f}]")
print(f"Within variance (W): {results['W']:.2f}")
print(f"Between variance (B): {results['B']:.2f}")
print(f"Total variance (T): {results['T']:.2f}")
print(f"Degrees of freedom: {results['nu']:.2f}")
print(f"Fraction of missing information (FMI): {results['fmi']:.4f}")
# Compare with complete case analysis
complete_case_mean = df['income'].dropna().mean()
print(f"\nComplete case mean: {complete_case_mean:.2f}")
print(f"Difference: {results['estimate'] - complete_case_mean:.2f}")Strategy 4: Missing Indicator Variable
# Create missing indicator variable
df['income_missing'] = df['income'].isnull().astype(int)
# Then fill original variable (e.g., with median)
df['income'].fillna(df['income'].median(), inplace=True)
# In regression, include both original variable and missing indicator
# Coefficient of income_missing reflects effect of missingness itselfWhen to use:
- Suspect missingness itself is informative (MNAR)
- Moderate missing rate (10-30%)
Production-Level Data Cleaning Class
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import IsolationForest
import warnings
warnings.filterwarnings('ignore')
class DataCleaner:
"""
Production-Level Data Cleaning Pipeline - Nobel Prize Standards
Features:
-----
1. Missing value diagnosis (MCAR/MAR/MNAR)
2. MICE multiple imputation
3. Outlier detection and handling (IQR, Z-score, Isolation Forest, Winsorization)
4. String cleaning
5. Duplicate handling
6. Data type unification
7. Complete cleaning report
Example:
-----
>>> cleaner = DataCleaner(df, name='CFPS 2018')
>>> cleaner.diagnose_missing()
>>> cleaner.impute_mice(n_imputations=5)
>>> cleaner.handle_outliers(method='winsorize')
>>> df_clean = cleaner.get_clean_data()
>>> cleaner.generate_cleaning_report()
"""
def __init__(self, df, name='Dataset'):
"""
Initialize cleaner
Parameters:
------
df : pd.DataFrame
Raw DataFrame
name : str
Dataset name
"""
self.df_original = df.copy()
self.df = df.copy()
self.name = name
self.cleaning_log = []
self.missing_diagnostics = {}
self.outlier_diagnostics = {}
self._log(f"Initialized DataCleaner: {name}")
self._log(f" Shape: {self.df.shape[0]} rows × {self.df.shape[1]} columns")
def _log(self, message):
"""Log cleaning steps"""
self.cleaning_log.append(message)
print(message)
def diagnose_missing(self, plot=True):
"""
Diagnose missing mechanism: MCAR/MAR/MNAR
Steps:
-----
1. Little's MCAR Test
2. Missing pattern analysis
3. Correlation heatmap
4. Missingness correlation with observed variables
"""
self._log("\n" + "="*70)
self._log("Missing Value Diagnosis")
self._log("="*70)
# Basic statistics
missing_count = self.df.isnull().sum()
missing_pct = self.df.isnull().mean() * 100
missing_summary = pd.DataFrame({
'Missing_Count': missing_count,
'Missing_Pct': missing_pct
})
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values(
'Missing_Pct', ascending=False
)
self._log(f"\n【1】Missing Value Overview")
self._log(f" Total missing values: {self.df.isnull().sum().sum()}")
self._log(f" Rows with missing: {self.df.isnull().any(axis=1).sum()} ({self.df.isnull().any(axis=1).mean()*100:.2f}%)")
self._log(f"\n{missing_summary.to_string()}")
# Little's MCAR Test
self._log(f"\n【2】Little's MCAR Test")
try:
chi2, p_value, df_test = self._littles_mcar_test()
self._log(f" Chi-square = {chi2:.4f}")
self._log(f" df = {df_test}")
self._log(f" p-value = {p_value:.4f}")
if p_value > 0.05:
self._log(f" Conclusion: Cannot reject MCAR (p > 0.05)")
self.missing_diagnostics['mechanism'] = 'MCAR'
else:
self._log(f" Conclusion: Reject MCAR (p < 0.05), data may be MAR or MNAR")
self.missing_diagnostics['mechanism'] = 'MAR/MNAR'
except Exception as e:
self._log(f" Unable to perform Little's MCAR Test: {e}")
# Missing pattern analysis
self._log(f"\n【3】Missing Pattern Analysis")
missing_patterns = self.df.isnull().astype(int)
unique_patterns = missing_patterns.value_counts()
self._log(f" Number of unique missing patterns: {len(unique_patterns)}")
self._log(f" Top 5 missing patterns:")
for i, (pattern, count) in enumerate(unique_patterns.head(5).items(), 1):
self._log(f" Pattern {i}: {count} rows ({count/len(self.df)*100:.2f}%)")
# Missingness correlation analysis
self._log(f"\n【4】Missingness Correlation with Observed Variables")
numeric_cols = self.df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if self.df[col].isnull().any():
# Create missing indicator variable
missing_indicator = self.df[col].isnull().astype(int)
# Calculate correlation with other numeric variables
correlations = []
for other_col in numeric_cols:
if other_col != col and not self.df[other_col].isnull().all():
try:
corr = missing_indicator.corr(self.df[other_col])
if abs(corr) > 0.1: # Only show significant correlations
correlations.append((other_col, corr))
except:
pass
if correlations:
self._log(f"\n {col} missingness correlates with:")
for other_col, corr in sorted(correlations, key=lambda x: abs(x[1]), reverse=True)[:3]:
self._log(f" - {other_col}: {corr:.3f}")
# Visualization
if plot:
self._plot_missing_patterns()
self.missing_diagnostics['summary'] = missing_summary
return missing_summary
def _littles_mcar_test(self):
"""Little's MCAR Test implementation"""
from scipy.linalg import inv
data = self.df.copy()
data = data.dropna(axis=1, how='all')
missing_patterns = data.isnull().astype(int)
unique_patterns = missing_patterns.drop_duplicates()
numeric_data = data.select_dtypes(include=[np.number])
complete_cases = numeric_data.dropna()
if len(complete_cases) < 2:
return np.nan, np.nan, np.nan
grand_mean = complete_cases.mean().values
grand_cov = complete_cases.cov().values
chi2_stat = 0
total_df = 0
for idx, pattern in unique_patterns.iterrows():
mask = (missing_patterns == pattern).all(axis=1)
pattern_data = numeric_data[mask]
if len(pattern_data) == 0:
continue
observed_vars = ~pattern[numeric_data.columns].values
if observed_vars.sum() == 0:
continue
pattern_mean = pattern_data.iloc[:, observed_vars].mean().values
sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]
if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
continue
sub_mean = grand_mean[observed_vars]
diff = pattern_mean - sub_mean
n_j = len(pattern_data)
try:
chi2_j = n_j * diff @ inv(sub_cov) @ diff
chi2_stat += chi2_j
total_df += len(pattern_mean)
except np.linalg.LinAlgError:
continue
n_patterns = len(unique_patterns)
n_vars = len(numeric_data.columns)
df = (n_patterns - 1) * n_vars
p_value = 1 - stats.chi2.cdf(chi2_stat, df) if df > 0 else np.nan
return chi2_stat, p_value, df
def _plot_missing_patterns(self):
"""Visualize missing patterns"""
try:
import missingno as msno
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Missing matrix
msno.matrix(self.df, ax=axes[0], fontsize=10)
axes[0].set_title(f'{self.name} - Missing Pattern Matrix', fontsize=12)
# Missing heatmap
msno.heatmap(self.df, ax=axes[1], fontsize=10)
axes[1].set_title(f'{self.name} - Missing Correlation', fontsize=12)
plt.tight_layout()
plt.show()
except ImportError:
self._log(" Tip: Install missingno for missing pattern visualization (pip install missingno)")
def impute_mice(self, n_imputations=5, max_iter=10, use_rubin=True):
"""
MICE (Multivariate Imputation by Chained Equations) complete implementation
Parameters:
------
n_imputations : int
Number of imputed datasets to generate
max_iter : int
Maximum iterations for each imputation
use_rubin : bool
Whether to use Rubin's Rules to combine results
Returns:
------
imputed_datasets : list (if use_rubin=False)
or pooled DataFrame (if use_rubin=True)
"""
self._log("\n" + "="*70)
self._log(f"MICE Multiple Imputation (m={n_imputations}, max_iter={max_iter})")
self._log("="*70)
# Only impute numeric columns
numeric_cols = self.df.select_dtypes(include=[np.number]).columns
non_numeric_cols = self.df.select_dtypes(exclude=[np.number]).columns
# Save non-numeric columns
df_non_numeric = self.df[non_numeric_cols].copy()
imputed_datasets = []
for i in range(n_imputations):
self._log(f"\n Imputed dataset {i+1}/{n_imputations}")
imputer = IterativeImputer(
estimator=BayesianRidge(),
max_iter=max_iter,
random_state=42 + i,
verbose=0
)
X_imputed = imputer.fit_transform(self.df[numeric_cols])
df_imputed = pd.DataFrame(
X_imputed,
columns=numeric_cols,
index=self.df.index
)
# Merge non-numeric columns
df_imputed = pd.concat([df_imputed, df_non_numeric], axis=1)
imputed_datasets.append(df_imputed)
self._log(f" Convergence status: Complete")
if use_rubin:
# Combine using Rubin's Rules
self._log(f"\n Combining {n_imputations} datasets using Rubin's Rules")
# Take average for each numeric variable (simplified Rubin's Rules)
df_pooled = pd.DataFrame(index=self.df.index)
for col in numeric_cols:
values = np.array([df[col].values for df in imputed_datasets])
df_pooled[col] = values.mean(axis=0)
# Merge non-numeric columns
df_pooled = pd.concat([df_pooled, df_non_numeric], axis=1)
self.df = df_pooled
self._log(f" Merged into single dataset")
return df_pooled
else:
self._log(f" Returning {n_imputations} independent imputed datasets")
return imputed_datasets
def handle_outliers(self, method='winsorize', limits=[0.05, 0.05],
threshold=3, contamination=0.1):
"""
Robust outlier handling
Parameters:
------
method : str
Handling method: 'winsorize', 'iqr', 'zscore', 'isolation_forest'
limits : list
Winsorization upper and lower limit ratios [lower, upper]
threshold : float
Z-score method threshold
contamination : float
Isolation Forest outlier proportion
"""
self._log("\n" + "="*70)
self._log(f"Outlier Handling (method: {method})")
self._log("="*70)
numeric_cols = self.df.select_dtypes(include=[np.number]).columns
outliers_summary = []
for col in numeric_cols:
if self.df[col].std() == 0:
continue
before_min = self.df[col].min()
before_max = self.df[col].max()
before_mean = self.df[col].mean()
if method == 'winsorize':
self.df[col] = winsorize(self.df[col], limits=limits)
elif method == 'iqr':
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_mask = (self.df[col] < lower_bound) | (self.df[col] > upper_bound)
n_outliers = outliers_mask.sum()
# Delete outliers
self.df = self.df[~outliers_mask]
outliers_summary.append({
'Variable': col,
'Method': 'IQR',
'Outlier_Count': n_outliers,
'Lower_Bound': lower_bound,
'Upper_Bound': upper_bound
})
elif method == 'zscore':
z_scores = np.abs(stats.zscore(self.df[col]))
outliers_mask = z_scores > threshold
n_outliers = outliers_mask.sum()
# Delete outliers
self.df = self.df[~outliers_mask]
outliers_summary.append({
'Variable': col,
'Method': 'Z-score',
'Outlier_Count': n_outliers,
'Threshold': threshold
})
elif method == 'isolation_forest':
# Use Isolation Forest
iso_forest = IsolationForest(
contamination=contamination,
random_state=42
)
outliers_mask = iso_forest.fit_predict(
self.df[[col]].values.reshape(-1, 1)
) == -1
n_outliers = outliers_mask.sum()
# Delete outliers
self.df = self.df[~outliers_mask]
outliers_summary.append({
'Variable': col,
'Method': 'Isolation Forest',
'Outlier_Count': n_outliers,
'Contamination': contamination
})
after_min = self.df[col].min()
after_max = self.df[col].max()
after_mean = self.df[col].mean()
if method == 'winsorize':
self._log(f"\n {col}:")
self._log(f" Before: [{before_min:.2f}, {before_max:.2f}], mean = {before_mean:.2f}")
self._log(f" After: [{after_min:.2f}, {after_max:.2f}], mean = {after_mean:.2f}")
if outliers_summary:
self._log(f"\n Outlier handling summary:")
for item in outliers_summary:
self._log(f" {item}")
self.outlier_diagnostics = {
'method': method,
'summary': outliers_summary
}
def clean_strings(self):
"""
String cleaning
Steps:
-----
1. Strip whitespace
2. Normalize case
3. Remove special characters
"""
self._log("\n" + "="*70)
self._log("String Cleaning")
self._log("="*70)
string_cols = self.df.select_dtypes(include=['object']).columns
for col in string_cols:
# Remove whitespace
self.df[col] = self.df[col].str.strip()
# Uniform lowercase
self.df[col] = self.df[col].str.lower()
self._log(f" {col}: Cleaned (removed whitespace + lowercase)")
self._log(f"\n Cleaned {len(string_cols)} string columns")
def remove_duplicates(self, subset=None, keep='first'):
"""
Remove duplicate values
Parameters:
------
subset : list
Columns to check for duplicates
keep : str
Retention strategy: 'first', 'last', False
"""
self._log("\n" + "="*70)
self._log("Remove Duplicates")
self._log("="*70)
before = len(self.df)
self.df = self.df.drop_duplicates(subset=subset, keep=keep)
after = len(self.df)
self._log(f" Removed {before - after} duplicate records ({(before-after)/before*100:.2f}%)")
def get_clean_data(self):
"""Return cleaned data"""
return self.df.copy()
def generate_cleaning_report(self):
"""
Generate complete cleaning report
"""
print("\n" + "="*70)
print(f"{self.name} - Data Cleaning Report")
print("="*70)
print(f"\n【1】Data Overview")
print(f" Original data: {self.df_original.shape[0]} rows × {self.df_original.shape[1]} columns")
print(f" After cleaning: {self.df.shape[0]} rows × {self.df.shape[1]} columns")
print(f" Rows removed: {self.df_original.shape[0] - self.df.shape[0]} ({(self.df_original.shape[0] - self.df.shape[0])/self.df_original.shape[0]*100:.2f}%)")
print(f"\n【2】Missing Value Statistics")
original_missing = self.df_original.isnull().sum().sum()
final_missing = self.df.isnull().sum().sum()
print(f" Original missing values: {original_missing}")
print(f" After cleaning missing values: {final_missing}")
print(f" Handled missing values: {original_missing - final_missing}")
if self.missing_diagnostics:
print(f" Missing mechanism: {self.missing_diagnostics.get('mechanism', 'Unknown')}")
print(f"\n【3】Outlier Handling")
if self.outlier_diagnostics:
print(f" Method: {self.outlier_diagnostics.get('method', 'None')}")
summary = self.outlier_diagnostics.get('summary', [])
if summary:
total_outliers = sum(item['Outlier_Count'] for item in summary)
print(f" Total outliers: {total_outliers}")
print(f"\n【4】Data Quality Metrics")
print(f" Completeness: {(1 - self.df.isnull().mean().mean())*100:.2f}%")
print(f" Duplicate rate: {self.df.duplicated().mean()*100:.2f}%")
print(f"\n【5】Cleaning Steps Log")
print(f" Executed {len(self.cleaning_log)} steps")
print("="*70)
# Return detailed log
return {
'original_shape': self.df_original.shape,
'final_shape': self.df.shape,
'rows_removed': self.df_original.shape[0] - self.df.shape[0],
'missing_handled': original_missing - final_missing,
'cleaning_log': self.cleaning_log,
'missing_diagnostics': self.missing_diagnostics,
'outlier_diagnostics': self.outlier_diagnostics
}
# Usage example
if __name__ == "__main__":
# Create simulated data
np.random.seed(42)
n = 1000
df_raw = pd.DataFrame({
'id': range(1, n+1),
'age': np.random.normal(40, 10, n),
'education': np.random.normal(12, 3, n),
'income': np.random.normal(50000, 20000, n),
'health': np.random.normal(70, 15, n),
'region': np.random.choice(['east', 'west', 'north', 'south'], n),
'gender': np.random.choice(['male', 'female'], n)
})
# Introduce missing values
df_raw.loc[df_raw['income'] > 60000, 'income'] = np.where(
np.random.random(sum(df_raw['income'] > 60000)) < 0.3,
np.nan,
df_raw.loc[df_raw['income'] > 60000, 'income']
)
for col in ['age', 'education', 'health']:
mask = np.random.random(n) < 0.1
df_raw.loc[mask, col] = np.nan
# Introduce outliers
df_raw.loc[np.random.choice(n, 10), 'age'] = -5
df_raw.loc[np.random.choice(n, 10), 'income'] = 999999
# Use DataCleaner
cleaner = DataCleaner(df_raw, name='Simulated Dataset')
# Diagnose missing
cleaner.diagnose_missing(plot=False)
# MICE imputation
cleaner.impute_mice(n_imputations=5, max_iter=10)
# Handle outliers
cleaner.handle_outliers(method='winsorize', limits=[0.05, 0.05])
# Clean strings
cleaner.clean_strings()
# Remove duplicates
cleaner.remove_duplicates()
# Generate report
report = cleaner.generate_cleaning_report()
# Get cleaned data
df_clean = cleaner.get_clean_data()Outlier Handling
What are Outliers?
Outlier: Observations significantly different from most data
Two Types:
Data Errors: Data entry errors, measurement errors
- Example: Age = -5, income = 999999999
- Handling: Delete or correct
True Extremes: True but rare observations
- Example: CEO annual salary, billionaire wealth
- Handling: Keep, Winsorize, log transformation
Methods to Identify Outliers
Method 1: Box Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Univariate box plot
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['income'])
plt.title('Income Distribution (Box Plot)')
plt.show()
# Multivariate comparison
plt.figure(figsize=(12, 6))
df[['age', 'income', 'education']].boxplot()
plt.ylabel('Value')
plt.title('Multivariate Box Plot')
plt.show()Method 2: IQR Method (Interquartile Range)
def detect_outliers_iqr(df, col):
"""
Detect outliers using IQR method
"""
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
print(f"Column: {col}")
print(f" Q1: {Q1:.2f}")
print(f" Q3: {Q3:.2f}")
print(f" IQR: {IQR:.2f}")
print(f" Lower bound: {lower_bound:.2f}")
print(f" Upper bound: {upper_bound:.2f}")
print(f" Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")
return outliers
# Usage example
income_outliers = detect_outliers_iqr(df, 'income')Method 3: Z-score Method
For normal distribution, of data is within :
from scipy import stats
def detect_outliers_zscore(df, col, threshold=3):
"""
Detect outliers using Z-score method
threshold: Usually use 3 (99.7% of data within ±3σ)
"""
z_scores = np.abs(stats.zscore(df[col].dropna()))
outliers = df[z_scores > threshold]
print(f"Column: {col}")
print(f" Mean: {df[col].mean():.2f}")
print(f" Std Dev: {df[col].std():.2f}")
print(f" Threshold: ±{threshold}σ")
print(f" Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")
return outliers
# Usage example
age_outliers = detect_outliers_zscore(df, 'age', threshold=3)Method 4: Isolation Forest (Machine Learning Method)
Isolation Forest isolates outliers based on decision trees:
from sklearn.ensemble import IsolationForest
def detect_outliers_isolation_forest(df, cols, contamination=0.1):
"""
Detect multivariate outliers using Isolation Forest
Parameters:
------
contamination : float
Expected proportion of outliers
"""
iso_forest = IsolationForest(
contamination=contamination,
random_state=42
)
# Predict (-1 indicates outlier)
outliers_mask = iso_forest.fit_predict(df[cols]) == -1
print(f"Isolation Forest:")
print(f" Number of outliers: {outliers_mask.sum()} ({outliers_mask.mean()*100:.2f}%)")
return df[outliers_mask]
# Usage example
outliers_multi = detect_outliers_isolation_forest(
df,
['age', 'income', 'education'],
contamination=0.05
)Outlier Handling Strategies
Strategy 1: Deletion
# Delete outliers
df_clean = df[~df.index.isin(outliers.index)]
# Delete records with income < 0 or income > 1,000,000
df_clean = df[(df['income'] > 0) & (df['income'] < 1000000)]When to delete:
- Obviously error values (e.g., age = -5)
- Sample size large enough, deletion doesn't affect analysis
- Outliers are true (e.g., billionaires)
Strategy 2: Winsorize (Winsorization)
from scipy.stats.mstats import winsorize
# Replace extreme values with 5th and 95th percentiles
df['income_winsorized'] = winsorize(df['income'], limits=[0.05, 0.05])
# Or manual implementation
lower = df['income'].quantile(0.05)
upper = df['income'].quantile(0.95)
df['income_winsorized'] = df['income'].clip(lower, upper)
print(f"Original range: {df['income'].min():.0f} - {df['income'].max():.0f}")
print(f"After Winsorize: {df['income_winsorized'].min():.0f} - {df['income_winsorized'].max():.0f}")Advantages:
- Preserve sample size
- Reduce extreme value impact
- Maintain data distribution shape
Strategy 3: Log Transformation
For right-skewed distributions (e.g., income, wealth):
Where is a constant (usually ), preventing .
# Log transformation can compress right-skewed distributions
df['log_income'] = np.log(df['income'] + 1)
# Visualization comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Original distribution
axes[0].hist(df['income'], bins=50, edgecolor='black')
axes[0].set_title('Original Income Distribution (Right-skewed)')
axes[0].set_xlabel('Income')
# After log transformation
axes[1].hist(df['log_income'], bins=50, edgecolor='black', color='green')
axes[1].set_title('Log Income Distribution (Closer to Normal)')
axes[1].set_xlabel('log(Income)')
plt.tight_layout()
plt.show()Strategy 4: Cap (Capping)
# Set upper and lower limits
df['income_capped'] = df['income'].clip(lower=0, upper=500000)
# Or use percentiles
lower = df['income'].quantile(0.01)
upper = df['income'].quantile(0.99)
df['income_capped'] = df['income'].clip(lower, upper)Duplicate Handling
Detecting Duplicates
# Detect completely duplicate rows
duplicates = df.duplicated()
print(f"Duplicate rows: {duplicates.sum()} ({100*duplicates.mean():.2f}%)")
# View duplicate rows
print(df[duplicates])
# Detect duplicates in specific columns
duplicates_id = df.duplicated(subset=['id'])
print(f"ID duplicates: {duplicates_id.sum()}")
# View duplicate IDs
duplicate_ids = df[duplicates_id]['id'].unique()
print(f"Duplicate IDs: {duplicate_ids}")Removing Duplicates
# Remove completely duplicate rows (keep first occurrence)
df_clean = df.drop_duplicates()
# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')
# Remove all duplicates (keep none)
df_clean = df.drop_duplicates(keep=False)
# Remove duplicates based on specific column
df_clean = df.drop_duplicates(subset=['id'], keep='first')
# Remove duplicates based on multiple column combination
df_clean = df.drop_duplicates(subset=['person_id', 'year'], keep='first')Handling "Fuzzy Duplicates"
# Example: Names slightly different but actually same person
from fuzzywuzzy import fuzz
def find_fuzzy_duplicates(df, col, threshold=90):
"""
Find fuzzy matching duplicates
"""
values = df[col].unique()
duplicates = []
for i, val1 in enumerate(values):
for val2 in values[i+1:]:
similarity = fuzz.ratio(str(val1), str(val2))
if similarity >= threshold:
duplicates.append((val1, val2, similarity))
return pd.DataFrame(duplicates, columns=['Value1', 'Value2', 'Similarity'])
# Example: Name matching
fuzzy_dups = find_fuzzy_duplicates(df, 'name', threshold=90)
print(fuzzy_dups)Data Types and Format Unification
Data Type Conversion
# 1. Convert to numeric type
df['age'] = pd.to_numeric(df['age'], errors='coerce') # Unable to convert becomes NaN
# 2. Convert to string
df['id'] = df['id'].astype(str)
# 3. Convert to category type (save memory)
df['region'] = df['region'].astype('category')
# 4. Convert to date type
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
# 5. Convert to boolean type
df['employed'] = df['employed'].astype(bool)String Cleaning
# Remove whitespace
df['name'] = df['name'].str.strip() # Leading and trailing whitespace
df['name'] = df['name'].str.lstrip() # Left whitespace
df['name'] = df['name'].str.rstrip() # Right whitespace
# Unify case
df['gender'] = df['gender'].str.lower() # All lowercase
df['gender'] = df['gender'].str.upper() # All uppercase
df['name'] = df['name'].str.title() # Capitalize first letter
# Replace characters
df['phone'] = df['phone'].str.replace('-', '') # Remove hyphens
df['income'] = df['income'].str.replace(',', '') # Remove thousands separator
df['income'] = df['income'].str.replace('$', '') # Remove dollar sign
# Extract numbers
df['income_num'] = df['income'].str.extract(r'(\d+)').astype(float)Date Handling
# Parse dates
df['date'] = pd.to_datetime(df['date'])
# Handle multiple date formats
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
# Calculate date differences
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.daysHeckman Sample Selection Model
Theoretical Foundation
Problem: When sample selection is non-random (MNAR), directly deleting missing values leads to selection bias.
Heckman Two-Step Model (Heckman, 1979):
Step 1: Selection Equation (Probit)
Where:
- indicates is observed
Step 2: Outcome Equation (OLS + IMR)
Key: Assume jointly normal:
Inverse Mills Ratio
Where:
- is the standard normal PDF
- is the standard normal CDF
Corrected Regression
Python Implementation
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression, LinearRegression
import statsmodels.api as sm
class HeckmanCorrection:
"""
Heckman Two-Step Selection Model
Used to handle sample selection bias (MNAR)
Example:
-----
>>> heckman = HeckmanCorrection()
>>> heckman.fit(
... selection_data=df,
... selection_formula='selected ~ age + education',
... outcome_data=df[df['selected'] == 1],
... outcome_formula='income ~ age + education'
... )
>>> results = heckman.summary()
"""
def __init__(self):
self.probit_model = None
self.ols_model = None
self.lambda_coef = None
self.rho = None
def fit(self, df, selection_vars, outcome_vars, selection_indicator):
"""
Fit Heckman model
Parameters:
------
df : pd.DataFrame
Complete DataFrame
selection_vars : list
Independent variables for selection equation
outcome_vars : list
Independent variables for outcome equation
selection_indicator : str
Selection indicator variable name (1=observed, 0=not observed)
"""
print("="*70)
print("Heckman Two-Step Selection Model")
print("="*70)
# Step 1: Probit for selection
print("\n【Step 1】Selection Equation (Probit)")
X_selection = df[selection_vars]
y_selection = df[selection_indicator]
# Use statsmodels Probit
X_selection_const = sm.add_constant(X_selection)
probit = sm.Probit(y_selection, X_selection_const)
self.probit_model = probit.fit(disp=False)
print(self.probit_model.summary())
# Calculate Inverse Mills Ratio
linear_pred = self.probit_model.predict(X_selection_const)
# Only for selected sample
selected_mask = (y_selection == 1)
# IMR = phi(z) / Phi(z)
imr = norm.pdf(linear_pred[selected_mask]) / norm.cdf(linear_pred[selected_mask])
# Step 2: OLS with IMR
print("\n【Step 2】Outcome Equation (OLS + IMR)")
df_selected = df[selected_mask].copy()
df_selected['imr'] = imr.values
# Outcome equation
X_outcome = df_selected[outcome_vars + ['imr']]
y_outcome = df_selected['outcome']
X_outcome_const = sm.add_constant(X_outcome)
ols = sm.OLS(y_outcome, X_outcome_const)
self.ols_model = ols.fit()
print(self.ols_model.summary())
# Extract IMR coefficient
self.lambda_coef = self.ols_model.params['imr']
print(f"\n【Results】")
print(f" Inverse Mills Ratio coefficient: {self.lambda_coef:.4f}")
if abs(self.lambda_coef) > 0.1:
print(f" ️ Significant sample selection bias exists!")
else:
print(f" Sample selection bias is minor")
return self
def predict(self, X_new, with_correction=True):
"""
Predict (optionally using Heckman correction)
"""
if with_correction:
# Use corrected model
return self.ols_model.predict(sm.add_constant(X_new))
else:
# Ignore IMR, use regular OLS
return self.ols_model.predict(sm.add_constant(X_new))
# Usage example
if __name__ == "__main__":
# Simulate data: Female wage data (married women more likely not to work)
np.random.seed(42)
n = 2000
df = pd.DataFrame({
'age': np.random.normal(35, 8, n),
'education': np.random.normal(13, 2, n),
'married': np.random.binomial(1, 0.5, n),
'children': np.random.poisson(1, n)
})
# Selection equation: Work probability
# Married women with children more likely not to work
work_prob = 1 / (1 + np.exp(
-(0.5 - 0.3 * df['married'] - 0.2 * df['children'] + 0.05 * df['education'])
))
df['working'] = np.random.binomial(1, work_prob)
# Outcome equation: Wage (only observed for working people)
df['wage'] = np.nan
working_mask = df['working'] == 1
df.loc[working_mask, 'wage'] = (
5000 +
200 * df.loc[working_mask, 'age'] +
1000 * df.loc[working_mask, 'education'] +
np.random.normal(0, 2000, working_mask.sum())
)
print(f"Working population: {df['working'].sum()} / {n} ({df['working'].mean()*100:.1f}%)")
print(f"Observed wages: {df['wage'].notna().sum()}")
# Heckman model
df_heckman = df.copy()
df_heckman['outcome'] = df_heckman['wage']
heckman = HeckmanCorrection()
heckman.fit(
df=df_heckman,
selection_vars=['age', 'education', 'married', 'children'],
outcome_vars=['age', 'education'],
selection_indicator='working'
)Real Case: CFPS 2018 Data Cleaning
Case Background
Data: China Family Panel Studies (CFPS) 2018
Research Question: Impact of education on income
Challenges:
- 30% income missing rate
- High-income individuals more likely to refuse answering (MAR/MNAR)
- Extreme values exist (very high income)
Complete Cleaning Workflow
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Simulate CFPS 2018 data
np.random.seed(42)
n = 5000
# Create simulated data
cfps = pd.DataFrame({
'pid': range(1, n+1),
'age': np.random.normal(42, 12, n),
'education': np.random.choice([6, 9, 12, 16], n, p=[0.3, 0.2, 0.3, 0.2]),
'gender': np.random.choice(['Male', 'Female'], n),
'urban': np.random.choice([0, 1], n, p=[0.4, 0.6]),
'province': np.random.choice([
'Beijing', 'Shanghai', 'Guangdong', 'Zhejiang', 'Jiangsu', 'Shandong', 'Henan', 'Sichuan', 'Hubei', 'Hunan'
], n),
'health': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.4, 0.3, 0.1])
})
# Generate income (based on education, age, etc.)
cfps['income'] = (
5000 +
2000 * cfps['education'] +
300 * cfps['age'] +
5000 * cfps['urban'] +
10000 * (cfps['gender'] == 'Male').astype(int) +
np.random.lognormal(8, 1.5, n)
)
# Introduce MAR missingness (high income, high education more likely to be missing)
missing_prob = 1 / (1 + np.exp(
-(cfps['income'] / 50000 - 2 + 0.1 * cfps['education'])
))
missing_mask = np.random.random(n) < missing_prob
cfps.loc[missing_mask, 'income'] = np.nan
print(f"Income missing rate: {cfps['income'].isnull().mean()*100:.2f}%")
# Introduce other missingness
for col in ['age', 'education', 'health']:
mask = np.random.random(n) < 0.05
cfps.loc[mask, col] = np.nan
# Introduce outliers
cfps.loc[np.random.choice(n, 20, replace=False), 'income'] = np.random.uniform(
500000, 1000000, 20
)
cfps.loc[np.random.choice(n, 10, replace=False), 'age'] = -99
# Introduce duplicates
cfps = pd.concat([cfps, cfps.sample(50)], ignore_index=True)
print(f"\nRaw data: {cfps.shape}")
print(f"\nFirst 5 rows:")
print(cfps.head())
# Use DataCleaner
print("\n" + "="*70)
print("Start Data Cleaning")
print("="*70)
cleaner = DataCleaner(cfps, name='CFPS 2018')
# 1. Diagnose missing
cleaner.diagnose_missing(plot=False)
# 2. Remove duplicates
cleaner.remove_duplicates(subset=['pid'], keep='first')
# 3. Handle obvious error values
print("\n" + "="*70)
print("Handle Error Values")
print("="*70)
# Age range: 16-100
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['age'] >= 16) & (cleaner.df['age'] <= 100)]
print(f" Removed age outliers: {before - len(cleaner.df)} rows")
# Income > 0
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['income'] > 0) | (cleaner.df['income'].isnull())]
print(f" Removed negative income: {before - len(cleaner.df)} rows")
# 4. MICE imputation
cleaner.impute_mice(n_imputations=5, max_iter=10, use_rubin=True)
# 5. Handle outliers
cleaner.handle_outliers(method='winsorize', limits=[0.01, 0.01])
# 6. Clean strings
cleaner.clean_strings()
# 7. Generate report
report = cleaner.generate_cleaning_report()
# Get cleaned data
cfps_clean = cleaner.get_clean_data()
print(f"\nCleaned data: {cfps_clean.shape}")
print(f"\nDescriptive statistics:")
print(cfps_clean[['age', 'education', 'income']].describe())
# 8. Compare complete case deletion vs MICE imputation
print("\n" + "="*70)
print("Complete Case Deletion vs MICE Imputation Comparison")
print("="*70)
# Complete case
cfps_complete = cfps.dropna(subset=['income'])
# Estimate return to education
from sklearn.linear_model import LinearRegression
# Complete case estimate
X_complete = cfps_complete[['education', 'age', 'urban']]
y_complete = cfps_complete['income']
model_complete = LinearRegression().fit(X_complete, y_complete)
beta_complete = model_complete.coef_[0]
# MICE imputation estimate
X_mice = cfps_clean[['education', 'age', 'urban']]
y_mice = cfps_clean['income']
model_mice = LinearRegression().fit(X_mice, y_mice)
beta_mice = model_mice.coef_[0]
print(f"\nReturn to education estimate:")
print(f" Complete case deletion: {beta_complete:.2f}")
print(f" MICE imputation: {beta_mice:.2f}")
print(f" Difference: {beta_mice - beta_complete:.2f} ({(beta_mice/beta_complete - 1)*100:.2f}%)")
# 9. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Original vs cleaned income distribution
axes[0, 0].hist(cfps['income'].dropna(), bins=50, alpha=0.5, label='Original', edgecolor='black')
axes[0, 0].hist(cfps_clean['income'], bins=50, alpha=0.5, label='Cleaned', edgecolor='black')
axes[0, 0].set_title('Income Distribution Comparison')
axes[0, 0].set_xlabel('Income')
axes[0, 0].legend()
# Age distribution
axes[0, 1].hist(cfps_clean['age'], bins=30, edgecolor='black', color='green')
axes[0, 1].set_title('Age Distribution (Cleaned)')
axes[0, 1].set_xlabel('Age')
# Education distribution
cfps_clean['education'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 0], color='orange')
axes[1, 0].set_title('Education Distribution (Cleaned)')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Frequency')
# Income vs education
axes[1, 1].scatter(cfps_clean['education'], cfps_clean['income'], alpha=0.3, s=10)
axes[1, 1].set_title('Education and Income Relationship')
axes[1, 1].set_xlabel('Years of Education')
axes[1, 1].set_ylabel('Income')
plt.tight_layout()
plt.show()
print("\n Data cleaning complete!")Sensitivity Analysis
print("\n" + "="*70)
print("Sensitivity Analysis: Missing Mechanism Assumptions")
print("="*70)
# Assumption 1: MCAR (deletion)
cfps_mcar = cfps.dropna()
X_mcar = cfps_mcar[['education', 'age', 'urban']]
y_mcar = cfps_mcar['income']
model_mcar = LinearRegression().fit(X_mcar, y_mcar)
beta_mcar = model_mcar.coef_[0]
# Assumption 2: MAR (MICE)
beta_mar = beta_mice
# Assumption 3: MNAR (Heckman)
# Simplified: Assume income missingness related to true income
# (In practice, more complex modeling needed)
print(f"\nReturn to education estimate (sensitivity analysis):")
print(f" MCAR assumption (deletion): {beta_mcar:.2f}")
print(f" MAR assumption (MICE): {beta_mar:.2f}")
print(f" Difference: {beta_mar - beta_mcar:.2f} ({abs(beta_mar - beta_mcar) / beta_mcar * 100:.2f}%)")
print(f"\nConclusion:")
if abs(beta_mar - beta_mcar) / beta_mcar < 0.05:
print(f" Results not sensitive to missing mechanism assumption (difference < 5%)")
else:
print(f" ️ Results sensitive to missing mechanism assumption (difference ≥ 5%)")
print(f" Suggestion: Further investigate missing mechanism, may need Heckman correction")Practice Problems
⭐⭐ Basic: Simple Mean Imputation
Task: Given a dataset with missing values, implement simple mean imputation.
# Practice data
np.random.seed(42)
df_practice = pd.DataFrame({
'x1': np.random.normal(50, 10, 100),
'x2': np.random.normal(100, 20, 100),
'x3': np.random.normal(75, 15, 100)
})
# Introduce missingness
for col in df_practice.columns:
mask = np.random.random(100) < 0.2
df_practice.loc[mask, col] = np.nan
print("Missing rates:")
print(df_practice.isnull().mean())
# TODO: Implement mean imputation
# df_imputed = ...
# Verification
# assert df_imputed.isnull().sum().sum() == 0⭐⭐⭐ Intermediate: IQR Outlier Detection
Task: Implement IQR-based outlier detection function and visualize.
# TODO: Implement function
def detect_and_visualize_outliers(df, col):
"""
Detect and visualize outliers
Returns:
------
outliers : pd.DataFrame
Outlier data
"""
# 1. IQR calculation
# 2. Identify outliers
# 3. Visualize (box plot + scatter plot)
pass
# Test
# outliers = detect_and_visualize_outliers(df, 'income')⭐⭐⭐⭐ Advanced: MICE + Rubin's Rules
Task: Implement complete MICE imputation and combine results using Rubin's Rules.
# TODO: Implement
# 1. MICE generates m=5 imputed datasets
# 2. Estimate linear regression coefficients for each dataset
# 3. Combine using Rubin's Rules
# 4. Calculate 95% confidence interval
# 5. Compare with complete case analysis⭐⭐⭐⭐⭐ Nobel Level: Heckman + Sensitivity Analysis
Task: Implement Heckman two-step method and conduct sensitivity analysis.
# TODO: Implement
# 1. Simulate female wage data (married women more likely not to work)
# 2. Heckman two-step estimation
# 3. Compare with regular OLS (delete missing)
# 4. Sensitivity analysis: Change selection equation variables
# 5. Explain economic meaning of Inverse Mills RatioSummary
Core Data Cleaning Functions
| Task | Function | Example |
|---|---|---|
| Missing values | dropna() | Delete missing |
fillna() | Fill missing | |
IterativeImputer() | MICE imputation | |
| Outliers | clip() | Cap |
winsorize() | Winsorize | |
IsolationForest() | ML detection | |
| Duplicates | duplicated() | Detect duplicates |
drop_duplicates() | Remove duplicates | |
| Type conversion | astype() | Type conversion |
pd.to_numeric() | Convert to numeric | |
pd.to_datetime() | Convert to date | |
| Strings | str.strip() | Remove whitespace |
str.lower() | Convert to lowercase | |
str.replace() | Replace characters |
Missing Value Handling Decision Tree
Start
│
├─ Missing rate < 5%?
│ ├─ Yes → Direct deletion (dropna)
│ └─ No → Continue
│
├─ Little's MCAR Test p > 0.05?
│ ├─ Yes → MCAR, can delete or simple imputation
│ └─ No → MAR/MNAR, need advanced methods
│
├─ MAR?
│ ├─ Yes → MICE multiple imputation
│ └─ No → MNAR
│
└─ MNAR?
├─ Sample selection problem → Heckman two-step
├─ Missingness itself informative → Missing indicator variable
└─ Uncertain → Sensitivity analysisData Cleaning Best Practices
- Preserve original data: Use
df.copy()to create copy - Record cleaning steps: Add comments and log for each step
- Explore before cleaning: Understand data before processing
- Reproducible workflow: Write functions for reusability
- Verify cleaning results: Recheck data quality after cleaning
- Sensitivity analysis: Compare impact of different cleaning strategies
- Documentation: Generate cleaning reports
Theoretical Key Points
Missing mechanisms:
- MCAR: Deletion doesn't cause bias
- MAR: Can be modeled with observed variables
- MNAR: Needs additional assumptions or sensitivity analysis
MICE:
- Iterative imputation considering inter-variable relationships
- Rubin's Rules combine multiple imputed datasets
- Standard errors correctly reflect uncertainty
Heckman model:
- Two-step method corrects sample selection bias
- Inverse Mills Ratio quantifies selection effect
- Needs exclusion restriction
Next Steps
Next section we will learn Variable Construction, including dummy variables, interactions, lag terms, etc.
Keep going!
References
Classic Textbooks
- Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd Edition). Wiley.
- Heckman, J. J. (1979). "Sample Selection Bias as a Specification Error." Econometrica, 47(1), 153-161.
- van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd Edition). Chapman and Hall/CRC.
Practical Tools
missingno: Missing value visualizationscikit-learn.impute: Imputation algorithmsstatsmodels: Statistical modeling (including Heckman)fancyimpute: Advanced imputation methods