Skip to content

3.3 Data Cleaning

"Garbage in, garbage out.""垃圾进,垃圾出。"— George Fuechsel, IBM Programmer (IBM程序员)

Garbage In, Garbage Out


Learning Objectives

  • Scientifically handle missing values (understand MCAR/MAR/MNAR)
  • Master MICE multiple imputation theory and practice
  • Identify and handle outliers
  • Remove duplicate values
  • Unify data types and formats
  • Establish reproducible cleaning workflows
  • Understand Heckman sample selection model

Why is Data Cleaning So Important?

Real Case: Excel Error Leading to Policy Failure

2010, Reinhart-Rogoff Debt and Growth Study

  • Harvard economists published paper: Government debt > 90% GDP → Economic growth significantly declines
  • Multiple governments implemented austerity policies based on this
  • 2013 Discovery: Excel formula error (omitted data from 5 countries)
  • Consequence: Conclusion overturned, policy recommendations invalidated

Lesson: Data cleaning errors lead to wrong conclusions and policies!


Statistical Theory of Missing Data

Mathematical Definition of Missing Mechanisms

Let be the complete data matrix, , and be the missing indicator matrix:

The missing mechanism is characterized by the conditional distribution , where is the missing mechanism parameter.

1. MCAR (Missing Completely At Random)

Missingness is independent of all variables (observed and unobserved).

Example: In surveys, respondents randomly interrupted (phone disconnected) leading to missingness.

2. MAR (Missing At Random)

Given observed data, missingness is independent of unobserved data.

Example: High-income individuals are more likely to refuse answering income questions, but given observed variables like education and occupation, missingness is independent of true income.

3. MNAR (Missing Not At Random)

Missingness is related to unobserved data, even given observed data.

Example: Drug users are more likely to refuse answering about drug history, missingness is related to the answer itself.

Little's MCAR Test

Statistical test for the MCAR hypothesis:

Where:

  • is the total number of missing patterns
  • is the sample size for the -th missing pattern
  • is the mean vector for the -th missing pattern
  • is the overall mean vector
  • is the pooled covariance matrix

Under the MCAR hypothesis, approximately follows a chi-square distribution.


Missing Value Handling

Three Types of Missing Values

TypeEnglishDefinitionHandling Strategy
Missing Completely At RandomMCARMissingness unrelated to any variableDelete or simple imputation
Missing At RandomMARMissingness related to observed variablesConditional imputation or interpolation
Missing Not At RandomMNARMissingness related to unobserved variables️ Needs modeling or sensitivity analysis

Detecting Missing Values

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Basic detection
df.isnull().sum()        # Missing count per column
df.isnull().mean() * 100 # Missing rate
df.isnull().any(axis=1).sum()  # Rows with missing values

# 2. Visualize missing patterns
import missingno as msno

# Missing value matrix plot
msno.matrix(df, figsize=(10, 6))
plt.title('Missing Value Patterns')

# Missing value correlation heatmap
msno.heatmap(df, figsize=(10, 6))

# Missing value dendrogram (shows which variables are missing together)
msno.dendrogram(df)

Little's MCAR Test Implementation

python
from scipy import stats
from scipy.linalg import inv

def littles_mcar_test(data):
    """
    Little's MCAR Test

    H0: Data is MCAR
    H1: Data is not MCAR

    Parameters:
    -----------
    data : pd.DataFrame
        DataFrame containing missing values

    Returns:
    --------
    chi2_stat : float
        Chi-square statistic
    p_value : float
        p-value
    df : int
        Degrees of freedom
    """
    # Remove columns that are all missing
    data = data.dropna(axis=1, how='all')

    # Identify missing patterns
    missing_patterns = data.isnull().astype(int)
    unique_patterns = missing_patterns.drop_duplicates()

    # Keep only numeric columns
    numeric_data = data.select_dtypes(include=[np.number])

    # Calculate overall mean and covariance (complete cases)
    complete_cases = numeric_data.dropna()
    if len(complete_cases) < 2:
        return np.nan, np.nan, np.nan

    grand_mean = complete_cases.mean().values
    grand_cov = complete_cases.cov().values

    # Calculate chi-square statistic
    chi2_stat = 0
    total_df = 0

    for idx, pattern in unique_patterns.iterrows():
        # Get data for this pattern
        mask = (missing_patterns == pattern).all(axis=1)
        pattern_data = numeric_data[mask]

        if len(pattern_data) == 0:
            continue

        # Observed variables
        observed_vars = ~pattern[numeric_data.columns].values

        if observed_vars.sum() == 0:
            continue

        # Mean for this pattern
        pattern_mean = pattern_data.iloc[:, observed_vars].mean().values

        # Covariance matrix for this pattern
        sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]

        if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
            continue

        # Overall mean (observed variables only)
        sub_mean = grand_mean[observed_vars]

        # Calculate Mahalanobis distance
        diff = pattern_mean - sub_mean
        n_j = len(pattern_data)

        try:
            chi2_j = n_j * diff @ inv(sub_cov) @ diff
            chi2_stat += chi2_j
            total_df += len(pattern_mean)
        except np.linalg.LinAlgError:
            continue

    # Degrees of freedom
    n_patterns = len(unique_patterns)
    n_vars = observed_vars.sum()
    df = (n_patterns - 1) * n_vars

    # p-value
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)

    return chi2_stat, p_value, df

# Usage example
chi2, p_value, df = littles_mcar_test(df)
print(f"Little's MCAR Test:")
print(f"  Chi-square = {chi2:.4f}")
print(f"  df = {df}")
print(f"  p-value = {p_value:.4f}")

if p_value > 0.05:
    print("  Conclusion: Cannot reject MCAR hypothesis (data may be MCAR)")
else:
    print("  Conclusion: Reject MCAR hypothesis (data is not MCAR)")

Missing Value Handling Strategies

Strategy 1: Deletion

Listwise Deletion:

python
# Delete rows with any missing column
df_clean = df.dropna()

# Delete rows with missing in specific columns
df_clean = df.dropna(subset=['income', 'education'])

# Delete rows that are all missing
df_clean = df.dropna(how='all')

# Delete rows with missing exceeding threshold (keep only if at least 5 non-missing values)
df_clean = df.dropna(thresh=5)

Delete columns with high missing rates:

python
# Delete columns with missing rate > 50%
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > 0.5].index
df_clean = df.drop(columns=cols_to_drop)

print(f"Dropped columns: {list(cols_to_drop)}")

When to use deletion:

  • Missing rate < 5%
  • MCAR (Missing Completely At Random)
  • Large enough sample size
  • High missing rate causes information loss
  • MAR/MNAR causes selection bias

Strategy 2: Simple Imputation

python
# 1. Mean imputation
df['age'].fillna(df['age'].mean(), inplace=True)

# 2. Median imputation (robust to outliers)
df['income'].fillna(df['income'].median(), inplace=True)

# 3. Mode imputation (categorical variables)
df['region'].fillna(df['region'].mode()[0], inplace=True)

# 4. Constant imputation
df['score'].fillna(0, inplace=True)

# 5. Forward fill (time series)
df['sales'].fillna(method='ffill', inplace=True)

# 6. Backward fill
df['sales'].fillna(method='bfill', inplace=True)

# 7. Linear interpolation (time series)
df['temperature'] = df['temperature'].interpolate(method='linear')

Batch imputation:

python
# Median for numeric columns, mode for categorical columns
from sklearn.impute import SimpleImputer

# Numeric columns
num_cols = df.select_dtypes(include=[np.number]).columns
imputer_num = SimpleImputer(strategy='median')
df[num_cols] = imputer_num.fit_transform(df[num_cols])

# Categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
imputer_cat = SimpleImputer(strategy='most_frequent')
df[cat_cols] = imputer_cat.fit_transform(df[cat_cols])

Strategy 3: Advanced Imputation

Conditional imputation (group-based imputation):

python
# Use regional average income to fill missing values
df['income'] = df.groupby('region')['income'].transform(
    lambda x: x.fillna(x.median())
)

# Use industry-education combination average wage to fill
df['wage'] = df.groupby(['industry', 'education'])['wage'].transform(
    lambda x: x.fillna(x.mean())
)

Regression imputation (predict missing values using other variables):

python
from sklearn.linear_model import LinearRegression

# Example: Use age and education to predict missing income
# 1. Separate data with/without missing
df_with_income = df[df['income'].notnull()]
df_missing_income = df[df['income'].isnull()]

# 2. Train model
X = df_with_income[['age', 'education']]
y = df_with_income['income']
model = LinearRegression().fit(X, y)

# 3. Predict missing values
X_missing = df_missing_income[['age', 'education']]
predicted_income = model.predict(X_missing)

# 4. Fill
df.loc[df['income'].isnull(), 'income'] = predicted_income

KNN imputation:

python
from sklearn.impute import KNNImputer

# Use K nearest neighbors' average value to fill
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df[['age', 'income', 'education']]),
    columns=['age', 'income', 'education']
)

MICE Multiple Imputation Theory

MICE Algorithm Principle

MICE (Multivariate Imputation by Chained Equations) is an iterative imputation algorithm.

Algorithm Steps

For variables :

  1. Initialization: Fill all missing values using simple methods (e.g., mean)

  2. Iteration Process: For the -th iteration, variable :

Where represents all variables except .

  1. Convergence: Repeat step 2 until convergence (usually 5-10 iterations)

  2. Generate multiple imputed datasets: Repeat the above process times (usually )

Rubin's Rules

For imputed datasets, how to combine estimators?

Let be the parameter estimate for the -th imputed dataset, and be its variance estimate.

Pooled estimator:

Total variance:

Where:

  • Within-imputation variance (average variance):
  • Between-imputation variance:

Degrees of freedom:

Confidence interval:

Fraction of Missing Information

measures the proportion of information loss due to missingness.

Complete MICE Implementation

python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import pandas as pd
import numpy as np

class MICEImputer:
    """
    MICE (Multivariate Imputation by Chained Equations)
    Complete implementation + Rubin's Rules
    """

    def __init__(self, n_imputations=5, max_iter=10, random_state=42):
        """
        Parameters:
        ------
        n_imputations : int
            Number of imputed datasets to generate
        max_iter : int
            Maximum iterations for each imputation
        random_state : int
            Random seed
        """
        self.n_imputations = n_imputations
        self.max_iter = max_iter
        self.random_state = random_state
        self.imputed_datasets = []

    def fit_transform(self, X):
        """
        Perform MICE imputation

        Parameters:
        ------
        X : pd.DataFrame
            DataFrame containing missing values

        Returns:
        ------
        imputed_datasets : list
            m imputed DataFrames
        """
        self.imputed_datasets = []

        for i in range(self.n_imputations):
            # Create iterative imputer
            imputer = IterativeImputer(
                estimator=BayesianRidge(),
                max_iter=self.max_iter,
                random_state=self.random_state + i,
                verbose=0
            )

            # Impute
            X_imputed = imputer.fit_transform(X)

            # Convert to DataFrame
            df_imputed = pd.DataFrame(
                X_imputed,
                columns=X.columns,
                index=X.index
            )

            self.imputed_datasets.append(df_imputed)

        return self.imputed_datasets

    def analyze(self, estimator_func):
        """
        Analyze multiple imputed datasets and combine results using Rubin's Rules

        Parameters:
        ------
        estimator_func : function
            Function that accepts DataFrame and returns (estimate, variance)

        Returns:
        ------
        pooled_estimate : float
            Pooled estimator
        pooled_variance : float
            Total variance
        pooled_se : float
            Pooled standard error
        ci_lower, ci_upper : float
            95% confidence interval
        """
        estimates = []
        variances = []

        # Estimate for each imputed dataset
        for df in self.imputed_datasets:
            est, var = estimator_func(df)
            estimates.append(est)
            variances.append(var)

        # Rubin's Rules
        m = self.n_imputations

        # Pooled estimator
        pooled_estimate = np.mean(estimates)

        # Within-imputation variance
        W = np.mean(variances)

        # Between-imputation variance
        B = np.var(estimates, ddof=1)

        # Total variance
        T = W + (1 + 1/m) * B

        # Standard error
        pooled_se = np.sqrt(T)

        # Degrees of freedom
        if B > 0:
            nu = (m - 1) * (1 + W / ((1 + 1/m) * B))**2
        else:
            nu = np.inf

        # 95% CI
        from scipy import stats
        t_crit = stats.t.ppf(0.975, nu) if nu < np.inf else 1.96
        ci_lower = pooled_estimate - t_crit * pooled_se
        ci_upper = pooled_estimate + t_crit * pooled_se

        # Fraction of missing information
        fmi = (B + B/m) / T if T > 0 else 0

        return {
            'estimate': pooled_estimate,
            'se': pooled_se,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'W': W,
            'B': B,
            'T': T,
            'nu': nu,
            'fmi': fmi
        }

# Usage example
# Create simulated data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age': np.random.normal(40, 10, n),
    'education': np.random.normal(12, 3, n),
    'income': np.random.normal(50000, 20000, n),
    'health': np.random.normal(70, 15, n)
})

# Introduce MAR missingness (high-income individuals more likely to have missing income)
df.loc[df['income'] > 60000, 'income'] = np.where(
    np.random.random(sum(df['income'] > 60000)) < 0.3,
    np.nan,
    df.loc[df['income'] > 60000, 'income']
)

# Introduce other missingness
for col in ['age', 'education', 'health']:
    mask = np.random.random(n) < 0.1
    df.loc[mask, col] = np.nan

print(f"Missing rates:\n{df.isnull().mean() * 100}")

# MICE imputation
mice = MICEImputer(n_imputations=5, max_iter=10)
imputed_datasets = mice.fit_transform(df)

# Define estimation function (e.g., estimate income mean)
def estimate_mean_income(df):
    """Estimate income mean and its variance"""
    mean = df['income'].mean()
    var = df['income'].var() / len(df)  # Variance of the mean
    return mean, var

# Combine results using Rubin's Rules
results = mice.analyze(estimate_mean_income)

print("\n=== MICE + Rubin's Rules Results ===")
print(f"Pooled estimate: {results['estimate']:.2f}")
print(f"Standard error: {results['se']:.2f}")
print(f"95% CI: [{results['ci_lower']:.2f}, {results['ci_upper']:.2f}]")
print(f"Within variance (W): {results['W']:.2f}")
print(f"Between variance (B): {results['B']:.2f}")
print(f"Total variance (T): {results['T']:.2f}")
print(f"Degrees of freedom: {results['nu']:.2f}")
print(f"Fraction of missing information (FMI): {results['fmi']:.4f}")

# Compare with complete case analysis
complete_case_mean = df['income'].dropna().mean()
print(f"\nComplete case mean: {complete_case_mean:.2f}")
print(f"Difference: {results['estimate'] - complete_case_mean:.2f}")

Strategy 4: Missing Indicator Variable

python
# Create missing indicator variable
df['income_missing'] = df['income'].isnull().astype(int)

# Then fill original variable (e.g., with median)
df['income'].fillna(df['income'].median(), inplace=True)

# In regression, include both original variable and missing indicator
# Coefficient of income_missing reflects effect of missingness itself

When to use:

  • Suspect missingness itself is informative (MNAR)
  • Moderate missing rate (10-30%)

Production-Level Data Cleaning Class

python
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import IsolationForest
import warnings
warnings.filterwarnings('ignore')

class DataCleaner:
    """
    Production-Level Data Cleaning Pipeline - Nobel Prize Standards

    Features:
    -----
    1. Missing value diagnosis (MCAR/MAR/MNAR)
    2. MICE multiple imputation
    3. Outlier detection and handling (IQR, Z-score, Isolation Forest, Winsorization)
    4. String cleaning
    5. Duplicate handling
    6. Data type unification
    7. Complete cleaning report

    Example:
    -----
    >>> cleaner = DataCleaner(df, name='CFPS 2018')
    >>> cleaner.diagnose_missing()
    >>> cleaner.impute_mice(n_imputations=5)
    >>> cleaner.handle_outliers(method='winsorize')
    >>> df_clean = cleaner.get_clean_data()
    >>> cleaner.generate_cleaning_report()
    """

    def __init__(self, df, name='Dataset'):
        """
        Initialize cleaner

        Parameters:
        ------
        df : pd.DataFrame
            Raw DataFrame
        name : str
            Dataset name
        """
        self.df_original = df.copy()
        self.df = df.copy()
        self.name = name
        self.cleaning_log = []
        self.missing_diagnostics = {}
        self.outlier_diagnostics = {}

        self._log(f"Initialized DataCleaner: {name}")
        self._log(f"  Shape: {self.df.shape[0]} rows × {self.df.shape[1]} columns")

    def _log(self, message):
        """Log cleaning steps"""
        self.cleaning_log.append(message)
        print(message)

    def diagnose_missing(self, plot=True):
        """
        Diagnose missing mechanism: MCAR/MAR/MNAR

        Steps:
        -----
        1. Little's MCAR Test
        2. Missing pattern analysis
        3. Correlation heatmap
        4. Missingness correlation with observed variables
        """
        self._log("\n" + "="*70)
        self._log("Missing Value Diagnosis")
        self._log("="*70)

        # Basic statistics
        missing_count = self.df.isnull().sum()
        missing_pct = self.df.isnull().mean() * 100

        missing_summary = pd.DataFrame({
            'Missing_Count': missing_count,
            'Missing_Pct': missing_pct
        })
        missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values(
            'Missing_Pct', ascending=False
        )

        self._log(f"\n【1】Missing Value Overview")
        self._log(f"  Total missing values: {self.df.isnull().sum().sum()}")
        self._log(f"  Rows with missing: {self.df.isnull().any(axis=1).sum()} ({self.df.isnull().any(axis=1).mean()*100:.2f}%)")
        self._log(f"\n{missing_summary.to_string()}")

        # Little's MCAR Test
        self._log(f"\n【2】Little's MCAR Test")
        try:
            chi2, p_value, df_test = self._littles_mcar_test()
            self._log(f"  Chi-square = {chi2:.4f}")
            self._log(f"  df = {df_test}")
            self._log(f"  p-value = {p_value:.4f}")

            if p_value > 0.05:
                self._log(f"  Conclusion: Cannot reject MCAR (p > 0.05)")
                self.missing_diagnostics['mechanism'] = 'MCAR'
            else:
                self._log(f"  Conclusion: Reject MCAR (p < 0.05), data may be MAR or MNAR")
                self.missing_diagnostics['mechanism'] = 'MAR/MNAR'
        except Exception as e:
            self._log(f"  Unable to perform Little's MCAR Test: {e}")

        # Missing pattern analysis
        self._log(f"\n【3】Missing Pattern Analysis")
        missing_patterns = self.df.isnull().astype(int)
        unique_patterns = missing_patterns.value_counts()
        self._log(f"  Number of unique missing patterns: {len(unique_patterns)}")
        self._log(f"  Top 5 missing patterns:")
        for i, (pattern, count) in enumerate(unique_patterns.head(5).items(), 1):
            self._log(f"    Pattern {i}: {count} rows ({count/len(self.df)*100:.2f}%)")

        # Missingness correlation analysis
        self._log(f"\n【4】Missingness Correlation with Observed Variables")
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns

        for col in numeric_cols:
            if self.df[col].isnull().any():
                # Create missing indicator variable
                missing_indicator = self.df[col].isnull().astype(int)

                # Calculate correlation with other numeric variables
                correlations = []
                for other_col in numeric_cols:
                    if other_col != col and not self.df[other_col].isnull().all():
                        try:
                            corr = missing_indicator.corr(self.df[other_col])
                            if abs(corr) > 0.1:  # Only show significant correlations
                                correlations.append((other_col, corr))
                        except:
                            pass

                if correlations:
                    self._log(f"\n  {col} missingness correlates with:")
                    for other_col, corr in sorted(correlations, key=lambda x: abs(x[1]), reverse=True)[:3]:
                        self._log(f"    - {other_col}: {corr:.3f}")

        # Visualization
        if plot:
            self._plot_missing_patterns()

        self.missing_diagnostics['summary'] = missing_summary

        return missing_summary

    def _littles_mcar_test(self):
        """Little's MCAR Test implementation"""
        from scipy.linalg import inv

        data = self.df.copy()
        data = data.dropna(axis=1, how='all')

        missing_patterns = data.isnull().astype(int)
        unique_patterns = missing_patterns.drop_duplicates()

        numeric_data = data.select_dtypes(include=[np.number])

        complete_cases = numeric_data.dropna()
        if len(complete_cases) < 2:
            return np.nan, np.nan, np.nan

        grand_mean = complete_cases.mean().values
        grand_cov = complete_cases.cov().values

        chi2_stat = 0
        total_df = 0

        for idx, pattern in unique_patterns.iterrows():
            mask = (missing_patterns == pattern).all(axis=1)
            pattern_data = numeric_data[mask]

            if len(pattern_data) == 0:
                continue

            observed_vars = ~pattern[numeric_data.columns].values

            if observed_vars.sum() == 0:
                continue

            pattern_mean = pattern_data.iloc[:, observed_vars].mean().values
            sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]

            if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
                continue

            sub_mean = grand_mean[observed_vars]
            diff = pattern_mean - sub_mean
            n_j = len(pattern_data)

            try:
                chi2_j = n_j * diff @ inv(sub_cov) @ diff
                chi2_stat += chi2_j
                total_df += len(pattern_mean)
            except np.linalg.LinAlgError:
                continue

        n_patterns = len(unique_patterns)
        n_vars = len(numeric_data.columns)
        df = (n_patterns - 1) * n_vars

        p_value = 1 - stats.chi2.cdf(chi2_stat, df) if df > 0 else np.nan

        return chi2_stat, p_value, df

    def _plot_missing_patterns(self):
        """Visualize missing patterns"""
        try:
            import missingno as msno
            import matplotlib.pyplot as plt

            fig, axes = plt.subplots(1, 2, figsize=(14, 6))

            # Missing matrix
            msno.matrix(self.df, ax=axes[0], fontsize=10)
            axes[0].set_title(f'{self.name} - Missing Pattern Matrix', fontsize=12)

            # Missing heatmap
            msno.heatmap(self.df, ax=axes[1], fontsize=10)
            axes[1].set_title(f'{self.name} - Missing Correlation', fontsize=12)

            plt.tight_layout()
            plt.show()
        except ImportError:
            self._log("  Tip: Install missingno for missing pattern visualization (pip install missingno)")

    def impute_mice(self, n_imputations=5, max_iter=10, use_rubin=True):
        """
        MICE (Multivariate Imputation by Chained Equations) complete implementation

        Parameters:
        ------
        n_imputations : int
            Number of imputed datasets to generate
        max_iter : int
            Maximum iterations for each imputation
        use_rubin : bool
            Whether to use Rubin's Rules to combine results

        Returns:
        ------
        imputed_datasets : list (if use_rubin=False)
            or pooled DataFrame (if use_rubin=True)
        """
        self._log("\n" + "="*70)
        self._log(f"MICE Multiple Imputation (m={n_imputations}, max_iter={max_iter})")
        self._log("="*70)

        # Only impute numeric columns
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        non_numeric_cols = self.df.select_dtypes(exclude=[np.number]).columns

        # Save non-numeric columns
        df_non_numeric = self.df[non_numeric_cols].copy()

        imputed_datasets = []

        for i in range(n_imputations):
            self._log(f"\n  Imputed dataset {i+1}/{n_imputations}")

            imputer = IterativeImputer(
                estimator=BayesianRidge(),
                max_iter=max_iter,
                random_state=42 + i,
                verbose=0
            )

            X_imputed = imputer.fit_transform(self.df[numeric_cols])

            df_imputed = pd.DataFrame(
                X_imputed,
                columns=numeric_cols,
                index=self.df.index
            )

            # Merge non-numeric columns
            df_imputed = pd.concat([df_imputed, df_non_numeric], axis=1)

            imputed_datasets.append(df_imputed)

            self._log(f"    Convergence status: Complete")

        if use_rubin:
            # Combine using Rubin's Rules
            self._log(f"\n  Combining {n_imputations} datasets using Rubin's Rules")

            # Take average for each numeric variable (simplified Rubin's Rules)
            df_pooled = pd.DataFrame(index=self.df.index)

            for col in numeric_cols:
                values = np.array([df[col].values for df in imputed_datasets])
                df_pooled[col] = values.mean(axis=0)

            # Merge non-numeric columns
            df_pooled = pd.concat([df_pooled, df_non_numeric], axis=1)

            self.df = df_pooled
            self._log(f"  Merged into single dataset")

            return df_pooled
        else:
            self._log(f"  Returning {n_imputations} independent imputed datasets")
            return imputed_datasets

    def handle_outliers(self, method='winsorize', limits=[0.05, 0.05],
                       threshold=3, contamination=0.1):
        """
        Robust outlier handling

        Parameters:
        ------
        method : str
            Handling method: 'winsorize', 'iqr', 'zscore', 'isolation_forest'
        limits : list
            Winsorization upper and lower limit ratios [lower, upper]
        threshold : float
            Z-score method threshold
        contamination : float
            Isolation Forest outlier proportion
        """
        self._log("\n" + "="*70)
        self._log(f"Outlier Handling (method: {method})")
        self._log("="*70)

        numeric_cols = self.df.select_dtypes(include=[np.number]).columns

        outliers_summary = []

        for col in numeric_cols:
            if self.df[col].std() == 0:
                continue

            before_min = self.df[col].min()
            before_max = self.df[col].max()
            before_mean = self.df[col].mean()

            if method == 'winsorize':
                self.df[col] = winsorize(self.df[col], limits=limits)

            elif method == 'iqr':
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR

                outliers_mask = (self.df[col] < lower_bound) | (self.df[col] > upper_bound)
                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'IQR',
                    'Outlier_Count': n_outliers,
                    'Lower_Bound': lower_bound,
                    'Upper_Bound': upper_bound
                })

            elif method == 'zscore':
                z_scores = np.abs(stats.zscore(self.df[col]))
                outliers_mask = z_scores > threshold
                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'Z-score',
                    'Outlier_Count': n_outliers,
                    'Threshold': threshold
                })

            elif method == 'isolation_forest':
                # Use Isolation Forest
                iso_forest = IsolationForest(
                    contamination=contamination,
                    random_state=42
                )

                outliers_mask = iso_forest.fit_predict(
                    self.df[[col]].values.reshape(-1, 1)
                ) == -1

                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'Isolation Forest',
                    'Outlier_Count': n_outliers,
                    'Contamination': contamination
                })

            after_min = self.df[col].min()
            after_max = self.df[col].max()
            after_mean = self.df[col].mean()

            if method == 'winsorize':
                self._log(f"\n  {col}:")
                self._log(f"    Before: [{before_min:.2f}, {before_max:.2f}], mean = {before_mean:.2f}")
                self._log(f"    After: [{after_min:.2f}, {after_max:.2f}], mean = {after_mean:.2f}")

        if outliers_summary:
            self._log(f"\n  Outlier handling summary:")
            for item in outliers_summary:
                self._log(f"    {item}")

        self.outlier_diagnostics = {
            'method': method,
            'summary': outliers_summary
        }

    def clean_strings(self):
        """
        String cleaning

        Steps:
        -----
        1. Strip whitespace
        2. Normalize case
        3. Remove special characters
        """
        self._log("\n" + "="*70)
        self._log("String Cleaning")
        self._log("="*70)

        string_cols = self.df.select_dtypes(include=['object']).columns

        for col in string_cols:
            # Remove whitespace
            self.df[col] = self.df[col].str.strip()

            # Uniform lowercase
            self.df[col] = self.df[col].str.lower()

            self._log(f"  {col}: Cleaned (removed whitespace + lowercase)")

        self._log(f"\n  Cleaned {len(string_cols)} string columns")

    def remove_duplicates(self, subset=None, keep='first'):
        """
        Remove duplicate values

        Parameters:
        ------
        subset : list
            Columns to check for duplicates
        keep : str
            Retention strategy: 'first', 'last', False
        """
        self._log("\n" + "="*70)
        self._log("Remove Duplicates")
        self._log("="*70)

        before = len(self.df)
        self.df = self.df.drop_duplicates(subset=subset, keep=keep)
        after = len(self.df)

        self._log(f"  Removed {before - after} duplicate records ({(before-after)/before*100:.2f}%)")

    def get_clean_data(self):
        """Return cleaned data"""
        return self.df.copy()

    def generate_cleaning_report(self):
        """
        Generate complete cleaning report
        """
        print("\n" + "="*70)
        print(f"{self.name} - Data Cleaning Report")
        print("="*70)

        print(f"\n【1】Data Overview")
        print(f"  Original data: {self.df_original.shape[0]} rows × {self.df_original.shape[1]} columns")
        print(f"  After cleaning: {self.df.shape[0]} rows × {self.df.shape[1]} columns")
        print(f"  Rows removed: {self.df_original.shape[0] - self.df.shape[0]} ({(self.df_original.shape[0] - self.df.shape[0])/self.df_original.shape[0]*100:.2f}%)")

        print(f"\n【2】Missing Value Statistics")
        original_missing = self.df_original.isnull().sum().sum()
        final_missing = self.df.isnull().sum().sum()
        print(f"  Original missing values: {original_missing}")
        print(f"  After cleaning missing values: {final_missing}")
        print(f"  Handled missing values: {original_missing - final_missing}")

        if self.missing_diagnostics:
            print(f"  Missing mechanism: {self.missing_diagnostics.get('mechanism', 'Unknown')}")

        print(f"\n【3】Outlier Handling")
        if self.outlier_diagnostics:
            print(f"  Method: {self.outlier_diagnostics.get('method', 'None')}")
            summary = self.outlier_diagnostics.get('summary', [])
            if summary:
                total_outliers = sum(item['Outlier_Count'] for item in summary)
                print(f"  Total outliers: {total_outliers}")

        print(f"\n【4】Data Quality Metrics")
        print(f"  Completeness: {(1 - self.df.isnull().mean().mean())*100:.2f}%")
        print(f"  Duplicate rate: {self.df.duplicated().mean()*100:.2f}%")

        print(f"\n【5】Cleaning Steps Log")
        print(f"  Executed {len(self.cleaning_log)} steps")

        print("="*70)

        # Return detailed log
        return {
            'original_shape': self.df_original.shape,
            'final_shape': self.df.shape,
            'rows_removed': self.df_original.shape[0] - self.df.shape[0],
            'missing_handled': original_missing - final_missing,
            'cleaning_log': self.cleaning_log,
            'missing_diagnostics': self.missing_diagnostics,
            'outlier_diagnostics': self.outlier_diagnostics
        }

# Usage example
if __name__ == "__main__":
    # Create simulated data
    np.random.seed(42)
    n = 1000

    df_raw = pd.DataFrame({
        'id': range(1, n+1),
        'age': np.random.normal(40, 10, n),
        'education': np.random.normal(12, 3, n),
        'income': np.random.normal(50000, 20000, n),
        'health': np.random.normal(70, 15, n),
        'region': np.random.choice(['east', 'west', 'north', 'south'], n),
        'gender': np.random.choice(['male', 'female'], n)
    })

    # Introduce missing values
    df_raw.loc[df_raw['income'] > 60000, 'income'] = np.where(
        np.random.random(sum(df_raw['income'] > 60000)) < 0.3,
        np.nan,
        df_raw.loc[df_raw['income'] > 60000, 'income']
    )

    for col in ['age', 'education', 'health']:
        mask = np.random.random(n) < 0.1
        df_raw.loc[mask, col] = np.nan

    # Introduce outliers
    df_raw.loc[np.random.choice(n, 10), 'age'] = -5
    df_raw.loc[np.random.choice(n, 10), 'income'] = 999999

    # Use DataCleaner
    cleaner = DataCleaner(df_raw, name='Simulated Dataset')

    # Diagnose missing
    cleaner.diagnose_missing(plot=False)

    # MICE imputation
    cleaner.impute_mice(n_imputations=5, max_iter=10)

    # Handle outliers
    cleaner.handle_outliers(method='winsorize', limits=[0.05, 0.05])

    # Clean strings
    cleaner.clean_strings()

    # Remove duplicates
    cleaner.remove_duplicates()

    # Generate report
    report = cleaner.generate_cleaning_report()

    # Get cleaned data
    df_clean = cleaner.get_clean_data()

Outlier Handling

What are Outliers?

Outlier: Observations significantly different from most data

Two Types:

  1. Data Errors: Data entry errors, measurement errors

    • Example: Age = -5, income = 999999999
    • Handling: Delete or correct
  2. True Extremes: True but rare observations

    • Example: CEO annual salary, billionaire wealth
    • Handling: Keep, Winsorize, log transformation

Methods to Identify Outliers

Method 1: Box Plot

python
import matplotlib.pyplot as plt
import seaborn as sns

# Univariate box plot
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['income'])
plt.title('Income Distribution (Box Plot)')
plt.show()

# Multivariate comparison
plt.figure(figsize=(12, 6))
df[['age', 'income', 'education']].boxplot()
plt.ylabel('Value')
plt.title('Multivariate Box Plot')
plt.show()

Method 2: IQR Method (Interquartile Range)

python
def detect_outliers_iqr(df, col):
    """
    Detect outliers using IQR method
    """
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    print(f"Column: {col}")
    print(f"  Q1: {Q1:.2f}")
    print(f"  Q3: {Q3:.2f}")
    print(f"  IQR: {IQR:.2f}")
    print(f"  Lower bound: {lower_bound:.2f}")
    print(f"  Upper bound: {upper_bound:.2f}")
    print(f"  Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")

    return outliers

# Usage example
income_outliers = detect_outliers_iqr(df, 'income')

Method 3: Z-score Method

For normal distribution, of data is within :

python
from scipy import stats

def detect_outliers_zscore(df, col, threshold=3):
    """
    Detect outliers using Z-score method
    threshold: Usually use 3 (99.7% of data within ±3σ)
    """
    z_scores = np.abs(stats.zscore(df[col].dropna()))
    outliers = df[z_scores > threshold]

    print(f"Column: {col}")
    print(f"  Mean: {df[col].mean():.2f}")
    print(f"  Std Dev: {df[col].std():.2f}")
    print(f"  Threshold: ±{threshold}σ")
    print(f"  Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")

    return outliers

# Usage example
age_outliers = detect_outliers_zscore(df, 'age', threshold=3)

Method 4: Isolation Forest (Machine Learning Method)

Isolation Forest isolates outliers based on decision trees:

python
from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df, cols, contamination=0.1):
    """
    Detect multivariate outliers using Isolation Forest

    Parameters:
    ------
    contamination : float
        Expected proportion of outliers
    """
    iso_forest = IsolationForest(
        contamination=contamination,
        random_state=42
    )

    # Predict (-1 indicates outlier)
    outliers_mask = iso_forest.fit_predict(df[cols]) == -1

    print(f"Isolation Forest:")
    print(f"  Number of outliers: {outliers_mask.sum()} ({outliers_mask.mean()*100:.2f}%)")

    return df[outliers_mask]

# Usage example
outliers_multi = detect_outliers_isolation_forest(
    df,
    ['age', 'income', 'education'],
    contamination=0.05
)

Outlier Handling Strategies

Strategy 1: Deletion

python
# Delete outliers
df_clean = df[~df.index.isin(outliers.index)]

# Delete records with income < 0 or income > 1,000,000
df_clean = df[(df['income'] > 0) & (df['income'] < 1000000)]

When to delete:

  • Obviously error values (e.g., age = -5)
  • Sample size large enough, deletion doesn't affect analysis
  • Outliers are true (e.g., billionaires)

Strategy 2: Winsorize (Winsorization)

python
from scipy.stats.mstats import winsorize

# Replace extreme values with 5th and 95th percentiles
df['income_winsorized'] = winsorize(df['income'], limits=[0.05, 0.05])

# Or manual implementation
lower = df['income'].quantile(0.05)
upper = df['income'].quantile(0.95)
df['income_winsorized'] = df['income'].clip(lower, upper)

print(f"Original range: {df['income'].min():.0f} - {df['income'].max():.0f}")
print(f"After Winsorize: {df['income_winsorized'].min():.0f} - {df['income_winsorized'].max():.0f}")

Advantages:

  • Preserve sample size
  • Reduce extreme value impact
  • Maintain data distribution shape

Strategy 3: Log Transformation

For right-skewed distributions (e.g., income, wealth):

Where is a constant (usually ), preventing .

python
# Log transformation can compress right-skewed distributions
df['log_income'] = np.log(df['income'] + 1)

# Visualization comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original distribution
axes[0].hist(df['income'], bins=50, edgecolor='black')
axes[0].set_title('Original Income Distribution (Right-skewed)')
axes[0].set_xlabel('Income')

# After log transformation
axes[1].hist(df['log_income'], bins=50, edgecolor='black', color='green')
axes[1].set_title('Log Income Distribution (Closer to Normal)')
axes[1].set_xlabel('log(Income)')

plt.tight_layout()
plt.show()

Strategy 4: Cap (Capping)

python
# Set upper and lower limits
df['income_capped'] = df['income'].clip(lower=0, upper=500000)

# Or use percentiles
lower = df['income'].quantile(0.01)
upper = df['income'].quantile(0.99)
df['income_capped'] = df['income'].clip(lower, upper)

Duplicate Handling

Detecting Duplicates

python
# Detect completely duplicate rows
duplicates = df.duplicated()
print(f"Duplicate rows: {duplicates.sum()} ({100*duplicates.mean():.2f}%)")

# View duplicate rows
print(df[duplicates])

# Detect duplicates in specific columns
duplicates_id = df.duplicated(subset=['id'])
print(f"ID duplicates: {duplicates_id.sum()}")

# View duplicate IDs
duplicate_ids = df[duplicates_id]['id'].unique()
print(f"Duplicate IDs: {duplicate_ids}")

Removing Duplicates

python
# Remove completely duplicate rows (keep first occurrence)
df_clean = df.drop_duplicates()

# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')

# Remove all duplicates (keep none)
df_clean = df.drop_duplicates(keep=False)

# Remove duplicates based on specific column
df_clean = df.drop_duplicates(subset=['id'], keep='first')

# Remove duplicates based on multiple column combination
df_clean = df.drop_duplicates(subset=['person_id', 'year'], keep='first')

Handling "Fuzzy Duplicates"

python
# Example: Names slightly different but actually same person
from fuzzywuzzy import fuzz

def find_fuzzy_duplicates(df, col, threshold=90):
    """
    Find fuzzy matching duplicates
    """
    values = df[col].unique()
    duplicates = []

    for i, val1 in enumerate(values):
        for val2 in values[i+1:]:
            similarity = fuzz.ratio(str(val1), str(val2))
            if similarity >= threshold:
                duplicates.append((val1, val2, similarity))

    return pd.DataFrame(duplicates, columns=['Value1', 'Value2', 'Similarity'])

# Example: Name matching
fuzzy_dups = find_fuzzy_duplicates(df, 'name', threshold=90)
print(fuzzy_dups)

Data Types and Format Unification

Data Type Conversion

python
# 1. Convert to numeric type
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # Unable to convert becomes NaN

# 2. Convert to string
df['id'] = df['id'].astype(str)

# 3. Convert to category type (save memory)
df['region'] = df['region'].astype('category')

# 4. Convert to date type
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# 5. Convert to boolean type
df['employed'] = df['employed'].astype(bool)

String Cleaning

python
# Remove whitespace
df['name'] = df['name'].str.strip()  # Leading and trailing whitespace
df['name'] = df['name'].str.lstrip()  # Left whitespace
df['name'] = df['name'].str.rstrip()  # Right whitespace

# Unify case
df['gender'] = df['gender'].str.lower()  # All lowercase
df['gender'] = df['gender'].str.upper()  # All uppercase
df['name'] = df['name'].str.title()     # Capitalize first letter

# Replace characters
df['phone'] = df['phone'].str.replace('-', '')  # Remove hyphens
df['income'] = df['income'].str.replace(',', '')  # Remove thousands separator
df['income'] = df['income'].str.replace('$', '')  # Remove dollar sign

# Extract numbers
df['income_num'] = df['income'].str.extract(r'(\d+)').astype(float)

Date Handling

python
# Parse dates
df['date'] = pd.to_datetime(df['date'])

# Handle multiple date formats
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()

# Calculate date differences
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days

Heckman Sample Selection Model

Theoretical Foundation

Problem: When sample selection is non-random (MNAR), directly deleting missing values leads to selection bias.

Heckman Two-Step Model (Heckman, 1979):

Step 1: Selection Equation (Probit)

Where:

  • indicates is observed

Step 2: Outcome Equation (OLS + IMR)

Key: Assume jointly normal:

Inverse Mills Ratio

Where:

  • is the standard normal PDF
  • is the standard normal CDF

Corrected Regression

Python Implementation

python
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression, LinearRegression
import statsmodels.api as sm

class HeckmanCorrection:
    """
    Heckman Two-Step Selection Model

    Used to handle sample selection bias (MNAR)

    Example:
    -----
    >>> heckman = HeckmanCorrection()
    >>> heckman.fit(
    ...     selection_data=df,
    ...     selection_formula='selected ~ age + education',
    ...     outcome_data=df[df['selected'] == 1],
    ...     outcome_formula='income ~ age + education'
    ... )
    >>> results = heckman.summary()
    """

    def __init__(self):
        self.probit_model = None
        self.ols_model = None
        self.lambda_coef = None
        self.rho = None

    def fit(self, df, selection_vars, outcome_vars, selection_indicator):
        """
        Fit Heckman model

        Parameters:
        ------
        df : pd.DataFrame
            Complete DataFrame
        selection_vars : list
            Independent variables for selection equation
        outcome_vars : list
            Independent variables for outcome equation
        selection_indicator : str
            Selection indicator variable name (1=observed, 0=not observed)
        """
        print("="*70)
        print("Heckman Two-Step Selection Model")
        print("="*70)

        # Step 1: Probit for selection
        print("\n【Step 1】Selection Equation (Probit)")

        X_selection = df[selection_vars]
        y_selection = df[selection_indicator]

        # Use statsmodels Probit
        X_selection_const = sm.add_constant(X_selection)
        probit = sm.Probit(y_selection, X_selection_const)
        self.probit_model = probit.fit(disp=False)

        print(self.probit_model.summary())

        # Calculate Inverse Mills Ratio
        linear_pred = self.probit_model.predict(X_selection_const)

        # Only for selected sample
        selected_mask = (y_selection == 1)

        # IMR = phi(z) / Phi(z)
        imr = norm.pdf(linear_pred[selected_mask]) / norm.cdf(linear_pred[selected_mask])

        # Step 2: OLS with IMR
        print("\n【Step 2】Outcome Equation (OLS + IMR)")

        df_selected = df[selected_mask].copy()
        df_selected['imr'] = imr.values

        # Outcome equation
        X_outcome = df_selected[outcome_vars + ['imr']]
        y_outcome = df_selected['outcome']

        X_outcome_const = sm.add_constant(X_outcome)
        ols = sm.OLS(y_outcome, X_outcome_const)
        self.ols_model = ols.fit()

        print(self.ols_model.summary())

        # Extract IMR coefficient
        self.lambda_coef = self.ols_model.params['imr']

        print(f"\n【Results】")
        print(f"  Inverse Mills Ratio coefficient: {self.lambda_coef:.4f}")

        if abs(self.lambda_coef) > 0.1:
            print(f"  ️  Significant sample selection bias exists!")
        else:
            print(f"   Sample selection bias is minor")

        return self

    def predict(self, X_new, with_correction=True):
        """
        Predict (optionally using Heckman correction)
        """
        if with_correction:
            # Use corrected model
            return self.ols_model.predict(sm.add_constant(X_new))
        else:
            # Ignore IMR, use regular OLS
            return self.ols_model.predict(sm.add_constant(X_new))

# Usage example
if __name__ == "__main__":
    # Simulate data: Female wage data (married women more likely not to work)
    np.random.seed(42)
    n = 2000

    df = pd.DataFrame({
        'age': np.random.normal(35, 8, n),
        'education': np.random.normal(13, 2, n),
        'married': np.random.binomial(1, 0.5, n),
        'children': np.random.poisson(1, n)
    })

    # Selection equation: Work probability
    # Married women with children more likely not to work
    work_prob = 1 / (1 + np.exp(
        -(0.5 - 0.3 * df['married'] - 0.2 * df['children'] + 0.05 * df['education'])
    ))
    df['working'] = np.random.binomial(1, work_prob)

    # Outcome equation: Wage (only observed for working people)
    df['wage'] = np.nan
    working_mask = df['working'] == 1
    df.loc[working_mask, 'wage'] = (
        5000 +
        200 * df.loc[working_mask, 'age'] +
        1000 * df.loc[working_mask, 'education'] +
        np.random.normal(0, 2000, working_mask.sum())
    )

    print(f"Working population: {df['working'].sum()} / {n} ({df['working'].mean()*100:.1f}%)")
    print(f"Observed wages: {df['wage'].notna().sum()}")

    # Heckman model
    df_heckman = df.copy()
    df_heckman['outcome'] = df_heckman['wage']

    heckman = HeckmanCorrection()
    heckman.fit(
        df=df_heckman,
        selection_vars=['age', 'education', 'married', 'children'],
        outcome_vars=['age', 'education'],
        selection_indicator='working'
    )

Real Case: CFPS 2018 Data Cleaning

Case Background

Data: China Family Panel Studies (CFPS) 2018

Research Question: Impact of education on income

Challenges:

  • 30% income missing rate
  • High-income individuals more likely to refuse answering (MAR/MNAR)
  • Extreme values exist (very high income)

Complete Cleaning Workflow

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate CFPS 2018 data
np.random.seed(42)
n = 5000

# Create simulated data
cfps = pd.DataFrame({
    'pid': range(1, n+1),
    'age': np.random.normal(42, 12, n),
    'education': np.random.choice([6, 9, 12, 16], n, p=[0.3, 0.2, 0.3, 0.2]),
    'gender': np.random.choice(['Male', 'Female'], n),
    'urban': np.random.choice([0, 1], n, p=[0.4, 0.6]),
    'province': np.random.choice([
        'Beijing', 'Shanghai', 'Guangdong', 'Zhejiang', 'Jiangsu', 'Shandong', 'Henan', 'Sichuan', 'Hubei', 'Hunan'
    ], n),
    'health': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.4, 0.3, 0.1])
})

# Generate income (based on education, age, etc.)
cfps['income'] = (
    5000 +
    2000 * cfps['education'] +
    300 * cfps['age'] +
    5000 * cfps['urban'] +
    10000 * (cfps['gender'] == 'Male').astype(int) +
    np.random.lognormal(8, 1.5, n)
)

# Introduce MAR missingness (high income, high education more likely to be missing)
missing_prob = 1 / (1 + np.exp(
    -(cfps['income'] / 50000 - 2 + 0.1 * cfps['education'])
))
missing_mask = np.random.random(n) < missing_prob
cfps.loc[missing_mask, 'income'] = np.nan

print(f"Income missing rate: {cfps['income'].isnull().mean()*100:.2f}%")

# Introduce other missingness
for col in ['age', 'education', 'health']:
    mask = np.random.random(n) < 0.05
    cfps.loc[mask, col] = np.nan

# Introduce outliers
cfps.loc[np.random.choice(n, 20, replace=False), 'income'] = np.random.uniform(
    500000, 1000000, 20
)
cfps.loc[np.random.choice(n, 10, replace=False), 'age'] = -99

# Introduce duplicates
cfps = pd.concat([cfps, cfps.sample(50)], ignore_index=True)

print(f"\nRaw data: {cfps.shape}")
print(f"\nFirst 5 rows:")
print(cfps.head())

# Use DataCleaner
print("\n" + "="*70)
print("Start Data Cleaning")
print("="*70)

cleaner = DataCleaner(cfps, name='CFPS 2018')

# 1. Diagnose missing
cleaner.diagnose_missing(plot=False)

# 2. Remove duplicates
cleaner.remove_duplicates(subset=['pid'], keep='first')

# 3. Handle obvious error values
print("\n" + "="*70)
print("Handle Error Values")
print("="*70)

# Age range: 16-100
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['age'] >= 16) & (cleaner.df['age'] <= 100)]
print(f"  Removed age outliers: {before - len(cleaner.df)} rows")

# Income > 0
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['income'] > 0) | (cleaner.df['income'].isnull())]
print(f"  Removed negative income: {before - len(cleaner.df)} rows")

# 4. MICE imputation
cleaner.impute_mice(n_imputations=5, max_iter=10, use_rubin=True)

# 5. Handle outliers
cleaner.handle_outliers(method='winsorize', limits=[0.01, 0.01])

# 6. Clean strings
cleaner.clean_strings()

# 7. Generate report
report = cleaner.generate_cleaning_report()

# Get cleaned data
cfps_clean = cleaner.get_clean_data()

print(f"\nCleaned data: {cfps_clean.shape}")
print(f"\nDescriptive statistics:")
print(cfps_clean[['age', 'education', 'income']].describe())

# 8. Compare complete case deletion vs MICE imputation
print("\n" + "="*70)
print("Complete Case Deletion vs MICE Imputation Comparison")
print("="*70)

# Complete case
cfps_complete = cfps.dropna(subset=['income'])

# Estimate return to education
from sklearn.linear_model import LinearRegression

# Complete case estimate
X_complete = cfps_complete[['education', 'age', 'urban']]
y_complete = cfps_complete['income']
model_complete = LinearRegression().fit(X_complete, y_complete)
beta_complete = model_complete.coef_[0]

# MICE imputation estimate
X_mice = cfps_clean[['education', 'age', 'urban']]
y_mice = cfps_clean['income']
model_mice = LinearRegression().fit(X_mice, y_mice)
beta_mice = model_mice.coef_[0]

print(f"\nReturn to education estimate:")
print(f"  Complete case deletion: {beta_complete:.2f}")
print(f"  MICE imputation: {beta_mice:.2f}")
print(f"  Difference: {beta_mice - beta_complete:.2f} ({(beta_mice/beta_complete - 1)*100:.2f}%)")

# 9. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original vs cleaned income distribution
axes[0, 0].hist(cfps['income'].dropna(), bins=50, alpha=0.5, label='Original', edgecolor='black')
axes[0, 0].hist(cfps_clean['income'], bins=50, alpha=0.5, label='Cleaned', edgecolor='black')
axes[0, 0].set_title('Income Distribution Comparison')
axes[0, 0].set_xlabel('Income')
axes[0, 0].legend()

# Age distribution
axes[0, 1].hist(cfps_clean['age'], bins=30, edgecolor='black', color='green')
axes[0, 1].set_title('Age Distribution (Cleaned)')
axes[0, 1].set_xlabel('Age')

# Education distribution
cfps_clean['education'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 0], color='orange')
axes[1, 0].set_title('Education Distribution (Cleaned)')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Frequency')

# Income vs education
axes[1, 1].scatter(cfps_clean['education'], cfps_clean['income'], alpha=0.3, s=10)
axes[1, 1].set_title('Education and Income Relationship')
axes[1, 1].set_xlabel('Years of Education')
axes[1, 1].set_ylabel('Income')

plt.tight_layout()
plt.show()

print("\n Data cleaning complete!")

Sensitivity Analysis

python
print("\n" + "="*70)
print("Sensitivity Analysis: Missing Mechanism Assumptions")
print("="*70)

# Assumption 1: MCAR (deletion)
cfps_mcar = cfps.dropna()
X_mcar = cfps_mcar[['education', 'age', 'urban']]
y_mcar = cfps_mcar['income']
model_mcar = LinearRegression().fit(X_mcar, y_mcar)
beta_mcar = model_mcar.coef_[0]

# Assumption 2: MAR (MICE)
beta_mar = beta_mice

# Assumption 3: MNAR (Heckman)
# Simplified: Assume income missingness related to true income
# (In practice, more complex modeling needed)

print(f"\nReturn to education estimate (sensitivity analysis):")
print(f"  MCAR assumption (deletion): {beta_mcar:.2f}")
print(f"  MAR assumption (MICE): {beta_mar:.2f}")
print(f"  Difference: {beta_mar - beta_mcar:.2f} ({abs(beta_mar - beta_mcar) / beta_mcar * 100:.2f}%)")

print(f"\nConclusion:")
if abs(beta_mar - beta_mcar) / beta_mcar < 0.05:
    print(f"   Results not sensitive to missing mechanism assumption (difference < 5%)")
else:
    print(f"  ️  Results sensitive to missing mechanism assumption (difference ≥ 5%)")
    print(f"  Suggestion: Further investigate missing mechanism, may need Heckman correction")

Practice Problems

⭐⭐ Basic: Simple Mean Imputation

Task: Given a dataset with missing values, implement simple mean imputation.

python
# Practice data
np.random.seed(42)
df_practice = pd.DataFrame({
    'x1': np.random.normal(50, 10, 100),
    'x2': np.random.normal(100, 20, 100),
    'x3': np.random.normal(75, 15, 100)
})

# Introduce missingness
for col in df_practice.columns:
    mask = np.random.random(100) < 0.2
    df_practice.loc[mask, col] = np.nan

print("Missing rates:")
print(df_practice.isnull().mean())

# TODO: Implement mean imputation
# df_imputed = ...

# Verification
# assert df_imputed.isnull().sum().sum() == 0

⭐⭐⭐ Intermediate: IQR Outlier Detection

Task: Implement IQR-based outlier detection function and visualize.

python
# TODO: Implement function
def detect_and_visualize_outliers(df, col):
    """
    Detect and visualize outliers

    Returns:
    ------
    outliers : pd.DataFrame
        Outlier data
    """
    # 1. IQR calculation
    # 2. Identify outliers
    # 3. Visualize (box plot + scatter plot)
    pass

# Test
# outliers = detect_and_visualize_outliers(df, 'income')

⭐⭐⭐⭐ Advanced: MICE + Rubin's Rules

Task: Implement complete MICE imputation and combine results using Rubin's Rules.

python
# TODO: Implement
# 1. MICE generates m=5 imputed datasets
# 2. Estimate linear regression coefficients for each dataset
# 3. Combine using Rubin's Rules
# 4. Calculate 95% confidence interval
# 5. Compare with complete case analysis

⭐⭐⭐⭐⭐ Nobel Level: Heckman + Sensitivity Analysis

Task: Implement Heckman two-step method and conduct sensitivity analysis.

python
# TODO: Implement
# 1. Simulate female wage data (married women more likely not to work)
# 2. Heckman two-step estimation
# 3. Compare with regular OLS (delete missing)
# 4. Sensitivity analysis: Change selection equation variables
# 5. Explain economic meaning of Inverse Mills Ratio

Summary

Core Data Cleaning Functions

TaskFunctionExample
Missing valuesdropna()Delete missing
fillna()Fill missing
IterativeImputer()MICE imputation
Outliersclip()Cap
winsorize()Winsorize
IsolationForest()ML detection
Duplicatesduplicated()Detect duplicates
drop_duplicates()Remove duplicates
Type conversionastype()Type conversion
pd.to_numeric()Convert to numeric
pd.to_datetime()Convert to date
Stringsstr.strip()Remove whitespace
str.lower()Convert to lowercase
str.replace()Replace characters

Missing Value Handling Decision Tree

Start

├─ Missing rate < 5%?
│  ├─ Yes → Direct deletion (dropna)
│  └─ No → Continue

├─ Little's MCAR Test p > 0.05?
│  ├─ Yes → MCAR, can delete or simple imputation
│  └─ No → MAR/MNAR, need advanced methods

├─ MAR?
│  ├─ Yes → MICE multiple imputation
│  └─ No → MNAR

└─ MNAR?
   ├─ Sample selection problem → Heckman two-step
   ├─ Missingness itself informative → Missing indicator variable
   └─ Uncertain → Sensitivity analysis

Data Cleaning Best Practices

  1. Preserve original data: Use df.copy() to create copy
  2. Record cleaning steps: Add comments and log for each step
  3. Explore before cleaning: Understand data before processing
  4. Reproducible workflow: Write functions for reusability
  5. Verify cleaning results: Recheck data quality after cleaning
  6. Sensitivity analysis: Compare impact of different cleaning strategies
  7. Documentation: Generate cleaning reports

Theoretical Key Points

  1. Missing mechanisms:

    • MCAR: Deletion doesn't cause bias
    • MAR: Can be modeled with observed variables
    • MNAR: Needs additional assumptions or sensitivity analysis
  2. MICE:

    • Iterative imputation considering inter-variable relationships
    • Rubin's Rules combine multiple imputed datasets
    • Standard errors correctly reflect uncertainty
  3. Heckman model:

    • Two-step method corrects sample selection bias
    • Inverse Mills Ratio quantifies selection effect
    • Needs exclusion restriction

Next Steps

Next section we will learn Variable Construction, including dummy variables, interactions, lag terms, etc.

Keep going!


References

Classic Textbooks

  • Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd Edition). Wiley.
  • Heckman, J. J. (1979). "Sample Selection Bias as a Specification Error." Econometrica, 47(1), 153-161.
  • van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd Edition). Chapman and Hall/CRC.

Practical Tools

  • missingno: Missing value visualization
  • scikit-learn.impute: Imputation algorithms
  • statsmodels: Statistical modeling (including Heckman)
  • fancyimpute: Advanced imputation methods

Online Resources

Released under the MIT License. Content © Author.