3.3 Data Cleaning

"Garbage in, garbage out.""垃圾进,垃圾出。"— George Fuechsel, IBM Programmer (IBM程序员)

Garbage In, Garbage Out

Learning Objectives

Scientifically handle missing values (understand MCAR/MAR/MNAR)
Master MICE multiple imputation theory and practice
Identify and handle outliers
Remove duplicate values
Unify data types and formats
Establish reproducible cleaning workflows
Understand Heckman sample selection model

Why is Data Cleaning So Important?

Real Case: Excel Error Leading to Policy Failure

2010, Reinhart-Rogoff Debt and Growth Study

Harvard economists published paper: Government debt > 90% GDP → Economic growth significantly declines
Multiple governments implemented austerity policies based on this
2013 Discovery: Excel formula error (omitted data from 5 countries)
Consequence: Conclusion overturned, policy recommendations invalidated

Lesson: Data cleaning errors lead to wrong conclusions and policies!

Statistical Theory of Missing Data

Mathematical Definition of Missing Mechanisms

Let be the complete data matrix, , and be the missing indicator matrix:

The missing mechanism is characterized by the conditional distribution , where is the missing mechanism parameter.

1. MCAR (Missing Completely At Random)

Missingness is independent of all variables (observed and unobserved).

Example: In surveys, respondents randomly interrupted (phone disconnected) leading to missingness.

2. MAR (Missing At Random)

Given observed data, missingness is independent of unobserved data.

Example: High-income individuals are more likely to refuse answering income questions, but given observed variables like education and occupation, missingness is independent of true income.

3. MNAR (Missing Not At Random)

Missingness is related to unobserved data, even given observed data.

Example: Drug users are more likely to refuse answering about drug history, missingness is related to the answer itself.

Little's MCAR Test

Statistical test for the MCAR hypothesis:

Where:

is the total number of missing patterns
is the sample size for the -th missing pattern
is the mean vector for the -th missing pattern
is the overall mean vector
is the pooled covariance matrix

Under the MCAR hypothesis, approximately follows a chi-square distribution.

Missing Value Handling

Three Types of Missing Values

Type	English	Definition	Handling Strategy
Missing Completely At Random	MCAR	Missingness unrelated to any variable	Delete or simple imputation
Missing At Random	MAR	Missingness related to observed variables	Conditional imputation or interpolation
Missing Not At Random	MNAR	Missingness related to unobserved variables	️ Needs modeling or sensitivity analysis

Detecting Missing Values

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Basic detection
df.isnull().sum()        # Missing count per column
df.isnull().mean() * 100 # Missing rate
df.isnull().any(axis=1).sum()  # Rows with missing values

# 2. Visualize missing patterns
import missingno as msno

# Missing value matrix plot
msno.matrix(df, figsize=(10, 6))
plt.title('Missing Value Patterns')

# Missing value correlation heatmap
msno.heatmap(df, figsize=(10, 6))

# Missing value dendrogram (shows which variables are missing together)
msno.dendrogram(df)

Little's MCAR Test Implementation

python

from scipy import stats
from scipy.linalg import inv

def littles_mcar_test(data):
    """
    Little's MCAR Test

    H0: Data is MCAR
    H1: Data is not MCAR

    Parameters:
    -----------
    data : pd.DataFrame
        DataFrame containing missing values

    Returns:
    --------
    chi2_stat : float
        Chi-square statistic
    p_value : float
        p-value
    df : int
        Degrees of freedom
    """
    # Remove columns that are all missing
    data = data.dropna(axis=1, how='all')

    # Identify missing patterns
    missing_patterns = data.isnull().astype(int)
    unique_patterns = missing_patterns.drop_duplicates()

    # Keep only numeric columns
    numeric_data = data.select_dtypes(include=[np.number])

    # Calculate overall mean and covariance (complete cases)
    complete_cases = numeric_data.dropna()
    if len(complete_cases) < 2:
        return np.nan, np.nan, np.nan

    grand_mean = complete_cases.mean().values
    grand_cov = complete_cases.cov().values

    # Calculate chi-square statistic
    chi2_stat = 0
    total_df = 0

    for idx, pattern in unique_patterns.iterrows():
        # Get data for this pattern
        mask = (missing_patterns == pattern).all(axis=1)
        pattern_data = numeric_data[mask]

        if len(pattern_data) == 0:
            continue

        # Observed variables
        observed_vars = ~pattern[numeric_data.columns].values

        if observed_vars.sum() == 0:
            continue

        # Mean for this pattern
        pattern_mean = pattern_data.iloc[:, observed_vars].mean().values

        # Covariance matrix for this pattern
        sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]

        if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
            continue

        # Overall mean (observed variables only)
        sub_mean = grand_mean[observed_vars]

        # Calculate Mahalanobis distance
        diff = pattern_mean - sub_mean
        n_j = len(pattern_data)

        try:
            chi2_j = n_j * diff @ inv(sub_cov) @ diff
            chi2_stat += chi2_j
            total_df += len(pattern_mean)
        except np.linalg.LinAlgError:
            continue

    # Degrees of freedom
    n_patterns = len(unique_patterns)
    n_vars = observed_vars.sum()
    df = (n_patterns - 1) * n_vars

    # p-value
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)

    return chi2_stat, p_value, df

# Usage example
chi2, p_value, df = littles_mcar_test(df)
print(f"Little's MCAR Test:")
print(f"  Chi-square = {chi2:.4f}")
print(f"  df = {df}")
print(f"  p-value = {p_value:.4f}")

if p_value > 0.05:
    print("  Conclusion: Cannot reject MCAR hypothesis (data may be MCAR)")
else:
    print("  Conclusion: Reject MCAR hypothesis (data is not MCAR)")

Missing Value Handling Strategies

Strategy 1: Deletion

Listwise Deletion:

python

# Delete rows with any missing column
df_clean = df.dropna()

# Delete rows with missing in specific columns
df_clean = df.dropna(subset=['income', 'education'])

# Delete rows that are all missing
df_clean = df.dropna(how='all')

# Delete rows with missing exceeding threshold (keep only if at least 5 non-missing values)
df_clean = df.dropna(thresh=5)

Delete columns with high missing rates:

python

# Delete columns with missing rate > 50%
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > 0.5].index
df_clean = df.drop(columns=cols_to_drop)

print(f"Dropped columns: {list(cols_to_drop)}")

When to use deletion:

Missing rate < 5%
MCAR (Missing Completely At Random)
Large enough sample size
High missing rate causes information loss
MAR/MNAR causes selection bias

Strategy 2: Simple Imputation

python

# 1. Mean imputation
df['age'].fillna(df['age'].mean(), inplace=True)

# 2. Median imputation (robust to outliers)
df['income'].fillna(df['income'].median(), inplace=True)

# 3. Mode imputation (categorical variables)
df['region'].fillna(df['region'].mode()[0], inplace=True)

# 4. Constant imputation
df['score'].fillna(0, inplace=True)

# 5. Forward fill (time series)
df['sales'].fillna(method='ffill', inplace=True)

# 6. Backward fill
df['sales'].fillna(method='bfill', inplace=True)

# 7. Linear interpolation (time series)
df['temperature'] = df['temperature'].interpolate(method='linear')

Batch imputation:

python

# Median for numeric columns, mode for categorical columns
from sklearn.impute import SimpleImputer

# Numeric columns
num_cols = df.select_dtypes(include=[np.number]).columns
imputer_num = SimpleImputer(strategy='median')
df[num_cols] = imputer_num.fit_transform(df[num_cols])

# Categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
imputer_cat = SimpleImputer(strategy='most_frequent')
df[cat_cols] = imputer_cat.fit_transform(df[cat_cols])

Strategy 3: Advanced Imputation

Conditional imputation (group-based imputation):

python

# Use regional average income to fill missing values
df['income'] = df.groupby('region')['income'].transform(
    lambda x: x.fillna(x.median())
)

# Use industry-education combination average wage to fill
df['wage'] = df.groupby(['industry', 'education'])['wage'].transform(
    lambda x: x.fillna(x.mean())
)

Regression imputation (predict missing values using other variables):

python

from sklearn.linear_model import LinearRegression

# Example: Use age and education to predict missing income
# 1. Separate data with/without missing
df_with_income = df[df['income'].notnull()]
df_missing_income = df[df['income'].isnull()]

# 2. Train model
X = df_with_income[['age', 'education']]
y = df_with_income['income']
model = LinearRegression().fit(X, y)

# 3. Predict missing values
X_missing = df_missing_income[['age', 'education']]
predicted_income = model.predict(X_missing)

# 4. Fill
df.loc[df['income'].isnull(), 'income'] = predicted_income

KNN imputation:

python

from sklearn.impute import KNNImputer

# Use K nearest neighbors' average value to fill
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df[['age', 'income', 'education']]),
    columns=['age', 'income', 'education']
)

MICE Multiple Imputation Theory

MICE Algorithm Principle

MICE (Multivariate Imputation by Chained Equations) is an iterative imputation algorithm.

Algorithm Steps

For variables :

Initialization: Fill all missing values using simple methods (e.g., mean)
Iteration Process: For the -th iteration, variable :

Where represents all variables except .

Convergence: Repeat step 2 until convergence (usually 5-10 iterations)
Generate multiple imputed datasets: Repeat the above process times (usually )

Rubin's Rules

For imputed datasets, how to combine estimators?

Let be the parameter estimate for the -th imputed dataset, and be its variance estimate.

Pooled estimator:

Total variance:

Where:

Within-imputation variance (average variance):

Between-imputation variance:

Degrees of freedom:

Confidence interval:

Fraction of Missing Information

measures the proportion of information loss due to missingness.

Complete MICE Implementation

python

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import pandas as pd
import numpy as np

class MICEImputer:
    """
    MICE (Multivariate Imputation by Chained Equations)
    Complete implementation + Rubin's Rules
    """

    def __init__(self, n_imputations=5, max_iter=10, random_state=42):
        """
        Parameters:
        ------
        n_imputations : int
            Number of imputed datasets to generate
        max_iter : int
            Maximum iterations for each imputation
        random_state : int
            Random seed
        """
        self.n_imputations = n_imputations
        self.max_iter = max_iter
        self.random_state = random_state
        self.imputed_datasets = []

    def fit_transform(self, X):
        """
        Perform MICE imputation

        Parameters:
        ------
        X : pd.DataFrame
            DataFrame containing missing values

        Returns:
        ------
        imputed_datasets : list
            m imputed DataFrames
        """
        self.imputed_datasets = []

        for i in range(self.n_imputations):
            # Create iterative imputer
            imputer = IterativeImputer(
                estimator=BayesianRidge(),
                max_iter=self.max_iter,
                random_state=self.random_state + i,
                verbose=0
            )

            # Impute
            X_imputed = imputer.fit_transform(X)

            # Convert to DataFrame
            df_imputed = pd.DataFrame(
                X_imputed,
                columns=X.columns,
                index=X.index
            )

            self.imputed_datasets.append(df_imputed)

        return self.imputed_datasets

    def analyze(self, estimator_func):
        """
        Analyze multiple imputed datasets and combine results using Rubin's Rules

        Parameters:
        ------
        estimator_func : function
            Function that accepts DataFrame and returns (estimate, variance)

        Returns:
        ------
        pooled_estimate : float
            Pooled estimator
        pooled_variance : float
            Total variance
        pooled_se : float
            Pooled standard error
        ci_lower, ci_upper : float
            95% confidence interval
        """
        estimates = []
        variances = []

        # Estimate for each imputed dataset
        for df in self.imputed_datasets:
            est, var = estimator_func(df)
            estimates.append(est)
            variances.append(var)

        # Rubin's Rules
        m = self.n_imputations

        # Pooled estimator
        pooled_estimate = np.mean(estimates)

        # Within-imputation variance
        W = np.mean(variances)

        # Between-imputation variance
        B = np.var(estimates, ddof=1)

        # Total variance
        T = W + (1 + 1/m) * B

        # Standard error
        pooled_se = np.sqrt(T)

        # Degrees of freedom
        if B > 0:
            nu = (m - 1) * (1 + W / ((1 + 1/m) * B))**2
        else:
            nu = np.inf

        # 95% CI
        from scipy import stats
        t_crit = stats.t.ppf(0.975, nu) if nu < np.inf else 1.96
        ci_lower = pooled_estimate - t_crit * pooled_se
        ci_upper = pooled_estimate + t_crit * pooled_se

        # Fraction of missing information
        fmi = (B + B/m) / T if T > 0 else 0

        return {
            'estimate': pooled_estimate,
            'se': pooled_se,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'W': W,
            'B': B,
            'T': T,
            'nu': nu,
            'fmi': fmi
        }

# Usage example
# Create simulated data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age': np.random.normal(40, 10, n),
    'education': np.random.normal(12, 3, n),
    'income': np.random.normal(50000, 20000, n),
    'health': np.random.normal(70, 15, n)
})

# Introduce MAR missingness (high-income individuals more likely to have missing income)
df.loc[df['income'] > 60000, 'income'] = np.where(
    np.random.random(sum(df['income'] > 60000)) < 0.3,
    np.nan,
    df.loc[df['income'] > 60000, 'income']
)

# Introduce other missingness
for col in ['age', 'education', 'health']:
    mask = np.random.random(n) < 0.1
    df.loc[mask, col] = np.nan

print(f"Missing rates:\n{df.isnull().mean() * 100}")

# MICE imputation
mice = MICEImputer(n_imputations=5, max_iter=10)
imputed_datasets = mice.fit_transform(df)

# Define estimation function (e.g., estimate income mean)
def estimate_mean_income(df):
    """Estimate income mean and its variance"""
    mean = df['income'].mean()
    var = df['income'].var() / len(df)  # Variance of the mean
    return mean, var

# Combine results using Rubin's Rules
results = mice.analyze(estimate_mean_income)

print("\n=== MICE + Rubin's Rules Results ===")
print(f"Pooled estimate: {results['estimate']:.2f}")
print(f"Standard error: {results['se']:.2f}")
print(f"95% CI: [{results['ci_lower']:.2f}, {results['ci_upper']:.2f}]")
print(f"Within variance (W): {results['W']:.2f}")
print(f"Between variance (B): {results['B']:.2f}")
print(f"Total variance (T): {results['T']:.2f}")
print(f"Degrees of freedom: {results['nu']:.2f}")
print(f"Fraction of missing information (FMI): {results['fmi']:.4f}")

# Compare with complete case analysis
complete_case_mean = df['income'].dropna().mean()
print(f"\nComplete case mean: {complete_case_mean:.2f}")
print(f"Difference: {results['estimate'] - complete_case_mean:.2f}")

Strategy 4: Missing Indicator Variable

python

# Create missing indicator variable
df['income_missing'] = df['income'].isnull().astype(int)

# Then fill original variable (e.g., with median)
df['income'].fillna(df['income'].median(), inplace=True)

# In regression, include both original variable and missing indicator
# Coefficient of income_missing reflects effect of missingness itself

When to use:

Suspect missingness itself is informative (MNAR)
Moderate missing rate (10-30%)

Production-Level Data Cleaning Class

python

import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import IsolationForest
import warnings
warnings.filterwarnings('ignore')

class DataCleaner:
    """
    Production-Level Data Cleaning Pipeline - Nobel Prize Standards

    Features:
    -----
    1. Missing value diagnosis (MCAR/MAR/MNAR)
    2. MICE multiple imputation
    3. Outlier detection and handling (IQR, Z-score, Isolation Forest, Winsorization)
    4. String cleaning
    5. Duplicate handling
    6. Data type unification
    7. Complete cleaning report

    Example:
    -----
    >>> cleaner = DataCleaner(df, name='CFPS 2018')
    >>> cleaner.diagnose_missing()
    >>> cleaner.impute_mice(n_imputations=5)
    >>> cleaner.handle_outliers(method='winsorize')
    >>> df_clean = cleaner.get_clean_data()
    >>> cleaner.generate_cleaning_report()
    """

    def __init__(self, df, name='Dataset'):
        """
        Initialize cleaner

        Parameters:
        ------
        df : pd.DataFrame
            Raw DataFrame
        name : str
            Dataset name
        """
        self.df_original = df.copy()
        self.df = df.copy()
        self.name = name
        self.cleaning_log = []
        self.missing_diagnostics = {}
        self.outlier_diagnostics = {}

        self._log(f"Initialized DataCleaner: {name}")
        self._log(f"  Shape: {self.df.shape[0]} rows × {self.df.shape[1]} columns")

    def _log(self, message):
        """Log cleaning steps"""
        self.cleaning_log.append(message)
        print(message)

    def diagnose_missing(self, plot=True):
        """
        Diagnose missing mechanism: MCAR/MAR/MNAR

        Steps:
        -----
        1. Little's MCAR Test
        2. Missing pattern analysis
        3. Correlation heatmap
        4. Missingness correlation with observed variables
        """
        self._log("\n" + "="*70)
        self._log("Missing Value Diagnosis")
        self._log("="*70)

        # Basic statistics
        missing_count = self.df.isnull().sum()
        missing_pct = self.df.isnull().mean() * 100

        missing_summary = pd.DataFrame({
            'Missing_Count': missing_count,
            'Missing_Pct': missing_pct
        })
        missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values(
            'Missing_Pct', ascending=False
        )

        self._log(f"\n【1】Missing Value Overview")
        self._log(f"  Total missing values: {self.df.isnull().sum().sum()}")
        self._log(f"  Rows with missing: {self.df.isnull().any(axis=1).sum()} ({self.df.isnull().any(axis=1).mean()*100:.2f}%)")
        self._log(f"\n{missing_summary.to_string()}")

        # Little's MCAR Test
        self._log(f"\n【2】Little's MCAR Test")
        try:
            chi2, p_value, df_test = self._littles_mcar_test()
            self._log(f"  Chi-square = {chi2:.4f}")
            self._log(f"  df = {df_test}")
            self._log(f"  p-value = {p_value:.4f}")

            if p_value > 0.05:
                self._log(f"  Conclusion: Cannot reject MCAR (p > 0.05)")
                self.missing_diagnostics['mechanism'] = 'MCAR'
            else:
                self._log(f"  Conclusion: Reject MCAR (p < 0.05), data may be MAR or MNAR")
                self.missing_diagnostics['mechanism'] = 'MAR/MNAR'
        except Exception as e:
            self._log(f"  Unable to perform Little's MCAR Test: {e}")

        # Missing pattern analysis
        self._log(f"\n【3】Missing Pattern Analysis")
        missing_patterns = self.df.isnull().astype(int)
        unique_patterns = missing_patterns.value_counts()
        self._log(f"  Number of unique missing patterns: {len(unique_patterns)}")
        self._log(f"  Top 5 missing patterns:")
        for i, (pattern, count) in enumerate(unique_patterns.head(5).items(), 1):
            self._log(f"    Pattern {i}: {count} rows ({count/len(self.df)*100:.2f}%)")

        # Missingness correlation analysis
        self._log(f"\n【4】Missingness Correlation with Observed Variables")
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns

        for col in numeric_cols:
            if self.df[col].isnull().any():
                # Create missing indicator variable
                missing_indicator = self.df[col].isnull().astype(int)

                # Calculate correlation with other numeric variables
                correlations = []
                for other_col in numeric_cols:
                    if other_col != col and not self.df[other_col].isnull().all():
                        try:
                            corr = missing_indicator.corr(self.df[other_col])
                            if abs(corr) > 0.1:  # Only show significant correlations
                                correlations.append((other_col, corr))
                        except:
                            pass

                if correlations:
                    self._log(f"\n  {col} missingness correlates with:")
                    for other_col, corr in sorted(correlations, key=lambda x: abs(x[1]), reverse=True)[:3]:
                        self._log(f"    - {other_col}: {corr:.3f}")

        # Visualization
        if plot:
            self._plot_missing_patterns()

        self.missing_diagnostics['summary'] = missing_summary

        return missing_summary

    def _littles_mcar_test(self):
        """Little's MCAR Test implementation"""
        from scipy.linalg import inv

        data = self.df.copy()
        data = data.dropna(axis=1, how='all')

        missing_patterns = data.isnull().astype(int)
        unique_patterns = missing_patterns.drop_duplicates()

        numeric_data = data.select_dtypes(include=[np.number])

        complete_cases = numeric_data.dropna()
        if len(complete_cases) < 2:
            return np.nan, np.nan, np.nan

        grand_mean = complete_cases.mean().values
        grand_cov = complete_cases.cov().values

        chi2_stat = 0
        total_df = 0

        for idx, pattern in unique_patterns.iterrows():
            mask = (missing_patterns == pattern).all(axis=1)
            pattern_data = numeric_data[mask]

            if len(pattern_data) == 0:
                continue

            observed_vars = ~pattern[numeric_data.columns].values

            if observed_vars.sum() == 0:
                continue

            pattern_mean = pattern_data.iloc[:, observed_vars].mean().values
            sub_cov = grand_cov[np.ix_(observed_vars, observed_vars)]

            if np.linalg.matrix_rank(sub_cov) < len(pattern_mean):
                continue

            sub_mean = grand_mean[observed_vars]
            diff = pattern_mean - sub_mean
            n_j = len(pattern_data)

            try:
                chi2_j = n_j * diff @ inv(sub_cov) @ diff
                chi2_stat += chi2_j
                total_df += len(pattern_mean)
            except np.linalg.LinAlgError:
                continue

        n_patterns = len(unique_patterns)
        n_vars = len(numeric_data.columns)
        df = (n_patterns - 1) * n_vars

        p_value = 1 - stats.chi2.cdf(chi2_stat, df) if df > 0 else np.nan

        return chi2_stat, p_value, df

    def _plot_missing_patterns(self):
        """Visualize missing patterns"""
        try:
            import missingno as msno
            import matplotlib.pyplot as plt

            fig, axes = plt.subplots(1, 2, figsize=(14, 6))

            # Missing matrix
            msno.matrix(self.df, ax=axes[0], fontsize=10)
            axes[0].set_title(f'{self.name} - Missing Pattern Matrix', fontsize=12)

            # Missing heatmap
            msno.heatmap(self.df, ax=axes[1], fontsize=10)
            axes[1].set_title(f'{self.name} - Missing Correlation', fontsize=12)

            plt.tight_layout()
            plt.show()
        except ImportError:
            self._log("  Tip: Install missingno for missing pattern visualization (pip install missingno)")

    def impute_mice(self, n_imputations=5, max_iter=10, use_rubin=True):
        """
        MICE (Multivariate Imputation by Chained Equations) complete implementation

        Parameters:
        ------
        n_imputations : int
            Number of imputed datasets to generate
        max_iter : int
            Maximum iterations for each imputation
        use_rubin : bool
            Whether to use Rubin's Rules to combine results

        Returns:
        ------
        imputed_datasets : list (if use_rubin=False)
            or pooled DataFrame (if use_rubin=True)
        """
        self._log("\n" + "="*70)
        self._log(f"MICE Multiple Imputation (m={n_imputations}, max_iter={max_iter})")
        self._log("="*70)

        # Only impute numeric columns
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        non_numeric_cols = self.df.select_dtypes(exclude=[np.number]).columns

        # Save non-numeric columns
        df_non_numeric = self.df[non_numeric_cols].copy()

        imputed_datasets = []

        for i in range(n_imputations):
            self._log(f"\n  Imputed dataset {i+1}/{n_imputations}")

            imputer = IterativeImputer(
                estimator=BayesianRidge(),
                max_iter=max_iter,
                random_state=42 + i,
                verbose=0
            )

            X_imputed = imputer.fit_transform(self.df[numeric_cols])

            df_imputed = pd.DataFrame(
                X_imputed,
                columns=numeric_cols,
                index=self.df.index
            )

            # Merge non-numeric columns
            df_imputed = pd.concat([df_imputed, df_non_numeric], axis=1)

            imputed_datasets.append(df_imputed)

            self._log(f"    Convergence status: Complete")

        if use_rubin:
            # Combine using Rubin's Rules
            self._log(f"\n  Combining {n_imputations} datasets using Rubin's Rules")

            # Take average for each numeric variable (simplified Rubin's Rules)
            df_pooled = pd.DataFrame(index=self.df.index)

            for col in numeric_cols:
                values = np.array([df[col].values for df in imputed_datasets])
                df_pooled[col] = values.mean(axis=0)

            # Merge non-numeric columns
            df_pooled = pd.concat([df_pooled, df_non_numeric], axis=1)

            self.df = df_pooled
            self._log(f"  Merged into single dataset")

            return df_pooled
        else:
            self._log(f"  Returning {n_imputations} independent imputed datasets")
            return imputed_datasets

    def handle_outliers(self, method='winsorize', limits=[0.05, 0.05],
                       threshold=3, contamination=0.1):
        """
        Robust outlier handling

        Parameters:
        ------
        method : str
            Handling method: 'winsorize', 'iqr', 'zscore', 'isolation_forest'
        limits : list
            Winsorization upper and lower limit ratios [lower, upper]
        threshold : float
            Z-score method threshold
        contamination : float
            Isolation Forest outlier proportion
        """
        self._log("\n" + "="*70)
        self._log(f"Outlier Handling (method: {method})")
        self._log("="*70)

        numeric_cols = self.df.select_dtypes(include=[np.number]).columns

        outliers_summary = []

        for col in numeric_cols:
            if self.df[col].std() == 0:
                continue

            before_min = self.df[col].min()
            before_max = self.df[col].max()
            before_mean = self.df[col].mean()

            if method == 'winsorize':
                self.df[col] = winsorize(self.df[col], limits=limits)

            elif method == 'iqr':
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR

                outliers_mask = (self.df[col] < lower_bound) | (self.df[col] > upper_bound)
                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'IQR',
                    'Outlier_Count': n_outliers,
                    'Lower_Bound': lower_bound,
                    'Upper_Bound': upper_bound
                })

            elif method == 'zscore':
                z_scores = np.abs(stats.zscore(self.df[col]))
                outliers_mask = z_scores > threshold
                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'Z-score',
                    'Outlier_Count': n_outliers,
                    'Threshold': threshold
                })

            elif method == 'isolation_forest':
                # Use Isolation Forest
                iso_forest = IsolationForest(
                    contamination=contamination,
                    random_state=42
                )

                outliers_mask = iso_forest.fit_predict(
                    self.df[[col]].values.reshape(-1, 1)
                ) == -1

                n_outliers = outliers_mask.sum()

                # Delete outliers
                self.df = self.df[~outliers_mask]

                outliers_summary.append({
                    'Variable': col,
                    'Method': 'Isolation Forest',
                    'Outlier_Count': n_outliers,
                    'Contamination': contamination
                })

            after_min = self.df[col].min()
            after_max = self.df[col].max()
            after_mean = self.df[col].mean()

            if method == 'winsorize':
                self._log(f"\n  {col}:")
                self._log(f"    Before: [{before_min:.2f}, {before_max:.2f}], mean = {before_mean:.2f}")
                self._log(f"    After: [{after_min:.2f}, {after_max:.2f}], mean = {after_mean:.2f}")

        if outliers_summary:
            self._log(f"\n  Outlier handling summary:")
            for item in outliers_summary:
                self._log(f"    {item}")

        self.outlier_diagnostics = {
            'method': method,
            'summary': outliers_summary
        }

    def clean_strings(self):
        """
        String cleaning

        Steps:
        -----
        1. Strip whitespace
        2. Normalize case
        3. Remove special characters
        """
        self._log("\n" + "="*70)
        self._log("String Cleaning")
        self._log("="*70)

        string_cols = self.df.select_dtypes(include=['object']).columns

        for col in string_cols:
            # Remove whitespace
            self.df[col] = self.df[col].str.strip()

            # Uniform lowercase
            self.df[col] = self.df[col].str.lower()

            self._log(f"  {col}: Cleaned (removed whitespace + lowercase)")

        self._log(f"\n  Cleaned {len(string_cols)} string columns")

    def remove_duplicates(self, subset=None, keep='first'):
        """
        Remove duplicate values

        Parameters:
        ------
        subset : list
            Columns to check for duplicates
        keep : str
            Retention strategy: 'first', 'last', False
        """
        self._log("\n" + "="*70)
        self._log("Remove Duplicates")
        self._log("="*70)

        before = len(self.df)
        self.df = self.df.drop_duplicates(subset=subset, keep=keep)
        after = len(self.df)

        self._log(f"  Removed {before - after} duplicate records ({(before-after)/before*100:.2f}%)")

    def get_clean_data(self):
        """Return cleaned data"""
        return self.df.copy()

    def generate_cleaning_report(self):
        """
        Generate complete cleaning report
        """
        print("\n" + "="*70)
        print(f"{self.name} - Data Cleaning Report")
        print("="*70)

        print(f"\n【1】Data Overview")
        print(f"  Original data: {self.df_original.shape[0]} rows × {self.df_original.shape[1]} columns")
        print(f"  After cleaning: {self.df.shape[0]} rows × {self.df.shape[1]} columns")
        print(f"  Rows removed: {self.df_original.shape[0] - self.df.shape[0]} ({(self.df_original.shape[0] - self.df.shape[0])/self.df_original.shape[0]*100:.2f}%)")

        print(f"\n【2】Missing Value Statistics")
        original_missing = self.df_original.isnull().sum().sum()
        final_missing = self.df.isnull().sum().sum()
        print(f"  Original missing values: {original_missing}")
        print(f"  After cleaning missing values: {final_missing}")
        print(f"  Handled missing values: {original_missing - final_missing}")

        if self.missing_diagnostics:
            print(f"  Missing mechanism: {self.missing_diagnostics.get('mechanism', 'Unknown')}")

        print(f"\n【3】Outlier Handling")
        if self.outlier_diagnostics:
            print(f"  Method: {self.outlier_diagnostics.get('method', 'None')}")
            summary = self.outlier_diagnostics.get('summary', [])
            if summary:
                total_outliers = sum(item['Outlier_Count'] for item in summary)
                print(f"  Total outliers: {total_outliers}")

        print(f"\n【4】Data Quality Metrics")
        print(f"  Completeness: {(1 - self.df.isnull().mean().mean())*100:.2f}%")
        print(f"  Duplicate rate: {self.df.duplicated().mean()*100:.2f}%")

        print(f"\n【5】Cleaning Steps Log")
        print(f"  Executed {len(self.cleaning_log)} steps")

        print("="*70)

        # Return detailed log
        return {
            'original_shape': self.df_original.shape,
            'final_shape': self.df.shape,
            'rows_removed': self.df_original.shape[0] - self.df.shape[0],
            'missing_handled': original_missing - final_missing,
            'cleaning_log': self.cleaning_log,
            'missing_diagnostics': self.missing_diagnostics,
            'outlier_diagnostics': self.outlier_diagnostics
        }

# Usage example
if __name__ == "__main__":
    # Create simulated data
    np.random.seed(42)
    n = 1000

    df_raw = pd.DataFrame({
        'id': range(1, n+1),
        'age': np.random.normal(40, 10, n),
        'education': np.random.normal(12, 3, n),
        'income': np.random.normal(50000, 20000, n),
        'health': np.random.normal(70, 15, n),
        'region': np.random.choice(['east', 'west', 'north', 'south'], n),
        'gender': np.random.choice(['male', 'female'], n)
    })

    # Introduce missing values
    df_raw.loc[df_raw['income'] > 60000, 'income'] = np.where(
        np.random.random(sum(df_raw['income'] > 60000)) < 0.3,
        np.nan,
        df_raw.loc[df_raw['income'] > 60000, 'income']
    )

    for col in ['age', 'education', 'health']:
        mask = np.random.random(n) < 0.1
        df_raw.loc[mask, col] = np.nan

    # Introduce outliers
    df_raw.loc[np.random.choice(n, 10), 'age'] = -5
    df_raw.loc[np.random.choice(n, 10), 'income'] = 999999

    # Use DataCleaner
    cleaner = DataCleaner(df_raw, name='Simulated Dataset')

    # Diagnose missing
    cleaner.diagnose_missing(plot=False)

    # MICE imputation
    cleaner.impute_mice(n_imputations=5, max_iter=10)

    # Handle outliers
    cleaner.handle_outliers(method='winsorize', limits=[0.05, 0.05])

    # Clean strings
    cleaner.clean_strings()

    # Remove duplicates
    cleaner.remove_duplicates()

    # Generate report
    report = cleaner.generate_cleaning_report()

    # Get cleaned data
    df_clean = cleaner.get_clean_data()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580

Outlier Handling

What are Outliers?

Outlier: Observations significantly different from most data

Two Types:

Data Errors: Data entry errors, measurement errors
- Example: Age = -5, income = 999999999
- Handling: Delete or correct
True Extremes: True but rare observations
- Example: CEO annual salary, billionaire wealth
- Handling: Keep, Winsorize, log transformation

Methods to Identify Outliers

Method 1: Box Plot

python

import matplotlib.pyplot as plt
import seaborn as sns

# Univariate box plot
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['income'])
plt.title('Income Distribution (Box Plot)')
plt.show()

# Multivariate comparison
plt.figure(figsize=(12, 6))
df[['age', 'income', 'education']].boxplot()
plt.ylabel('Value')
plt.title('Multivariate Box Plot')
plt.show()

Method 2: IQR Method (Interquartile Range)

python

def detect_outliers_iqr(df, col):
    """
    Detect outliers using IQR method
    """
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    print(f"Column: {col}")
    print(f"  Q1: {Q1:.2f}")
    print(f"  Q3: {Q3:.2f}")
    print(f"  IQR: {IQR:.2f}")
    print(f"  Lower bound: {lower_bound:.2f}")
    print(f"  Upper bound: {upper_bound:.2f}")
    print(f"  Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")

    return outliers

# Usage example
income_outliers = detect_outliers_iqr(df, 'income')

Method 3: Z-score Method

For normal distribution, of data is within :

python

from scipy import stats

def detect_outliers_zscore(df, col, threshold=3):
    """
    Detect outliers using Z-score method
    threshold: Usually use 3 (99.7% of data within ±3σ)
    """
    z_scores = np.abs(stats.zscore(df[col].dropna()))
    outliers = df[z_scores > threshold]

    print(f"Column: {col}")
    print(f"  Mean: {df[col].mean():.2f}")
    print(f"  Std Dev: {df[col].std():.2f}")
    print(f"  Threshold: ±{threshold}σ")
    print(f"  Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.2f}%)")

    return outliers

# Usage example
age_outliers = detect_outliers_zscore(df, 'age', threshold=3)

Method 4: Isolation Forest (Machine Learning Method)

Isolation Forest isolates outliers based on decision trees:

python

from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df, cols, contamination=0.1):
    """
    Detect multivariate outliers using Isolation Forest

    Parameters:
    ------
    contamination : float
        Expected proportion of outliers
    """
    iso_forest = IsolationForest(
        contamination=contamination,
        random_state=42
    )

    # Predict (-1 indicates outlier)
    outliers_mask = iso_forest.fit_predict(df[cols]) == -1

    print(f"Isolation Forest:")
    print(f"  Number of outliers: {outliers_mask.sum()} ({outliers_mask.mean()*100:.2f}%)")

    return df[outliers_mask]

# Usage example
outliers_multi = detect_outliers_isolation_forest(
    df,
    ['age', 'income', 'education'],
    contamination=0.05
)

Outlier Handling Strategies

Strategy 1: Deletion

python

# Delete outliers
df_clean = df[~df.index.isin(outliers.index)]

# Delete records with income < 0 or income > 1,000,000
df_clean = df[(df['income'] > 0) & (df['income'] < 1000000)]

When to delete:

Obviously error values (e.g., age = -5)
Sample size large enough, deletion doesn't affect analysis
Outliers are true (e.g., billionaires)

Strategy 2: Winsorize (Winsorization)

python

from scipy.stats.mstats import winsorize

# Replace extreme values with 5th and 95th percentiles
df['income_winsorized'] = winsorize(df['income'], limits=[0.05, 0.05])

# Or manual implementation
lower = df['income'].quantile(0.05)
upper = df['income'].quantile(0.95)
df['income_winsorized'] = df['income'].clip(lower, upper)

print(f"Original range: {df['income'].min():.0f} - {df['income'].max():.0f}")
print(f"After Winsorize: {df['income_winsorized'].min():.0f} - {df['income_winsorized'].max():.0f}")

Advantages:

Preserve sample size
Reduce extreme value impact
Maintain data distribution shape

Strategy 3: Log Transformation

For right-skewed distributions (e.g., income, wealth):

Where is a constant (usually ), preventing .

python

# Log transformation can compress right-skewed distributions
df['log_income'] = np.log(df['income'] + 1)

# Visualization comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original distribution
axes[0].hist(df['income'], bins=50, edgecolor='black')
axes[0].set_title('Original Income Distribution (Right-skewed)')
axes[0].set_xlabel('Income')

# After log transformation
axes[1].hist(df['log_income'], bins=50, edgecolor='black', color='green')
axes[1].set_title('Log Income Distribution (Closer to Normal)')
axes[1].set_xlabel('log(Income)')

plt.tight_layout()
plt.show()

Strategy 4: Cap (Capping)

python

# Set upper and lower limits
df['income_capped'] = df['income'].clip(lower=0, upper=500000)

# Or use percentiles
lower = df['income'].quantile(0.01)
upper = df['income'].quantile(0.99)
df['income_capped'] = df['income'].clip(lower, upper)

Duplicate Handling

Detecting Duplicates

python

# Detect completely duplicate rows
duplicates = df.duplicated()
print(f"Duplicate rows: {duplicates.sum()} ({100*duplicates.mean():.2f}%)")

# View duplicate rows
print(df[duplicates])

# Detect duplicates in specific columns
duplicates_id = df.duplicated(subset=['id'])
print(f"ID duplicates: {duplicates_id.sum()}")

# View duplicate IDs
duplicate_ids = df[duplicates_id]['id'].unique()
print(f"Duplicate IDs: {duplicate_ids}")

Removing Duplicates

python

# Remove completely duplicate rows (keep first occurrence)
df_clean = df.drop_duplicates()

# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')

# Remove all duplicates (keep none)
df_clean = df.drop_duplicates(keep=False)

# Remove duplicates based on specific column
df_clean = df.drop_duplicates(subset=['id'], keep='first')

# Remove duplicates based on multiple column combination
df_clean = df.drop_duplicates(subset=['person_id', 'year'], keep='first')

Handling "Fuzzy Duplicates"

python

# Example: Names slightly different but actually same person
from fuzzywuzzy import fuzz

def find_fuzzy_duplicates(df, col, threshold=90):
    """
    Find fuzzy matching duplicates
    """
    values = df[col].unique()
    duplicates = []

    for i, val1 in enumerate(values):
        for val2 in values[i+1:]:
            similarity = fuzz.ratio(str(val1), str(val2))
            if similarity >= threshold:
                duplicates.append((val1, val2, similarity))

    return pd.DataFrame(duplicates, columns=['Value1', 'Value2', 'Similarity'])

# Example: Name matching
fuzzy_dups = find_fuzzy_duplicates(df, 'name', threshold=90)
print(fuzzy_dups)

Data Types and Format Unification

Data Type Conversion

python

# 1. Convert to numeric type
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # Unable to convert becomes NaN

# 2. Convert to string
df['id'] = df['id'].astype(str)

# 3. Convert to category type (save memory)
df['region'] = df['region'].astype('category')

# 4. Convert to date type
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# 5. Convert to boolean type
df['employed'] = df['employed'].astype(bool)

String Cleaning

python

# Remove whitespace
df['name'] = df['name'].str.strip()  # Leading and trailing whitespace
df['name'] = df['name'].str.lstrip()  # Left whitespace
df['name'] = df['name'].str.rstrip()  # Right whitespace

# Unify case
df['gender'] = df['gender'].str.lower()  # All lowercase
df['gender'] = df['gender'].str.upper()  # All uppercase
df['name'] = df['name'].str.title()     # Capitalize first letter

# Replace characters
df['phone'] = df['phone'].str.replace('-', '')  # Remove hyphens
df['income'] = df['income'].str.replace(',', '')  # Remove thousands separator
df['income'] = df['income'].str.replace('$', '')  # Remove dollar sign

# Extract numbers
df['income_num'] = df['income'].str.extract(r'(\d+)').astype(float)

Date Handling

python

# Parse dates
df['date'] = pd.to_datetime(df['date'])

# Handle multiple date formats
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()

# Calculate date differences
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days

Heckman Sample Selection Model

Theoretical Foundation

Problem: When sample selection is non-random (MNAR), directly deleting missing values leads to selection bias.

Heckman Two-Step Model (Heckman, 1979):

Step 1: Selection Equation (Probit)

Where:

indicates is observed

Step 2: Outcome Equation (OLS + IMR)

Key: Assume jointly normal:

Inverse Mills Ratio

Where:

is the standard normal PDF
is the standard normal CDF

Corrected Regression

Python Implementation

python

from scipy.stats import norm
from sklearn.linear_model import LogisticRegression, LinearRegression
import statsmodels.api as sm

class HeckmanCorrection:
    """
    Heckman Two-Step Selection Model

    Used to handle sample selection bias (MNAR)

    Example:
    -----
    >>> heckman = HeckmanCorrection()
    >>> heckman.fit(
    ...     selection_data=df,
    ...     selection_formula='selected ~ age + education',
    ...     outcome_data=df[df['selected'] == 1],
    ...     outcome_formula='income ~ age + education'
    ... )
    >>> results = heckman.summary()
    """

    def __init__(self):
        self.probit_model = None
        self.ols_model = None
        self.lambda_coef = None
        self.rho = None

    def fit(self, df, selection_vars, outcome_vars, selection_indicator):
        """
        Fit Heckman model

        Parameters:
        ------
        df : pd.DataFrame
            Complete DataFrame
        selection_vars : list
            Independent variables for selection equation
        outcome_vars : list
            Independent variables for outcome equation
        selection_indicator : str
            Selection indicator variable name (1=observed, 0=not observed)
        """
        print("="*70)
        print("Heckman Two-Step Selection Model")
        print("="*70)

        # Step 1: Probit for selection
        print("\n【Step 1】Selection Equation (Probit)")

        X_selection = df[selection_vars]
        y_selection = df[selection_indicator]

        # Use statsmodels Probit
        X_selection_const = sm.add_constant(X_selection)
        probit = sm.Probit(y_selection, X_selection_const)
        self.probit_model = probit.fit(disp=False)

        print(self.probit_model.summary())

        # Calculate Inverse Mills Ratio
        linear_pred = self.probit_model.predict(X_selection_const)

        # Only for selected sample
        selected_mask = (y_selection == 1)

        # IMR = phi(z) / Phi(z)
        imr = norm.pdf(linear_pred[selected_mask]) / norm.cdf(linear_pred[selected_mask])

        # Step 2: OLS with IMR
        print("\n【Step 2】Outcome Equation (OLS + IMR)")

        df_selected = df[selected_mask].copy()
        df_selected['imr'] = imr.values

        # Outcome equation
        X_outcome = df_selected[outcome_vars + ['imr']]
        y_outcome = df_selected['outcome']

        X_outcome_const = sm.add_constant(X_outcome)
        ols = sm.OLS(y_outcome, X_outcome_const)
        self.ols_model = ols.fit()

        print(self.ols_model.summary())

        # Extract IMR coefficient
        self.lambda_coef = self.ols_model.params['imr']

        print(f"\n【Results】")
        print(f"  Inverse Mills Ratio coefficient: {self.lambda_coef:.4f}")

        if abs(self.lambda_coef) > 0.1:
            print(f"  ️  Significant sample selection bias exists!")
        else:
            print(f"   Sample selection bias is minor")

        return self

    def predict(self, X_new, with_correction=True):
        """
        Predict (optionally using Heckman correction)
        """
        if with_correction:
            # Use corrected model
            return self.ols_model.predict(sm.add_constant(X_new))
        else:
            # Ignore IMR, use regular OLS
            return self.ols_model.predict(sm.add_constant(X_new))

# Usage example
if __name__ == "__main__":
    # Simulate data: Female wage data (married women more likely not to work)
    np.random.seed(42)
    n = 2000

    df = pd.DataFrame({
        'age': np.random.normal(35, 8, n),
        'education': np.random.normal(13, 2, n),
        'married': np.random.binomial(1, 0.5, n),
        'children': np.random.poisson(1, n)
    })

    # Selection equation: Work probability
    # Married women with children more likely not to work
    work_prob = 1 / (1 + np.exp(
        -(0.5 - 0.3 * df['married'] - 0.2 * df['children'] + 0.05 * df['education'])
    ))
    df['working'] = np.random.binomial(1, work_prob)

    # Outcome equation: Wage (only observed for working people)
    df['wage'] = np.nan
    working_mask = df['working'] == 1
    df.loc[working_mask, 'wage'] = (
        5000 +
        200 * df.loc[working_mask, 'age'] +
        1000 * df.loc[working_mask, 'education'] +
        np.random.normal(0, 2000, working_mask.sum())
    )

    print(f"Working population: {df['working'].sum()} / {n} ({df['working'].mean()*100:.1f}%)")
    print(f"Observed wages: {df['wage'].notna().sum()}")

    # Heckman model
    df_heckman = df.copy()
    df_heckman['outcome'] = df_heckman['wage']

    heckman = HeckmanCorrection()
    heckman.fit(
        df=df_heckman,
        selection_vars=['age', 'education', 'married', 'children'],
        outcome_vars=['age', 'education'],
        selection_indicator='working'
    )

Real Case: CFPS 2018 Data Cleaning

Case Background

Data: China Family Panel Studies (CFPS) 2018

Research Question: Impact of education on income

Challenges:

30% income missing rate
High-income individuals more likely to refuse answering (MAR/MNAR)
Extreme values exist (very high income)

Complete Cleaning Workflow

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate CFPS 2018 data
np.random.seed(42)
n = 5000

# Create simulated data
cfps = pd.DataFrame({
    'pid': range(1, n+1),
    'age': np.random.normal(42, 12, n),
    'education': np.random.choice([6, 9, 12, 16], n, p=[0.3, 0.2, 0.3, 0.2]),
    'gender': np.random.choice(['Male', 'Female'], n),
    'urban': np.random.choice([0, 1], n, p=[0.4, 0.6]),
    'province': np.random.choice([
        'Beijing', 'Shanghai', 'Guangdong', 'Zhejiang', 'Jiangsu', 'Shandong', 'Henan', 'Sichuan', 'Hubei', 'Hunan'
    ], n),
    'health': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.4, 0.3, 0.1])
})

# Generate income (based on education, age, etc.)
cfps['income'] = (
    5000 +
    2000 * cfps['education'] +
    300 * cfps['age'] +
    5000 * cfps['urban'] +
    10000 * (cfps['gender'] == 'Male').astype(int) +
    np.random.lognormal(8, 1.5, n)
)

# Introduce MAR missingness (high income, high education more likely to be missing)
missing_prob = 1 / (1 + np.exp(
    -(cfps['income'] / 50000 - 2 + 0.1 * cfps['education'])
))
missing_mask = np.random.random(n) < missing_prob
cfps.loc[missing_mask, 'income'] = np.nan

print(f"Income missing rate: {cfps['income'].isnull().mean()*100:.2f}%")

# Introduce other missingness
for col in ['age', 'education', 'health']:
    mask = np.random.random(n) < 0.05
    cfps.loc[mask, col] = np.nan

# Introduce outliers
cfps.loc[np.random.choice(n, 20, replace=False), 'income'] = np.random.uniform(
    500000, 1000000, 20
)
cfps.loc[np.random.choice(n, 10, replace=False), 'age'] = -99

# Introduce duplicates
cfps = pd.concat([cfps, cfps.sample(50)], ignore_index=True)

print(f"\nRaw data: {cfps.shape}")
print(f"\nFirst 5 rows:")
print(cfps.head())

# Use DataCleaner
print("\n" + "="*70)
print("Start Data Cleaning")
print("="*70)

cleaner = DataCleaner(cfps, name='CFPS 2018')

# 1. Diagnose missing
cleaner.diagnose_missing(plot=False)

# 2. Remove duplicates
cleaner.remove_duplicates(subset=['pid'], keep='first')

# 3. Handle obvious error values
print("\n" + "="*70)
print("Handle Error Values")
print("="*70)

# Age range: 16-100
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['age'] >= 16) & (cleaner.df['age'] <= 100)]
print(f"  Removed age outliers: {before - len(cleaner.df)} rows")

# Income > 0
before = len(cleaner.df)
cleaner.df = cleaner.df[(cleaner.df['income'] > 0) | (cleaner.df['income'].isnull())]
print(f"  Removed negative income: {before - len(cleaner.df)} rows")

# 4. MICE imputation
cleaner.impute_mice(n_imputations=5, max_iter=10, use_rubin=True)

# 5. Handle outliers
cleaner.handle_outliers(method='winsorize', limits=[0.01, 0.01])

# 6. Clean strings
cleaner.clean_strings()

# 7. Generate report
report = cleaner.generate_cleaning_report()

# Get cleaned data
cfps_clean = cleaner.get_clean_data()

print(f"\nCleaned data: {cfps_clean.shape}")
print(f"\nDescriptive statistics:")
print(cfps_clean[['age', 'education', 'income']].describe())

# 8. Compare complete case deletion vs MICE imputation
print("\n" + "="*70)
print("Complete Case Deletion vs MICE Imputation Comparison")
print("="*70)

# Complete case
cfps_complete = cfps.dropna(subset=['income'])

# Estimate return to education
from sklearn.linear_model import LinearRegression

# Complete case estimate
X_complete = cfps_complete[['education', 'age', 'urban']]
y_complete = cfps_complete['income']
model_complete = LinearRegression().fit(X_complete, y_complete)
beta_complete = model_complete.coef_[0]

# MICE imputation estimate
X_mice = cfps_clean[['education', 'age', 'urban']]
y_mice = cfps_clean['income']
model_mice = LinearRegression().fit(X_mice, y_mice)
beta_mice = model_mice.coef_[0]

print(f"\nReturn to education estimate:")
print(f"  Complete case deletion: {beta_complete:.2f}")
print(f"  MICE imputation: {beta_mice:.2f}")
print(f"  Difference: {beta_mice - beta_complete:.2f} ({(beta_mice/beta_complete - 1)*100:.2f}%)")

# 9. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original vs cleaned income distribution
axes[0, 0].hist(cfps['income'].dropna(), bins=50, alpha=0.5, label='Original', edgecolor='black')
axes[0, 0].hist(cfps_clean['income'], bins=50, alpha=0.5, label='Cleaned', edgecolor='black')
axes[0, 0].set_title('Income Distribution Comparison')
axes[0, 0].set_xlabel('Income')
axes[0, 0].legend()

# Age distribution
axes[0, 1].hist(cfps_clean['age'], bins=30, edgecolor='black', color='green')
axes[0, 1].set_title('Age Distribution (Cleaned)')
axes[0, 1].set_xlabel('Age')

# Education distribution
cfps_clean['education'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 0], color='orange')
axes[1, 0].set_title('Education Distribution (Cleaned)')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Frequency')

# Income vs education
axes[1, 1].scatter(cfps_clean['education'], cfps_clean['income'], alpha=0.3, s=10)
axes[1, 1].set_title('Education and Income Relationship')
axes[1, 1].set_xlabel('Years of Education')
axes[1, 1].set_ylabel('Income')

plt.tight_layout()
plt.show()

print("\n Data cleaning complete!")

Sensitivity Analysis

python

print("\n" + "="*70)
print("Sensitivity Analysis: Missing Mechanism Assumptions")
print("="*70)

# Assumption 1: MCAR (deletion)
cfps_mcar = cfps.dropna()
X_mcar = cfps_mcar[['education', 'age', 'urban']]
y_mcar = cfps_mcar['income']
model_mcar = LinearRegression().fit(X_mcar, y_mcar)
beta_mcar = model_mcar.coef_[0]

# Assumption 2: MAR (MICE)
beta_mar = beta_mice

# Assumption 3: MNAR (Heckman)
# Simplified: Assume income missingness related to true income
# (In practice, more complex modeling needed)

print(f"\nReturn to education estimate (sensitivity analysis):")
print(f"  MCAR assumption (deletion): {beta_mcar:.2f}")
print(f"  MAR assumption (MICE): {beta_mar:.2f}")
print(f"  Difference: {beta_mar - beta_mcar:.2f} ({abs(beta_mar - beta_mcar) / beta_mcar * 100:.2f}%)")

print(f"\nConclusion:")
if abs(beta_mar - beta_mcar) / beta_mcar < 0.05:
    print(f"   Results not sensitive to missing mechanism assumption (difference < 5%)")
else:
    print(f"  ️  Results sensitive to missing mechanism assumption (difference ≥ 5%)")
    print(f"  Suggestion: Further investigate missing mechanism, may need Heckman correction")

Practice Problems

⭐⭐ Basic: Simple Mean Imputation

Task: Given a dataset with missing values, implement simple mean imputation.

python

# Practice data
np.random.seed(42)
df_practice = pd.DataFrame({
    'x1': np.random.normal(50, 10, 100),
    'x2': np.random.normal(100, 20, 100),
    'x3': np.random.normal(75, 15, 100)
})

# Introduce missingness
for col in df_practice.columns:
    mask = np.random.random(100) < 0.2
    df_practice.loc[mask, col] = np.nan

print("Missing rates:")
print(df_practice.isnull().mean())

# TODO: Implement mean imputation
# df_imputed = ...

# Verification
# assert df_imputed.isnull().sum().sum() == 0

⭐⭐⭐ Intermediate: IQR Outlier Detection

Task: Implement IQR-based outlier detection function and visualize.

python

# TODO: Implement function
def detect_and_visualize_outliers(df, col):
    """
    Detect and visualize outliers

    Returns:
    ------
    outliers : pd.DataFrame
        Outlier data
    """
    # 1. IQR calculation
    # 2. Identify outliers
    # 3. Visualize (box plot + scatter plot)
    pass

# Test
# outliers = detect_and_visualize_outliers(df, 'income')

⭐⭐⭐⭐ Advanced: MICE + Rubin's Rules

Task: Implement complete MICE imputation and combine results using Rubin's Rules.

python

# TODO: Implement
# 1. MICE generates m=5 imputed datasets
# 2. Estimate linear regression coefficients for each dataset
# 3. Combine using Rubin's Rules
# 4. Calculate 95% confidence interval
# 5. Compare with complete case analysis

⭐⭐⭐⭐⭐ Nobel Level: Heckman + Sensitivity Analysis

Task: Implement Heckman two-step method and conduct sensitivity analysis.

python

# TODO: Implement
# 1. Simulate female wage data (married women more likely not to work)
# 2. Heckman two-step estimation
# 3. Compare with regular OLS (delete missing)
# 4. Sensitivity analysis: Change selection equation variables
# 5. Explain economic meaning of Inverse Mills Ratio

Summary

Core Data Cleaning Functions

Task	Function	Example
Missing values	`dropna()`	Delete missing
	`fillna()`	Fill missing
	`IterativeImputer()`	MICE imputation
Outliers	`clip()`	Cap
	`winsorize()`	Winsorize
	`IsolationForest()`	ML detection
Duplicates	`duplicated()`	Detect duplicates
	`drop_duplicates()`	Remove duplicates
Type conversion	`astype()`	Type conversion
	`pd.to_numeric()`	Convert to numeric
	`pd.to_datetime()`	Convert to date
Strings	`str.strip()`	Remove whitespace
	`str.lower()`	Convert to lowercase
	`str.replace()`	Replace characters

Missing Value Handling Decision Tree

Start
│
├─ Missing rate < 5%?
│  ├─ Yes → Direct deletion (dropna)
│  └─ No → Continue
│
├─ Little's MCAR Test p > 0.05?
│  ├─ Yes → MCAR, can delete or simple imputation
│  └─ No → MAR/MNAR, need advanced methods
│
├─ MAR?
│  ├─ Yes → MICE multiple imputation
│  └─ No → MNAR
│
└─ MNAR?
   ├─ Sample selection problem → Heckman two-step
   ├─ Missingness itself informative → Missing indicator variable
   └─ Uncertain → Sensitivity analysis

Data Cleaning Best Practices

Preserve original data: Use df.copy() to create copy
Record cleaning steps: Add comments and log for each step
Explore before cleaning: Understand data before processing
Reproducible workflow: Write functions for reusability
Verify cleaning results: Recheck data quality after cleaning
Sensitivity analysis: Compare impact of different cleaning strategies
Documentation: Generate cleaning reports

Theoretical Key Points

Missing mechanisms:
- MCAR: Deletion doesn't cause bias
- MAR: Can be modeled with observed variables
- MNAR: Needs additional assumptions or sensitivity analysis
MICE:
- Iterative imputation considering inter-variable relationships
- Rubin's Rules combine multiple imputed datasets
- Standard errors correctly reflect uncertainty
Heckman model:
- Two-step method corrects sample selection bias
- Inverse Mills Ratio quantifies selection effect
- Needs exclusion restriction

Next Steps

Next section we will learn Variable Construction, including dummy variables, interactions, lag terms, etc.

Keep going!

References

Classic Textbooks

Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd Edition). Wiley.
Heckman, J. J. (1979). "Sample Selection Bias as a Specification Error." Econometrica, 47(1), 153-161.
van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd Edition). Chapman and Hall/CRC.

Practical Tools

missingno: Missing value visualization
scikit-learn.impute: Imputation algorithms
statsmodels: Statistical modeling (including Heckman)
fancyimpute: Advanced imputation methods

3.3 Data Cleaning ​

Learning Objectives ​

Why is Data Cleaning So Important? ​

Real Case: Excel Error Leading to Policy Failure ​

Statistical Theory of Missing Data ​

Mathematical Definition of Missing Mechanisms ​

1. MCAR (Missing Completely At Random) ​

2. MAR (Missing At Random) ​

3. MNAR (Missing Not At Random) ​

Little's MCAR Test ​

Missing Value Handling ​

Three Types of Missing Values ​

Detecting Missing Values ​

Little's MCAR Test Implementation ​

Missing Value Handling Strategies ​

Strategy 1: Deletion ​

Strategy 2: Simple Imputation ​

Strategy 3: Advanced Imputation ​

MICE Multiple Imputation Theory ​

MICE Algorithm Principle ​

Algorithm Steps ​

Rubin's Rules ​

Fraction of Missing Information ​

Complete MICE Implementation ​

Strategy 4: Missing Indicator Variable ​

Production-Level Data Cleaning Class ​

Outlier Handling ​

What are Outliers? ​

Methods to Identify Outliers ​

Method 1: Box Plot ​

Method 2: IQR Method (Interquartile Range) ​

Method 3: Z-score Method ​

Method 4: Isolation Forest (Machine Learning Method) ​

Outlier Handling Strategies ​

Strategy 1: Deletion ​

Strategy 2: Winsorize (Winsorization) ​

Strategy 3: Log Transformation ​

Strategy 4: Cap (Capping) ​

Duplicate Handling ​

Detecting Duplicates ​

Removing Duplicates ​

Handling "Fuzzy Duplicates" ​

Data Types and Format Unification ​

Data Type Conversion ​

String Cleaning ​

Date Handling ​

Heckman Sample Selection Model ​

Theoretical Foundation ​

Step 1: Selection Equation (Probit) ​

Step 2: Outcome Equation (OLS + IMR) ​

Inverse Mills Ratio ​

Corrected Regression ​

Python Implementation ​

Real Case: CFPS 2018 Data Cleaning ​

Case Background ​

Complete Cleaning Workflow ​

Sensitivity Analysis ​

Practice Problems ​

⭐⭐ Basic: Simple Mean Imputation ​

⭐⭐⭐ Intermediate: IQR Outlier Detection ​

⭐⭐⭐⭐ Advanced: MICE + Rubin's Rules ​

⭐⭐⭐⭐⭐ Nobel Level: Heckman + Sensitivity Analysis ​

Summary ​

Core Data Cleaning Functions ​

Missing Value Handling Decision Tree ​

Data Cleaning Best Practices ​

Theoretical Key Points ​

Next Steps ​

References ​

Classic Textbooks ​

Practical Tools ​

Online Resources ​

3.3 Data Cleaning

Learning Objectives

Why is Data Cleaning So Important?

Real Case: Excel Error Leading to Policy Failure

Statistical Theory of Missing Data

Mathematical Definition of Missing Mechanisms

1. MCAR (Missing Completely At Random)

2. MAR (Missing At Random)

3. MNAR (Missing Not At Random)

Little's MCAR Test

Missing Value Handling

Three Types of Missing Values

Detecting Missing Values

Little's MCAR Test Implementation

Missing Value Handling Strategies

Strategy 1: Deletion

Strategy 2: Simple Imputation

Strategy 3: Advanced Imputation

MICE Multiple Imputation Theory

MICE Algorithm Principle

Algorithm Steps

Rubin's Rules

Fraction of Missing Information

Complete MICE Implementation

Strategy 4: Missing Indicator Variable

Production-Level Data Cleaning Class

Outlier Handling

What are Outliers?

Methods to Identify Outliers

Method 1: Box Plot

Method 2: IQR Method (Interquartile Range)

Method 3: Z-score Method

Method 4: Isolation Forest (Machine Learning Method)

Outlier Handling Strategies

Strategy 1: Deletion

Strategy 2: Winsorize (Winsorization)

Strategy 3: Log Transformation

Strategy 4: Cap (Capping)

Duplicate Handling

Detecting Duplicates

Removing Duplicates

Handling "Fuzzy Duplicates"

Data Types and Format Unification

Data Type Conversion

String Cleaning

Date Handling

Heckman Sample Selection Model

Theoretical Foundation

Step 1: Selection Equation (Probit)

Step 2: Outcome Equation (OLS + IMR)

Inverse Mills Ratio

Corrected Regression

Python Implementation

Real Case: CFPS 2018 Data Cleaning

Case Background

Complete Cleaning Workflow

Sensitivity Analysis

Practice Problems

⭐⭐ Basic: Simple Mean Imputation

⭐⭐⭐ Intermediate: IQR Outlier Detection

⭐⭐⭐⭐ Advanced: MICE + Rubin's Rules

⭐⭐⭐⭐⭐ Nobel Level: Heckman + Sensitivity Analysis

Summary

Core Data Cleaning Functions

Missing Value Handling Decision Tree

Data Cleaning Best Practices

Theoretical Key Points

Next Steps

References

Classic Textbooks

Practical Tools

Online Resources