3.5 Data Transformation

"Data transforms are as important as the algorithms themselves.""数据转换与算法本身同样重要。"— Hadley Wickham, Chief Scientist at RStudio (RStudio首席科学家)

Making data more suitable for statistical analysis

📋 Learning Objectives

Master standardization and normalization
Understand application scenarios for log transformations
Learn binning techniques
Understand when data transformation is necessary

📊 Standardization

Z-score Standardization

python

from scipy import stats
import pandas as pd
import numpy as np

# Method 1: scipy
df['income_std'] = stats.zscore(df['income'])

# Method 2: manual calculation
df['income_std'] = (df['income'] - df['income'].mean()) / df['income'].std()

# Properties: mean=0, standard deviation=1

Application Scenarios:

Comparing regression coefficients across variables with different units
Machine learning algorithms (e.g., SVM, KNN)
Detecting outliers (|z| > 3)

Min-Max Normalization

python

# Scale to [0, 1]
df['income_norm'] = (df['income'] - df['income'].min()) / \
                    (df['income'].max() - df['income'].min())

# Scale to [a, b]
a, b = 0, 100
df['income_scaled'] = a + (b - a) * df['income_norm']

Standardized Regression Coefficients

python

import statsmodels.api as sm

# Original regression
X = sm.add_constant(df[['education', 'experience']])
y = df['income']
model_raw = sm.OLS(y, X).fit()

# Standardized regression
df_std = df[['income', 'education', 'experience']].apply(stats.zscore)
X_std = sm.add_constant(df_std[['education', 'experience']])
y_std = df_std['income']
model_std = sm.OLS(y_std, X_std).fit()

print("Original coefficients:", model_raw.params)
print("Standardized coefficients:", model_std.params)

# Standardized coefficient interpretation:
# β_std = 0.4 → one standard deviation increase in education leads to 0.4 standard deviation increase in income

📈 Log Transformation

Why Use Logarithms?

Handling Right-Skewed Distributions

python

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original distribution (right-skewed)
axes[0].hist(df['income'], bins=50, edgecolor='black')
axes[0].set_title('Original Income Distribution (Right-Skewed)')

# Log-transformed (approximately normal)
axes[1].hist(np.log(df['income']), bins=50, edgecolor='black', color='green')
axes[1].set_title('log(Income) Distribution (Approximately Normal)')

plt.show()

Stabilizing Variance (reducing heteroskedasticity)
Coefficient Interpretation as Elasticity

python

# log-level model
# log(Y) = β₀ + β₁*X + ε
# Interpretation: 1 unit increase in X → Y increases by approximately 100×β₁%

# log-log model
# log(Y) = β₀ + β₁*log(X) + ε
# Interpretation: 1% increase in X → Y increases by β₁% (elasticity)

# level-log model
# Y = β₀ + β₁*log(X) + ε
# Interpretation: 1% increase in X → Y increases by β₁/100 units

Four Log Models

python

import statsmodels.api as sm

# Original data
df['log_wage'] = np.log(df['wage'])
df['log_education'] = np.log(df['education'])

# Model 1: Level-Level
X1 = sm.add_constant(df['education'])
model1 = sm.OLS(df['wage'], X1).fit()
print("Level-Level: β₁ =", model1.params['education'])
print("Interpretation: 1 year increase in education → wage increases by β₁ dollars")

# Model 2: Log-Level
X2 = sm.add_constant(df['education'])
model2 = sm.OLS(df['log_wage'], X2).fit()
print("Log-Level: β₁ =", model2.params['education'])
print(f"Interpretation: 1 year increase in education → wage increases by {model2.params['education']*100:.2f}%")

# Model 3: Level-Log
X3 = sm.add_constant(df['log_education'])
model3 = sm.OLS(df['wage'], X3).fit()
print("Level-Log: β₁ =", model3.params['log_education'])
print(f"Interpretation: 1% increase in education → wage increases by {model3.params['log_education']/100:.2f} dollars")

# Model 4: Log-Log (Elasticity)
X4 = sm.add_constant(df['log_education'])
model4 = sm.OLS(df['log_wage'], X4).fit()
print("Log-Log: β₁ =", model4.params['log_education'])
print("Interpretation: 1% increase in education → wage increases by β₁%")

Handling Zeros and Negative Values

python

# Problem: log(0) = -∞, log(negative) = NaN

# Solution 1: log(x + 1)
df['log_income_plus1'] = np.log(df['income'] + 1)

# Solution 2: log(x + c), where c is half of the minimum positive value
c = df[df['income'] > 0]['income'].min() / 2
df['log_income_c'] = np.log(df['income'] + c)

# Solution 3: IHS transformation (inverse hyperbolic sine)
df['ihs_income'] = np.arcsinh(df['income'])
# Advantage: can handle negative values, approximates logarithm properties

🗂️ Binning

Why Bin?

Handle non-linear relationships
Robust to outliers
Easier interpretation
Test piecewise linear assumptions

Equal-Width Binning

python

# Divide age into 4 groups
df['age_group'] = pd.cut(
    df['age'],
    bins=4,
    labels=['Young', 'Middle-aged', 'Mature', 'Senior']
)

# Custom boundaries
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['<25', '25-40', '40-60', '60+']
)

Equal-Frequency Binning (Quantiles)

python

# Divide into 4 equal-frequency groups (same sample size per group)
df['income_quartile'] = pd.qcut(
    df['income'],
    q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# Custom quantiles
df['income_decile'] = pd.qcut(
    df['income'],
    q=10,
    labels=[f'D{i}' for i in range(1, 11)]
)

Using Bins in Regression

python

# Create dummy variables
age_dummies = pd.get_dummies(df['age_group'], prefix='age', drop_first=True)
df = pd.concat([df, age_dummies], axis=1)

# Regression
X = sm.add_constant(df[['education', 'age_Middle-aged', 'age_Mature', 'age_Senior']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

print(model.summary())

# Coefficient interpretation:
# age_Middle-aged coefficient: wage difference for middle-aged compared to young (reference group)

Visualizing Binning Effects

python

# Calculate mean for each group
age_group_means = df.groupby('age_group')['wage'].mean()

# Plot
age_group_means.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Age Group')
plt.ylabel('Average Wage')
plt.title('Age and Wage Relationship (Binned)')
plt.show()

🔄 Other Transformations

Box-Cox Transformation

python

from scipy.stats import boxcox

# Automatically select optimal λ parameter
df['income_boxcox'], lambda_opt = boxcox(df['income'])

print(f"Optimal λ = {lambda_opt:.4f}")
# λ = 0 → log transformation
# λ = 1 → no transformation
# λ = 0.5 → square root transformation

Square Root Transformation

python

# Suitable for count data (Poisson distribution)
df['sqrt_count'] = np.sqrt(df['count'])

Rank Transformation (Non-parametric)

python

# Convert data to ranks (extremely robust to outliers)
df['income_rank'] = df['income'].rank()

# Percentile rank
df['income_pct_rank'] = df['income'].rank(pct=True)

📚 Summary

Transformation Methods Comparison

Transformation Method	Application Scenario	Python Code
Standardization	Coefficient comparability, machine learning	`stats.zscore()`
Normalization	[0,1] range	`(x - min) / (max - min)`
Logarithm	Right-skewed distribution, elasticity interpretation	`np.log()`
Binning	Non-linearity, robustness	`pd.cut()`, `pd.qcut()`
Box-Cox	Automatic optimal transformation	`boxcox()`
Ranking	Extreme robustness	`df.rank()`

When to Transform?

Problem	Transformation Method
Right-skewed distribution	Logarithm, Box-Cox
Heteroskedasticity	Logarithm, square root
Many outliers	Winsorize, ranking
Coefficient comparability	Standardization
Non-linear relationship	Polynomial, binning

📝 Practice Exercises

Apply log transformation to a right-skewed income variable and compare the distributions before and after transformation.
Create a fourth-degree polynomial of age and test for non-linear relationships in regression.
Bin education years (below high school, high school, college, graduate), create dummy variables, and run regression.

🎯 Next Steps

Next section: Data Merging and Reshaping

References:

Box, G. E., & Cox, D. R. (1964). "An analysis of transformations"
Wooldridge (2020): Chapter 6 - Functional Form

3.5 Data Transformation ​

📋 Learning Objectives ​

📊 Standardization ​

Z-score Standardization ​

Min-Max Normalization ​

Standardized Regression Coefficients ​

📈 Log Transformation ​

Why Use Logarithms? ​

Four Log Models ​

Handling Zeros and Negative Values ​

🗂️ Binning ​

Why Bin? ​

Equal-Width Binning ​

Equal-Frequency Binning (Quantiles) ​

Using Bins in Regression ​

Visualizing Binning Effects ​

🔄 Other Transformations ​

Box-Cox Transformation ​

Square Root Transformation ​

Rank Transformation (Non-parametric) ​

📚 Summary ​

Transformation Methods Comparison ​

When to Transform? ​

📝 Practice Exercises ​

🎯 Next Steps ​

3.5 Data Transformation

📋 Learning Objectives

📊 Standardization

Z-score Standardization

Min-Max Normalization

Standardized Regression Coefficients

📈 Log Transformation

Why Use Logarithms?

Four Log Models

Handling Zeros and Negative Values

🗂️ Binning

Why Bin?

Equal-Width Binning

Equal-Frequency Binning (Quantiles)

Using Bins in Regression

Visualizing Binning Effects

🔄 Other Transformations

Box-Cox Transformation

Square Root Transformation

Rank Transformation (Non-parametric)

📚 Summary

Transformation Methods Comparison

When to Transform?

📝 Practice Exercises

🎯 Next Steps