Skip to content

3.5 Data Transformation

"Data transforms are as important as the algorithms themselves.""数据转换与算法本身同样重要。"— Hadley Wickham, Chief Scientist at RStudio (RStudio首席科学家)

Making data more suitable for statistical analysis


📋 Learning Objectives

  • Master standardization and normalization
  • Understand application scenarios for log transformations
  • Learn binning techniques
  • Understand when data transformation is necessary

📊 Standardization

Z-score Standardization

python
from scipy import stats
import pandas as pd
import numpy as np

# Method 1: scipy
df['income_std'] = stats.zscore(df['income'])

# Method 2: manual calculation
df['income_std'] = (df['income'] - df['income'].mean()) / df['income'].std()

# Properties: mean=0, standard deviation=1

Application Scenarios:

  • Comparing regression coefficients across variables with different units
  • Machine learning algorithms (e.g., SVM, KNN)
  • Detecting outliers (|z| > 3)

Min-Max Normalization

python
# Scale to [0, 1]
df['income_norm'] = (df['income'] - df['income'].min()) / \
                    (df['income'].max() - df['income'].min())

# Scale to [a, b]
a, b = 0, 100
df['income_scaled'] = a + (b - a) * df['income_norm']

Standardized Regression Coefficients

python
import statsmodels.api as sm

# Original regression
X = sm.add_constant(df[['education', 'experience']])
y = df['income']
model_raw = sm.OLS(y, X).fit()

# Standardized regression
df_std = df[['income', 'education', 'experience']].apply(stats.zscore)
X_std = sm.add_constant(df_std[['education', 'experience']])
y_std = df_std['income']
model_std = sm.OLS(y_std, X_std).fit()

print("Original coefficients:", model_raw.params)
print("Standardized coefficients:", model_std.params)

# Standardized coefficient interpretation:
# β_std = 0.4 → one standard deviation increase in education leads to 0.4 standard deviation increase in income

📈 Log Transformation

Why Use Logarithms?

  1. Handling Right-Skewed Distributions
python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original distribution (right-skewed)
axes[0].hist(df['income'], bins=50, edgecolor='black')
axes[0].set_title('Original Income Distribution (Right-Skewed)')

# Log-transformed (approximately normal)
axes[1].hist(np.log(df['income']), bins=50, edgecolor='black', color='green')
axes[1].set_title('log(Income) Distribution (Approximately Normal)')

plt.show()
  1. Stabilizing Variance (reducing heteroskedasticity)

  2. Coefficient Interpretation as Elasticity

python
# log-level model
# log(Y) = β₀ + β₁*X + ε
# Interpretation: 1 unit increase in X → Y increases by approximately 100×β₁%

# log-log model
# log(Y) = β₀ + β₁*log(X) + ε
# Interpretation: 1% increase in X → Y increases by β₁% (elasticity)

# level-log model
# Y = β₀ + β₁*log(X) + ε
# Interpretation: 1% increase in X → Y increases by β₁/100 units

Four Log Models

python
import statsmodels.api as sm

# Original data
df['log_wage'] = np.log(df['wage'])
df['log_education'] = np.log(df['education'])

# Model 1: Level-Level
X1 = sm.add_constant(df['education'])
model1 = sm.OLS(df['wage'], X1).fit()
print("Level-Level: β₁ =", model1.params['education'])
print("Interpretation: 1 year increase in education → wage increases by β₁ dollars")

# Model 2: Log-Level
X2 = sm.add_constant(df['education'])
model2 = sm.OLS(df['log_wage'], X2).fit()
print("Log-Level: β₁ =", model2.params['education'])
print(f"Interpretation: 1 year increase in education → wage increases by {model2.params['education']*100:.2f}%")

# Model 3: Level-Log
X3 = sm.add_constant(df['log_education'])
model3 = sm.OLS(df['wage'], X3).fit()
print("Level-Log: β₁ =", model3.params['log_education'])
print(f"Interpretation: 1% increase in education → wage increases by {model3.params['log_education']/100:.2f} dollars")

# Model 4: Log-Log (Elasticity)
X4 = sm.add_constant(df['log_education'])
model4 = sm.OLS(df['log_wage'], X4).fit()
print("Log-Log: β₁ =", model4.params['log_education'])
print("Interpretation: 1% increase in education → wage increases by β₁%")

Handling Zeros and Negative Values

python
# Problem: log(0) = -∞, log(negative) = NaN

# Solution 1: log(x + 1)
df['log_income_plus1'] = np.log(df['income'] + 1)

# Solution 2: log(x + c), where c is half of the minimum positive value
c = df[df['income'] > 0]['income'].min() / 2
df['log_income_c'] = np.log(df['income'] + c)

# Solution 3: IHS transformation (inverse hyperbolic sine)
df['ihs_income'] = np.arcsinh(df['income'])
# Advantage: can handle negative values, approximates logarithm properties

🗂️ Binning

Why Bin?

  • Handle non-linear relationships
  • Robust to outliers
  • Easier interpretation
  • Test piecewise linear assumptions

Equal-Width Binning

python
# Divide age into 4 groups
df['age_group'] = pd.cut(
    df['age'],
    bins=4,
    labels=['Young', 'Middle-aged', 'Mature', 'Senior']
)

# Custom boundaries
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['<25', '25-40', '40-60', '60+']
)

Equal-Frequency Binning (Quantiles)

python
# Divide into 4 equal-frequency groups (same sample size per group)
df['income_quartile'] = pd.qcut(
    df['income'],
    q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# Custom quantiles
df['income_decile'] = pd.qcut(
    df['income'],
    q=10,
    labels=[f'D{i}' for i in range(1, 11)]
)

Using Bins in Regression

python
# Create dummy variables
age_dummies = pd.get_dummies(df['age_group'], prefix='age', drop_first=True)
df = pd.concat([df, age_dummies], axis=1)

# Regression
X = sm.add_constant(df[['education', 'age_Middle-aged', 'age_Mature', 'age_Senior']])
y = df['log_wage']
model = sm.OLS(y, X).fit()

print(model.summary())

# Coefficient interpretation:
# age_Middle-aged coefficient: wage difference for middle-aged compared to young (reference group)

Visualizing Binning Effects

python
# Calculate mean for each group
age_group_means = df.groupby('age_group')['wage'].mean()

# Plot
age_group_means.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Age Group')
plt.ylabel('Average Wage')
plt.title('Age and Wage Relationship (Binned)')
plt.show()

🔄 Other Transformations

Box-Cox Transformation

python
from scipy.stats import boxcox

# Automatically select optimal λ parameter
df['income_boxcox'], lambda_opt = boxcox(df['income'])

print(f"Optimal λ = {lambda_opt:.4f}")
# λ = 0 → log transformation
# λ = 1 → no transformation
# λ = 0.5 → square root transformation

Square Root Transformation

python
# Suitable for count data (Poisson distribution)
df['sqrt_count'] = np.sqrt(df['count'])

Rank Transformation (Non-parametric)

python
# Convert data to ranks (extremely robust to outliers)
df['income_rank'] = df['income'].rank()

# Percentile rank
df['income_pct_rank'] = df['income'].rank(pct=True)

📚 Summary

Transformation Methods Comparison

Transformation MethodApplication ScenarioPython Code
StandardizationCoefficient comparability, machine learningstats.zscore()
Normalization[0,1] range(x - min) / (max - min)
LogarithmRight-skewed distribution, elasticity interpretationnp.log()
BinningNon-linearity, robustnesspd.cut(), pd.qcut()
Box-CoxAutomatic optimal transformationboxcox()
RankingExtreme robustnessdf.rank()

When to Transform?

ProblemTransformation Method
Right-skewed distributionLogarithm, Box-Cox
HeteroskedasticityLogarithm, square root
Many outliersWinsorize, ranking
Coefficient comparabilityStandardization
Non-linear relationshipPolynomial, binning

📝 Practice Exercises

  1. Apply log transformation to a right-skewed income variable and compare the distributions before and after transformation.

  2. Create a fourth-degree polynomial of age and test for non-linear relationships in regression.

  3. Bin education years (below high school, high school, college, graduate), create dummy variables, and run regression.


🎯 Next Steps

Next section: Data Merging and Reshaping


References:

  • Box, G. E., & Cox, D. R. (1964). "An analysis of transformations"
  • Wooldridge (2020): Chapter 6 - Functional Form

Released under the MIT License. Content © Author.