3.6 Data Merging and Reshaping

"The art of data wrangling is the art of knowing what shape your data should be in.""数据整理的艺术在于知道数据应该是什么形状。"— Wes McKinney, Creator of pandas (pandas创始人)

Integrating multi-source data to construct analytical datasets

📋 Learning Objectives

Master data merging (merge, join, concat)
Understand wide format and long format conversions
Learn panel data construction
Avoid common merging errors

🔗 Data Merging (Merge & Join)

Four Types of Merges

Type	Description	SQL Equivalent	Use Case
`inner`	Intersection (both sides match)	INNER JOIN	Keep only matched records
`left`	Keep all from left table	LEFT JOIN	Main dataset is complete
`right`	Keep all from right table	RIGHT JOIN	Rarely used (use left instead)
`outer`	Union (all records)	FULL OUTER JOIN	Completeness check

Basic Merge

python

import pandas as pd

# Example: Individual data + Regional GDP
individuals = pd.DataFrame({
    'person_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'region': ['East', 'West', 'East', 'South']
})

region_gdp = pd.DataFrame({
    'region': ['East', 'West', 'North'],
    'gdp': [100, 120, 90]
})

# Left join (keep all individuals)
merged = pd.merge(
    individuals,
    region_gdp,
    on='region',
    how='left'
)

print(merged)
#    person_id     name region    gdp
# 0          1    Alice   East  100.0
# 1          2      Bob   West  120.0
# 2          3  Charlie   East  100.0
# 3          4    David  South    NaN  # South doesn't exist in region_gdp

Multi-Key Merge

python

# Merge based on multiple columns
df1 = pd.DataFrame({
    'year': [2018, 2018, 2019, 2019],
    'region': ['A', 'B', 'A', 'B'],
    'sales': [100, 120, 110, 130]
})

df2 = pd.DataFrame({
    'year': [2018, 2019],
    'region': ['A', 'A'],
    'gdp': [500, 520]
})

merged = pd.merge(
    df1, df2,
    on=['year', 'region'],
    how='left'
)

Merging with Different Column Names

python

# Left table uses 'region_code', right table uses 'code'
merged = pd.merge(
    df1, df2,
    left_on='region_code',
    right_on='code',
    how='left'
)

Index-Based Merge

python

# Merge based on index
merged = pd.merge(
    df1, df2,
    left_index=True,
    right_index=True,
    how='left'
)

# Or use join directly
merged = df1.join(df2, how='left')

Merge Validation (Important!)

python

# Validate merge relationship
merged = pd.merge(
    individuals,
    region_gdp,
    on='region',
    how='left',
    validate='m:1'  # Many-to-one relationship
)

# Common relationship types:
# '1:1' - One-to-one
# '1:m' - One-to-many
# 'm:1' - Many-to-one
# 'm:m' - Many-to-many (usually not recommended)

⚠️ Common Error: Cartesian Product

python

# Wrong example: region is not unique in both tables
df1 = pd.DataFrame({'region': ['A', 'A', 'B'], 'x': [1, 2, 3]})
df2 = pd.DataFrame({'region': ['A', 'A'], 'y': [10, 20]})

merged = pd.merge(df1, df2, on='region')
# Result: 4 rows (2×2 Cartesian product)
#   region  x   y
# 0      A  1  10
# 1      A  1  20  # Duplicate!
# 2      A  2  10  # Duplicate!
# 3      A  2  20  # Duplicate!

# Solution: deduplicate first or use validate parameter

📚 Data Concatenation (Concat)

Vertical Concatenation (Stacking)

python

# Combine multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')

# Vertical stacking
df_panel = pd.concat(
    [df_2018, df_2019, df_2020],
    ignore_index=True
)

# Add year identifier
for year, df in zip([2018, 2019, 2020], [df_2018, df_2019, df_2020]):
    df['year'] = year

df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)

Horizontal Concatenation

python

# Merge columns (ensure index alignment)
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [4, 5, 6]})

df_combined = pd.concat([df1, df2], axis=1)
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6

🔄 Wide Format ↔ Long Format

Long Format → Wide Format (Pivot)

python

# Long format (panel data)
df_long = pd.DataFrame({
    'person_id': [1, 1, 1, 2, 2, 2],
    'year': [2018, 2019, 2020, 2018, 2019, 2020],
    'income': [50000, 52000, 55000, 60000, 62000, 65000]
})

# Convert to wide format
df_wide = df_long.pivot(
    index='person_id',
    columns='year',
    values='income'
)

print(df_wide)
# year       2018   2019   2020
# person_id
# 1         50000  52000  55000
# 2         60000  62000  65000

Wide Format → Long Format (Melt)

python

# Wide format
df_wide = pd.DataFrame({
    'person_id': [1, 2],
    'income_2018': [50000, 60000],
    'income_2019': [52000, 62000],
    'income_2020': [55000, 65000]
})

# Convert to long format
df_long = pd.melt(
    df_wide,
    id_vars=['person_id'],
    value_vars=['income_2018', 'income_2019', 'income_2020'],
    var_name='year',
    value_name='income'
)

# Clean year column
df_long['year'] = df_long['year'].str.extract(r'(\d{4})').astype(int)

print(df_long)
#    person_id  year  income
# 0          1  2018   50000
# 1          2  2018   60000
# 2          1  2019   52000
# 3          2  2019   62000
# 4          1  2020   55000
# 5          2  2020   65000

When to Use Which Format?

Format	Use Cases
Long format	Panel data regression, time series analysis, visualization
Wide format	Excel reports, cross-sectional data analysis, human readability

📊 Panel Data Construction

Complete Workflow

python

import pandas as pd
import numpy as np

# Step 1: Read multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')

# Add year
df_2018['year'] = 2018
df_2019['year'] = 2019
df_2020['year'] = 2020

# Step 2: Vertical merge
df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)

# Step 3: Sort (important!)
df_panel = df_panel.sort_values(['person_id', 'year'])

# Step 4: Set multi-level index
df_panel = df_panel.set_index(['person_id', 'year'])

# Step 5: Check balance
obs_per_person = df_panel.groupby('person_id').size()
if obs_per_person.nunique() == 1:
    print("✓ Balanced panel")
else:
    print("⚠️ Unbalanced panel")
    print(obs_per_person.value_counts())

# Step 6: Create lags
df_panel = df_panel.sort_index()
df_panel['income_lag1'] = df_panel.groupby('person_id')['income'].shift(1)

# Step 7: Panel regression
from linearmodels.panel import PanelOLS

model = PanelOLS(
    df_panel['income'],
    df_panel[['education', 'experience', 'income_lag1']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print(model)

📚 Summary

Core Functions

Task	Function	Example
Merge	`pd.merge()`	Merge based on keys
	`df.join()`	Merge based on index
Concatenate	`pd.concat()`	Vertical/horizontal concatenation
Reshape	`df.pivot()`	Long → Wide
	`pd.melt()`	Wide → Long

Best Practices

✅ Use validate parameter: Prevent accidental Cartesian products
✅ Check merge results: Compare row counts before and after merging
✅ Handle missing values: Check for NaN after merging
✅ Preserve original data: Use df.copy()
✅ Document merge logic: Comment on keys and merge types

Common Errors

Error	Consequence	Solution
Not sorting before creating lags	Incorrect lag variables	`sort_values()` first
Duplicate keys causing Cartesian product	Row explosion	Use `validate`
Inconsistent merge key formats	Cannot match	Unify case, remove spaces
Forgetting `how` parameter	Data loss	Explicitly specify `how='left'`

📝 Practice Exercises

Merge a student information table (student_id, name) with a grades table (student_id, subject, score), keeping all students.
Convert wide format panel data (id, income_2018, income_2019, income_2020) to long format.
Construct a balanced panel dataset with 100 individuals and 3 years of data, and create a 1-period lag of income.

🎯 Next Steps

Final section: Complete Case Study

References:

Python for Data Analysis (3rd Ed) - Chapter 8
Pandas Official Documentation: Merge, Join, Concatenate

3.6 Data Merging and Reshaping ​

📋 Learning Objectives ​

🔗 Data Merging (Merge & Join) ​

Four Types of Merges ​

Basic Merge ​

Multi-Key Merge ​

Merging with Different Column Names ​

Index-Based Merge ​

Merge Validation (Important!) ​

📚 Data Concatenation (Concat) ​

Vertical Concatenation (Stacking) ​

Horizontal Concatenation ​

🔄 Wide Format ↔ Long Format ​

Long Format → Wide Format (Pivot) ​

Wide Format → Long Format (Melt) ​

When to Use Which Format? ​

📊 Panel Data Construction ​

Complete Workflow ​

📚 Summary ​

Core Functions ​

Best Practices ​

Common Errors ​

📝 Practice Exercises ​

🎯 Next Steps ​

3.6 Data Merging and Reshaping

📋 Learning Objectives

🔗 Data Merging (Merge & Join)

Four Types of Merges

Basic Merge

Multi-Key Merge

Merging with Different Column Names

Index-Based Merge

Merge Validation (Important!)

📚 Data Concatenation (Concat)

Vertical Concatenation (Stacking)

Horizontal Concatenation

🔄 Wide Format ↔ Long Format

Long Format → Wide Format (Pivot)

Wide Format → Long Format (Melt)

When to Use Which Format?

📊 Panel Data Construction

Complete Workflow

📚 Summary

Core Functions

Best Practices

Common Errors

📝 Practice Exercises

🎯 Next Steps