3.6 Data Merging and Reshaping
"The art of data wrangling is the art of knowing what shape your data should be in.""数据整理的艺术在于知道数据应该是什么形状。"— Wes McKinney, Creator of pandas (pandas创始人)
Integrating multi-source data to construct analytical datasets
📋 Learning Objectives
- Master data merging (merge, join, concat)
- Understand wide format and long format conversions
- Learn panel data construction
- Avoid common merging errors
🔗 Data Merging (Merge & Join)
Four Types of Merges
| Type | Description | SQL Equivalent | Use Case |
|---|---|---|---|
inner | Intersection (both sides match) | INNER JOIN | Keep only matched records |
left | Keep all from left table | LEFT JOIN | Main dataset is complete |
right | Keep all from right table | RIGHT JOIN | Rarely used (use left instead) |
outer | Union (all records) | FULL OUTER JOIN | Completeness check |
Basic Merge
python
import pandas as pd
# Example: Individual data + Regional GDP
individuals = pd.DataFrame({
'person_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'region': ['East', 'West', 'East', 'South']
})
region_gdp = pd.DataFrame({
'region': ['East', 'West', 'North'],
'gdp': [100, 120, 90]
})
# Left join (keep all individuals)
merged = pd.merge(
individuals,
region_gdp,
on='region',
how='left'
)
print(merged)
# person_id name region gdp
# 0 1 Alice East 100.0
# 1 2 Bob West 120.0
# 2 3 Charlie East 100.0
# 3 4 David South NaN # South doesn't exist in region_gdpMulti-Key Merge
python
# Merge based on multiple columns
df1 = pd.DataFrame({
'year': [2018, 2018, 2019, 2019],
'region': ['A', 'B', 'A', 'B'],
'sales': [100, 120, 110, 130]
})
df2 = pd.DataFrame({
'year': [2018, 2019],
'region': ['A', 'A'],
'gdp': [500, 520]
})
merged = pd.merge(
df1, df2,
on=['year', 'region'],
how='left'
)Merging with Different Column Names
python
# Left table uses 'region_code', right table uses 'code'
merged = pd.merge(
df1, df2,
left_on='region_code',
right_on='code',
how='left'
)Index-Based Merge
python
# Merge based on index
merged = pd.merge(
df1, df2,
left_index=True,
right_index=True,
how='left'
)
# Or use join directly
merged = df1.join(df2, how='left')Merge Validation (Important!)
python
# Validate merge relationship
merged = pd.merge(
individuals,
region_gdp,
on='region',
how='left',
validate='m:1' # Many-to-one relationship
)
# Common relationship types:
# '1:1' - One-to-one
# '1:m' - One-to-many
# 'm:1' - Many-to-one
# 'm:m' - Many-to-many (usually not recommended)⚠️ Common Error: Cartesian Product
python
# Wrong example: region is not unique in both tables
df1 = pd.DataFrame({'region': ['A', 'A', 'B'], 'x': [1, 2, 3]})
df2 = pd.DataFrame({'region': ['A', 'A'], 'y': [10, 20]})
merged = pd.merge(df1, df2, on='region')
# Result: 4 rows (2×2 Cartesian product)
# region x y
# 0 A 1 10
# 1 A 1 20 # Duplicate!
# 2 A 2 10 # Duplicate!
# 3 A 2 20 # Duplicate!
# Solution: deduplicate first or use validate parameter📚 Data Concatenation (Concat)
Vertical Concatenation (Stacking)
python
# Combine multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')
# Vertical stacking
df_panel = pd.concat(
[df_2018, df_2019, df_2020],
ignore_index=True
)
# Add year identifier
for year, df in zip([2018, 2019, 2020], [df_2018, df_2019, df_2020]):
df['year'] = year
df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)Horizontal Concatenation
python
# Merge columns (ensure index alignment)
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [4, 5, 6]})
df_combined = pd.concat([df1, df2], axis=1)
# A B
# 0 1 4
# 1 2 5
# 2 3 6🔄 Wide Format ↔ Long Format
Long Format → Wide Format (Pivot)
python
# Long format (panel data)
df_long = pd.DataFrame({
'person_id': [1, 1, 1, 2, 2, 2],
'year': [2018, 2019, 2020, 2018, 2019, 2020],
'income': [50000, 52000, 55000, 60000, 62000, 65000]
})
# Convert to wide format
df_wide = df_long.pivot(
index='person_id',
columns='year',
values='income'
)
print(df_wide)
# year 2018 2019 2020
# person_id
# 1 50000 52000 55000
# 2 60000 62000 65000Wide Format → Long Format (Melt)
python
# Wide format
df_wide = pd.DataFrame({
'person_id': [1, 2],
'income_2018': [50000, 60000],
'income_2019': [52000, 62000],
'income_2020': [55000, 65000]
})
# Convert to long format
df_long = pd.melt(
df_wide,
id_vars=['person_id'],
value_vars=['income_2018', 'income_2019', 'income_2020'],
var_name='year',
value_name='income'
)
# Clean year column
df_long['year'] = df_long['year'].str.extract(r'(\d{4})').astype(int)
print(df_long)
# person_id year income
# 0 1 2018 50000
# 1 2 2018 60000
# 2 1 2019 52000
# 3 2 2019 62000
# 4 1 2020 55000
# 5 2 2020 65000When to Use Which Format?
| Format | Use Cases |
|---|---|
| Long format | Panel data regression, time series analysis, visualization |
| Wide format | Excel reports, cross-sectional data analysis, human readability |
📊 Panel Data Construction
Complete Workflow
python
import pandas as pd
import numpy as np
# Step 1: Read multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')
# Add year
df_2018['year'] = 2018
df_2019['year'] = 2019
df_2020['year'] = 2020
# Step 2: Vertical merge
df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)
# Step 3: Sort (important!)
df_panel = df_panel.sort_values(['person_id', 'year'])
# Step 4: Set multi-level index
df_panel = df_panel.set_index(['person_id', 'year'])
# Step 5: Check balance
obs_per_person = df_panel.groupby('person_id').size()
if obs_per_person.nunique() == 1:
print("✓ Balanced panel")
else:
print("⚠️ Unbalanced panel")
print(obs_per_person.value_counts())
# Step 6: Create lags
df_panel = df_panel.sort_index()
df_panel['income_lag1'] = df_panel.groupby('person_id')['income'].shift(1)
# Step 7: Panel regression
from linearmodels.panel import PanelOLS
model = PanelOLS(
df_panel['income'],
df_panel[['education', 'experience', 'income_lag1']],
entity_effects=True,
time_effects=True
).fit(cov_type='clustered', cluster_entity=True)
print(model)📚 Summary
Core Functions
| Task | Function | Example |
|---|---|---|
| Merge | pd.merge() | Merge based on keys |
df.join() | Merge based on index | |
| Concatenate | pd.concat() | Vertical/horizontal concatenation |
| Reshape | df.pivot() | Long → Wide |
pd.melt() | Wide → Long |
Best Practices
- ✅ Use validate parameter: Prevent accidental Cartesian products
- ✅ Check merge results: Compare row counts before and after merging
- ✅ Handle missing values: Check for NaN after merging
- ✅ Preserve original data: Use
df.copy() - ✅ Document merge logic: Comment on keys and merge types
Common Errors
| Error | Consequence | Solution |
|---|---|---|
| Not sorting before creating lags | Incorrect lag variables | sort_values() first |
| Duplicate keys causing Cartesian product | Row explosion | Use validate |
| Inconsistent merge key formats | Cannot match | Unify case, remove spaces |
Forgetting how parameter | Data loss | Explicitly specify how='left' |
📝 Practice Exercises
Merge a student information table (student_id, name) with a grades table (student_id, subject, score), keeping all students.
Convert wide format panel data (id, income_2018, income_2019, income_2020) to long format.
Construct a balanced panel dataset with 100 individuals and 3 years of data, and create a 1-period lag of income.
🎯 Next Steps
Final section: Complete Case Study
References:
- Python for Data Analysis (3rd Ed) - Chapter 8
- Pandas Official Documentation: Merge, Join, Concatenate