Skip to content

3.6 Data Merging and Reshaping

"The art of data wrangling is the art of knowing what shape your data should be in.""数据整理的艺术在于知道数据应该是什么形状。"— Wes McKinney, Creator of pandas (pandas创始人)

Integrating multi-source data to construct analytical datasets


📋 Learning Objectives

  • Master data merging (merge, join, concat)
  • Understand wide format and long format conversions
  • Learn panel data construction
  • Avoid common merging errors

🔗 Data Merging (Merge & Join)

Four Types of Merges

TypeDescriptionSQL EquivalentUse Case
innerIntersection (both sides match)INNER JOINKeep only matched records
leftKeep all from left tableLEFT JOINMain dataset is complete
rightKeep all from right tableRIGHT JOINRarely used (use left instead)
outerUnion (all records)FULL OUTER JOINCompleteness check

Basic Merge

python
import pandas as pd

# Example: Individual data + Regional GDP
individuals = pd.DataFrame({
    'person_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'region': ['East', 'West', 'East', 'South']
})

region_gdp = pd.DataFrame({
    'region': ['East', 'West', 'North'],
    'gdp': [100, 120, 90]
})

# Left join (keep all individuals)
merged = pd.merge(
    individuals,
    region_gdp,
    on='region',
    how='left'
)

print(merged)
#    person_id     name region    gdp
# 0          1    Alice   East  100.0
# 1          2      Bob   West  120.0
# 2          3  Charlie   East  100.0
# 3          4    David  South    NaN  # South doesn't exist in region_gdp

Multi-Key Merge

python
# Merge based on multiple columns
df1 = pd.DataFrame({
    'year': [2018, 2018, 2019, 2019],
    'region': ['A', 'B', 'A', 'B'],
    'sales': [100, 120, 110, 130]
})

df2 = pd.DataFrame({
    'year': [2018, 2019],
    'region': ['A', 'A'],
    'gdp': [500, 520]
})

merged = pd.merge(
    df1, df2,
    on=['year', 'region'],
    how='left'
)

Merging with Different Column Names

python
# Left table uses 'region_code', right table uses 'code'
merged = pd.merge(
    df1, df2,
    left_on='region_code',
    right_on='code',
    how='left'
)

Index-Based Merge

python
# Merge based on index
merged = pd.merge(
    df1, df2,
    left_index=True,
    right_index=True,
    how='left'
)

# Or use join directly
merged = df1.join(df2, how='left')

Merge Validation (Important!)

python
# Validate merge relationship
merged = pd.merge(
    individuals,
    region_gdp,
    on='region',
    how='left',
    validate='m:1'  # Many-to-one relationship
)

# Common relationship types:
# '1:1' - One-to-one
# '1:m' - One-to-many
# 'm:1' - Many-to-one
# 'm:m' - Many-to-many (usually not recommended)

⚠️ Common Error: Cartesian Product

python
# Wrong example: region is not unique in both tables
df1 = pd.DataFrame({'region': ['A', 'A', 'B'], 'x': [1, 2, 3]})
df2 = pd.DataFrame({'region': ['A', 'A'], 'y': [10, 20]})

merged = pd.merge(df1, df2, on='region')
# Result: 4 rows (2×2 Cartesian product)
#   region  x   y
# 0      A  1  10
# 1      A  1  20  # Duplicate!
# 2      A  2  10  # Duplicate!
# 3      A  2  20  # Duplicate!

# Solution: deduplicate first or use validate parameter

📚 Data Concatenation (Concat)

Vertical Concatenation (Stacking)

python
# Combine multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')

# Vertical stacking
df_panel = pd.concat(
    [df_2018, df_2019, df_2020],
    ignore_index=True
)

# Add year identifier
for year, df in zip([2018, 2019, 2020], [df_2018, df_2019, df_2020]):
    df['year'] = year

df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)

Horizontal Concatenation

python
# Merge columns (ensure index alignment)
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [4, 5, 6]})

df_combined = pd.concat([df1, df2], axis=1)
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6

🔄 Wide Format ↔ Long Format

Long Format → Wide Format (Pivot)

python
# Long format (panel data)
df_long = pd.DataFrame({
    'person_id': [1, 1, 1, 2, 2, 2],
    'year': [2018, 2019, 2020, 2018, 2019, 2020],
    'income': [50000, 52000, 55000, 60000, 62000, 65000]
})

# Convert to wide format
df_wide = df_long.pivot(
    index='person_id',
    columns='year',
    values='income'
)

print(df_wide)
# year       2018   2019   2020
# person_id
# 1         50000  52000  55000
# 2         60000  62000  65000

Wide Format → Long Format (Melt)

python
# Wide format
df_wide = pd.DataFrame({
    'person_id': [1, 2],
    'income_2018': [50000, 60000],
    'income_2019': [52000, 62000],
    'income_2020': [55000, 65000]
})

# Convert to long format
df_long = pd.melt(
    df_wide,
    id_vars=['person_id'],
    value_vars=['income_2018', 'income_2019', 'income_2020'],
    var_name='year',
    value_name='income'
)

# Clean year column
df_long['year'] = df_long['year'].str.extract(r'(\d{4})').astype(int)

print(df_long)
#    person_id  year  income
# 0          1  2018   50000
# 1          2  2018   60000
# 2          1  2019   52000
# 3          2  2019   62000
# 4          1  2020   55000
# 5          2  2020   65000

When to Use Which Format?

FormatUse Cases
Long formatPanel data regression, time series analysis, visualization
Wide formatExcel reports, cross-sectional data analysis, human readability

📊 Panel Data Construction

Complete Workflow

python
import pandas as pd
import numpy as np

# Step 1: Read multi-year data
df_2018 = pd.read_csv('survey_2018.csv')
df_2019 = pd.read_csv('survey_2019.csv')
df_2020 = pd.read_csv('survey_2020.csv')

# Add year
df_2018['year'] = 2018
df_2019['year'] = 2019
df_2020['year'] = 2020

# Step 2: Vertical merge
df_panel = pd.concat([df_2018, df_2019, df_2020], ignore_index=True)

# Step 3: Sort (important!)
df_panel = df_panel.sort_values(['person_id', 'year'])

# Step 4: Set multi-level index
df_panel = df_panel.set_index(['person_id', 'year'])

# Step 5: Check balance
obs_per_person = df_panel.groupby('person_id').size()
if obs_per_person.nunique() == 1:
    print("✓ Balanced panel")
else:
    print("⚠️ Unbalanced panel")
    print(obs_per_person.value_counts())

# Step 6: Create lags
df_panel = df_panel.sort_index()
df_panel['income_lag1'] = df_panel.groupby('person_id')['income'].shift(1)

# Step 7: Panel regression
from linearmodels.panel import PanelOLS

model = PanelOLS(
    df_panel['income'],
    df_panel[['education', 'experience', 'income_lag1']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

print(model)

📚 Summary

Core Functions

TaskFunctionExample
Mergepd.merge()Merge based on keys
df.join()Merge based on index
Concatenatepd.concat()Vertical/horizontal concatenation
Reshapedf.pivot()Long → Wide
pd.melt()Wide → Long

Best Practices

  1. Use validate parameter: Prevent accidental Cartesian products
  2. Check merge results: Compare row counts before and after merging
  3. Handle missing values: Check for NaN after merging
  4. Preserve original data: Use df.copy()
  5. Document merge logic: Comment on keys and merge types

Common Errors

ErrorConsequenceSolution
Not sorting before creating lagsIncorrect lag variablessort_values() first
Duplicate keys causing Cartesian productRow explosionUse validate
Inconsistent merge key formatsCannot matchUnify case, remove spaces
Forgetting how parameterData lossExplicitly specify how='left'

📝 Practice Exercises

  1. Merge a student information table (student_id, name) with a grades table (student_id, subject, score), keeping all students.

  2. Convert wide format panel data (id, income_2018, income_2019, income_2020) to long format.

  3. Construct a balanced panel dataset with 100 individuals and 3 years of data, and create a 1-period lag of income.


🎯 Next Steps

Final section: Complete Case Study


References:

  • Python for Data Analysis (3rd Ed) - Chapter 8
  • Pandas Official Documentation: Merge, Join, Concatenate

Released under the MIT License. Content © Author.