Skip to content

Advanced Pandas Operations

Merging, Reshaping, and Group Statistics — Replicating Stata's Advanced Features


Data Merging

Similar to Stata's merge

python
import pandas as pd

# Main data
df_main = pd.DataFrame({
    'country': ['USA', 'China', 'India'],
    'population': [330, 1400, 1380]
})

# GDP data
df_gdp = pd.DataFrame({
    'country': ['USA', 'China', 'Brazil'],
    'gdp': [21, 14, 2]
})

# Merge (inner join)
merged = pd.merge(df_main, df_gdp, on='country', how='inner')
# Result: USA, China (present in both)

# Merge (left join)
merged = pd.merge(df_main, df_gdp, on='country', how='left')
# Result: USA, China, India (keep all from left)

# Merge (outer join)
merged = pd.merge(df_main, df_gdp, on='country', how='outer')
# Result: USA, China, India, Brazil (keep all)

Stata Comparison

stata
* Stata
merge 1:1 country using "gdp_data.dta"
python
# Pandas equivalent
merged = pd.merge(df_main, df_gdp, on='country', how='outer')

Group Statistics (GroupBy)

Basic Grouping

python
df = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'age_group': ['Young', 'Young', 'Old', 'Old', 'Young', 'Old'],
    'income': [50000, 55000, 80000, 75000, 60000, 85000]
})

# Group by gender
df.groupby('gender')['income'].mean()

# Group by multiple variables
df.groupby(['gender', 'age_group'])['income'].mean()

# Multiple statistics
df.groupby('gender').agg({
    'income': ['mean', 'std', 'count'],
    'age': ['mean', 'min', 'max']
})

Stata Comparison

stata
* Stata
bysort gender: summarize income
collapse (mean) avg_income=income, by(gender age_group)
python
# Pandas equivalent
df.groupby(['gender', 'age_group'])['income'].mean()

Data Reshaping

Long ↔ Wide Conversion

python
# Wide format
df_wide = pd.DataFrame({
    'id': [1, 2, 3],
    'income_2020': [50000, 60000, 70000],
    'income_2021': [55000, 65000, 75000],
    'income_2022': [60000, 70000, 80000]
})

# Wide → Long (similar to Stata's reshape long)
df_long = pd.melt(
    df_wide,
    id_vars=['id'],
    var_name='year',
    value_name='income'
)

# Long → Wide (similar to Stata's reshape wide)
df_wide_again = df_long.pivot(
    index='id',
    columns='year',
    values='income'
)

Stata Comparison

stata
* Stata: long → wide
reshape wide income, i(id) j(year)

* Stata: wide → long
reshape long income, i(id) j(year)

Pivot Tables

python
df = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'education': ['HS', 'BA', 'BA', 'MA', 'HS', 'MA'],
    'income': [45000, 55000, 65000, 80000, 50000, 90000]
})

# Create pivot table
pivot = pd.pivot_table(
    df,
    values='income',
    index='gender',
    columns='education',
    aggfunc='mean'
)

Practical Examples

Example 1: Panel Data Analysis

python
# Multi-year income data
df = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
    'income': [50000, 55000, 60000, 60000, 65000, 70000, 45000, 50000, 55000]
})

# Calculate income growth rate
df = df.sort_values(['id', 'year'])
df['income_growth'] = df.groupby('id')['income'].pct_change() * 100

# Statistics by individual
summary = df.groupby('id').agg({
    'income': ['mean', 'std'],
    'income_growth': 'mean'
})

Example 2: Cross-Tabulation

python
df = pd.DataFrame({
    'gender': ['M']*50 + ['F']*50,
    'education': ['HS']*20 + ['BA']*30 + ['BA']*25 + ['MA']*25,
    'employed': [True]*40 + [False]*10 + [True]*35 + [False]*15
})

# Cross-tabulation
crosstab = pd.crosstab(
    df['gender'],
    df['education'],
    values=df['employed'],
    aggfunc='mean'
) * 100  # Convert to percentage

print("Employment rate by group (%):")
print(crosstab)

Example 3: Time Series Resampling

python
# Date index
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'sales': np.random.randint(100, 200, 100)
})
df.set_index('date', inplace=True)

# Weekly aggregation
weekly = df.resample('W').sum()

# Monthly aggregation
monthly = df.resample('M').agg({
    'sales': ['sum', 'mean', 'std']
})

Advanced Techniques

transform (Preserve Original Structure)

python
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'value': [10, 20, 30, 40, 50, 60]
})

# Add group mean column
df['group_mean'] = df.groupby('group')['value'].transform('mean')

apply (Custom Functions)

python
def standardize(x):
    return (x - x.mean()) / x.std()

df['value_std'] = df.groupby('group')['value'].transform(standardize)

Practice Exercises

python
# Data: Student grades across multiple semesters
df = pd.DataFrame({
    'student_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'semester': ['S1', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1', 'S2', 'S3'],
    'gpa': [3.5, 3.7, 3.8, 3.2, 3.4, 3.6, 3.9, 3.8, 4.0]
})

# Tasks:
# 1. Calculate each student's GPA growth rate
# 2. Convert to wide format (student × semester)
# 3. Add average GPA column

Next Steps

The next section will cover Data Visualization (Matplotlib/Seaborn).

Keep going!

Released under the MIT License. Content © Author.