Advanced Pandas Operations
Merging, Reshaping, and Group Statistics — Replicating Stata's Advanced Features
Data Merging
Similar to Stata's merge
python
import pandas as pd
# Main data
df_main = pd.DataFrame({
'country': ['USA', 'China', 'India'],
'population': [330, 1400, 1380]
})
# GDP data
df_gdp = pd.DataFrame({
'country': ['USA', 'China', 'Brazil'],
'gdp': [21, 14, 2]
})
# Merge (inner join)
merged = pd.merge(df_main, df_gdp, on='country', how='inner')
# Result: USA, China (present in both)
# Merge (left join)
merged = pd.merge(df_main, df_gdp, on='country', how='left')
# Result: USA, China, India (keep all from left)
# Merge (outer join)
merged = pd.merge(df_main, df_gdp, on='country', how='outer')
# Result: USA, China, India, Brazil (keep all)Stata Comparison
stata
* Stata
merge 1:1 country using "gdp_data.dta"python
# Pandas equivalent
merged = pd.merge(df_main, df_gdp, on='country', how='outer')Group Statistics (GroupBy)
Basic Grouping
python
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'age_group': ['Young', 'Young', 'Old', 'Old', 'Young', 'Old'],
'income': [50000, 55000, 80000, 75000, 60000, 85000]
})
# Group by gender
df.groupby('gender')['income'].mean()
# Group by multiple variables
df.groupby(['gender', 'age_group'])['income'].mean()
# Multiple statistics
df.groupby('gender').agg({
'income': ['mean', 'std', 'count'],
'age': ['mean', 'min', 'max']
})Stata Comparison
stata
* Stata
bysort gender: summarize income
collapse (mean) avg_income=income, by(gender age_group)python
# Pandas equivalent
df.groupby(['gender', 'age_group'])['income'].mean()Data Reshaping
Long ↔ Wide Conversion
python
# Wide format
df_wide = pd.DataFrame({
'id': [1, 2, 3],
'income_2020': [50000, 60000, 70000],
'income_2021': [55000, 65000, 75000],
'income_2022': [60000, 70000, 80000]
})
# Wide → Long (similar to Stata's reshape long)
df_long = pd.melt(
df_wide,
id_vars=['id'],
var_name='year',
value_name='income'
)
# Long → Wide (similar to Stata's reshape wide)
df_wide_again = df_long.pivot(
index='id',
columns='year',
values='income'
)Stata Comparison
stata
* Stata: long → wide
reshape wide income, i(id) j(year)
* Stata: wide → long
reshape long income, i(id) j(year)Pivot Tables
python
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'education': ['HS', 'BA', 'BA', 'MA', 'HS', 'MA'],
'income': [45000, 55000, 65000, 80000, 50000, 90000]
})
# Create pivot table
pivot = pd.pivot_table(
df,
values='income',
index='gender',
columns='education',
aggfunc='mean'
)Practical Examples
Example 1: Panel Data Analysis
python
# Multi-year income data
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
'income': [50000, 55000, 60000, 60000, 65000, 70000, 45000, 50000, 55000]
})
# Calculate income growth rate
df = df.sort_values(['id', 'year'])
df['income_growth'] = df.groupby('id')['income'].pct_change() * 100
# Statistics by individual
summary = df.groupby('id').agg({
'income': ['mean', 'std'],
'income_growth': 'mean'
})Example 2: Cross-Tabulation
python
df = pd.DataFrame({
'gender': ['M']*50 + ['F']*50,
'education': ['HS']*20 + ['BA']*30 + ['BA']*25 + ['MA']*25,
'employed': [True]*40 + [False]*10 + [True]*35 + [False]*15
})
# Cross-tabulation
crosstab = pd.crosstab(
df['gender'],
df['education'],
values=df['employed'],
aggfunc='mean'
) * 100 # Convert to percentage
print("Employment rate by group (%):")
print(crosstab)Example 3: Time Series Resampling
python
# Date index
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'sales': np.random.randint(100, 200, 100)
})
df.set_index('date', inplace=True)
# Weekly aggregation
weekly = df.resample('W').sum()
# Monthly aggregation
monthly = df.resample('M').agg({
'sales': ['sum', 'mean', 'std']
})Advanced Techniques
transform (Preserve Original Structure)
python
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'C', 'C'],
'value': [10, 20, 30, 40, 50, 60]
})
# Add group mean column
df['group_mean'] = df.groupby('group')['value'].transform('mean')apply (Custom Functions)
python
def standardize(x):
return (x - x.mean()) / x.std()
df['value_std'] = df.groupby('group')['value'].transform(standardize)Practice Exercises
python
# Data: Student grades across multiple semesters
df = pd.DataFrame({
'student_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'semester': ['S1', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1', 'S2', 'S3'],
'gpa': [3.5, 3.7, 3.8, 3.2, 3.4, 3.6, 3.9, 3.8, 4.0]
})
# Tasks:
# 1. Calculate each student's GPA growth rate
# 2. Convert to wide format (student × semester)
# 3. Add average GPA columnNext Steps
The next section will cover Data Visualization (Matplotlib/Seaborn).
Keep going!