Pandas 高级操作
合并、重塑、分组统计 —— 复刻 Stata 的高级功能
数据合并(Merge)
类似 Stata 的 merge
python
import pandas as pd
# 主数据
df_main = pd.DataFrame({
'country': ['USA', 'China', 'India'],
'population': [330, 1400, 1380]
})
# GDP 数据
df_gdp = pd.DataFrame({
'country': ['USA', 'China', 'Brazil'],
'gdp': [21, 14, 2]
})
# 合并(inner join)
merged = pd.merge(df_main, df_gdp, on='country', how='inner')
# 结果:USA, China(两边都有)
# 合并(left join)
merged = pd.merge(df_main, df_gdp, on='country', how='left')
# 结果:USA, China, India(保留左边所有)
# 合并(outer join)
merged = pd.merge(df_main, df_gdp, on='country', how='outer')
# 结果:USA, China, India, Brazil(保留所有)Stata 对比
stata
* Stata
merge 1:1 country using "gdp_data.dta"python
# Pandas 等价
merged = pd.merge(df_main, df_gdp, on='country', how='outer')分组统计(GroupBy)
基本分组
python
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'age_group': ['Young', 'Young', 'Old', 'Old', 'Young', 'Old'],
'income': [50000, 55000, 80000, 75000, 60000, 85000]
})
# 按性别分组
df.groupby('gender')['income'].mean()
# 按多个变量分组
df.groupby(['gender', 'age_group'])['income'].mean()
# 多种统计
df.groupby('gender').agg({
'income': ['mean', 'std', 'count'],
'age': ['mean', 'min', 'max']
})Stata 对比
stata
* Stata
bysort gender: summarize income
collapse (mean) avg_income=income, by(gender age_group)python
# Pandas 等价
df.groupby(['gender', 'age_group'])['income'].mean()数据重塑
Long ↔ Wide 转换
python
# Wide 格式
df_wide = pd.DataFrame({
'id': [1, 2, 3],
'income_2020': [50000, 60000, 70000],
'income_2021': [55000, 65000, 75000],
'income_2022': [60000, 70000, 80000]
})
# Wide → Long (类似 Stata 的 reshape long)
df_long = pd.melt(
df_wide,
id_vars=['id'],
var_name='year',
value_name='income'
)
# Long → Wide (类似 Stata 的 reshape wide)
df_wide_again = df_long.pivot(
index='id',
columns='year',
values='income'
)Stata 对比
stata
* Stata: long → wide
reshape wide income, i(id) j(year)
* Stata: wide → long
reshape long income, i(id) j(year)透视表(Pivot Table)
python
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'education': ['HS', 'BA', 'BA', 'MA', 'HS', 'MA'],
'income': [45000, 55000, 65000, 80000, 50000, 90000]
})
# 创建透视表
pivot = pd.pivot_table(
df,
values='income',
index='gender',
columns='education',
aggfunc='mean'
)实战案例
案例 1:面板数据分析
python
# 多年收入数据
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
'income': [50000, 55000, 60000, 60000, 65000, 70000, 45000, 50000, 55000]
})
# 计算收入增长率
df = df.sort_values(['id', 'year'])
df['income_growth'] = df.groupby('id')['income'].pct_change() * 100
# 按个体统计
summary = df.groupby('id').agg({
'income': ['mean', 'std'],
'income_growth': 'mean'
})案例 2:交叉统计表
python
df = pd.DataFrame({
'gender': ['M']*50 + ['F']*50,
'education': ['HS']*20 + ['BA']*30 + ['BA']*25 + ['MA']*25,
'employed': [True]*40 + [False]*10 + [True]*35 + [False]*15
})
# 交叉表
crosstab = pd.crosstab(
df['gender'],
df['education'],
values=df['employed'],
aggfunc='mean'
) * 100 # 转为百分比
print("各组就业率(%):")
print(crosstab)案例 3:时间序列重采样
python
# 日期索引
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'sales': np.random.randint(100, 200, 100)
})
df.set_index('date', inplace=True)
# 按周汇总
weekly = df.resample('W').sum()
# 按月汇总
monthly = df.resample('M').agg({
'sales': ['sum', 'mean', 'std']
})高级技巧
transform(保持原数据结构)
python
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'C', 'C'],
'value': [10, 20, 30, 40, 50, 60]
})
# 添加组均值列
df['group_mean'] = df.groupby('group')['value'].transform('mean')apply(自定义函数)
python
def standardize(x):
return (x - x.mean()) / x.std()
df['value_std'] = df.groupby('group')['value'].transform(standardize)练习题
python
# 数据:学生多学期成绩
df = pd.DataFrame({
'student_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'semester': ['S1', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1', 'S2', 'S3'],
'gpa': [3.5, 3.7, 3.8, 3.2, 3.4, 3.6, 3.9, 3.8, 4.0]
})
# 任务:
# 1. 计算每个学生的 GPA 增长率
# 2. 转换为 wide 格式(学生×学期)
# 3. 添加平均 GPA 列下一步
下一节学习 数据可视化(Matplotlib/Seaborn)。
继续!