Python vs Stata vs R:语法对比速查
快速建立 Python 思维 —— 用你熟悉的 Stata/R 理解 Python
核心概念对比
1. 数据框(DataFrame)概念
所有三种语言的核心都是 二维数据表:
| 概念 | Stata | R | Python (Pandas) |
|---|---|---|---|
| 数据框 | Dataset(内存中只能有一个) | data.frame | DataFrame(可以同时有多个) |
| 变量(列) | Variable | Column | Column |
| 观测(行) | Observation | Row | Row |
关键区别:
- Stata:一次只能操作一个数据集
- R/Python:可以同时处理多个数据框
常用操作对比
操作 1:读取 CSV 文件
stata
* Stata
import delimited "data.csv", clearr
# R
df <- read.csv("data.csv")python
# Python
import pandas as pd
df = pd.read_csv("data.csv")操作 2:查看数据前几行
stata
* Stata
list in 1/5
browse in 1/5r
# R
head(df)python
# Python
df.head()操作 3:创建新变量
示例:创建收入对数
stata
* Stata
gen log_income = log(income)r
# R
df$log_income <- log(df$income)python
# Python
df['log_income'] = np.log(df['income'])操作 4:条件筛选
示例:筛选年龄大于 30 的观测
stata
* Stata
keep if age > 30r
# R
df_filtered <- df[df$age > 30, ]
# 或使用 dplyr
df_filtered <- df %>% filter(age > 30)python
# Python
df_filtered = df[df['age'] > 30]操作 5:分组汇总
示例:按国家计算平均收入
stata
* Stata
collapse (mean) avg_income=income, by(country)r
# R (base R)
aggregate(income ~ country, data=df, FUN=mean)
# R (dplyr)
df %>%
group_by(country) %>%
summarise(avg_income = mean(income))python
# Python
df.groupby('country')['income'].mean()
# 或更详细的写法
df.groupby('country').agg({'income': 'mean'})操作 6:描述性统计
stata
* Stata
summarize income age educationr
# R
summary(df[c("income", "age", "education")])python
# Python
df[['income', 'age', 'education']].describe()操作 7:回归分析
示例:OLS 回归
stata
* Stata
regress income education age i.genderr
# R
model <- lm(income ~ education + age + factor(gender), data=df)
summary(model)python
# Python (statsmodels,最接近 Stata)
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + C(gender)', data=df).fit()
print(model.summary())
# 或使用 sklearn(更简洁,但输出不同)
from sklearn.linear_model import LinearRegression
X = df[['education', 'age']]
y = df['income']
model = LinearRegression().fit(X, y)操作 8:合并数据
示例:按 country 合并两个数据集
stata
* Stata
merge 1:1 country using "gdp_data.dta"r
# R
merged <- merge(df1, df2, by="country", all=TRUE)
# R (dplyr)
merged <- df1 %>% full_join(df2, by="country")python
# Python
merged = pd.merge(df1, df2, on='country', how='outer')关键语法差异
1. 索引方式
| 操作 | Stata | R | Python |
|---|---|---|---|
| 选择列 | income | df$income | df['income'] |
| 选择多列 | keep income age | df[c("income", "age")] | df[['income', 'age']] |
| 选择行 | keep if age > 30 | df[df$age > 30, ] | df[df['age'] > 30] |
2. 赋值操作
| 操作 | Stata | R | Python |
|---|---|---|---|
| 创建新变量 | gen x = 1 | df$x <- 1 | df['x'] = 1 |
| 替换变量 | replace x = 2 if age > 30 | df$x[df$age > 30] <- 2 | df.loc[df['age'] > 30, 'x'] = 2 |
3. 缺失值
| 操作 | Stata | R | Python |
|---|---|---|---|
| 缺失值符号 | . | NA | NaN 或 None |
| 删除缺失值 | drop if missing(income) | df <- na.omit(df) | df.dropna() |
| 填充缺失值 | replace income = 0 if missing(income) | df$income[is.na(df$income)] <- 0 | df['income'].fillna(0) |
思维转换:从 Stata/R 到 Python
Stata 用户的转换
Stata 的思维:一次只操作一个数据集
stata
use "data1.dta", clear
gen new_var = old_var * 2
save "data1_new.dta", replacePython 的思维:同时操作多个数据框
python
df1 = pd.read_csv("data1.csv")
df1['new_var'] = df1['old_var'] * 2
df1.to_csv("data1_new.csv")
# 可以同时有 df2, df3, ... 在内存中
df2 = pd.read_csv("data2.csv")R 用户的转换
R 的思维:函数式编程
r
df %>%
filter(age > 30) %>%
mutate(log_income = log(income)) %>%
group_by(country) %>%
summarise(mean_income = mean(income))Python 的思维:对象方法链(类似 R 的管道)
python
(df
.query('age > 30')
.assign(log_income=lambda x: np.log(x['income']))
.groupby('country')
.agg({'income': 'mean'})
)实战示例:复刻 Stata 的经典分析
Stata 代码
stata
* 1. 加载数据
use "survey_data.dta", clear
* 2. 数据清洗
drop if missing(income)
keep if age >= 18 & age <= 65
* 3. 创建新变量
gen log_income = log(income)
gen age_squared = age^2
* 4. 描述性统计
tabstat income education age, by(gender) stat(mean sd)
* 5. 回归分析
regress log_income education age age_squared i.genderPython 等价代码
python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# 1. 加载数据
df = pd.read_stata("survey_data.dta")
# 2. 数据清洗
df = df.dropna(subset=['income'])
df = df[(df['age'] >= 18) & (df['age'] <= 65)]
# 3. 创建新变量
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2
# 4. 描述性统计
df.groupby('gender')[['income', 'education', 'age']].agg(['mean', 'std'])
# 5. 回归分析
model = smf.ols('log_income ~ education + age + age_squared + C(gender)', data=df).fit()
print(model.summary())运行结果对比:Python 的输出与 Stata 几乎一致!
Python 的独特优势
1. 同时处理多个数据集
python
# 同时加载多个国家的数据
df_china = pd.read_csv("china_data.csv")
df_usa = pd.read_csv("usa_data.csv")
df_india = pd.read_csv("india_data.csv")
# 合并它们
df_all = pd.concat([df_china, df_usa, df_india])2. 循环处理(比 Stata 更灵活)
python
# 对多个变量进行对数转换
for var in ['income', 'gdp', 'population']:
df[f'log_{var}'] = np.log(df[var])3. 直接调用外部 API
python
# Stata/R 很难做到的:直接从 API 获取数据
import requests
response = requests.get("https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD?format=json")
data = response.json()高级操作对比
操作 9:面板数据回归(固定效应)
示例:分析教育对工资的影响(控制个体固定效应)
stata
* Stata - 非常简洁!
xtset individual_id year
xtreg wage education experience, fer
# R (plm package)
library(plm)
model <- plm(wage ~ education + experience,
data = panel_data,
index = c("individual_id", "year"),
model = "within")
summary(model)python
# Python (linearmodels)
from linearmodels.panel import PanelOLS
# 设置多重索引
panel_data = df.set_index(['individual_id', 'year'])
# 固定效应回归
model = PanelOLS.from_formula(
'wage ~ education + experience + EntityEffects',
data=panel_data
)
result = model.fit(cov_type='clustered', cluster_entity=True)
print(result.summary)注释:Stata 在面板数据上最简洁,但 Python 的 linearmodels 功能同样强大。
操作 10:处理分类变量(因子编码)
stata
* Stata - 自动处理
regress wage i.education_level i.industryr
# R - 自动处理
model <- lm(wage ~ factor(education_level) + factor(industry),
data = df)python
# Python - 需要明确指定
import statsmodels.formula.api as smf
model = smf.ols('wage ~ C(education_level) + C(industry)',
data=df).fit()
# 或使用 pd.get_dummies() 手动编码
df_encoded = pd.get_dummies(df,
columns=['education_level', 'industry'],
drop_first=True)操作 11:时间序列操作
示例:计算滞后值和增长率
stata
* Stata
tsset date
gen gdp_lag1 = L.gdp
gen gdp_growth = (gdp - L.gdp) / L.gdpr
# R (dplyr)
df <- df %>%
arrange(date) %>%
mutate(gdp_lag1 = lag(gdp),
gdp_growth = (gdp - lag(gdp)) / lag(gdp))python
# Python (pandas)
df = df.sort_values('date')
df['gdp_lag1'] = df['gdp'].shift(1)
df['gdp_growth'] = df['gdp'].pct_change()操作 12:字符串处理
示例:提取姓氏,转换大小写
stata
* Stata
gen last_name = word(name, -1)
gen name_upper = upper(name)r
# R (stringr)
library(stringr)
df$last_name <- word(df$name, -1)
df$name_upper <- str_to_upper(df$name)python
# Python (pandas + str methods)
df['last_name'] = df['name'].str.split().str[-1]
df['name_upper'] = df['name'].str.upper()
# 或使用正则表达式
df['last_name'] = df['name'].str.extract(r'(\w+)$')复杂分析流程对比
案例:完整的实证研究流程
研究问题:分析最低工资政策对就业的影响(DID 设计)
Stata 实现
stata
* 1. 加载数据
use "employment_data.dta", clear
* 2. 数据清洗
drop if missing(employment, wage, treatment)
keep if year >= 2010 & year <= 2020
* 3. 生成交互项
gen post = (year >= 2015)
gen treated = (state == "CA")
gen did = post * treated
* 4. 描述性统计
tabstat employment, by(treated post) stat(mean sd n)
* 5. DID 回归
regress employment did treated post controls, cluster(state)
* 6. 平行趋势检验
forvalues y = 2010/2020 {
gen year_`y' = (year == `y')
gen treat_year_`y' = treated * year_`y'
}
regress employment treat_year_*, cluster(state)Python 实现
python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
# 1. 加载数据
df = pd.read_stata("employment_data.dta")
# 2. 数据清洗
df = df.dropna(subset=['employment', 'wage', 'treatment'])
df = df[(df['year'] >= 2010) & (df['year'] <= 2020)]
# 3. 生成交互项
df['post'] = (df['year'] >= 2015).astype(int)
df['treated'] = (df['state'] == "CA").astype(int)
df['did'] = df['post'] * df['treated']
# 4. 描述性统计
summary = df.groupby(['treated', 'post'])['employment'].agg([
('mean', 'mean'),
('std', 'std'),
('n', 'count')
])
print(summary)
# 5. DID 回归(聚类标准误)
model = smf.ols('employment ~ did + treated + post + controls',
data=df).fit(cov_type='cluster',
cov_kwds={'groups': df['state']})
print(model.summary())
# 6. 平行趋势检验
df['year_str'] = df['year'].astype(str)
model_parallel = smf.ols(
'employment ~ C(year_str):treated + C(year_str) + controls',
data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['state']})
# 提取系数并可视化
coefs = model_parallel.params.filter(regex='year_str.*treated')
ci = model_parallel.conf_int().loc[coefs.index]
plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs,
yerr=[(coefs - ci[0]), (ci[1] - coefs)],
fmt='o-')
plt.axhline(y=0, color='red', linestyle='--')
plt.axvline(x=5, color='gray', linestyle='--', label='Treatment Year')
plt.title('Parallel Trends Test')
plt.xlabel('Year')
plt.ylabel('Treatment Effect')
plt.legend()
plt.show()对比:
- Stata:代码更简洁(~20 行)
- Python:代码稍长(~40 行),但可视化和灵活性更强
数据可视化对比
案例:绘制回归结果的系数图
Stata
stata
regress wage education experience female urban
coefplot, drop(_cons) xline(0)R (ggplot2)
r
library(ggplot2)
library(broom)
model <- lm(wage ~ education + experience + female + urban,
data = df)
coef_df <- tidy(model, conf.int = TRUE)
ggplot(coef_df, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, linetype = "dashed") +
theme_minimal() +
labs(title = "Regression Coefficients",
x = "Estimate", y = "Variable")Python (matplotlib + seaborn)
python
import matplotlib.pyplot as plt
import seaborn as sns
model = smf.ols('wage ~ education + experience + female + urban',
data=df).fit()
# 提取系数和置信区间
coefs = model.params.drop('Intercept')
ci = model.conf_int().drop('Intercept')
fig, ax = plt.subplots(figsize=(8, 6))
y_pos = range(len(coefs))
ax.errorbar(coefs, y_pos,
xerr=[(coefs - ci[0]), (ci[1] - coefs)],
fmt='o', capsize=5)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('Coefficient Estimate')
ax.set_title('Regression Coefficients with 95% CI')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()性能对比
大数据处理速度(100 万行数据)
| 操作 | Stata | R (dplyr) | Python (pandas) |
|---|---|---|---|
| 读取 CSV | ~5 秒 | ~3 秒 | ~2 秒 |
| 分组汇总 | ~2 秒 | ~1 秒 | ~0.8 秒 |
| 合并数据 | ~4 秒 | ~2 秒 | ~1.5 秒 |
| 回归分析 | ~0.5 秒 | ~0.8 秒 | ~0.6 秒 |
注:Python + pandas 在大数据上通常最快,但 Stata 在回归分析上有优化。
️ 生态系统对比
Stata 的优势
- 计量经济学方法最全(IV、DID、RDD、PSM)
- 面板数据处理最简洁
- 回归输出格式最接近学术要求
- 社区集中(StataList,SSC)
R 的优势
- 统计方法最丰富(贝叶斯、生存分析、因子分析)
- 数据可视化(ggplot2)最优雅
- 开源免费
- CRAN 包生态庞大(2万+ 包)
Python 的优势
- 机器学习生态最强(sklearn、PyTorch、TensorFlow)
- 深度学习和 LLM 唯一选择
- 通用编程能力最强
- 数据工程工具最全(爬虫、API、数据库)
- 就业市场需求最大
学习建议
如果你是 Stata 用户
- 重点学习 Pandas(80% 的 Stata 功能都能复刻)
- 使用 statsmodels(输出格式最接近 Stata)
- 记住:
df['var']≈ Stata 的变量名 - 学习路径:
- Week 1: Pandas 基础(等同于 Stata 的数据操作)
- Week 2: statsmodels 回归(等同于 Stata 的 regress)
- Week 3: linearmodels 面板数据(等同于 Stata 的 xtreg)
- Week 4: scikit-learn 机器学习(Stata 无法做到)
如果你是 R 用户
- 学习 Pandas(类似 dplyr + data.table)
- 使用 plotnine(Python 版本的 ggplot2)
- 记住:Python 用
.而不是%>% - 学习路径:
- Week 1: Python 基础语法(R 用户最快 3 天)
- Week 2: Pandas(类似 tidyverse)
- Week 3: Matplotlib/Seaborn(不如 ggplot2,但够用)
- Week 4: sklearn + PyTorch(R 的弱项)
三语言混合使用策略
最佳实践:根据任务选择工具
数据清洗 → Python (pandas)
↓
描述性统计 → Stata/R(看个人习惯)
↓
传统计量 → Stata(面板数据、IV)
↓
机器学习 → Python (sklearn)
↓
文本分析 → Python (transformers)
↓
可视化 → R (ggplot2) 或 Python (seaborn)下一步
在下一节中,我们将编写 第一个 Python 程序,体验 Python 的简洁和强大。
准备好了吗?