Skip to content

Python vs Stata vs R:语法对比速查

快速建立 Python 思维 —— 用你熟悉的 Stata/R 理解 Python


核心概念对比

1. 数据框(DataFrame)概念

所有三种语言的核心都是 二维数据表

概念StataRPython (Pandas)
数据框Dataset(内存中只能有一个)data.frameDataFrame(可以同时有多个)
变量(列)VariableColumnColumn
观测(行)ObservationRowRow

关键区别

  • Stata:一次只能操作一个数据集
  • R/Python:可以同时处理多个数据框

常用操作对比

操作 1:读取 CSV 文件

stata
* Stata
import delimited "data.csv", clear
r
# R
df <- read.csv("data.csv")
python
# Python
import pandas as pd
df = pd.read_csv("data.csv")

操作 2:查看数据前几行

stata
* Stata
list in 1/5
browse in 1/5
r
# R
head(df)
python
# Python
df.head()

操作 3:创建新变量

示例:创建收入对数

stata
* Stata
gen log_income = log(income)
r
# R
df$log_income <- log(df$income)
python
# Python
df['log_income'] = np.log(df['income'])

操作 4:条件筛选

示例:筛选年龄大于 30 的观测

stata
* Stata
keep if age > 30
r
# R
df_filtered <- df[df$age > 30, ]
# 或使用 dplyr
df_filtered <- df %>% filter(age > 30)
python
# Python
df_filtered = df[df['age'] > 30]

操作 5:分组汇总

示例:按国家计算平均收入

stata
* Stata
collapse (mean) avg_income=income, by(country)
r
# R (base R)
aggregate(income ~ country, data=df, FUN=mean)

# R (dplyr)
df %>%
  group_by(country) %>%
  summarise(avg_income = mean(income))
python
# Python
df.groupby('country')['income'].mean()

# 或更详细的写法
df.groupby('country').agg({'income': 'mean'})

操作 6:描述性统计

stata
* Stata
summarize income age education
r
# R
summary(df[c("income", "age", "education")])
python
# Python
df[['income', 'age', 'education']].describe()

操作 7:回归分析

示例:OLS 回归

stata
* Stata
regress income education age i.gender
r
# R
model <- lm(income ~ education + age + factor(gender), data=df)
summary(model)
python
# Python (statsmodels,最接近 Stata)
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + C(gender)', data=df).fit()
print(model.summary())

# 或使用 sklearn(更简洁,但输出不同)
from sklearn.linear_model import LinearRegression
X = df[['education', 'age']]
y = df['income']
model = LinearRegression().fit(X, y)

操作 8:合并数据

示例:按 country 合并两个数据集

stata
* Stata
merge 1:1 country using "gdp_data.dta"
r
# R
merged <- merge(df1, df2, by="country", all=TRUE)

# R (dplyr)
merged <- df1 %>% full_join(df2, by="country")
python
# Python
merged = pd.merge(df1, df2, on='country', how='outer')

关键语法差异

1. 索引方式

操作StataRPython
选择列incomedf$incomedf['income']
选择多列keep income agedf[c("income", "age")]df[['income', 'age']]
选择行keep if age > 30df[df$age > 30, ]df[df['age'] > 30]

2. 赋值操作

操作StataRPython
创建新变量gen x = 1df$x <- 1df['x'] = 1
替换变量replace x = 2 if age > 30df$x[df$age > 30] <- 2df.loc[df['age'] > 30, 'x'] = 2

3. 缺失值

操作StataRPython
缺失值符号.NANaNNone
删除缺失值drop if missing(income)df <- na.omit(df)df.dropna()
填充缺失值replace income = 0 if missing(income)df$income[is.na(df$income)] <- 0df['income'].fillna(0)

思维转换:从 Stata/R 到 Python

Stata 用户的转换

Stata 的思维:一次只操作一个数据集

stata
use "data1.dta", clear
gen new_var = old_var * 2
save "data1_new.dta", replace

Python 的思维:同时操作多个数据框

python
df1 = pd.read_csv("data1.csv")
df1['new_var'] = df1['old_var'] * 2
df1.to_csv("data1_new.csv")

# 可以同时有 df2, df3, ... 在内存中
df2 = pd.read_csv("data2.csv")

R 用户的转换

R 的思维:函数式编程

r
df %>%
  filter(age > 30) %>%
  mutate(log_income = log(income)) %>%
  group_by(country) %>%
  summarise(mean_income = mean(income))

Python 的思维:对象方法链(类似 R 的管道)

python
(df
 .query('age > 30')
 .assign(log_income=lambda x: np.log(x['income']))
 .groupby('country')
 .agg({'income': 'mean'})
)

实战示例:复刻 Stata 的经典分析

Stata 代码

stata
* 1. 加载数据
use "survey_data.dta", clear

* 2. 数据清洗
drop if missing(income)
keep if age >= 18 & age <= 65

* 3. 创建新变量
gen log_income = log(income)
gen age_squared = age^2

* 4. 描述性统计
tabstat income education age, by(gender) stat(mean sd)

* 5. 回归分析
regress log_income education age age_squared i.gender

Python 等价代码

python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# 1. 加载数据
df = pd.read_stata("survey_data.dta")

# 2. 数据清洗
df = df.dropna(subset=['income'])
df = df[(df['age'] >= 18) & (df['age'] <= 65)]

# 3. 创建新变量
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2

# 4. 描述性统计
df.groupby('gender')[['income', 'education', 'age']].agg(['mean', 'std'])

# 5. 回归分析
model = smf.ols('log_income ~ education + age + age_squared + C(gender)', data=df).fit()
print(model.summary())

运行结果对比:Python 的输出与 Stata 几乎一致!


Python 的独特优势

1. 同时处理多个数据集

python
# 同时加载多个国家的数据
df_china = pd.read_csv("china_data.csv")
df_usa = pd.read_csv("usa_data.csv")
df_india = pd.read_csv("india_data.csv")

# 合并它们
df_all = pd.concat([df_china, df_usa, df_india])

2. 循环处理(比 Stata 更灵活)

python
# 对多个变量进行对数转换
for var in ['income', 'gdp', 'population']:
    df[f'log_{var}'] = np.log(df[var])

3. 直接调用外部 API

python
# Stata/R 很难做到的:直接从 API 获取数据
import requests
response = requests.get("https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD?format=json")
data = response.json()

高级操作对比

操作 9:面板数据回归(固定效应)

示例:分析教育对工资的影响(控制个体固定效应)

stata
* Stata - 非常简洁!
xtset individual_id year
xtreg wage education experience, fe
r
# R (plm package)
library(plm)
model <- plm(wage ~ education + experience,
             data = panel_data,
             index = c("individual_id", "year"),
             model = "within")
summary(model)
python
# Python (linearmodels)
from linearmodels.panel import PanelOLS

# 设置多重索引
panel_data = df.set_index(['individual_id', 'year'])

# 固定效应回归
model = PanelOLS.from_formula(
    'wage ~ education + experience + EntityEffects',
    data=panel_data
)
result = model.fit(cov_type='clustered', cluster_entity=True)
print(result.summary)

注释:Stata 在面板数据上最简洁,但 Python 的 linearmodels 功能同样强大。


操作 10:处理分类变量(因子编码)

stata
* Stata - 自动处理
regress wage i.education_level i.industry
r
# R - 自动处理
model <- lm(wage ~ factor(education_level) + factor(industry),
            data = df)
python
# Python - 需要明确指定
import statsmodels.formula.api as smf

model = smf.ols('wage ~ C(education_level) + C(industry)',
                data=df).fit()
# 或使用 pd.get_dummies() 手动编码
df_encoded = pd.get_dummies(df,
                             columns=['education_level', 'industry'],
                             drop_first=True)

操作 11:时间序列操作

示例:计算滞后值和增长率

stata
* Stata
tsset date
gen gdp_lag1 = L.gdp
gen gdp_growth = (gdp - L.gdp) / L.gdp
r
# R (dplyr)
df <- df %>%
  arrange(date) %>%
  mutate(gdp_lag1 = lag(gdp),
         gdp_growth = (gdp - lag(gdp)) / lag(gdp))
python
# Python (pandas)
df = df.sort_values('date')
df['gdp_lag1'] = df['gdp'].shift(1)
df['gdp_growth'] = df['gdp'].pct_change()

操作 12:字符串处理

示例:提取姓氏,转换大小写

stata
* Stata
gen last_name = word(name, -1)
gen name_upper = upper(name)
r
# R (stringr)
library(stringr)
df$last_name <- word(df$name, -1)
df$name_upper <- str_to_upper(df$name)
python
# Python (pandas + str methods)
df['last_name'] = df['name'].str.split().str[-1]
df['name_upper'] = df['name'].str.upper()

# 或使用正则表达式
df['last_name'] = df['name'].str.extract(r'(\w+)$')

复杂分析流程对比

案例:完整的实证研究流程

研究问题:分析最低工资政策对就业的影响(DID 设计)

Stata 实现

stata
* 1. 加载数据
use "employment_data.dta", clear

* 2. 数据清洗
drop if missing(employment, wage, treatment)
keep if year >= 2010 & year <= 2020

* 3. 生成交互项
gen post = (year >= 2015)
gen treated = (state == "CA")
gen did = post * treated

* 4. 描述性统计
tabstat employment, by(treated post) stat(mean sd n)

* 5. DID 回归
regress employment did treated post controls, cluster(state)

* 6. 平行趋势检验
forvalues y = 2010/2020 {
    gen year_`y' = (year == `y')
    gen treat_year_`y' = treated * year_`y'
}
regress employment treat_year_*, cluster(state)

Python 实现

python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# 1. 加载数据
df = pd.read_stata("employment_data.dta")

# 2. 数据清洗
df = df.dropna(subset=['employment', 'wage', 'treatment'])
df = df[(df['year'] >= 2010) & (df['year'] <= 2020)]

# 3. 生成交互项
df['post'] = (df['year'] >= 2015).astype(int)
df['treated'] = (df['state'] == "CA").astype(int)
df['did'] = df['post'] * df['treated']

# 4. 描述性统计
summary = df.groupby(['treated', 'post'])['employment'].agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('n', 'count')
])
print(summary)

# 5. DID 回归(聚类标准误)
model = smf.ols('employment ~ did + treated + post + controls',
                data=df).fit(cov_type='cluster',
                             cov_kwds={'groups': df['state']})
print(model.summary())

# 6. 平行趋势检验
df['year_str'] = df['year'].astype(str)
model_parallel = smf.ols(
    'employment ~ C(year_str):treated + C(year_str) + controls',
    data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['state']})

# 提取系数并可视化
coefs = model_parallel.params.filter(regex='year_str.*treated')
ci = model_parallel.conf_int().loc[coefs.index]

plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs,
             yerr=[(coefs - ci[0]), (ci[1] - coefs)],
             fmt='o-')
plt.axhline(y=0, color='red', linestyle='--')
plt.axvline(x=5, color='gray', linestyle='--', label='Treatment Year')
plt.title('Parallel Trends Test')
plt.xlabel('Year')
plt.ylabel('Treatment Effect')
plt.legend()
plt.show()

对比

  • Stata:代码更简洁(~20 行)
  • Python:代码稍长(~40 行),但可视化和灵活性更强

数据可视化对比

案例:绘制回归结果的系数图

Stata

stata
regress wage education experience female urban
coefplot, drop(_cons) xline(0)

R (ggplot2)

r
library(ggplot2)
library(broom)

model <- lm(wage ~ education + experience + female + urban,
            data = df)
coef_df <- tidy(model, conf.int = TRUE)

ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(title = "Regression Coefficients",
       x = "Estimate", y = "Variable")

Python (matplotlib + seaborn)

python
import matplotlib.pyplot as plt
import seaborn as sns

model = smf.ols('wage ~ education + experience + female + urban',
                data=df).fit()

# 提取系数和置信区间
coefs = model.params.drop('Intercept')
ci = model.conf_int().drop('Intercept')

fig, ax = plt.subplots(figsize=(8, 6))
y_pos = range(len(coefs))

ax.errorbar(coefs, y_pos,
            xerr=[(coefs - ci[0]), (ci[1] - coefs)],
            fmt='o', capsize=5)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('Coefficient Estimate')
ax.set_title('Regression Coefficients with 95% CI')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

性能对比

大数据处理速度(100 万行数据)

操作StataR (dplyr)Python (pandas)
读取 CSV~5 秒~3 秒~2 秒
分组汇总~2 秒~1 秒~0.8 秒
合并数据~4 秒~2 秒~1.5 秒
回归分析~0.5 秒~0.8 秒~0.6 秒

:Python + pandas 在大数据上通常最快,但 Stata 在回归分析上有优化。


️ 生态系统对比

Stata 的优势

  • 计量经济学方法最全(IV、DID、RDD、PSM)
  • 面板数据处理最简洁
  • 回归输出格式最接近学术要求
  • 社区集中(StataList,SSC)

R 的优势

  • 统计方法最丰富(贝叶斯、生存分析、因子分析)
  • 数据可视化(ggplot2)最优雅
  • 开源免费
  • CRAN 包生态庞大(2万+ 包)

Python 的优势

  • 机器学习生态最强(sklearn、PyTorch、TensorFlow)
  • 深度学习和 LLM 唯一选择
  • 通用编程能力最强
  • 数据工程工具最全(爬虫、API、数据库)
  • 就业市场需求最大

学习建议

如果你是 Stata 用户

  1. 重点学习 Pandas(80% 的 Stata 功能都能复刻)
  2. 使用 statsmodels(输出格式最接近 Stata)
  3. 记住:df['var'] ≈ Stata 的变量名
  4. 学习路径
    • Week 1: Pandas 基础(等同于 Stata 的数据操作)
    • Week 2: statsmodels 回归(等同于 Stata 的 regress)
    • Week 3: linearmodels 面板数据(等同于 Stata 的 xtreg)
    • Week 4: scikit-learn 机器学习(Stata 无法做到)

如果你是 R 用户

  1. 学习 Pandas(类似 dplyr + data.table)
  2. 使用 plotnine(Python 版本的 ggplot2)
  3. 记住:Python 用 . 而不是 %>%
  4. 学习路径
    • Week 1: Python 基础语法(R 用户最快 3 天)
    • Week 2: Pandas(类似 tidyverse)
    • Week 3: Matplotlib/Seaborn(不如 ggplot2,但够用)
    • Week 4: sklearn + PyTorch(R 的弱项)

三语言混合使用策略

最佳实践:根据任务选择工具

数据清洗 → Python (pandas)

描述性统计 → Stata/R(看个人习惯)

传统计量 → Stata(面板数据、IV)

机器学习 → Python (sklearn)

文本分析 → Python (transformers)

可视化 → R (ggplot2) 或 Python (seaborn)

下一步

在下一节中,我们将编写 第一个 Python 程序,体验 Python 的简洁和强大。

准备好了吗?

基于 MIT 许可证发布。内容版权归作者所有。