Python vs Stata vs R：语法对比速查

快速建立 Python 思维 —— 用你熟悉的 Stata/R 理解 Python

核心概念对比

1. 数据框（DataFrame）概念

所有三种语言的核心都是 二维数据表：

概念	Stata	R	Python (Pandas)
数据框	Dataset（内存中只能有一个）	data.frame	DataFrame（可以同时有多个）
变量（列）	Variable	Column	Column
观测（行）	Observation	Row	Row

关键区别：

Stata：一次只能操作一个数据集
R/Python：可以同时处理多个数据框

常用操作对比

操作 1：读取 CSV 文件

stata

* Stata
import delimited "data.csv", clear

# R
df <- read.csv("data.csv")

python

# Python
import pandas as pd
df = pd.read_csv("data.csv")

操作 2：查看数据前几行

stata

* Stata
list in 1/5
browse in 1/5

# R
head(df)

python

# Python
df.head()

操作 3：创建新变量

示例：创建收入对数

stata

* Stata
gen log_income = log(income)

# R
df$log_income <- log(df$income)

python

# Python
df['log_income'] = np.log(df['income'])

操作 4：条件筛选

示例：筛选年龄大于 30 的观测

stata

* Stata
keep if age > 30

# R
df_filtered <- df[df$age > 30, ]
# 或使用 dplyr
df_filtered <- df %>% filter(age > 30)

python

# Python
df_filtered = df[df['age'] > 30]

操作 5：分组汇总

示例：按国家计算平均收入

stata

* Stata
collapse (mean) avg_income=income, by(country)

# R (base R)
aggregate(income ~ country, data=df, FUN=mean)

# R (dplyr)
df %>%
  group_by(country) %>%
  summarise(avg_income = mean(income))

python

# Python
df.groupby('country')['income'].mean()

# 或更详细的写法
df.groupby('country').agg({'income': 'mean'})

操作 6：描述性统计

stata

* Stata
summarize income age education

# R
summary(df[c("income", "age", "education")])

python

# Python
df[['income', 'age', 'education']].describe()

操作 7：回归分析

示例：OLS 回归

stata

* Stata
regress income education age i.gender

# R
model <- lm(income ~ education + age + factor(gender), data=df)
summary(model)

python

# Python (statsmodels，最接近 Stata)
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + C(gender)', data=df).fit()
print(model.summary())

# 或使用 sklearn（更简洁，但输出不同）
from sklearn.linear_model import LinearRegression
X = df[['education', 'age']]
y = df['income']
model = LinearRegression().fit(X, y)

操作 8：合并数据

示例：按 country 合并两个数据集

stata

* Stata
merge 1:1 country using "gdp_data.dta"

# R
merged <- merge(df1, df2, by="country", all=TRUE)

# R (dplyr)
merged <- df1 %>% full_join(df2, by="country")

python

# Python
merged = pd.merge(df1, df2, on='country', how='outer')

关键语法差异

1. 索引方式

操作	Stata	R	Python
选择列	`income`	`df$income`	`df['income']`
选择多列	`keep income age`	`df[c("income", "age")]`	`df[['income', 'age']]`
选择行	`keep if age > 30`	`df[df$age > 30, ]`	`df[df['age'] > 30]`

2. 赋值操作

操作	Stata	R	Python
创建新变量	`gen x = 1`	`df$x <- 1`	`df['x'] = 1`
替换变量	`replace x = 2 if age > 30`	`df$x[df$age > 30] <- 2`	`df.loc[df['age'] > 30, 'x'] = 2`

3. 缺失值

操作	Stata	R	Python
缺失值符号	`.`	`NA`	`NaN` 或 `None`
删除缺失值	`drop if missing(income)`	`df <- na.omit(df)`	`df.dropna()`
填充缺失值	`replace income = 0 if missing(income)`	`df$income[is.na(df$income)] <- 0`	`df['income'].fillna(0)`

思维转换：从 Stata/R 到 Python

Stata 用户的转换

Stata 的思维：一次只操作一个数据集

stata

use "data1.dta", clear
gen new_var = old_var * 2
save "data1_new.dta", replace

Python 的思维：同时操作多个数据框

python

df1 = pd.read_csv("data1.csv")
df1['new_var'] = df1['old_var'] * 2
df1.to_csv("data1_new.csv")

# 可以同时有 df2, df3, ... 在内存中
df2 = pd.read_csv("data2.csv")

R 用户的转换

R 的思维：函数式编程

df %>%
  filter(age > 30) %>%
  mutate(log_income = log(income)) %>%
  group_by(country) %>%
  summarise(mean_income = mean(income))

Python 的思维：对象方法链（类似 R 的管道）

python

(df
 .query('age > 30')
 .assign(log_income=lambda x: np.log(x['income']))
 .groupby('country')
 .agg({'income': 'mean'})
)

实战示例：复刻 Stata 的经典分析

Stata 代码

stata

* 1. 加载数据
use "survey_data.dta", clear

* 2. 数据清洗
drop if missing(income)
keep if age >= 18 & age <= 65

* 3. 创建新变量
gen log_income = log(income)
gen age_squared = age^2

* 4. 描述性统计
tabstat income education age, by(gender) stat(mean sd)

* 5. 回归分析
regress log_income education age age_squared i.gender

Python 等价代码

python

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# 1. 加载数据
df = pd.read_stata("survey_data.dta")

# 2. 数据清洗
df = df.dropna(subset=['income'])
df = df[(df['age'] >= 18) & (df['age'] <= 65)]

# 3. 创建新变量
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2

# 4. 描述性统计
df.groupby('gender')[['income', 'education', 'age']].agg(['mean', 'std'])

# 5. 回归分析
model = smf.ols('log_income ~ education + age + age_squared + C(gender)', data=df).fit()
print(model.summary())

运行结果对比：Python 的输出与 Stata 几乎一致！

Python 的独特优势

1. 同时处理多个数据集

python

# 同时加载多个国家的数据
df_china = pd.read_csv("china_data.csv")
df_usa = pd.read_csv("usa_data.csv")
df_india = pd.read_csv("india_data.csv")

# 合并它们
df_all = pd.concat([df_china, df_usa, df_india])

2. 循环处理（比 Stata 更灵活）

python

# 对多个变量进行对数转换
for var in ['income', 'gdp', 'population']:
    df[f'log_{var}'] = np.log(df[var])

3. 直接调用外部 API

python

# Stata/R 很难做到的：直接从 API 获取数据
import requests
response = requests.get("https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD?format=json")
data = response.json()

高级操作对比

操作 9：面板数据回归（固定效应）

示例：分析教育对工资的影响（控制个体固定效应）

stata

* Stata - 非常简洁！
xtset individual_id year
xtreg wage education experience, fe

# R (plm package)
library(plm)
model <- plm(wage ~ education + experience,
             data = panel_data,
             index = c("individual_id", "year"),
             model = "within")
summary(model)

python

# Python (linearmodels)
from linearmodels.panel import PanelOLS

# 设置多重索引
panel_data = df.set_index(['individual_id', 'year'])

# 固定效应回归
model = PanelOLS.from_formula(
    'wage ~ education + experience + EntityEffects',
    data=panel_data
)
result = model.fit(cov_type='clustered', cluster_entity=True)
print(result.summary)

注释：Stata 在面板数据上最简洁，但 Python 的 linearmodels 功能同样强大。

操作 10：处理分类变量（因子编码）

stata

* Stata - 自动处理
regress wage i.education_level i.industry

# R - 自动处理
model <- lm(wage ~ factor(education_level) + factor(industry),
            data = df)

python

# Python - 需要明确指定
import statsmodels.formula.api as smf

model = smf.ols('wage ~ C(education_level) + C(industry)',
                data=df).fit()
# 或使用 pd.get_dummies() 手动编码
df_encoded = pd.get_dummies(df,
                             columns=['education_level', 'industry'],
                             drop_first=True)

操作 11：时间序列操作

示例：计算滞后值和增长率

stata

* Stata
tsset date
gen gdp_lag1 = L.gdp
gen gdp_growth = (gdp - L.gdp) / L.gdp

# R (dplyr)
df <- df %>%
  arrange(date) %>%
  mutate(gdp_lag1 = lag(gdp),
         gdp_growth = (gdp - lag(gdp)) / lag(gdp))

python

# Python (pandas)
df = df.sort_values('date')
df['gdp_lag1'] = df['gdp'].shift(1)
df['gdp_growth'] = df['gdp'].pct_change()

操作 12：字符串处理

示例：提取姓氏，转换大小写

stata

* Stata
gen last_name = word(name, -1)
gen name_upper = upper(name)

# R (stringr)
library(stringr)
df$last_name <- word(df$name, -1)
df$name_upper <- str_to_upper(df$name)

python

# Python (pandas + str methods)
df['last_name'] = df['name'].str.split().str[-1]
df['name_upper'] = df['name'].str.upper()

# 或使用正则表达式
df['last_name'] = df['name'].str.extract(r'(\w+)$')

复杂分析流程对比

案例：完整的实证研究流程

研究问题：分析最低工资政策对就业的影响（DID 设计）

Stata 实现

stata

* 1. 加载数据
use "employment_data.dta", clear

* 2. 数据清洗
drop if missing(employment, wage, treatment)
keep if year >= 2010 & year <= 2020

* 3. 生成交互项
gen post = (year >= 2015)
gen treated = (state == "CA")
gen did = post * treated

* 4. 描述性统计
tabstat employment, by(treated post) stat(mean sd n)

* 5. DID 回归
regress employment did treated post controls, cluster(state)

* 6. 平行趋势检验
forvalues y = 2010/2020 {
    gen year_`y' = (year == `y')
    gen treat_year_`y' = treated * year_`y'
}
regress employment treat_year_*, cluster(state)

Python 实现

python

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# 1. 加载数据
df = pd.read_stata("employment_data.dta")

# 2. 数据清洗
df = df.dropna(subset=['employment', 'wage', 'treatment'])
df = df[(df['year'] >= 2010) & (df['year'] <= 2020)]

# 3. 生成交互项
df['post'] = (df['year'] >= 2015).astype(int)
df['treated'] = (df['state'] == "CA").astype(int)
df['did'] = df['post'] * df['treated']

# 4. 描述性统计
summary = df.groupby(['treated', 'post'])['employment'].agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('n', 'count')
])
print(summary)

# 5. DID 回归（聚类标准误）
model = smf.ols('employment ~ did + treated + post + controls',
                data=df).fit(cov_type='cluster',
                             cov_kwds={'groups': df['state']})
print(model.summary())

# 6. 平行趋势检验
df['year_str'] = df['year'].astype(str)
model_parallel = smf.ols(
    'employment ~ C(year_str):treated + C(year_str) + controls',
    data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['state']})

# 提取系数并可视化
coefs = model_parallel.params.filter(regex='year_str.*treated')
ci = model_parallel.conf_int().loc[coefs.index]

plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs,
             yerr=[(coefs - ci[0]), (ci[1] - coefs)],
             fmt='o-')
plt.axhline(y=0, color='red', linestyle='--')
plt.axvline(x=5, color='gray', linestyle='--', label='Treatment Year')
plt.title('Parallel Trends Test')
plt.xlabel('Year')
plt.ylabel('Treatment Effect')
plt.legend()
plt.show()

对比：

Stata：代码更简洁（~20 行）
Python：代码稍长（~40 行），但可视化和灵活性更强

数据可视化对比

案例：绘制回归结果的系数图

Stata

stata

regress wage education experience female urban
coefplot, drop(_cons) xline(0)

R (ggplot2)

library(ggplot2)
library(broom)

model <- lm(wage ~ education + experience + female + urban,
            data = df)
coef_df <- tidy(model, conf.int = TRUE)

ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(title = "Regression Coefficients",
       x = "Estimate", y = "Variable")

Python (matplotlib + seaborn)

python

import matplotlib.pyplot as plt
import seaborn as sns

model = smf.ols('wage ~ education + experience + female + urban',
                data=df).fit()

# 提取系数和置信区间
coefs = model.params.drop('Intercept')
ci = model.conf_int().drop('Intercept')

fig, ax = plt.subplots(figsize=(8, 6))
y_pos = range(len(coefs))

ax.errorbar(coefs, y_pos,
            xerr=[(coefs - ci[0]), (ci[1] - coefs)],
            fmt='o', capsize=5)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('Coefficient Estimate')
ax.set_title('Regression Coefficients with 95% CI')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

性能对比

大数据处理速度（100 万行数据）

操作	Stata	R (dplyr)	Python (pandas)
读取 CSV	~5 秒	~3 秒	~2 秒
分组汇总	~2 秒	~1 秒	~0.8 秒
合并数据	~4 秒	~2 秒	~1.5 秒
回归分析	~0.5 秒	~0.8 秒	~0.6 秒

注：Python + pandas 在大数据上通常最快，但 Stata 在回归分析上有优化。

️ 生态系统对比

Stata 的优势

计量经济学方法最全（IV、DID、RDD、PSM）
面板数据处理最简洁
回归输出格式最接近学术要求
社区集中（StataList，SSC）

R 的优势

统计方法最丰富（贝叶斯、生存分析、因子分析）
数据可视化（ggplot2）最优雅
开源免费
CRAN 包生态庞大（2万+ 包）

Python 的优势

机器学习生态最强（sklearn、PyTorch、TensorFlow）
深度学习和 LLM 唯一选择
通用编程能力最强
数据工程工具最全（爬虫、API、数据库）
就业市场需求最大

学习建议

如果你是 Stata 用户

重点学习 Pandas（80% 的 Stata 功能都能复刻）
使用 statsmodels（输出格式最接近 Stata）
记住：df['var'] ≈ Stata 的变量名
学习路径：
- Week 1: Pandas 基础（等同于 Stata 的数据操作）
- Week 2: statsmodels 回归（等同于 Stata 的 regress）
- Week 3: linearmodels 面板数据（等同于 Stata 的 xtreg）
- Week 4: scikit-learn 机器学习（Stata 无法做到）

如果你是 R 用户

学习 Pandas（类似 dplyr + data.table）
使用 plotnine（Python 版本的 ggplot2）
记住：Python 用 . 而不是 %>%
学习路径：
- Week 1: Python 基础语法（R 用户最快 3 天）
- Week 2: Pandas（类似 tidyverse）
- Week 3: Matplotlib/Seaborn（不如 ggplot2，但够用）
- Week 4: sklearn + PyTorch（R 的弱项）

三语言混合使用策略

最佳实践：根据任务选择工具

数据清洗 → Python (pandas)
   ↓
描述性统计 → Stata/R（看个人习惯）
   ↓
传统计量 → Stata（面板数据、IV）
   ↓
机器学习 → Python (sklearn)
   ↓
文本分析 → Python (transformers)
   ↓
可视化 → R (ggplot2) 或 Python (seaborn)

下一步

在下一节中，我们将编写 第一个 Python 程序，体验 Python 的简洁和强大。

Python vs Stata vs R：语法对比速查 ​

核心概念对比 ​

1. 数据框（DataFrame）概念 ​

常用操作对比 ​

操作 1：读取 CSV 文件 ​

操作 2：查看数据前几行 ​

操作 3：创建新变量 ​

操作 4：条件筛选 ​

操作 5：分组汇总 ​

操作 6：描述性统计 ​

操作 7：回归分析 ​

操作 8：合并数据 ​

关键语法差异 ​

1. 索引方式 ​

2. 赋值操作 ​

3. 缺失值 ​

思维转换：从 Stata/R 到 Python ​

Stata 用户的转换 ​

R 用户的转换 ​

实战示例：复刻 Stata 的经典分析 ​

Stata 代码 ​

Python 等价代码 ​

Python 的独特优势 ​

1. 同时处理多个数据集 ​

2. 循环处理（比 Stata 更灵活） ​

3. 直接调用外部 API ​

高级操作对比 ​

操作 9：面板数据回归（固定效应） ​

操作 10：处理分类变量（因子编码） ​

操作 11：时间序列操作 ​

操作 12：字符串处理 ​

复杂分析流程对比 ​

案例：完整的实证研究流程 ​

Stata 实现 ​

Python 实现 ​

数据可视化对比 ​

案例：绘制回归结果的系数图 ​

Stata ​

R (ggplot2) ​

Python (matplotlib + seaborn) ​

性能对比 ​

大数据处理速度（100 万行数据） ​

️ 生态系统对比 ​

Stata 的优势 ​

R 的优势 ​

Python 的优势 ​

学习建议 ​

如果你是 Stata 用户 ​

如果你是 R 用户 ​

三语言混合使用策略 ​

下一步 ​

Python vs Stata vs R：语法对比速查

核心概念对比

1. 数据框（DataFrame）概念

常用操作对比

操作 1：读取 CSV 文件

操作 2：查看数据前几行

操作 3：创建新变量

操作 4：条件筛选

操作 5：分组汇总

操作 6：描述性统计

操作 7：回归分析

操作 8：合并数据

关键语法差异

1. 索引方式

2. 赋值操作

3. 缺失值

思维转换：从 Stata/R 到 Python

Stata 用户的转换

R 用户的转换

实战示例：复刻 Stata 的经典分析

Stata 代码

Python 等价代码

Python 的独特优势

1. 同时处理多个数据集

2. 循环处理（比 Stata 更灵活）

3. 直接调用外部 API

高级操作对比

操作 9：面板数据回归（固定效应）

操作 10：处理分类变量（因子编码）

操作 11：时间序列操作

操作 12：字符串处理

复杂分析流程对比

案例：完整的实证研究流程

Stata 实现

Python 实现

数据可视化对比

案例：绘制回归结果的系数图

Stata

R (ggplot2)

Python (matplotlib + seaborn)

性能对比

大数据处理速度（100 万行数据）

️ 生态系统对比

Stata 的优势

R 的优势

Python 的优势

学习建议

如果你是 Stata 用户

如果你是 R 用户

三语言混合使用策略

下一步