Stata 数据文件读写
Python 与 Stata 的无缝对接
为什么需要读写 Stata 文件?
- 现状:很多社科数据以
.dta格式存储 - 需求:在 Python 中使用 Stata 数据
- 优势:保留变量标签、值标签等元数据
使用 Pandas 读写 Stata 文件
安装依赖
bash
pip install pandas
# Pandas 内置支持 .dta 文件读取 Stata 文件
python
import pandas as pd
# 基本读取
df = pd.read_stata('survey_data.dta')
print(df.head())
# 查看变量标签
print(df.columns)
# 保留值标签(例如:1='Male', 2='Female')
df = pd.read_stata('survey_data.dta', convert_categoricals=True)写入 Stata 文件
python
import pandas as pd
df = pd.DataFrame({
'respondent_id': [1, 2, 3],
'age': [25, 30, 35],
'income': [50000, 75000, 85000],
'gender': ['Male', 'Female', 'Male']
})
# 保存为 Stata 13 格式
df.to_stata('output.dta', write_index=False, version=117)
# Stata 版本对照:
# 117 = Stata 13/14
# 118 = Stata 15/16
# 119 = Stata 17处理变量标签和值标签
读取时保留元数据
python
import pandas as pd
# 读取并保留分类变量
df = pd.read_stata(
'survey.dta',
convert_categoricals=True, # 保留值标签
preserve_dtypes=True # 保留数据类型
)
# 查看分类变量的标签
if df['gender'].dtype.name == 'category':
print(df['gender'].cat.categories)写入时添加标签
python
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3],
'gender': pd.Categorical(['Male', 'Female', 'Male']),
'education': pd.Categorical(['High School', 'Bachelor', 'Master'])
})
# 添加变量标签
variable_labels = {
'id': 'Respondent ID',
'gender': 'Gender',
'education': 'Education Level'
}
df.to_stata(
'output.dta',
write_index=False,
variable_labels=variable_labels
)实战案例
案例 1:Stata 到 Python 数据流
python
import pandas as pd
import numpy as np
# 1. 读取 Stata 数据
df = pd.read_stata('raw_survey.dta')
print(f"原始数据: {len(df)} 行")
# 2. 数据清洗(Python)
df_clean = df[
(df['age'] >= 18) &
(df['age'] <= 100) &
(df['income'] > 0)
].copy()
# 3. 新变量生成
df_clean['log_income'] = np.log(df_clean['income'])
df_clean['age_squared'] = df_clean['age'] ** 2
# 4. 保存回 Stata 格式
df_clean.to_stata('clean_survey.dta', write_index=False)
print(f"清洗后: {len(df_clean)} 行")案例 2:批量处理多个 Stata 文件
python
import pandas as pd
from pathlib import Path
# 读取多个年份的数据
years = [2020, 2021, 2022, 2023]
all_data = []
for year in years:
file_path = f'survey_{year}.dta'
if Path(file_path).exists():
df = pd.read_stata(file_path)
df['year'] = year # 添加年份标识
all_data.append(df)
print(f"{year}: {len(df)} 行")
# 合并
combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_stata('panel_data.dta', write_index=False)
print(f"总计: {len(combined_df)} 行")案例 3:Stata 与 Python 的往返
python
import pandas as pd
from sklearn.preprocessing import StandardScaler
# 从 Stata 读取
df = pd.read_stata('original.dta')
# Python 数据处理
scaler = StandardScaler()
df['income_std'] = scaler.fit_transform(df[['income']])
df['age_std'] = scaler.fit_transform(df[['age']])
# 保存回 Stata(带标签)
variable_labels = {
'income_std': 'Standardized Income',
'age_std': 'Standardized Age'
}
df.to_stata(
'processed.dta',
write_index=False,
variable_labels=variable_labels,
version=117
)Python vs Stata 数据操作对比
读取数据
stata
* Stata
use "survey_data.dta", clearpython
# Python
import pandas as pd
df = pd.read_stata('survey_data.dta')筛选数据
stata
* Stata
keep if age >= 18 & age <= 65
keep if income > 0python
# Python
df = df[(df['age'] >= 18) & (df['age'] <= 65)]
df = df[df['income'] > 0]生成新变量
stata
* Stata
gen log_income = log(income)
gen age_squared = age^2python
# Python
import numpy as np
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2保存数据
stata
* Stata
save "output.dta", replacepython
# Python
df.to_stata('output.dta', write_index=False)最佳实践
1. 选择合适的 Stata 版本
python
# Stata 13/14(最兼容)
df.to_stata('output.dta', version=117)
# Stata 17(最新特性)
df.to_stata('output.dta', version=119)2. 处理大文件
python
# 分块读取大型 Stata 文件
import pandas as pd
chunks = pd.read_stata('large_file.dta', chunksize=10000)
results = []
for chunk in chunks:
# 处理每块
processed = chunk[chunk['age'] > 18]
results.append(processed)
df = pd.concat(results, ignore_index=True)3. 保留数据类型
python
# 确保日期、分类等类型正确转换
df = pd.read_stata(
'data.dta',
convert_dates=True, # 转换日期
convert_categoricals=True # 保留分类
)Python-Stata 工作流
工作流 1:Stata 预处理 → Python 分析
python
# 1. Stata 中完成数据清洗(.do 文件)
# 2. Python 读取并分析
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_stata('clean_data.dta')
# 机器学习(Stata 不擅长)
X = df[['age', 'education_years']]
y = df['income']
model = LinearRegression()
model.fit(X, y)工作流 2:Python 处理 → Stata 回归
python
# 1. Python 完成特征工程
import pandas as pd
import numpy as np
df = pd.read_stata('raw.dta')
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2
# 2. 保存给 Stata
df.to_stata('for_regression.dta', write_index=False)
# 3. 在 Stata 中运行回归
# regress log_income age age_squared education练习题
python
# 练习 1:格式转换
# 读取 survey.dta
# 添加新列 'income_category' (低/中/高)
# 保存为新的 .dta 文件,保留变量标签
# 练习 2:批量处理
# 读取文件夹中所有 .dta 文件
# 合并为一个大数据集
# 添加来源文件名作为新列
# 保存为 combined.dta下一步
下一节学习 JSON 数据处理。
继续!