Python vs Stata vs R: Syntax Comparison Quick Reference
Quickly Build Python Thinking — Understand Python Through Familiar Stata/R
Core Concept Comparison
1. DataFrame Concept
The core of all three languages is the two-dimensional data table:
| Concept | Stata | R | Python (Pandas) |
|---|---|---|---|
| Data Frame | Dataset (only one in memory) | data.frame | DataFrame (multiple allowed) |
| Variable (Column) | Variable | Column | Column |
| Observation (Row) | Observation | Row | Row |
Key Difference:
- Stata: Can only work with one dataset at a time
- R/Python: Can handle multiple data frames simultaneously
Common Operations Comparison
Operation 1: Read CSV File
* Stata
import delimited "data.csv", clear# R
df <- read.csv("data.csv")# Python
import pandas as pd
df = pd.read_csv("data.csv")Operation 2: View First Few Rows
* Stata
list in 1/5
browse in 1/5# R
head(df)# Python
df.head()Operation 3: Create New Variable
Example: Create log of income
* Stata
gen log_income = log(income)# R
df$log_income <- log(df$income)# Python
df['log_income'] = np.log(df['income'])Operation 4: Conditional Filtering
Example: Filter observations where age > 30
* Stata
keep if age > 30# R
df_filtered <- df[df$age > 30, ]
# Or using dplyr
df_filtered <- df %>% filter(age > 30)# Python
df_filtered = df[df['age'] > 30]Operation 5: Group Aggregation
Example: Calculate average income by country
* Stata
collapse (mean) avg_income=income, by(country)# R (base R)
aggregate(income ~ country, data=df, FUN=mean)
# R (dplyr)
df %>%
group_by(country) %>%
summarise(avg_income = mean(income))# Python
df.groupby('country')['income'].mean()
# Or more detailed syntax
df.groupby('country').agg({'income': 'mean'})Operation 6: Descriptive Statistics
* Stata
summarize income age education# R
summary(df[c("income", "age", "education")])# Python
df[['income', 'age', 'education']].describe()Operation 7: Regression Analysis
Example: OLS Regression
* Stata
regress income education age i.gender# R
model <- lm(income ~ education + age + factor(gender), data=df)
summary(model)# Python (statsmodels, closest to Stata)
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + C(gender)', data=df).fit()
print(model.summary())
# Or using sklearn (more concise, but different output)
from sklearn.linear_model import LinearRegression
X = df[['education', 'age']]
y = df['income']
model = LinearRegression().fit(X, y)Operation 8: Merge Data
Example: Merge two datasets by country
* Stata
merge 1:1 country using "gdp_data.dta"# R
merged <- merge(df1, df2, by="country", all=TRUE)
# R (dplyr)
merged <- df1 %>% full_join(df2, by="country")# Python
merged = pd.merge(df1, df2, on='country', how='outer')Key Syntax Differences
1. Indexing Methods
| Operation | Stata | R | Python |
|---|---|---|---|
| Select Column | income | df$income | df['income'] |
| Select Multiple Columns | keep income age | df[c("income", "age")] | df[['income', 'age']] |
| Select Rows | keep if age > 30 | df[df$age > 30, ] | df[df['age'] > 30] |
2. Assignment Operations
| Operation | Stata | R | Python |
|---|---|---|---|
| Create New Variable | gen x = 1 | df$x <- 1 | df['x'] = 1 |
| Replace Variable | replace x = 2 if age > 30 | df$x[df$age > 30] <- 2 | df.loc[df['age'] > 30, 'x'] = 2 |
3. Missing Values
| Operation | Stata | R | Python |
|---|---|---|---|
| Missing Value Symbol | . | NA | NaN or None |
| Drop Missing | drop if missing(income) | df <- na.omit(df) | df.dropna() |
| Fill Missing | replace income = 0 if missing(income) | df$income[is.na(df$income)] <- 0 | df['income'].fillna(0) |
Mental Shift: From Stata/R to Python
Stata User Transition
Stata Thinking: Work with one dataset at a time
use "data1.dta", clear
gen new_var = old_var * 2
save "data1_new.dta", replacePython Thinking: Work with multiple data frames simultaneously
df1 = pd.read_csv("data1.csv")
df1['new_var'] = df1['old_var'] * 2
df1.to_csv("data1_new.csv")
# Can have df2, df3, ... in memory at the same time
df2 = pd.read_csv("data2.csv")R User Transition
R Thinking: Functional programming
df %>%
filter(age > 30) %>%
mutate(log_income = log(income)) %>%
group_by(country) %>%
summarise(mean_income = mean(income))Python Thinking: Object method chaining (similar to R pipes)
(df
.query('age > 30')
.assign(log_income=lambda x: np.log(x['income']))
.groupby('country')
.agg({'income': 'mean'})
)Practical Example: Replicating Classic Stata Analysis
Stata Code
* 1. Load data
use "survey_data.dta", clear
* 2. Data cleaning
drop if missing(income)
keep if age >= 18 & age <= 65
* 3. Create new variables
gen log_income = log(income)
gen age_squared = age^2
* 4. Descriptive statistics
tabstat income education age, by(gender) stat(mean sd)
* 5. Regression analysis
regress log_income education age age_squared i.genderPython Equivalent Code
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# 1. Load data
df = pd.read_stata("survey_data.dta")
# 2. Data cleaning
df = df.dropna(subset=['income'])
df = df[(df['age'] >= 18) & (df['age'] <= 65)]
# 3. Create new variables
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2
# 4. Descriptive statistics
df.groupby('gender')[['income', 'education', 'age']].agg(['mean', 'std'])
# 5. Regression analysis
model = smf.ols('log_income ~ education + age + age_squared + C(gender)', data=df).fit()
print(model.summary())Output Comparison: Python output is nearly identical to Stata!
Python's Unique Advantages
1. Handle Multiple Datasets Simultaneously
# Load data from multiple countries at once
df_china = pd.read_csv("china_data.csv")
df_usa = pd.read_csv("usa_data.csv")
df_india = pd.read_csv("india_data.csv")
# Combine them
df_all = pd.concat([df_china, df_usa, df_india])2. Loop Processing (More Flexible Than Stata)
# Log transform multiple variables
for var in ['income', 'gdp', 'population']:
df[f'log_{var}'] = np.log(df[var])3. Direct API Calls
# Difficult in Stata/R: get data directly from API
import requests
response = requests.get("https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD?format=json")
data = response.json()Advanced Operations Comparison
Operation 9: Panel Data Regression (Fixed Effects)
Example: Analyze impact of education on wages (controlling for individual fixed effects)
* Stata - Very concise!
xtset individual_id year
xtreg wage education experience, fe# R (plm package)
library(plm)
model <- plm(wage ~ education + experience,
data = panel_data,
index = c("individual_id", "year"),
model = "within")
summary(model)# Python (linearmodels)
from linearmodels.panel import PanelOLS
# Set multi-index
panel_data = df.set_index(['individual_id', 'year'])
# Fixed effects regression
model = PanelOLS.from_formula(
'wage ~ education + experience + EntityEffects',
data=panel_data
)
result = model.fit(cov_type='clustered', cluster_entity=True)
print(result.summary)Note: Stata is most concise for panel data, but Python's linearmodels is equally powerful.
Operation 10: Handle Categorical Variables (Factor Encoding)
* Stata - Auto-handled
regress wage i.education_level i.industry# R - Auto-handled
model <- lm(wage ~ factor(education_level) + factor(industry),
data = df)# Python - Need explicit specification
import statsmodels.formula.api as smf
model = smf.ols('wage ~ C(education_level) + C(industry)',
data=df).fit()
# Or manually encode with pd.get_dummies()
df_encoded = pd.get_dummies(df,
columns=['education_level', 'industry'],
drop_first=True)Operation 11: Time Series Operations
Example: Calculate lags and growth rates
* Stata
tsset date
gen gdp_lag1 = L.gdp
gen gdp_growth = (gdp - L.gdp) / L.gdp# R (dplyr)
df <- df %>%
arrange(date) %>%
mutate(gdp_lag1 = lag(gdp),
gdp_growth = (gdp - lag(gdp)) / lag(gdp))# Python (pandas)
df = df.sort_values('date')
df['gdp_lag1'] = df['gdp'].shift(1)
df['gdp_growth'] = df['gdp'].pct_change()Operation 12: String Processing
Example: Extract last name, convert case
* Stata
gen last_name = word(name, -1)
gen name_upper = upper(name)# R (stringr)
library(stringr)
df$last_name <- word(df$name, -1)
df$name_upper <- str_to_upper(df$name)# Python (pandas + str methods)
df['last_name'] = df['name'].str.split().str[-1]
df['name_upper'] = df['name'].str.upper()
# Or use regex
df['last_name'] = df['name'].str.extract(r'(\w+)$')Complex Analysis Workflow Comparison
Case: Complete Empirical Research Workflow
Research Question: Analyze impact of minimum wage policy on employment (DID design)
Stata Implementation
* 1. Load data
use "employment_data.dta", clear
* 2. Data cleaning
drop if missing(employment, wage, treatment)
keep if year >= 2010 & year <= 2020
* 3. Generate interaction terms
gen post = (year >= 2015)
gen treated = (state == "CA")
gen did = post * treated
* 4. Descriptive statistics
tabstat employment, by(treated post) stat(mean sd n)
* 5. DID regression
regress employment did treated post controls, cluster(state)
* 6. Parallel trends test
forvalues y = 2010/2020 {
gen year_`y' = (year == `y')
gen treat_year_`y' = treated * year_`y'
}
regress employment treat_year_*, cluster(state)Python Implementation
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
# 1. Load data
df = pd.read_stata("employment_data.dta")
# 2. Data cleaning
df = df.dropna(subset=['employment', 'wage', 'treatment'])
df = df[(df['year'] >= 2010) & (df['year'] <= 2020)]
# 3. Generate interaction terms
df['post'] = (df['year'] >= 2015).astype(int)
df['treated'] = (df['state'] == "CA").astype(int)
df['did'] = df['post'] * df['treated']
# 4. Descriptive statistics
summary = df.groupby(['treated', 'post'])['employment'].agg([
('mean', 'mean'),
('std', 'std'),
('n', 'count')
])
print(summary)
# 5. DID regression (clustered standard errors)
model = smf.ols('employment ~ did + treated + post + controls',
data=df).fit(cov_type='cluster',
cov_kwds={'groups': df['state']})
print(model.summary())
# 6. Parallel trends test
df['year_str'] = df['year'].astype(str)
model_parallel = smf.ols(
'employment ~ C(year_str):treated + C(year_str) + controls',
data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['state']})
# Extract coefficients and visualize
coefs = model_parallel.params.filter(regex='year_str.*treated')
ci = model_parallel.conf_int().loc[coefs.index]
plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs,
yerr=[(coefs - ci[0]), (ci[1] - coefs)],
fmt='o-')
plt.axhline(y=0, color='red', linestyle='--')
plt.axvline(x=5, color='gray', linestyle='--', label='Treatment Year')
plt.title('Parallel Trends Test')
plt.xlabel('Year')
plt.ylabel('Treatment Effect')
plt.legend()
plt.show()Comparison:
- Stata: More concise code (~20 lines)
- Python: Slightly longer code (~40 lines), but stronger visualization and flexibility
Data Visualization Comparison
Case: Plot Coefficient Plot of Regression Results
Stata
regress wage education experience female urban
coefplot, drop(_cons) xline(0)R (ggplot2)
library(ggplot2)
library(broom)
model <- lm(wage ~ education + experience + female + urban,
data = df)
coef_df <- tidy(model, conf.int = TRUE)
ggplot(coef_df, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, linetype = "dashed") +
theme_minimal() +
labs(title = "Regression Coefficients",
x = "Estimate", y = "Variable")Python (matplotlib + seaborn)
import matplotlib.pyplot as plt
import seaborn as sns
model = smf.ols('wage ~ education + experience + female + urban',
data=df).fit()
# Extract coefficients and confidence intervals
coefs = model.params.drop('Intercept')
ci = model.conf_int().drop('Intercept')
fig, ax = plt.subplots(figsize=(8, 6))
y_pos = range(len(coefs))
ax.errorbar(coefs, y_pos,
xerr=[(coefs - ci[0]), (ci[1] - coefs)],
fmt='o', capsize=5)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('Coefficient Estimate')
ax.set_title('Regression Coefficients with 95% CI')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()Performance Comparison
Big Data Processing Speed (1 million rows)
| Operation | Stata | R (dplyr) | Python (pandas) |
|---|---|---|---|
| Read CSV | ~5 sec | ~3 sec | ~2 sec |
| Group Aggregation | ~2 sec | ~1 sec | ~0.8 sec |
| Merge Data | ~4 sec | ~2 sec | ~1.5 sec |
| Regression Analysis | ~0.5 sec | ~0.8 sec | ~0.6 sec |
Note: Python + pandas is usually fastest for big data, but Stata is optimized for regression analysis.
Ecosystem Comparison
Stata's Advantages
- Most comprehensive econometric methods (IV, DID, RDD, PSM)
- Most concise panel data handling
- Regression output format closest to academic requirements
- Concentrated community (StataList, SSC)
R's Advantages
- Richest statistical methods (Bayesian, survival analysis, factor analysis)
- Most elegant data visualization (ggplot2)
- Open source and free
- Huge CRAN package ecosystem (20,000+ packages)
Python's Advantages
- Strongest machine learning ecosystem (sklearn, PyTorch, TensorFlow)
- Only choice for deep learning and LLMs
- Strongest general programming capabilities
- Most comprehensive data engineering tools (scraping, APIs, databases)
- Largest job market demand
Learning Recommendations
If You're a Stata User
- Focus on learning Pandas (80% of Stata functionality can be replicated)
- Use statsmodels (output format closest to Stata)
- Remember:
df['var']≈ Stata variable name - Learning Path:
- Week 1: Pandas basics (equivalent to Stata data operations)
- Week 2: statsmodels regression (equivalent to Stata regress)
- Week 3: linearmodels panel data (equivalent to Stata xtreg)
- Week 4: scikit-learn machine learning (Stata cannot do)
If You're an R User
- Learn Pandas (similar to dplyr + data.table)
- Use plotnine (Python version of ggplot2)
- Remember: Python uses
.not%>% - Learning Path:
- Week 1: Python basic syntax (R users fastest: 3 days)
- Week 2: Pandas (similar to tidyverse)
- Week 3: Matplotlib/Seaborn (not as good as ggplot2, but sufficient)
- Week 4: sklearn + PyTorch (R's weakness)
Three-Language Mixed Use Strategy
Best Practice: Choose tool based on task
Data Cleaning → Python (pandas)
↓
Descriptive Statistics → Stata/R (personal preference)
↓
Traditional Econometrics → Stata (panel data, IV)
↓
Machine Learning → Python (sklearn)
↓
Text Analysis → Python (transformers)
↓
Visualization → R (ggplot2) or Python (seaborn)Next Steps
In the next section, we will write Your First Python Program and experience Python's simplicity and power.
Ready?