Skip to content

Python vs Stata vs R: Syntax Comparison Quick Reference

Quickly Build Python Thinking — Understand Python Through Familiar Stata/R


Core Concept Comparison

1. DataFrame Concept

The core of all three languages is the two-dimensional data table:

ConceptStataRPython (Pandas)
Data FrameDataset (only one in memory)data.frameDataFrame (multiple allowed)
Variable (Column)VariableColumnColumn
Observation (Row)ObservationRowRow

Key Difference:

  • Stata: Can only work with one dataset at a time
  • R/Python: Can handle multiple data frames simultaneously

Common Operations Comparison

Operation 1: Read CSV File

stata
* Stata
import delimited "data.csv", clear
r
# R
df <- read.csv("data.csv")
python
# Python
import pandas as pd
df = pd.read_csv("data.csv")

Operation 2: View First Few Rows

stata
* Stata
list in 1/5
browse in 1/5
r
# R
head(df)
python
# Python
df.head()

Operation 3: Create New Variable

Example: Create log of income

stata
* Stata
gen log_income = log(income)
r
# R
df$log_income <- log(df$income)
python
# Python
df['log_income'] = np.log(df['income'])

Operation 4: Conditional Filtering

Example: Filter observations where age > 30

stata
* Stata
keep if age > 30
r
# R
df_filtered <- df[df$age > 30, ]
# Or using dplyr
df_filtered <- df %>% filter(age > 30)
python
# Python
df_filtered = df[df['age'] > 30]

Operation 5: Group Aggregation

Example: Calculate average income by country

stata
* Stata
collapse (mean) avg_income=income, by(country)
r
# R (base R)
aggregate(income ~ country, data=df, FUN=mean)

# R (dplyr)
df %>%
  group_by(country) %>%
  summarise(avg_income = mean(income))
python
# Python
df.groupby('country')['income'].mean()

# Or more detailed syntax
df.groupby('country').agg({'income': 'mean'})

Operation 6: Descriptive Statistics

stata
* Stata
summarize income age education
r
# R
summary(df[c("income", "age", "education")])
python
# Python
df[['income', 'age', 'education']].describe()

Operation 7: Regression Analysis

Example: OLS Regression

stata
* Stata
regress income education age i.gender
r
# R
model <- lm(income ~ education + age + factor(gender), data=df)
summary(model)
python
# Python (statsmodels, closest to Stata)
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + C(gender)', data=df).fit()
print(model.summary())

# Or using sklearn (more concise, but different output)
from sklearn.linear_model import LinearRegression
X = df[['education', 'age']]
y = df['income']
model = LinearRegression().fit(X, y)

Operation 8: Merge Data

Example: Merge two datasets by country

stata
* Stata
merge 1:1 country using "gdp_data.dta"
r
# R
merged <- merge(df1, df2, by="country", all=TRUE)

# R (dplyr)
merged <- df1 %>% full_join(df2, by="country")
python
# Python
merged = pd.merge(df1, df2, on='country', how='outer')

Key Syntax Differences

1. Indexing Methods

OperationStataRPython
Select Columnincomedf$incomedf['income']
Select Multiple Columnskeep income agedf[c("income", "age")]df[['income', 'age']]
Select Rowskeep if age > 30df[df$age > 30, ]df[df['age'] > 30]

2. Assignment Operations

OperationStataRPython
Create New Variablegen x = 1df$x <- 1df['x'] = 1
Replace Variablereplace x = 2 if age > 30df$x[df$age > 30] <- 2df.loc[df['age'] > 30, 'x'] = 2

3. Missing Values

OperationStataRPython
Missing Value Symbol.NANaN or None
Drop Missingdrop if missing(income)df <- na.omit(df)df.dropna()
Fill Missingreplace income = 0 if missing(income)df$income[is.na(df$income)] <- 0df['income'].fillna(0)

Mental Shift: From Stata/R to Python

Stata User Transition

Stata Thinking: Work with one dataset at a time

stata
use "data1.dta", clear
gen new_var = old_var * 2
save "data1_new.dta", replace

Python Thinking: Work with multiple data frames simultaneously

python
df1 = pd.read_csv("data1.csv")
df1['new_var'] = df1['old_var'] * 2
df1.to_csv("data1_new.csv")

# Can have df2, df3, ... in memory at the same time
df2 = pd.read_csv("data2.csv")

R User Transition

R Thinking: Functional programming

r
df %>%
  filter(age > 30) %>%
  mutate(log_income = log(income)) %>%
  group_by(country) %>%
  summarise(mean_income = mean(income))

Python Thinking: Object method chaining (similar to R pipes)

python
(df
 .query('age > 30')
 .assign(log_income=lambda x: np.log(x['income']))
 .groupby('country')
 .agg({'income': 'mean'})
)

Practical Example: Replicating Classic Stata Analysis

Stata Code

stata
* 1. Load data
use "survey_data.dta", clear

* 2. Data cleaning
drop if missing(income)
keep if age >= 18 & age <= 65

* 3. Create new variables
gen log_income = log(income)
gen age_squared = age^2

* 4. Descriptive statistics
tabstat income education age, by(gender) stat(mean sd)

* 5. Regression analysis
regress log_income education age age_squared i.gender

Python Equivalent Code

python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# 1. Load data
df = pd.read_stata("survey_data.dta")

# 2. Data cleaning
df = df.dropna(subset=['income'])
df = df[(df['age'] >= 18) & (df['age'] <= 65)]

# 3. Create new variables
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2

# 4. Descriptive statistics
df.groupby('gender')[['income', 'education', 'age']].agg(['mean', 'std'])

# 5. Regression analysis
model = smf.ols('log_income ~ education + age + age_squared + C(gender)', data=df).fit()
print(model.summary())

Output Comparison: Python output is nearly identical to Stata!


Python's Unique Advantages

1. Handle Multiple Datasets Simultaneously

python
# Load data from multiple countries at once
df_china = pd.read_csv("china_data.csv")
df_usa = pd.read_csv("usa_data.csv")
df_india = pd.read_csv("india_data.csv")

# Combine them
df_all = pd.concat([df_china, df_usa, df_india])

2. Loop Processing (More Flexible Than Stata)

python
# Log transform multiple variables
for var in ['income', 'gdp', 'population']:
    df[f'log_{var}'] = np.log(df[var])

3. Direct API Calls

python
# Difficult in Stata/R: get data directly from API
import requests
response = requests.get("https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD?format=json")
data = response.json()

Advanced Operations Comparison

Operation 9: Panel Data Regression (Fixed Effects)

Example: Analyze impact of education on wages (controlling for individual fixed effects)

stata
* Stata - Very concise!
xtset individual_id year
xtreg wage education experience, fe
r
# R (plm package)
library(plm)
model <- plm(wage ~ education + experience,
             data = panel_data,
             index = c("individual_id", "year"),
             model = "within")
summary(model)
python
# Python (linearmodels)
from linearmodels.panel import PanelOLS

# Set multi-index
panel_data = df.set_index(['individual_id', 'year'])

# Fixed effects regression
model = PanelOLS.from_formula(
    'wage ~ education + experience + EntityEffects',
    data=panel_data
)
result = model.fit(cov_type='clustered', cluster_entity=True)
print(result.summary)

Note: Stata is most concise for panel data, but Python's linearmodels is equally powerful.


Operation 10: Handle Categorical Variables (Factor Encoding)

stata
* Stata - Auto-handled
regress wage i.education_level i.industry
r
# R - Auto-handled
model <- lm(wage ~ factor(education_level) + factor(industry),
            data = df)
python
# Python - Need explicit specification
import statsmodels.formula.api as smf

model = smf.ols('wage ~ C(education_level) + C(industry)',
                data=df).fit()
# Or manually encode with pd.get_dummies()
df_encoded = pd.get_dummies(df,
                             columns=['education_level', 'industry'],
                             drop_first=True)

Operation 11: Time Series Operations

Example: Calculate lags and growth rates

stata
* Stata
tsset date
gen gdp_lag1 = L.gdp
gen gdp_growth = (gdp - L.gdp) / L.gdp
r
# R (dplyr)
df <- df %>%
  arrange(date) %>%
  mutate(gdp_lag1 = lag(gdp),
         gdp_growth = (gdp - lag(gdp)) / lag(gdp))
python
# Python (pandas)
df = df.sort_values('date')
df['gdp_lag1'] = df['gdp'].shift(1)
df['gdp_growth'] = df['gdp'].pct_change()

Operation 12: String Processing

Example: Extract last name, convert case

stata
* Stata
gen last_name = word(name, -1)
gen name_upper = upper(name)
r
# R (stringr)
library(stringr)
df$last_name <- word(df$name, -1)
df$name_upper <- str_to_upper(df$name)
python
# Python (pandas + str methods)
df['last_name'] = df['name'].str.split().str[-1]
df['name_upper'] = df['name'].str.upper()

# Or use regex
df['last_name'] = df['name'].str.extract(r'(\w+)$')

Complex Analysis Workflow Comparison

Case: Complete Empirical Research Workflow

Research Question: Analyze impact of minimum wage policy on employment (DID design)

Stata Implementation

stata
* 1. Load data
use "employment_data.dta", clear

* 2. Data cleaning
drop if missing(employment, wage, treatment)
keep if year >= 2010 & year <= 2020

* 3. Generate interaction terms
gen post = (year >= 2015)
gen treated = (state == "CA")
gen did = post * treated

* 4. Descriptive statistics
tabstat employment, by(treated post) stat(mean sd n)

* 5. DID regression
regress employment did treated post controls, cluster(state)

* 6. Parallel trends test
forvalues y = 2010/2020 {
    gen year_`y' = (year == `y')
    gen treat_year_`y' = treated * year_`y'
}
regress employment treat_year_*, cluster(state)

Python Implementation

python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# 1. Load data
df = pd.read_stata("employment_data.dta")

# 2. Data cleaning
df = df.dropna(subset=['employment', 'wage', 'treatment'])
df = df[(df['year'] >= 2010) & (df['year'] <= 2020)]

# 3. Generate interaction terms
df['post'] = (df['year'] >= 2015).astype(int)
df['treated'] = (df['state'] == "CA").astype(int)
df['did'] = df['post'] * df['treated']

# 4. Descriptive statistics
summary = df.groupby(['treated', 'post'])['employment'].agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('n', 'count')
])
print(summary)

# 5. DID regression (clustered standard errors)
model = smf.ols('employment ~ did + treated + post + controls',
                data=df).fit(cov_type='cluster',
                             cov_kwds={'groups': df['state']})
print(model.summary())

# 6. Parallel trends test
df['year_str'] = df['year'].astype(str)
model_parallel = smf.ols(
    'employment ~ C(year_str):treated + C(year_str) + controls',
    data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['state']})

# Extract coefficients and visualize
coefs = model_parallel.params.filter(regex='year_str.*treated')
ci = model_parallel.conf_int().loc[coefs.index]

plt.figure(figsize=(10, 6))
plt.errorbar(range(len(coefs)), coefs,
             yerr=[(coefs - ci[0]), (ci[1] - coefs)],
             fmt='o-')
plt.axhline(y=0, color='red', linestyle='--')
plt.axvline(x=5, color='gray', linestyle='--', label='Treatment Year')
plt.title('Parallel Trends Test')
plt.xlabel('Year')
plt.ylabel('Treatment Effect')
plt.legend()
plt.show()

Comparison:

  • Stata: More concise code (~20 lines)
  • Python: Slightly longer code (~40 lines), but stronger visualization and flexibility

Data Visualization Comparison

Case: Plot Coefficient Plot of Regression Results

Stata

stata
regress wage education experience female urban
coefplot, drop(_cons) xline(0)

R (ggplot2)

r
library(ggplot2)
library(broom)

model <- lm(wage ~ education + experience + female + urban,
            data = df)
coef_df <- tidy(model, conf.int = TRUE)

ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(title = "Regression Coefficients",
       x = "Estimate", y = "Variable")

Python (matplotlib + seaborn)

python
import matplotlib.pyplot as plt
import seaborn as sns

model = smf.ols('wage ~ education + experience + female + urban',
                data=df).fit()

# Extract coefficients and confidence intervals
coefs = model.params.drop('Intercept')
ci = model.conf_int().drop('Intercept')

fig, ax = plt.subplots(figsize=(8, 6))
y_pos = range(len(coefs))

ax.errorbar(coefs, y_pos,
            xerr=[(coefs - ci[0]), (ci[1] - coefs)],
            fmt='o', capsize=5)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(coefs.index)
ax.set_xlabel('Coefficient Estimate')
ax.set_title('Regression Coefficients with 95% CI')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

Performance Comparison

Big Data Processing Speed (1 million rows)

OperationStataR (dplyr)Python (pandas)
Read CSV~5 sec~3 sec~2 sec
Group Aggregation~2 sec~1 sec~0.8 sec
Merge Data~4 sec~2 sec~1.5 sec
Regression Analysis~0.5 sec~0.8 sec~0.6 sec

Note: Python + pandas is usually fastest for big data, but Stata is optimized for regression analysis.


Ecosystem Comparison

Stata's Advantages

  • Most comprehensive econometric methods (IV, DID, RDD, PSM)
  • Most concise panel data handling
  • Regression output format closest to academic requirements
  • Concentrated community (StataList, SSC)

R's Advantages

  • Richest statistical methods (Bayesian, survival analysis, factor analysis)
  • Most elegant data visualization (ggplot2)
  • Open source and free
  • Huge CRAN package ecosystem (20,000+ packages)

Python's Advantages

  • Strongest machine learning ecosystem (sklearn, PyTorch, TensorFlow)
  • Only choice for deep learning and LLMs
  • Strongest general programming capabilities
  • Most comprehensive data engineering tools (scraping, APIs, databases)
  • Largest job market demand

Learning Recommendations

If You're a Stata User

  1. Focus on learning Pandas (80% of Stata functionality can be replicated)
  2. Use statsmodels (output format closest to Stata)
  3. Remember: df['var'] ≈ Stata variable name
  4. Learning Path:
    • Week 1: Pandas basics (equivalent to Stata data operations)
    • Week 2: statsmodels regression (equivalent to Stata regress)
    • Week 3: linearmodels panel data (equivalent to Stata xtreg)
    • Week 4: scikit-learn machine learning (Stata cannot do)

If You're an R User

  1. Learn Pandas (similar to dplyr + data.table)
  2. Use plotnine (Python version of ggplot2)
  3. Remember: Python uses . not %>%
  4. Learning Path:
    • Week 1: Python basic syntax (R users fastest: 3 days)
    • Week 2: Pandas (similar to tidyverse)
    • Week 3: Matplotlib/Seaborn (not as good as ggplot2, but sufficient)
    • Week 4: sklearn + PyTorch (R's weakness)

Three-Language Mixed Use Strategy

Best Practice: Choose tool based on task

Data Cleaning → Python (pandas)

Descriptive Statistics → Stata/R (personal preference)

Traditional Econometrics → Stata (panel data, IV)

Machine Learning → Python (sklearn)

Text Analysis → Python (transformers)

Visualization → R (ggplot2) or Python (seaborn)

Next Steps

In the next section, we will write Your First Python Program and experience Python's simplicity and power.

Ready?

Released under the MIT License. Content © Author.