Skip to content

Why Should Social Science Students Learn Python?

From Stata/R to Python — Exploring New Tools for Data Analysis


Evolution of Computing Tools in Social Science Research

Limitations of Traditional Tools

As a social science student, you may have already encountered:

  • Stata: The mainstream tool in economics and political science, excels at panel data and econometric models
  • R: A powerful tool for statistics with a rich collection of statistical packages
  • SPSS: Commonly used in psychology and sociology, user-friendly interface but limited functionality

These tools are excellent in their respective fields, but in the AI era, they face some challenges:

DimensionStataRPython
Data Analysis⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Statistical Modeling⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Machine Learning⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Deep Learning⭐⭐⭐⭐⭐⭐⭐
LLM API Calls⭐⭐⭐⭐⭐⭐⭐
Web Scraping⭐⭐⭐⭐⭐⭐⭐⭐
General Programming⭐⭐⭐⭐⭐⭐⭐
Community Ecosystem⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐

Unique Value of Python for Social Science Students

1. Unified Workflow

With Python, you can complete everything in one environment:

python
# Step 1: Scrape data
import requests
data = requests.get('https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD')

# Step 2: Data cleaning and analysis
import pandas as pd
df = pd.read_csv('survey_data.csv')
df.groupby('country')['income'].mean()

# Step 3: Statistical modeling
from statsmodels.formula.api import ols
model = ols('income ~ education + age', data=df).fit()

# Step 4: Machine learning prediction
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Step 5: Call LLM to generate text
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Summarize this research report"}]
)

Comparison with Stata/R:

  • Stata: Need external tools to scrape data → Import to Stata → Model → Cannot call LLMs
  • R: Can scrape data and model, but machine learning and LLM calling ecosystem is weaker

2. Essential Skills for the AI Era

Current frontier trends in social science research:

  • Text Analysis: Use LLMs to analyze policy documents, social media, historical documents
  • Causal Inference: New methods combining machine learning + causal inference (Double ML, Causal Forest)
  • Experimental Design: A/B testing, multi-armed bandit algorithms
  • Big Data Processing: Handle millions of rows of administrative data, network data

The best tool for all these frontier methods is Python.

3. Job Market Demand

Job TypeStataRPython
Academic Research (Economics)
Data Analyst
Data Scientist
AI Product Manager
Policy Analyst

4. Native Language of Large Language Models

If you've used ChatGPT, Doubao, or Qwen, you know that when you ask AI to solve a problem through programming, especially data science problems, Python is AI's default language. If you're also familiar with Python, communication with AI will naturally be more effective.

5. Econometric Modeling Capability: On Par with Stata

Many economics students worry that Python isn't as professional as Stata for econometric modeling. But in fact, through statsmodels and linearmodels packages, Python already has a complete econometric toolchain.

Case Study: Multi-Model Comparison Analysis of Wage Regression

Suppose we want to study the impact of education and work experience on wages, and compare different model specifications:

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col
from linearmodels.panel import PanelOLS

# Step 1: Generate simulated data (replace with real data in actual research)
np.random.seed(42)
n = 1000

data = pd.DataFrame({
    'wage': np.random.normal(50000, 15000, n),
    'education': np.random.randint(12, 20, n),  # Years of education
    'experience': np.random.randint(0, 30, n),  # Work experience
    'age': np.random.randint(22, 60, n),
    'gender': np.random.choice([0, 1], n),  # 0=female, 1=male
    'region': np.random.choice(['East', 'West', 'South', 'North'], n)
})

# Add real causal relationships
data['wage'] = (
    20000 +
    3000 * data['education'] +
    800 * data['experience'] +
    5000 * data['gender'] +
    np.random.normal(0, 5000, n)
)

# Step 2: Build multiple regression models

# Model 1: Baseline OLS (education only)
model1 = ols('wage ~ education', data=data).fit()

# Model 2: Add work experience
model2 = ols('wage ~ education + experience', data=data).fit()

# Model 3: Add gender control variable
model3 = ols('wage ~ education + experience + gender', data=data).fit()

# Model 4: Add interaction term (education × experience)
model4 = ols('wage ~ education + experience + gender + education:experience',
             data=data).fit()

# Model 5: Add region fixed effects
model5 = ols('wage ~ education + experience + gender + C(region)',
             data=data).fit()

# Step 3: Use summary_col to consolidate output into one table
# This is Python's powerful econometric modeling feature: similar to Stata's esttab
results_table = summary_col(
    [model1, model2, model3, model4, model5],
    model_names=['Model(1)', 'Model(2)', 'Model(3)', 'Model(4)', 'Model(5)'],
    stars=True,  # Add significance stars
    float_format='%.2f',
    info_dict={
        'N': lambda x: f"{int(x.nobs)}",
        'R²': lambda x: f"{x.rsquared:.3f}"
    }
)

print(results_table)

Output (similar to Stata regression tables):

===============================================================================
                       Model(1)   Model(2)   Model(3)   Model(4)   Model(5)
-------------------------------------------------------------------------------
education            2891.35*** 2902.44*** 2903.12*** 2654.89*** 2897.85***
                     (138.99)   (99.87)    (79.45)    (156.32)   (79.82)
experience                      822.15***  819.43***  645.23***  816.78***
                                (45.23)    (35.98)    (89.45)    (36.15)
gender                                     4998.67*** 4987.34*** 4992.11***
                                           (298.76)   (299.12)   (299.34)
education:experience                                  12.56
                                                      (6.78)
C(region)[T.North]                                               -234.56
                                                                 (445.23)
C(region)[T.South]                                               156.78
                                                                 (438.91)
C(region)[T.West]                                                89.34
                                                                 (442.67)
R²                   0.293      0.456      0.612      0.615      0.613
N                    1000       1000       1000       1000       1000
===============================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

What does this case demonstrate?

  1. Professional regression output: summary_col can generate regression tables as professional as Stata's esttab
  2. Multi-model comparison: Easily display different model specifications for robustness checks
  3. Flexible model specification: Supports interaction terms, fixed effects, clustered standard errors, and other advanced features
  4. Complete ecosystem:
    • statsmodels: OLS, Logit, Probit, time series (ARIMA), causal inference (DID, RDD)
    • linearmodels: Panel data (fixed effects, random effects, instrumental variables), GMM estimation
    • econml/dowhy: Combining machine learning with causal inference (Double ML, Causal Forest)

Comparison with Stata:

  • Python can also generate journal-quality regression tables
  • Supports all mainstream econometric methods (IV, DID, RDD, panel data, etc.)
  • Additional advantage: Seamless integration with machine learning, deep learning, and LLM analysis

Practical Usage Tips:

  • To export to LaTeX or Word format, use the stargazer package (Python version)
  • For panel data, linearmodels.PanelOLS provides the same functionality as Stata's xtreg
  • Clustered standard errors can be implemented via .fit(cov_type='cluster', cov_kwds={'groups': data['cluster_id']})

6. Jupyter Notebook: The Gold Standard for Research Reproducibility

In modern scientific research, reproducibility has become a core requirement for academic integrity and research quality. Top journals (such as AER, QJE, Nature, Science) all require authors to submit reproducible code and data. And Jupyter Notebook is the best tool for achieving research reproducibility.

Why is Jupyter Notebook the Best Tool for Presenting Data Analysis?

Core Advantage: Code, Results, and Explanations in One

Traditional workflow (like Stata):

Code file (.do) → Run → Output file (.log, .tex) → Manually organize into paper

Problem: Code and results are separated, after modifying code you need to re-run and manually update all outputs

Jupyter Notebook's Revolutionary Change:

One document (.ipynb) = Code + Execution results + Charts + Explanatory text

Real Case: A Complete Research Analysis Workflow

Suppose you want to analyze "the impact of education on income," with Jupyter Notebook you can present it like this:

Traditional Method (Stata):

  1. Write analysis.do file
  2. Run to get results.log
  3. Use esttab to export regression table to .tex
  4. Manually copy-paste to Word/LaTeX paper
  5. If modifications needed, repeat steps 2-4

Jupyter Notebook Method:

markdown
# Analysis of Education's Impact on Income

## 1. Data Loading and Cleaning

```python
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('wage_data.csv')

# View data overview
print(f"Sample size: {len(data)}")
print(f"Variable list: {data.columns.tolist()}")
data.describe()
```

**Output** (automatically displayed in notebook):
```
Sample size: 5000
Variable list: ['wage', 'education', 'experience', 'age', 'gender']
              wage   education  experience        age     gender
count    5000.000      5000.00     5000.00    5000.00    5000.00
mean    48234.56        15.23       12.45      38.67       0.51
std     18456.78         2.34        8.12      10.23       0.50
...
```

## 2. Descriptive Statistics Visualization

```python
# Plot scatter of education vs wage
plt.figure(figsize=(10, 6))
plt.scatter(data['education'], data['wage'], alpha=0.5)
plt.xlabel('Years of Education')
plt.ylabel('Annual Income')
plt.title('Relationship Between Education and Income')
plt.show()
```

**Output** (chart embedded directly):
[Scatter plot automatically displays in notebook, no manual insertion needed]

## 3. Regression Analysis

```python
# Build multiple models
model1 = ols('wage ~ education', data=data).fit()
model2 = ols('wage ~ education + experience', data=data).fit()
model3 = ols('wage ~ education + experience + gender', data=data).fit()

# Generate regression table
results = summary_col([model1, model2, model3],
                      model_names=['Model(1)', 'Model(2)', 'Model(3)'],
                      stars=True)
print(results)
```

**Output** (regression table displayed directly):
```
                       Model(1)   Model(2)   Model(3)
education            2850.34*** 2860.12*** 2855.67***
                     (145.23)   (102.34)   (85.12)
...
```

## 4. Conclusion

We found that each additional year of education increases income by approximately 2,850 yuan on average, and remains significant after controlling for work experience and gender.
This supports the predictions of human capital theory.

Core Advantages of Jupyter Notebook

1. One-Click Reproducibility

  • Recipients can open the .ipynb file and click "Run All" to reproduce all results
  • No need to manually run multiple files or manually organize outputs

2. Code and Results Update Synchronously

  • After modifying code, re-run the cell and output updates automatically
  • No mismatch between code and results

3. Interactive Exploration

  • Can modify parameters and re-run code blocks at any time
  • Convenient for debugging and trying different model specifications

4. Rich Output Formats

  • Supports tables, charts, mathematical formulas (LaTeX), Markdown text
  • Can export to PDF, HTML, Slides, and other formats

5. Meets Journal Reproducibility Standards

Mainstream journal reproducibility policies (2024):

JournalReproducibility RequirementJupyter Notebook Support
AERMust provide runnable code and dataExplicitly accepts .ipynb
QJERequires replication packageAccepts Jupyter Notebook
NatureMust provide Code availability statementRecommends using Jupyter
ScienceRequires code archived in public repositorySupports .ipynb

Real Examples:

  • 2019 Nobel Economics Prize winner Abhijit Banerjee's team uses Jupyter Notebook to publish replication code
  • Over 60% of data science papers in Nature use Jupyter Notebook as supplementary material

Comparison with Traditional Tools

DimensionStata .do + .logR Script + RMarkdownJupyter Notebook
Code-Result IntegrationSeparatedRMarkdown canNative support
Interactive ExecutionMust re-run everything⭐ Partial supportFull support
Auto-Embedded ChartsNeed manual exportRMarkdown supportsAuto display
Learning Curve⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Cross-Language SupportStata onlyR onlyPython/R/Julia etc
Sharing Convenience⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Version Control (Git)(needs config)

RMarkdown vs Jupyter Notebook:

  • RMarkdown is popular in the R community, but Jupyter is more universal in data science
  • Jupyter supports 40+ programming languages (Python, R, Julia, Stata, etc.)
  • Google Colab, Kaggle, and other platforms are all based on Jupyter

Practical Use Cases

Scenario 1: Course Assignment Submission

  • Students submit .ipynb files, teachers can directly run to verify results
  • Clearer than submitting Word documents + code files

Scenario 2: Sharing Within Research Groups

  • Team members share analysis notebooks, everyone sees identical results
  • Avoids "works on my machine" problems

Scenario 3: Replication Package for Paper Submission

submission/
├── data/
│   └── analysis_data.csv
├── code/
│   └── main_analysis.ipynb    # Main analysis notebook
├── figures/
│   └── (automatically generated by notebook)
└── README.md

Reviewers/editors only need to run main_analysis.ipynb to reproduce all results

Scenario 4: Public Research Code (Boost Citation Rates)

  • Publish to GitHub, other researchers can directly view and run
  • Many highly-cited papers publicly share Jupyter Notebook code

Getting Started with Jupyter Notebook

Installation and startup are very simple:

bash
# Install Jupyter
pip install jupyter

# Start Jupyter Notebook
jupyter notebook

# Or use more modern JupyterLab
pip install jupyterlab
jupyter lab

Free Online Use (no installation required):

Summary

Jupyter Notebook is not just a programming tool, but the reproducibility standard for modern scientific research. It makes your research:

  • More transparent: Code and results fully correspond, impossible to fake
  • Easier to reproduce: Recipients can verify with one-click execution
  • Easier to share: Export to HTML/PDF, non-programmers can also view
  • Meets journal requirements: Satisfies reproducibility policies of top journals

In the AI and big data era, mastering Jupyter Notebook is an essential skill for social science students. All code examples in this tutorial will provide Jupyter Notebook versions to help you get started quickly.


Real Cases: Python Applications in Social Science Research

Case 1: Large-Scale Text Analysis (Political Science)

Research Question: Analyze 1 million tweets to study political polarization

python
import pandas as pd
from transformers import pipeline

# Load pre-trained sentiment analysis model
sentiment_analyzer = pipeline("sentiment-analysis")

# Batch process Twitter data
tweets = pd.read_csv("twitter_data.csv")
tweets['sentiment'] = tweets['text'].apply(
    lambda x: sentiment_analyzer(x)[0]['label']
)

# Group analysis by party
result = tweets.groupby(['party', 'sentiment']).size()

Why not Stata/R?

  • Stata: Cannot call Transformer models
  • R: Can call, but ecosystem less mature than Python
  • Python: Has Hugging Face ecosystem, rich model selection

Case 2: Machine Learning in Causal Inference (Economics)

Research Question: Use Double Machine Learning to estimate returns to education

python
from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor

# Double ML estimation
dml = LinearDML(
    model_y=RandomForestRegressor(),
    model_t=RandomForestRegressor()
)
dml.fit(Y=wages, T=education, X=controls)

# Get causal effect
treatment_effect = dml.effect(X_test)
print(f"Causal effect of education: {treatment_effect.mean():.2f}")

Why Python?

  • EconML (Microsoft), DoWhy (Amazon) and other causal inference libraries
  • Best tool for combining machine learning and causal inference

Case 3: Real-Time Data Acquisition and Analysis (Financial Economics)

Research Question: Daily tracking of stock market data, analyzing policy shocks

python
import yfinance as yf

# Download S&P 500 data
sp500 = yf.download("^GSPC", start="2020-01-01")

# Calculate daily returns
sp500['returns'] = sp500['Close'].pct_change()

# Event study: analyze abnormal returns around policy date
event_date = "2020-03-15"
event_returns = sp500.loc[event_date]['returns']
normal_returns = sp500['returns'].rolling(30).mean()
abnormal_return = event_returns - normal_returns[event_date]

print(f"Event day abnormal return: {abnormal_return:.2%}")

Case 4: Network Analysis (Sociology)

Research Question: Analyze community structure in social networks

python
import networkx as nx
from networkx.algorithms import community

# Create social network graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 4)])

# Detect communities
communities = community.greedy_modularity_communities(G)

# Calculate centrality
centrality = nx.betweenness_centrality(G)
print(f"Node importance: {centrality}")

Common Questions

Q1: I already know Stata/R, do I still need to learn Python?

Answer: Depends on your goals

  • If only doing traditional econometrics research → Stata is sufficient
  • If doing statistical methods research → R is powerful
  • If using machine learning, LLMs, or crossing over to tech industry → Must learn Python

Recommendation: The three are not replacement relationships, but complementary

  • Stata: Panel data, IV, DID and other econometric methods
  • R: Statistical inference, Bayesian analysis
  • Python: Machine learning, deep learning, LLMs, general programming

Real Data (based on 2024 Stack Overflow survey):

  • Data science positions requiring Python: 91%
  • Requiring R: 47%
  • Requiring Stata: 12%

Q2: Is Python difficult to learn?

Answer: Simpler than you think!

If you know Stata, you already understand:

  • Variable concepts (Stata's gen, replace)
  • Data frame concepts (Stata's dataset)
  • Loops and conditionals (Stata's foreach, if)

Python just uses different syntax to express the same logic.

Learning Curve Comparison (from zero to doing research):

  • Stata: ~4 weeks (simple syntax, but limited functionality)
  • R: ~6 weeks (powerful statistical functions, but inconsistent syntax)
  • Python: ~6-8 weeks (slightly harder initially, but high long-term returns)

Q3: How long does it take to learn Python?

Answer: Depends on goals

  • Data analysis basics (Pandas + statistics): 2-3 weeks
  • Machine learning intro (sklearn): Another 2-3 weeks
  • Deep learning/LLMs: Another 4-6 weeks

The goal of this tutorial is 6-8 weeks to take you from zero to doing research with Python.

Q4: Do top journals accept Python code?

Answer: Absolutely!

Code Policies of Mainstream Economics Journals (2024):

  • AER (American Economic Review): Accepts Python/R/Stata/Julia
  • QJE (Quarterly Journal of Economics): Accepts all mainstream languages
  • Econometrica: Accepts Python, requires reproducible code
  • JPE (Journal of Political Economy): Accepts Python

Trend: More and more top journal papers use Python, especially involving:

  • Machine learning methods
  • Text analysis
  • Network data
  • Real-time data acquisition

Q5: What are Python's disadvantages?

Answer: Honestly speaking, Python also has weaknesses

Compared to Stata:

  • Panel data commands not as concise as Stata (but functionality not weak)
  • Regression output needs manual formatting (Stata's esttab more convenient)
  • Slightly steeper learning curve

Compared to R:

  • Statistical packages not as comprehensive as R (some frontier statistical methods implemented faster in R)
  • Python version of ggplot2 (plotnine) less mature than original

But: These disadvantages are rapidly improving, and Python's advantages (machine learning, LLMs, versatility) far outweigh these weaknesses.


Learning Path of This Tutorial

Week 1-2: Python Basics → Can write simple scripts

Week 3-4: Pandas Data Analysis → Can replicate all Stata operations

Week 5-6: Statistical Modeling + Machine Learning → Can do regression, classification, clustering

Week 7-8: LLM APIs + Advanced Applications → Can call GPT, Claude for text analysis

Next Steps

In the next section, we will compare in detail Python vs Stata vs R syntax differences to help you quickly build Python thinking patterns.

Ready? Let's begin!

Released under the MIT License. Content © Author.