1.1 Chapter Introduction (Regression Analysis & Python Applications)

"Essentially, all models are wrong, but some are useful."— George E. P. Box, Statistician

Master the core methods of Python regression analysis from scratch

Chapter Objectives

After completing this chapter, you will be able to:

Conduct OLS linear regression analysis using Python
Master binary dependent variable models like Logit/Probit
Present regression results tables like top-tier journals
Understand and interpret every metric in regression output
Compare Python regression results with Stata/R

Why Start with Regression Analysis?

Regression analysis is the cornerstone of empirical research in social sciences. Whether in economics, sociology, political science, or management, the vast majority of empirical papers use regression methods:

Field	Typical Research Questions	Regression Type
Labor Economics	Returns to education, gender wage gap	OLS, Mincer equation
Development Economics	Poverty traps, economic growth factors	Panel regression, IV
Corporate Finance	Capital structure, corporate governance	Logit, fixed effects
Sociology	Social mobility, inequality	Multilevel regression, Logit
Political Science	Voting behavior, policy effects	Probit, DID

Python vs Stata vs R: Why Choose Python?

Feature	Python	Stata	R
Regression Capabilities	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Learning Curve	Medium	Easy	Steep
Versatility	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐
Data Processing	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Machine Learning	⭐⭐⭐⭐⭐	⭐	⭐⭐⭐⭐
Free & Open Source	✅	❌ (Expensive)	✅
Community Support	Most active	Medium	Active

Python's Unique Advantages:

All-in-one solution: From data cleaning → statistical analysis → machine learning → web applications
Career development: Python is the standard language for data science and AI
Ecosystem: Perfect integration of pandas (data processing) + statsmodels (statistics) + scikit-learn (machine learning) + PyTorch (deep learning)
Industry recognition: 95% of tech companies use Python; Stata is limited to academia

Chapter Content Overview

Section 1: Python Regression Analysis Quick Start

Study Time: 30 minutes

Run your first OLS regression in 5 minutes
Core tool: Introduction to statsmodels library
Python vs Stata vs R syntax comparison
Understanding key metrics in regression output

You will learn:

python

import statsmodels.api as sm

# Complete regression in 3 lines
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(data['wage'], X).fit()
print(model.summary())

Section 2: OLS Regression Explained

Study Time: 1.5 hours

From simple to multiple regression
Deep dive into regression output: R², F-statistic, t-statistic, p-value
Case study: Researching returns to education (Mincer equation)
Extracting regression results: coefficients, standard errors, confidence intervals
Model diagnostics: residual analysis, multicollinearity

Core Concepts:

Mincer Equation: The most classical wage equation in economics
Returns to Education: Each additional year of education increases wages by 8-12% on average
Control Variables: How to include gender, city, industry, and other control variables

Practical Skills:

python

# Extract regression coefficients
coef = model.params['education']
se = model.bse['education']
pvalue = model.pvalues['education']

# 95% confidence interval
conf_int = model.conf_int(alpha=0.05)

# Prediction
predictions = model.predict(new_data)

Section 3: Logit Regression - Binary Dependent Variable Models

Study Time: 1.5 hours

When to use Logit/Probit?
Mathematical principles of Logit model
Case study: Factors affecting college admission
Interpreting coefficients: log odds ratio vs marginal effects
Comparison with Stata/R's logit command

Typical Application Scenarios:

Research Question	Dependent Variable	Example Independent Variables
College attendance	Attend=1, Not=0	Family income, parental education, SAT score
Employment	Employed=1, Unemployed=0	Education, experience, gender, location
Default	Default=1, Not=0	Credit score, income, debt ratio
Voting	Vote=1, Not=0	Age, education, income, political orientation

Core Skills:

python

from statsmodels.formula.api import logit

# Logit regression
model = logit('admitted ~ gpa + sat + income', data=df).fit()

# Marginal effects (most important!)
marginal_effects = model.get_margeff()
print(marginal_effects.summary())

# Predict probabilities
prob = model.predict(new_data)

Marginal Effects Interpretation:

Coefficients cannot be directly interpreted (because they are log odds)
Marginal effects: How much the probability of dependent variable changes when independent variable changes by 1 unit
Example: Each 0.1 increase in GPA raises admission probability by 5.2 percentage points

Section 4: summary_col - Elegantly Comparing Multiple Models

Study Time: 1 hour

Academic paper standard: Side-by-side presentation of multiple models
Using summary_col() to generate regression tables
Comparison with Stata's esttab and R's stargazer
Custom output: Significance asterisks, adding statistics

Why Compare Models?

In top journals (AER, QJE, JPE), the standard practice is to present 3-6 progressive models:

Model	Included Variables	Purpose
Model 1	Only core explanatory variables	Show basic relationship
Model 2	+ Basic control variables	Control confounding factors
Model 3	+ More control variables	Robustness checks
Model 4	+ Fixed effects	Control unobserved heterogeneity

Hands-on Example:

python

from statsmodels.iolib.summary2 import summary_col

# Build 4 progressive models
model1 = sm.OLS(y, X1).fit()
model2 = sm.OLS(y, X2).fit()
model3 = sm.OLS(y, X3).fit()
model4 = sm.OLS(y, X4).fit()

# Generate comparison table with one command
table = summary_col([model1, model2, model3, model4],
                    stars=True,  # Add significance asterisks
                    float_format='%.3f',
                    model_names=['(1)', '(2)', '(3)', '(4)'],
                    info_dict={'N': lambda x: f"{int(x.nobs):,}",
                              'R-squared': lambda x: f"{x.rsquared:.3f}"})
print(table)

Output Effect:

================================================================
                   (1)      (2)      (3)      (4)
----------------------------------------------------------------
education       450.000*** 380.000*** 320.000*** 310.000***
               (25.000)  (22.000)  (20.000)  (19.000)
experience                50.000***  45.000***  42.000***
                         (5.000)   (4.800)   (4.500)
female                             -800.000***-750.000***
                                   (100.000) (95.000)
urban                                         400.000***
                                             (80.000)
----------------------------------------------------------------
N               1,000    1,000    1,000    1,000
R-squared       0.450    0.520    0.580    0.600
================================================================
Standard errors in parentheses.
* p<0.1, ** p<0.05, ***p<0.01

Core Tools & Tech Stack

Essential Python Libraries

Library	Purpose	Installation
pandas	Data processing	`pip install pandas`
numpy	Numerical computing	`pip install numpy`
statsmodels	Statistical modeling (chapter core)	`pip install statsmodels`
matplotlib	Data visualization	`pip install matplotlib`
seaborn	Advanced visualization	`pip install seaborn`

statsmodels Core Functions

python

import statsmodels.api as sm
from statsmodels.formula.api import ols, logit, probit

# 1. OLS regression (matrix form)
model = sm.OLS(y, X).fit()

# 2. OLS regression (formula form, similar to R)
model = ols('wage ~ education + experience', data=df).fit()

# 3. Logit regression
model = logit('admitted ~ gpa + sat', data=df).fit()

# 4. Probit regression
model = probit('admitted ~ gpa + sat', data=df).fit()

# 5. Weighted Least Squares (WLS)
model = sm.WLS(y, X, weights=weights).fit()

# 6. Generalized Linear Model (GLM)
model = sm.GLM(y, X, family=sm.families.Binomial()).fit()

Chapter Case Datasets

Case 1: Returns to Education Research (Mincer Equation)

Sample Size: 1,000 observations
Variables:
- Dependent variable: wage (salary in dollars)
- Core explanatory variable: education (years of education)
- Control variables: experience (years of work experience), female (gender), urban (urban/rural)
Research Question: How much does wage increase for each additional year of education?

Case 2: College Admission Determinants (Logit Model)

Sample Size: 500 applicants
Variables:
- Dependent variable: admitted (admitted or not, 1=yes, 0=no)
- Explanatory variables: gpa (GPA score), sat (SAT score), extracurricular (number of extracurricular activities), income (family income)
Research Question: What factors affect college admission probability?

Case 3: Progressive Model Comparison (summary_col Application)

Sample Size: 1,000 observations
Model Comparison:
- Model 1: Only education variable
- Model 2: + Work experience
- Model 3: + Gender
- Model 4: + Urban/rural
Purpose: Show how to present regression results like top-tier journals

Learning Path Recommendations

Recommended Study Sequence (Total 4-6 hours)

Section 1 (30 minutes)
    ↓
Run first regression, build confidence
    ↓
Section 2 (1.5 hours)
    ↓
Deep dive into OLS, master core skills
    ↓
Section 3 (1.5 hours)
    ↓
Learn Logit models, expand toolkit
    ↓
Section 4 (1 hour)
    ↓
Learn professional presentation of regression results
    ↓
Complete chapter!

Study Suggestions for Each Section

Learn Theory First: Understand basic regression principles (15 minutes)
Run Code: Execute all code examples on webpage (30 minutes)
Modify Parameters: Try changing variables, sample sizes, observe result changes (15 minutes)
Compare Stata/R: If you know Stata or R, compare syntax differences (15 minutes)
Complete Exercises: Programming exercises at end of each section (30 minutes)

Prerequisites

Required Skills ✅

Python Basics:
- Variables, data types, lists, dictionaries
- Function definition and calls
- Conditionals (if/else) and loops (for/while)
- ⚠️ If unfamiliar, suggest learning Module 1: Python Programming Introduction first
pandas Basics:
- Read CSV files: pd.read_csv()
- Data selection: df['column'], df[['col1', 'col2']]
- Data filtering: df[df['age'] > 30]
- 💡 Can learn while referencing documentation if unfamiliar

Statistics Basics (Recommended but Not Required)

Understand concepts of mean, variance, standard deviation
Know what correlation coefficient is
Heard of "significance testing" and "p-value"
✅ Don't worry if you don't know: This chapter explains all statistical concepts in plain language

What You Don't Need to Know

❌ Advanced mathematics (linear algebra, calculus)
❌ Complex statistical theory
❌ Stata or R (but helpful for comparative learning if you do)

Learning Outcome Assessment

After completing this chapter, you should be able to:

Theoretical Understanding 📚

[ ] Explain basic principles of OLS regression
[ ] State when to use Logit/Probit instead of OLS
[ ] Understand meaning of R², p-value, confidence intervals
[ ] Know how to judge if regression coefficient is significant

Programming Skills 💻

[ ] Conduct OLS regression using statsmodels
[ ] Conduct Logit regression using statsmodels
[ ] Extract regression coefficients, standard errors, p-values
[ ] Calculate marginal effects (Logit models)
[ ] Compare multiple models using summary_col

Practical Abilities 🎯

[ ] Independently complete a full regression analysis project
[ ] Present regression results like academic papers
[ ] Interpret regression output and draw conclusions
[ ] Identify common regression pitfalls (multicollinearity, endogeneity)

Next Steps in Learning

After completing this chapter, you can continue with:

Module 2: Counterfactuals & RCTs - Learn the basic framework of causal inference
Module 3: Data Cleaning & Variable Construction - Master data processing skills for empirical research
Module 5: Advanced Regression Analysis - Learn advanced topics like heteroscedasticity, autocorrelation, endogeneity
Module 9: Difference-in-Differences (DID) - Master the most commonly used causal identification method in top economics journals

References

Wooldridge, J. M. (2020). Introductory Econometrics: A Modern Approach (7th ed.). Cengage Learning.
- Most classic econometrics textbook, suitable for beginners
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- Must-read for empirical research, explains causal inference accessibly
Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics (4th ed.). Pearson.
- Another classic textbook with rich empirical cases

Python Official Documentation

statsmodels Documentation: https://www.statsmodels.org/stable/index.html
pandas Documentation: https://pandas.pydata.org/docs/

Classic Papers (Using Regression Analysis)

Mincer, J. (1974). "Schooling, Experience, and Earnings." NBER.
- Foundational work on returns to education research
Card, D., & Krueger, A. B. (1994). "Minimum Wages and Employment." American Economic Review.
- Uses DID method to study impact of minimum wage on employment

Study Recommendations

For Beginners

"Regression analysis looks complex, but the essence is simple: finding relationships between variables. Don't be intimidated by mathematical formulas—focus on understanding intuition; code will handle the calculations."

For Learners with Stata/R Experience

"Python's regression syntax is very similar to Stata/R. This chapter compares all three in detail to help you transition quickly. statsmodels output format is almost identical to Stata!"

For Students Wanting to Do Empirical Research

"After mastering this chapter's content, you'll have the skills to read 80% of economics and sociology papers. Regression analysis is the foundation of empirical research—no matter what research you do in the future, these skills won't become outdated."

Start Learning

Ready? Let's start with Section 1 and run your first Python regression model in 5 minutes!

🚀 Start Section 1: Quick Start

Your statistical learning journey starts here!

1.1 Chapter Introduction (Regression Analysis & Python Applications) ​

Chapter Objectives ​

Why Start with Regression Analysis? ​

The Role of Regression Analysis in Social Sciences ​

Python vs Stata vs R: Why Choose Python? ​

Chapter Content Overview ​

Section 1: Python Regression Analysis Quick Start ​

Section 2: OLS Regression Explained ​

Section 3: Logit Regression - Binary Dependent Variable Models ​

Section 4: summary_col - Elegantly Comparing Multiple Models ​

Core Tools & Tech Stack ​

Essential Python Libraries ​

statsmodels Core Functions ​

Chapter Case Datasets ​

Case 1: Returns to Education Research (Mincer Equation) ​

Case 2: College Admission Determinants (Logit Model) ​

Case 3: Progressive Model Comparison (summary_col Application) ​

Learning Path Recommendations ​

Recommended Study Sequence (Total 4-6 hours) ​

Study Suggestions for Each Section ​

Prerequisites ​

Required Skills ✅ ​

Statistics Basics (Recommended but Not Required) ​

What You Don't Need to Know ​

Learning Outcome Assessment ​

Theoretical Understanding 📚 ​

Programming Skills 💻 ​

Practical Abilities 🎯 ​

Next Steps in Learning ​

References ​

Recommended Textbooks ​

Python Official Documentation ​

Classic Papers (Using Regression Analysis) ​

Study Recommendations ​

For Beginners ​

For Learners with Stata/R Experience ​

For Students Wanting to Do Empirical Research ​

Start Learning ​

1.1 Chapter Introduction (Regression Analysis & Python Applications)

Chapter Objectives

Why Start with Regression Analysis?

The Role of Regression Analysis in Social Sciences

Python vs Stata vs R: Why Choose Python?

Chapter Content Overview

Section 1: Python Regression Analysis Quick Start

Section 2: OLS Regression Explained

Section 3: Logit Regression - Binary Dependent Variable Models

Section 4: summary_col - Elegantly Comparing Multiple Models

Core Tools & Tech Stack

Essential Python Libraries

statsmodels Core Functions

Chapter Case Datasets

Case 1: Returns to Education Research (Mincer Equation)

Case 2: College Admission Determinants (Logit Model)

Case 3: Progressive Model Comparison (summary_col Application)

Learning Path Recommendations

Recommended Study Sequence (Total 4-6 hours)

Study Suggestions for Each Section

Prerequisites

Required Skills ✅

Statistics Basics (Recommended but Not Required)

What You Don't Need to Know

Learning Outcome Assessment

Theoretical Understanding 📚

Programming Skills 💻

Practical Abilities 🎯

Next Steps in Learning

References

Recommended Textbooks

Python Official Documentation

Classic Papers (Using Regression Analysis)

Study Recommendations

For Beginners

For Learners with Stata/R Experience

For Students Wanting to Do Empirical Research

Start Learning