1.1 Chapter Introduction (Regression Analysis & Python Applications)
"Essentially, all models are wrong, but some are useful."— George E. P. Box, Statistician
Master the core methods of Python regression analysis from scratch
Chapter Objectives
After completing this chapter, you will be able to:
- Conduct OLS linear regression analysis using Python
- Master binary dependent variable models like Logit/Probit
- Present regression results tables like top-tier journals
- Understand and interpret every metric in regression output
- Compare Python regression results with Stata/R
Why Start with Regression Analysis?
The Role of Regression Analysis in Social Sciences
Regression analysis is the cornerstone of empirical research in social sciences. Whether in economics, sociology, political science, or management, the vast majority of empirical papers use regression methods:
| Field | Typical Research Questions | Regression Type |
|---|---|---|
| Labor Economics | Returns to education, gender wage gap | OLS, Mincer equation |
| Development Economics | Poverty traps, economic growth factors | Panel regression, IV |
| Corporate Finance | Capital structure, corporate governance | Logit, fixed effects |
| Sociology | Social mobility, inequality | Multilevel regression, Logit |
| Political Science | Voting behavior, policy effects | Probit, DID |
Python vs Stata vs R: Why Choose Python?
| Feature | Python | Stata | R |
|---|---|---|---|
| Regression Capabilities | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Learning Curve | Medium | Easy | Steep |
| Versatility | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| Data Processing | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Machine Learning | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ |
| Free & Open Source | ✅ | ❌ (Expensive) | ✅ |
| Community Support | Most active | Medium | Active |
Python's Unique Advantages:
- All-in-one solution: From data cleaning → statistical analysis → machine learning → web applications
- Career development: Python is the standard language for data science and AI
- Ecosystem: Perfect integration of pandas (data processing) + statsmodels (statistics) + scikit-learn (machine learning) + PyTorch (deep learning)
- Industry recognition: 95% of tech companies use Python; Stata is limited to academia
Chapter Content Overview
Section 1: Python Regression Analysis Quick Start
Study Time: 30 minutes
- Run your first OLS regression in 5 minutes
- Core tool: Introduction to statsmodels library
- Python vs Stata vs R syntax comparison
- Understanding key metrics in regression output
You will learn:
import statsmodels.api as sm
# Complete regression in 3 lines
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(data['wage'], X).fit()
print(model.summary())Section 2: OLS Regression Explained
Study Time: 1.5 hours
- From simple to multiple regression
- Deep dive into regression output: R², F-statistic, t-statistic, p-value
- Case study: Researching returns to education (Mincer equation)
- Extracting regression results: coefficients, standard errors, confidence intervals
- Model diagnostics: residual analysis, multicollinearity
Core Concepts:
- Mincer Equation: The most classical wage equation in economics
- Returns to Education: Each additional year of education increases wages by 8-12% on average
- Control Variables: How to include gender, city, industry, and other control variables
Practical Skills:
# Extract regression coefficients
coef = model.params['education']
se = model.bse['education']
pvalue = model.pvalues['education']
# 95% confidence interval
conf_int = model.conf_int(alpha=0.05)
# Prediction
predictions = model.predict(new_data)Section 3: Logit Regression - Binary Dependent Variable Models
Study Time: 1.5 hours
- When to use Logit/Probit?
- Mathematical principles of Logit model
- Case study: Factors affecting college admission
- Interpreting coefficients: log odds ratio vs marginal effects
- Comparison with Stata/R's
logitcommand
Typical Application Scenarios:
| Research Question | Dependent Variable | Example Independent Variables |
|---|---|---|
| College attendance | Attend=1, Not=0 | Family income, parental education, SAT score |
| Employment | Employed=1, Unemployed=0 | Education, experience, gender, location |
| Default | Default=1, Not=0 | Credit score, income, debt ratio |
| Voting | Vote=1, Not=0 | Age, education, income, political orientation |
Core Skills:
from statsmodels.formula.api import logit
# Logit regression
model = logit('admitted ~ gpa + sat + income', data=df).fit()
# Marginal effects (most important!)
marginal_effects = model.get_margeff()
print(marginal_effects.summary())
# Predict probabilities
prob = model.predict(new_data)Marginal Effects Interpretation:
- Coefficients cannot be directly interpreted (because they are log odds)
- Marginal effects: How much the probability of dependent variable changes when independent variable changes by 1 unit
- Example: Each 0.1 increase in GPA raises admission probability by 5.2 percentage points
Section 4: summary_col - Elegantly Comparing Multiple Models
Study Time: 1 hour
- Academic paper standard: Side-by-side presentation of multiple models
- Using
summary_col()to generate regression tables - Comparison with Stata's
esttaband R'sstargazer - Custom output: Significance asterisks, adding statistics
Why Compare Models?
In top journals (AER, QJE, JPE), the standard practice is to present 3-6 progressive models:
| Model | Included Variables | Purpose |
|---|---|---|
| Model 1 | Only core explanatory variables | Show basic relationship |
| Model 2 | + Basic control variables | Control confounding factors |
| Model 3 | + More control variables | Robustness checks |
| Model 4 | + Fixed effects | Control unobserved heterogeneity |
Hands-on Example:
from statsmodels.iolib.summary2 import summary_col
# Build 4 progressive models
model1 = sm.OLS(y, X1).fit()
model2 = sm.OLS(y, X2).fit()
model3 = sm.OLS(y, X3).fit()
model4 = sm.OLS(y, X4).fit()
# Generate comparison table with one command
table = summary_col([model1, model2, model3, model4],
stars=True, # Add significance asterisks
float_format='%.3f',
model_names=['(1)', '(2)', '(3)', '(4)'],
info_dict={'N': lambda x: f"{int(x.nobs):,}",
'R-squared': lambda x: f"{x.rsquared:.3f}"})
print(table)Output Effect:
================================================================
(1) (2) (3) (4)
----------------------------------------------------------------
education 450.000*** 380.000*** 320.000*** 310.000***
(25.000) (22.000) (20.000) (19.000)
experience 50.000*** 45.000*** 42.000***
(5.000) (4.800) (4.500)
female -800.000***-750.000***
(100.000) (95.000)
urban 400.000***
(80.000)
----------------------------------------------------------------
N 1,000 1,000 1,000 1,000
R-squared 0.450 0.520 0.580 0.600
================================================================
Standard errors in parentheses.
* p<0.1, ** p<0.05, ***p<0.01Core Tools & Tech Stack
Essential Python Libraries
| Library | Purpose | Installation |
|---|---|---|
| pandas | Data processing | pip install pandas |
| numpy | Numerical computing | pip install numpy |
| statsmodels | Statistical modeling (chapter core) | pip install statsmodels |
| matplotlib | Data visualization | pip install matplotlib |
| seaborn | Advanced visualization | pip install seaborn |
statsmodels Core Functions
import statsmodels.api as sm
from statsmodels.formula.api import ols, logit, probit
# 1. OLS regression (matrix form)
model = sm.OLS(y, X).fit()
# 2. OLS regression (formula form, similar to R)
model = ols('wage ~ education + experience', data=df).fit()
# 3. Logit regression
model = logit('admitted ~ gpa + sat', data=df).fit()
# 4. Probit regression
model = probit('admitted ~ gpa + sat', data=df).fit()
# 5. Weighted Least Squares (WLS)
model = sm.WLS(y, X, weights=weights).fit()
# 6. Generalized Linear Model (GLM)
model = sm.GLM(y, X, family=sm.families.Binomial()).fit()Chapter Case Datasets
Case 1: Returns to Education Research (Mincer Equation)
- Sample Size: 1,000 observations
- Variables:
- Dependent variable:
wage(salary in dollars) - Core explanatory variable:
education(years of education) - Control variables:
experience(years of work experience),female(gender),urban(urban/rural)
- Dependent variable:
- Research Question: How much does wage increase for each additional year of education?
Case 2: College Admission Determinants (Logit Model)
- Sample Size: 500 applicants
- Variables:
- Dependent variable:
admitted(admitted or not, 1=yes, 0=no) - Explanatory variables:
gpa(GPA score),sat(SAT score),extracurricular(number of extracurricular activities),income(family income)
- Dependent variable:
- Research Question: What factors affect college admission probability?
Case 3: Progressive Model Comparison (summary_col Application)
- Sample Size: 1,000 observations
- Model Comparison:
- Model 1: Only education variable
- Model 2: + Work experience
- Model 3: + Gender
- Model 4: + Urban/rural
- Purpose: Show how to present regression results like top-tier journals
Learning Path Recommendations
Recommended Study Sequence (Total 4-6 hours)
Section 1 (30 minutes)
↓
Run first regression, build confidence
↓
Section 2 (1.5 hours)
↓
Deep dive into OLS, master core skills
↓
Section 3 (1.5 hours)
↓
Learn Logit models, expand toolkit
↓
Section 4 (1 hour)
↓
Learn professional presentation of regression results
↓
Complete chapter!Study Suggestions for Each Section
- Learn Theory First: Understand basic regression principles (15 minutes)
- Run Code: Execute all code examples on webpage (30 minutes)
- Modify Parameters: Try changing variables, sample sizes, observe result changes (15 minutes)
- Compare Stata/R: If you know Stata or R, compare syntax differences (15 minutes)
- Complete Exercises: Programming exercises at end of each section (30 minutes)
Prerequisites
Required Skills ✅
Python Basics:
- Variables, data types, lists, dictionaries
- Function definition and calls
- Conditionals (if/else) and loops (for/while)
- ⚠️ If unfamiliar, suggest learning Module 1: Python Programming Introduction first
pandas Basics:
- Read CSV files:
pd.read_csv() - Data selection:
df['column'],df[['col1', 'col2']] - Data filtering:
df[df['age'] > 30] - 💡 Can learn while referencing documentation if unfamiliar
- Read CSV files:
Statistics Basics (Recommended but Not Required)
- Understand concepts of mean, variance, standard deviation
- Know what correlation coefficient is
- Heard of "significance testing" and "p-value"
- ✅ Don't worry if you don't know: This chapter explains all statistical concepts in plain language
What You Don't Need to Know
- ❌ Advanced mathematics (linear algebra, calculus)
- ❌ Complex statistical theory
- ❌ Stata or R (but helpful for comparative learning if you do)
Learning Outcome Assessment
After completing this chapter, you should be able to:
Theoretical Understanding 📚
- [ ] Explain basic principles of OLS regression
- [ ] State when to use Logit/Probit instead of OLS
- [ ] Understand meaning of R², p-value, confidence intervals
- [ ] Know how to judge if regression coefficient is significant
Programming Skills 💻
- [ ] Conduct OLS regression using statsmodels
- [ ] Conduct Logit regression using statsmodels
- [ ] Extract regression coefficients, standard errors, p-values
- [ ] Calculate marginal effects (Logit models)
- [ ] Compare multiple models using summary_col
Practical Abilities 🎯
- [ ] Independently complete a full regression analysis project
- [ ] Present regression results like academic papers
- [ ] Interpret regression output and draw conclusions
- [ ] Identify common regression pitfalls (multicollinearity, endogeneity)
Next Steps in Learning
After completing this chapter, you can continue with:
- Module 2: Counterfactuals & RCTs - Learn the basic framework of causal inference
- Module 3: Data Cleaning & Variable Construction - Master data processing skills for empirical research
- Module 5: Advanced Regression Analysis - Learn advanced topics like heteroscedasticity, autocorrelation, endogeneity
- Module 9: Difference-in-Differences (DID) - Master the most commonly used causal identification method in top economics journals
References
Recommended Textbooks
Wooldridge, J. M. (2020). Introductory Econometrics: A Modern Approach (7th ed.). Cengage Learning.
- Most classic econometrics textbook, suitable for beginners
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- Must-read for empirical research, explains causal inference accessibly
Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics (4th ed.). Pearson.
- Another classic textbook with rich empirical cases
Python Official Documentation
- statsmodels Documentation: https://www.statsmodels.org/stable/index.html
- pandas Documentation: https://pandas.pydata.org/docs/
Classic Papers (Using Regression Analysis)
Mincer, J. (1974). "Schooling, Experience, and Earnings." NBER.
- Foundational work on returns to education research
Card, D., & Krueger, A. B. (1994). "Minimum Wages and Employment." American Economic Review.
- Uses DID method to study impact of minimum wage on employment
Study Recommendations
For Beginners
"Regression analysis looks complex, but the essence is simple: finding relationships between variables. Don't be intimidated by mathematical formulas—focus on understanding intuition; code will handle the calculations."
For Learners with Stata/R Experience
"Python's regression syntax is very similar to Stata/R. This chapter compares all three in detail to help you transition quickly. statsmodels output format is almost identical to Stata!"
For Students Wanting to Do Empirical Research
"After mastering this chapter's content, you'll have the skills to read 80% of economics and sociology papers. Regression analysis is the foundation of empirical research—no matter what research you do in the future, these skills won't become outdated."
Start Learning
Ready? Let's start with Section 1 and run your first Python regression model in 5 minutes!
Your statistical learning journey starts here!