Skip to content

1.1 Chapter Introduction (Regression Analysis & Python Applications)

"Essentially, all models are wrong, but some are useful."— George E. P. Box, Statistician

Master the core methods of Python regression analysis from scratch

DifficultyStudy TimePracticality


Chapter Objectives

After completing this chapter, you will be able to:

  • Conduct OLS linear regression analysis using Python
  • Master binary dependent variable models like Logit/Probit
  • Present regression results tables like top-tier journals
  • Understand and interpret every metric in regression output
  • Compare Python regression results with Stata/R

Why Start with Regression Analysis?

The Role of Regression Analysis in Social Sciences

Regression analysis is the cornerstone of empirical research in social sciences. Whether in economics, sociology, political science, or management, the vast majority of empirical papers use regression methods:

FieldTypical Research QuestionsRegression Type
Labor EconomicsReturns to education, gender wage gapOLS, Mincer equation
Development EconomicsPoverty traps, economic growth factorsPanel regression, IV
Corporate FinanceCapital structure, corporate governanceLogit, fixed effects
SociologySocial mobility, inequalityMultilevel regression, Logit
Political ScienceVoting behavior, policy effectsProbit, DID

Python vs Stata vs R: Why Choose Python?

FeaturePythonStataR
Regression Capabilities⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Learning CurveMediumEasySteep
Versatility⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Data Processing⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Machine Learning⭐⭐⭐⭐⭐⭐⭐⭐⭐
Free & Open Source❌ (Expensive)
Community SupportMost activeMediumActive

Python's Unique Advantages:

  • All-in-one solution: From data cleaning → statistical analysis → machine learning → web applications
  • Career development: Python is the standard language for data science and AI
  • Ecosystem: Perfect integration of pandas (data processing) + statsmodels (statistics) + scikit-learn (machine learning) + PyTorch (deep learning)
  • Industry recognition: 95% of tech companies use Python; Stata is limited to academia

Chapter Content Overview

Section 1: Python Regression Analysis Quick Start

Study Time: 30 minutes

  • Run your first OLS regression in 5 minutes
  • Core tool: Introduction to statsmodels library
  • Python vs Stata vs R syntax comparison
  • Understanding key metrics in regression output

You will learn:

python
import statsmodels.api as sm

# Complete regression in 3 lines
X = sm.add_constant(data[['education', 'experience']])
model = sm.OLS(data['wage'], X).fit()
print(model.summary())

Section 2: OLS Regression Explained

Study Time: 1.5 hours

  • From simple to multiple regression
  • Deep dive into regression output: R², F-statistic, t-statistic, p-value
  • Case study: Researching returns to education (Mincer equation)
  • Extracting regression results: coefficients, standard errors, confidence intervals
  • Model diagnostics: residual analysis, multicollinearity

Core Concepts:

  • Mincer Equation: The most classical wage equation in economics
  • Returns to Education: Each additional year of education increases wages by 8-12% on average
  • Control Variables: How to include gender, city, industry, and other control variables

Practical Skills:

python
# Extract regression coefficients
coef = model.params['education']
se = model.bse['education']
pvalue = model.pvalues['education']

# 95% confidence interval
conf_int = model.conf_int(alpha=0.05)

# Prediction
predictions = model.predict(new_data)

Section 3: Logit Regression - Binary Dependent Variable Models

Study Time: 1.5 hours

  • When to use Logit/Probit?
  • Mathematical principles of Logit model
  • Case study: Factors affecting college admission
  • Interpreting coefficients: log odds ratio vs marginal effects
  • Comparison with Stata/R's logit command

Typical Application Scenarios:

Research QuestionDependent VariableExample Independent Variables
College attendanceAttend=1, Not=0Family income, parental education, SAT score
EmploymentEmployed=1, Unemployed=0Education, experience, gender, location
DefaultDefault=1, Not=0Credit score, income, debt ratio
VotingVote=1, Not=0Age, education, income, political orientation

Core Skills:

python
from statsmodels.formula.api import logit

# Logit regression
model = logit('admitted ~ gpa + sat + income', data=df).fit()

# Marginal effects (most important!)
marginal_effects = model.get_margeff()
print(marginal_effects.summary())

# Predict probabilities
prob = model.predict(new_data)

Marginal Effects Interpretation:

  • Coefficients cannot be directly interpreted (because they are log odds)
  • Marginal effects: How much the probability of dependent variable changes when independent variable changes by 1 unit
  • Example: Each 0.1 increase in GPA raises admission probability by 5.2 percentage points

Section 4: summary_col - Elegantly Comparing Multiple Models

Study Time: 1 hour

  • Academic paper standard: Side-by-side presentation of multiple models
  • Using summary_col() to generate regression tables
  • Comparison with Stata's esttab and R's stargazer
  • Custom output: Significance asterisks, adding statistics

Why Compare Models?

In top journals (AER, QJE, JPE), the standard practice is to present 3-6 progressive models:

ModelIncluded VariablesPurpose
Model 1Only core explanatory variablesShow basic relationship
Model 2+ Basic control variablesControl confounding factors
Model 3+ More control variablesRobustness checks
Model 4+ Fixed effectsControl unobserved heterogeneity

Hands-on Example:

python
from statsmodels.iolib.summary2 import summary_col

# Build 4 progressive models
model1 = sm.OLS(y, X1).fit()
model2 = sm.OLS(y, X2).fit()
model3 = sm.OLS(y, X3).fit()
model4 = sm.OLS(y, X4).fit()

# Generate comparison table with one command
table = summary_col([model1, model2, model3, model4],
                    stars=True,  # Add significance asterisks
                    float_format='%.3f',
                    model_names=['(1)', '(2)', '(3)', '(4)'],
                    info_dict={'N': lambda x: f"{int(x.nobs):,}",
                              'R-squared': lambda x: f"{x.rsquared:.3f}"})
print(table)

Output Effect:

================================================================
                   (1)      (2)      (3)      (4)
----------------------------------------------------------------
education       450.000*** 380.000*** 320.000*** 310.000***
               (25.000)  (22.000)  (20.000)  (19.000)
experience                50.000***  45.000***  42.000***
                         (5.000)   (4.800)   (4.500)
female                             -800.000***-750.000***
                                   (100.000) (95.000)
urban                                         400.000***
                                             (80.000)
----------------------------------------------------------------
N               1,000    1,000    1,000    1,000
R-squared       0.450    0.520    0.580    0.600
================================================================
Standard errors in parentheses.
* p<0.1, ** p<0.05, ***p<0.01

Core Tools & Tech Stack

Essential Python Libraries

LibraryPurposeInstallation
pandasData processingpip install pandas
numpyNumerical computingpip install numpy
statsmodelsStatistical modeling (chapter core)pip install statsmodels
matplotlibData visualizationpip install matplotlib
seabornAdvanced visualizationpip install seaborn

statsmodels Core Functions

python
import statsmodels.api as sm
from statsmodels.formula.api import ols, logit, probit

# 1. OLS regression (matrix form)
model = sm.OLS(y, X).fit()

# 2. OLS regression (formula form, similar to R)
model = ols('wage ~ education + experience', data=df).fit()

# 3. Logit regression
model = logit('admitted ~ gpa + sat', data=df).fit()

# 4. Probit regression
model = probit('admitted ~ gpa + sat', data=df).fit()

# 5. Weighted Least Squares (WLS)
model = sm.WLS(y, X, weights=weights).fit()

# 6. Generalized Linear Model (GLM)
model = sm.GLM(y, X, family=sm.families.Binomial()).fit()

Chapter Case Datasets

Case 1: Returns to Education Research (Mincer Equation)

  • Sample Size: 1,000 observations
  • Variables:
    • Dependent variable: wage (salary in dollars)
    • Core explanatory variable: education (years of education)
    • Control variables: experience (years of work experience), female (gender), urban (urban/rural)
  • Research Question: How much does wage increase for each additional year of education?

Case 2: College Admission Determinants (Logit Model)

  • Sample Size: 500 applicants
  • Variables:
    • Dependent variable: admitted (admitted or not, 1=yes, 0=no)
    • Explanatory variables: gpa (GPA score), sat (SAT score), extracurricular (number of extracurricular activities), income (family income)
  • Research Question: What factors affect college admission probability?

Case 3: Progressive Model Comparison (summary_col Application)

  • Sample Size: 1,000 observations
  • Model Comparison:
    • Model 1: Only education variable
    • Model 2: + Work experience
    • Model 3: + Gender
    • Model 4: + Urban/rural
  • Purpose: Show how to present regression results like top-tier journals

Learning Path Recommendations

Section 1 (30 minutes)

Run first regression, build confidence

Section 2 (1.5 hours)

Deep dive into OLS, master core skills

Section 3 (1.5 hours)

Learn Logit models, expand toolkit

Section 4 (1 hour)

Learn professional presentation of regression results

Complete chapter!

Study Suggestions for Each Section

  1. Learn Theory First: Understand basic regression principles (15 minutes)
  2. Run Code: Execute all code examples on webpage (30 minutes)
  3. Modify Parameters: Try changing variables, sample sizes, observe result changes (15 minutes)
  4. Compare Stata/R: If you know Stata or R, compare syntax differences (15 minutes)
  5. Complete Exercises: Programming exercises at end of each section (30 minutes)

Prerequisites

Required Skills ✅

  • Python Basics:

  • pandas Basics:

    • Read CSV files: pd.read_csv()
    • Data selection: df['column'], df[['col1', 'col2']]
    • Data filtering: df[df['age'] > 30]
    • 💡 Can learn while referencing documentation if unfamiliar
  • Understand concepts of mean, variance, standard deviation
  • Know what correlation coefficient is
  • Heard of "significance testing" and "p-value"
  • Don't worry if you don't know: This chapter explains all statistical concepts in plain language

What You Don't Need to Know

  • ❌ Advanced mathematics (linear algebra, calculus)
  • ❌ Complex statistical theory
  • ❌ Stata or R (but helpful for comparative learning if you do)

Learning Outcome Assessment

After completing this chapter, you should be able to:

Theoretical Understanding 📚

  • [ ] Explain basic principles of OLS regression
  • [ ] State when to use Logit/Probit instead of OLS
  • [ ] Understand meaning of R², p-value, confidence intervals
  • [ ] Know how to judge if regression coefficient is significant

Programming Skills 💻

  • [ ] Conduct OLS regression using statsmodels
  • [ ] Conduct Logit regression using statsmodels
  • [ ] Extract regression coefficients, standard errors, p-values
  • [ ] Calculate marginal effects (Logit models)
  • [ ] Compare multiple models using summary_col

Practical Abilities 🎯

  • [ ] Independently complete a full regression analysis project
  • [ ] Present regression results like academic papers
  • [ ] Interpret regression output and draw conclusions
  • [ ] Identify common regression pitfalls (multicollinearity, endogeneity)

Next Steps in Learning

After completing this chapter, you can continue with:

  1. Module 2: Counterfactuals & RCTs - Learn the basic framework of causal inference
  2. Module 3: Data Cleaning & Variable Construction - Master data processing skills for empirical research
  3. Module 5: Advanced Regression Analysis - Learn advanced topics like heteroscedasticity, autocorrelation, endogeneity
  4. Module 9: Difference-in-Differences (DID) - Master the most commonly used causal identification method in top economics journals

References

  1. Wooldridge, J. M. (2020). Introductory Econometrics: A Modern Approach (7th ed.). Cengage Learning.

    • Most classic econometrics textbook, suitable for beginners
  2. Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics. Princeton University Press.

    • Must-read for empirical research, explains causal inference accessibly
  3. Stock, J. H., & Watson, M. W. (2020). Introduction to Econometrics (4th ed.). Pearson.

    • Another classic textbook with rich empirical cases

Python Official Documentation

Classic Papers (Using Regression Analysis)

  1. Mincer, J. (1974). "Schooling, Experience, and Earnings." NBER.

    • Foundational work on returns to education research
  2. Card, D., & Krueger, A. B. (1994). "Minimum Wages and Employment." American Economic Review.

    • Uses DID method to study impact of minimum wage on employment

Study Recommendations

For Beginners

"Regression analysis looks complex, but the essence is simple: finding relationships between variables. Don't be intimidated by mathematical formulas—focus on understanding intuition; code will handle the calculations."

For Learners with Stata/R Experience

"Python's regression syntax is very similar to Stata/R. This chapter compares all three in detail to help you transition quickly. statsmodels output format is almost identical to Stata!"

For Students Wanting to Do Empirical Research

"After mastering this chapter's content, you'll have the skills to read 80% of economics and sociology papers. Regression analysis is the foundation of empirical research—no matter what research you do in the future, these skills won't become outdated."


Start Learning

Ready? Let's start with Section 1 and run your first Python regression model in 5 minutes!


Your statistical learning journey starts here!

Released under the MIT License. Content © Author.