5.1 Chapter Introduction (Foundations of Regression Analysis)
From Simple Regression to Multiple Regression: The Core Method in Social Science Research
Why is Regression Analysis So Important?
Core Questions in Social Science Research
Nearly all social science research seeks to answer similar questions:
| Discipline | Core Question | Regression Application |
|---|---|---|
| Economics | Effect of education on income? | log(wage) ~ education + experience |
| Sociology | Role of social capital in upward mobility? | mobility ~ social_capital + education + family_background |
| Political Science | Impact of democracy on economic growth? | gdp_growth ~ democracy + institutions + controls |
| Education | Effect of class size on student achievement? | test_score ~ class_size + teacher_quality + controls |
| Psychology | Impact of stress on mental health? | mental_health ~ stress + social_support + personality |
Commonality: All require regression analysis to establish relationships between variables
The Essence of Regression Analysis
From Correlation to Causation
Correlation:
# Simple correlation
corr = df['education'].corr(df['wage'])
# Problem: Cannot control for confounders, cannot infer causalityRegression Analysis:
# Multiple regression
model = sm.OLS(df['log_wage'],
sm.add_constant(df[['education', 'experience', 'ability']])).fit()
# Advantage: Controls for other factors, isolates net effectsThree Primary Uses of Regression
Describing Relationships (Descriptive)
- Quantifies the strength of relationships between variables
- Example: Each additional year of education increases wages by 8%
Prediction
- Predicts dependent variables based on independent variables
- Example: Predicting future income based on individual characteristics
Causal Inference
- Infers causal relationships under certain assumptions
- Example: The causal effect of education on income
Learning Roadmap
Section 1: Simple Linear Regression
Core Concepts:
- Simple linear regression model:
- OLS estimation principle: Minimizing sum of squared residuals
- Goodness of fit: Meaning of
- Statistical inference: tests, confidence intervals
Python Implementation:
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Simple linear regression
X = sm.add_constant(df['education'])
y = df['wage']
model = sm.OLS(y, X).fit()
# Visualization
plt.scatter(df['education'], df['wage'], alpha=0.5)
plt.plot(df['education'], model.fittedvalues, 'r-', linewidth=2)
plt.xlabel('Education (years)')
plt.ylabel('Wage')
plt.title('Simple Linear Regression')
plt.show()Classic Cases:
- Galton's height regression (origin of "regression")
- Mincer wage equation (foundation of education economics)
Section 2: Multiple Linear Regression
Core Concepts:
- Multiple regression model:
- Partial regression coefficient
- Multicollinearity
- Omitted variable bias
Mathematical Intuition:
Interpretation: is the net effect of on after controlling for other variables
Python Implementation:
# Multiple regression
X = sm.add_constant(df[['education', 'experience', 'female', 'urban']])
model = sm.OLS(df['log_wage'], X).fit(cov_type='HC3')
print(model.summary())Classic Cases:
- Wage determination equation (Mincer Equation)
- Cobb-Douglas production function
- Demand/Supply models
Section 3: Regression Diagnostics
Core Assumptions (Gauss-Markov Assumptions):
| Assumption | Meaning | Violation Consequence | Test Method |
|---|---|---|---|
| Linearity | $E[Y | X] = X\beta$ | Biased coefficients |
| Exogeneity | $E[\varepsilon | X] = 0$ | Biased coefficients |
| Homoskedasticity | $Var(\varepsilon | X) = \sigma^2$ | Biased SE |
| No Autocorrelation | Biased SE | DW test | |
| Normality | Affects small sample inference | Shapiro test |
Diagnostic Tools:
# 1. Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)
# 2. Multicollinearity (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# 3. Residual analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted values, Q-Q plot, residual histogram, leverage plotProblem Solutions:
- Heteroskedasticity → Robust standard errors (HC3), WLS
- Multicollinearity → Remove variables, principal component regression
- Nonlinearity → Transformations, polynomials, splines
Section 4: Categorical Variables and Interaction Terms
Dummy Variable Trap:
# Wrong: Including all dummy variables
dummies = pd.get_dummies(df['region']) # 4 regions → 4 dummy variables
# Causes perfect collinearity
# Correct: Drop one reference category
dummies = pd.get_dummies(df['region'], drop_first=True) # 3 dummy variablesInteraction Effects:
# Does education return vary by gender?
df['edu_x_female'] = df['education'] * df['female']
model = smf.ols('log_wage ~ education + female + education:female',
data=df).fit()
# Interpretation:
# Male education return = β₁
# Female education return = β₁ + β₃Visualizing Interaction Effects:
# Plot education-wage curves for different gendersSection 5: Interpretation and Reporting of Regression Results
The Art of Coefficient Interpretation:
| Model Form | Interpretation | Example |
|---|---|---|
| Level-Level | increases by 1 unit, increases by units | |
| Log-Level | increases by 1 unit, increases by % | |
| Level-Log | increases by 1%, increases by units | |
| Log-Log | increases by 1%, increases by % (elasticity) |
Publication-Grade Tables:
from statsmodels.iolib.summary2 import summary_col
# Multi-model comparison
results = summary_col([model1, model2, model3],
model_names=['(1)', '(2)', '(3)'],
stars=True)
print(results.as_latex())Complete Regression Report Should Include:
- Coefficient estimates and standard errors
- Significance markers (*** p<0.01, ** p<0.05, * p<0.1)
- Sample size, , adjusted
- F statistic
- Standard error type (robust, clustered)
- Control variable specifications
Learning Objectives
After completing this chapter, you will be able to:
| Competency Dimension | Specific Objectives |
|---|---|
| Theoretical Understanding | Understand OLS estimation principles |
| Master Gauss-Markov assumptions | |
| Understand the meaning of partial regression coefficients | |
| Python Implementation | Conduct OLS regression using statsmodels |
| Perform comprehensive regression diagnostics | |
| Handle categorical variables and interaction terms | |
| Results Interpretation | Correctly interpret regression coefficients |
| Distinguish statistical significance from substantive significance | |
| Identify pitfalls in causal inference | |
| Academic Writing | Produce publication-grade regression tables |
| Write standardized regression result reports |
Common Misconceptions in Regression Analysis
Misconception 1: Correlation Implies Causation
Wrong:
Regression shows ice cream sales are significantly positively correlated with drowning deaths
→ Conclusion: Ice cream causes drowningCorrect:
- Confounding variable: Summer temperature
- Correlation ≠ Causation
- Need to consider endogeneity, omitted variables, etc.
Misconception 2: Higher is Always Better
Wrong: Blindly pursuing high
Correct:
- For prediction tasks: matters
- For causal inference: Unbiasedness of coefficients is more important
- High may indicate overfitting
Misconception 3: Statistical Significance = Practical Importance
Wrong:
p < 0.001, therefore the effect is very importantCorrect:
- Statistical significance: Effect is nonzero
- Substantive significance: Effect size is meaningful
- With large samples, tiny effects can be significant
Misconception 4: More Control Variables is Better
Wrong: Including all possible variables
Correct:
- Only control for confounders
- Over-controlling can lead to collider bias
- Bad controls (intermediate variables, outcome variables)
Classic Literature and Cases
Pioneering Research
Mincer (1974): "Schooling, Experience, and Earnings"
- Mincer wage equation
Card & Krueger (1994): "Minimum Wages and Employment"
- Impact of minimum wage on employment
- Difference-in-differences regression
Angrist & Krueger (1991): "Does Compulsory School Attendance Affect Schooling and Earnings?"
- Birth quarter as instrumental variable
- Causal return to education
Classic Textbooks
Wooldridge (2020): Introductory Econometrics (7th Edition)
- Econometrics bible
- Python example code
Angrist & Pischke (2009): Mostly Harmless Econometrics
- Causal inference perspective
- Practice-oriented
Stock & Watson (2020): Introduction to Econometrics (4th Edition)
- Clear and accessible
- Abundant case studies
Learning Recommendations
DO (Recommended Practices)
- Understand Before Operating: Don't just run code, understand the statistical principles
- Visualization First: Plot scatter plots before regression to intuitively understand relationships
- Diagnostics are Essential: Check assumptions with every regression
- Caution with Causality: Distinguish correlation from causation
- Standardized Reporting: Report results according to academic standards
DON'T (Avoid Pitfalls)
- Don't p-hack: Don't try models repeatedly until significant
- Don't Ignore Assumptions: Gauss-Markov assumptions matter
- Don't Over-interpret: Correlation does not equal causation
- Don't Forget Robust Standard Errors: Especially with cross-sectional data
- Don't Blindly Trust : High doesn't mean good model
Chapter Datasets
| Dataset | Description | Source | Variables |
|---|---|---|---|
| wage_data.csv | Wage data (cross-sectional) | CPS | wage, education, experience, female |
| housing_prices.csv | Housing price data | Boston Housing | price, rooms, crime, distance |
| student_achievement.csv | Student achievement | PISA | test_score, class_size, teacher_qual |
| country_growth.csv | Country growth | Penn World Table | gdp_growth, investment, education |
Ready to Begin?
Regression analysis is the core method in social science research, and is:
- An extension of descriptive statistics
- An application of hypothesis testing
- A foundation for causal inference
- A starting point for predictive modeling
Mastering regression analysis will enable you to:
- Scientifically study relationships between variables
- Write standardized empirical research papers
- Conduct data-driven decision analysis
- Build a foundation for advanced econometric methods
Let's begin our journey into regression analysis!
Chapter File List
module-5_Regression Analysis/
├── 5.1-Chapter Introduction.md # This file
├── 5.2-Simple Linear Regression.md # Simple linear regression
├── 5.3-Multiple Regression.md # Multiple linear regression
├── 5.4-Regression Diagnostics.md # Regression diagnostics
├── 5.5-Categorical Variables and Interactions.md # Categorical variables and interaction terms
└── 5.6-Interpretation and Reporting.md # Results interpretation and reportingEstimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistics background) Practicality: ⭐⭐⭐⭐⭐ (Essential for social science research)
Next Section: 5.2 - Simple Linear Regression
Let's start with the simplest model!