5.1 Chapter Introduction (Foundations of Regression Analysis)

From Simple Regression to Multiple Regression: The Core Method in Social Science Research

Why is Regression Analysis So Important?

Nearly all social science research seeks to answer similar questions:

Discipline	Core Question	Regression Application
Economics	Effect of education on income?	`log(wage) ~ education + experience`
Sociology	Role of social capital in upward mobility?	`mobility ~ social_capital + education + family_background`
Political Science	Impact of democracy on economic growth?	`gdp_growth ~ democracy + institutions + controls`
Education	Effect of class size on student achievement?	`test_score ~ class_size + teacher_quality + controls`
Psychology	Impact of stress on mental health?	`mental_health ~ stress + social_support + personality`

Commonality: All require regression analysis to establish relationships between variables

The Essence of Regression Analysis

From Correlation to Causation

Correlation:

python

# Simple correlation
corr = df['education'].corr(df['wage'])
# Problem: Cannot control for confounders, cannot infer causality

Regression Analysis:

python

# Multiple regression
model = sm.OLS(df['log_wage'],
               sm.add_constant(df[['education', 'experience', 'ability']])).fit()
# Advantage: Controls for other factors, isolates net effects

Three Primary Uses of Regression

Describing Relationships (Descriptive)
- Quantifies the strength of relationships between variables
- Example: Each additional year of education increases wages by 8%
Prediction
- Predicts dependent variables based on independent variables
- Example: Predicting future income based on individual characteristics
Causal Inference
- Infers causal relationships under certain assumptions
- Example: The causal effect of education on income

Learning Roadmap

Section 1: Simple Linear Regression

Core Concepts:

Simple linear regression model:
OLS estimation principle: Minimizing sum of squared residuals
Goodness of fit: Meaning of
Statistical inference: tests, confidence intervals

Python Implementation:

python

import statsmodels.api as sm
import matplotlib.pyplot as plt

# Simple linear regression
X = sm.add_constant(df['education'])
y = df['wage']
model = sm.OLS(y, X).fit()

# Visualization
plt.scatter(df['education'], df['wage'], alpha=0.5)
plt.plot(df['education'], model.fittedvalues, 'r-', linewidth=2)
plt.xlabel('Education (years)')
plt.ylabel('Wage')
plt.title('Simple Linear Regression')
plt.show()

Classic Cases:

Galton's height regression (origin of "regression")
Mincer wage equation (foundation of education economics)

Section 2: Multiple Linear Regression

Core Concepts:

Multiple regression model:
Partial regression coefficient
Multicollinearity
Omitted variable bias

Mathematical Intuition:

Interpretation: is the net effect of on after controlling for other variables

Python Implementation:

python

# Multiple regression
X = sm.add_constant(df[['education', 'experience', 'female', 'urban']])
model = sm.OLS(df['log_wage'], X).fit(cov_type='HC3')
print(model.summary())

Classic Cases:

Wage determination equation (Mincer Equation)
Cobb-Douglas production function
Demand/Supply models

Section 3: Regression Diagnostics

Core Assumptions (Gauss-Markov Assumptions):

Assumption	Meaning	Violation Consequence	Test Method
Linearity	$E[Y	X] = X\beta$	Biased coefficients
Exogeneity	$E[\varepsilon	X] = 0$	Biased coefficients
Homoskedasticity	$Var(\varepsilon	X) = \sigma^2$	Biased SE
No Autocorrelation		Biased SE	DW test
Normality		Affects small sample inference	Shapiro test

Diagnostic Tools:

python

# 1. Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)

# 2. Multicollinearity (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# 3. Residual analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted values, Q-Q plot, residual histogram, leverage plot

Problem Solutions:

Heteroskedasticity → Robust standard errors (HC3), WLS
Multicollinearity → Remove variables, principal component regression
Nonlinearity → Transformations, polynomials, splines

Section 4: Categorical Variables and Interaction Terms

Dummy Variable Trap:

python

# Wrong: Including all dummy variables
dummies = pd.get_dummies(df['region'])  # 4 regions → 4 dummy variables
# Causes perfect collinearity

# Correct: Drop one reference category
dummies = pd.get_dummies(df['region'], drop_first=True)  # 3 dummy variables

Interaction Effects:

python

# Does education return vary by gender?
df['edu_x_female'] = df['education'] * df['female']

model = smf.ols('log_wage ~ education + female + education:female',
                data=df).fit()

# Interpretation:
# Male education return = β₁
# Female education return = β₁ + β₃

Visualizing Interaction Effects:

python

# Plot education-wage curves for different genders

Section 5: Interpretation and Reporting of Regression Results

The Art of Coefficient Interpretation:

Model Form	Interpretation	Example
Level-Level		increases by 1 unit, increases by units
Log-Level		increases by 1 unit, increases by %
Level-Log		increases by 1%, increases by units
Log-Log		increases by 1%, increases by % (elasticity)

Publication-Grade Tables:

python

from statsmodels.iolib.summary2 import summary_col

# Multi-model comparison
results = summary_col([model1, model2, model3],
                      model_names=['(1)', '(2)', '(3)'],
                      stars=True)
print(results.as_latex())

Complete Regression Report Should Include:

Coefficient estimates and standard errors
Significance markers (*** p<0.01, ** p<0.05, * p<0.1)
Sample size, , adjusted
F statistic
Standard error type (robust, clustered)
Control variable specifications

Learning Objectives

After completing this chapter, you will be able to:

Competency Dimension	Specific Objectives
Theoretical Understanding	Understand OLS estimation principles
	Master Gauss-Markov assumptions
	Understand the meaning of partial regression coefficients
Python Implementation	Conduct OLS regression using statsmodels
	Perform comprehensive regression diagnostics
	Handle categorical variables and interaction terms
Results Interpretation	Correctly interpret regression coefficients
	Distinguish statistical significance from substantive significance
	Identify pitfalls in causal inference
Academic Writing	Produce publication-grade regression tables
	Write standardized regression result reports

Common Misconceptions in Regression Analysis

Misconception 1: Correlation Implies Causation

Wrong:

Regression shows ice cream sales are significantly positively correlated with drowning deaths
→ Conclusion: Ice cream causes drowning

Correct:

Confounding variable: Summer temperature
Correlation ≠ Causation
Need to consider endogeneity, omitted variables, etc.

Misconception 2: Higher is Always Better

Wrong: Blindly pursuing high

Correct:

For prediction tasks: matters
For causal inference: Unbiasedness of coefficients is more important
High may indicate overfitting

Misconception 3: Statistical Significance = Practical Importance

Wrong:

p < 0.001, therefore the effect is very important

Correct:

Statistical significance: Effect is nonzero
Substantive significance: Effect size is meaningful
With large samples, tiny effects can be significant

Misconception 4: More Control Variables is Better

Wrong: Including all possible variables

Correct:

Only control for confounders
Over-controlling can lead to collider bias
Bad controls (intermediate variables, outcome variables)

Classic Literature and Cases

Pioneering Research

Mincer (1974): "Schooling, Experience, and Earnings"
- Mincer wage equation
Card & Krueger (1994): "Minimum Wages and Employment"
- Impact of minimum wage on employment
- Difference-in-differences regression
Angrist & Krueger (1991): "Does Compulsory School Attendance Affect Schooling and Earnings?"
- Birth quarter as instrumental variable
- Causal return to education

Classic Textbooks

Wooldridge (2020): Introductory Econometrics (7th Edition)
- Econometrics bible
- Python example code
Angrist & Pischke (2009): Mostly Harmless Econometrics
- Causal inference perspective
- Practice-oriented
Stock & Watson (2020): Introduction to Econometrics (4th Edition)
- Clear and accessible
- Abundant case studies

Learning Recommendations

DO (Recommended Practices)

Understand Before Operating: Don't just run code, understand the statistical principles
Visualization First: Plot scatter plots before regression to intuitively understand relationships
Diagnostics are Essential: Check assumptions with every regression
Caution with Causality: Distinguish correlation from causation
Standardized Reporting: Report results according to academic standards

DON'T (Avoid Pitfalls)

Don't p-hack: Don't try models repeatedly until significant
Don't Ignore Assumptions: Gauss-Markov assumptions matter
Don't Over-interpret: Correlation does not equal causation
Don't Forget Robust Standard Errors: Especially with cross-sectional data
Don't Blindly Trust : High doesn't mean good model

Chapter Datasets

Dataset	Description	Source	Variables
wage_data.csv	Wage data (cross-sectional)	CPS	wage, education, experience, female
housing_prices.csv	Housing price data	Boston Housing	price, rooms, crime, distance
student_achievement.csv	Student achievement	PISA	test_score, class_size, teacher_qual
country_growth.csv	Country growth	Penn World Table	gdp_growth, investment, education

Ready to Begin?

Regression analysis is the core method in social science research, and is:

An extension of descriptive statistics
An application of hypothesis testing
A foundation for causal inference
A starting point for predictive modeling

Mastering regression analysis will enable you to:

Scientifically study relationships between variables
Write standardized empirical research papers
Conduct data-driven decision analysis
Build a foundation for advanced econometric methods

Let's begin our journey into regression analysis!

Chapter File List

module-5_Regression Analysis/
├── 5.1-Chapter Introduction.md            # This file
├── 5.2-Simple Linear Regression.md        # Simple linear regression
├── 5.3-Multiple Regression.md             # Multiple linear regression
├── 5.4-Regression Diagnostics.md          # Regression diagnostics
├── 5.5-Categorical Variables and Interactions.md  # Categorical variables and interaction terms
└── 5.6-Interpretation and Reporting.md    # Results interpretation and reporting

Estimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistics background) Practicality: ⭐⭐⭐⭐⭐ (Essential for social science research)

Next Section: 5.2 - Simple Linear Regression

Let's start with the simplest model!

5.1 Chapter Introduction (Foundations of Regression Analysis) ​

Why is Regression Analysis So Important? ​

Core Questions in Social Science Research ​

The Essence of Regression Analysis ​

From Correlation to Causation ​

Three Primary Uses of Regression ​

Learning Roadmap ​

Section 1: Simple Linear Regression ​

Section 2: Multiple Linear Regression ​

Section 3: Regression Diagnostics ​

Section 4: Categorical Variables and Interaction Terms ​

Section 5: Interpretation and Reporting of Regression Results ​

Learning Objectives ​

Common Misconceptions in Regression Analysis ​

Misconception 1: Correlation Implies Causation ​

Misconception 2: Higher is Always Better ​

Misconception 3: Statistical Significance = Practical Importance ​

Misconception 4: More Control Variables is Better ​

Classic Literature and Cases ​

Pioneering Research ​

Classic Textbooks ​

Learning Recommendations ​

DO (Recommended Practices) ​

DON'T (Avoid Pitfalls) ​

Chapter Datasets ​

Ready to Begin? ​

Chapter File List ​

5.1 Chapter Introduction (Foundations of Regression Analysis)

Why is Regression Analysis So Important?

Core Questions in Social Science Research

The Essence of Regression Analysis

From Correlation to Causation

Three Primary Uses of Regression

Learning Roadmap

Section 1: Simple Linear Regression

Section 2: Multiple Linear Regression

Section 3: Regression Diagnostics

Section 4: Categorical Variables and Interaction Terms

Section 5: Interpretation and Reporting of Regression Results

Learning Objectives

Common Misconceptions in Regression Analysis

Misconception 1: Correlation Implies Causation

Misconception 2: Higher is Always Better

Misconception 3: Statistical Significance = Practical Importance

Misconception 4: More Control Variables is Better

Classic Literature and Cases

Pioneering Research

Classic Textbooks

Learning Recommendations

DO (Recommended Practices)

DON'T (Avoid Pitfalls)

Chapter Datasets

Ready to Begin?

Chapter File List