Skip to content

5.1 Chapter Introduction (Foundations of Regression Analysis)

From Simple Regression to Multiple Regression: The Core Method in Social Science Research

DifficultyImportancePracticality


Why is Regression Analysis So Important?

Core Questions in Social Science Research

Nearly all social science research seeks to answer similar questions:

DisciplineCore QuestionRegression Application
EconomicsEffect of education on income?log(wage) ~ education + experience
SociologyRole of social capital in upward mobility?mobility ~ social_capital + education + family_background
Political ScienceImpact of democracy on economic growth?gdp_growth ~ democracy + institutions + controls
EducationEffect of class size on student achievement?test_score ~ class_size + teacher_quality + controls
PsychologyImpact of stress on mental health?mental_health ~ stress + social_support + personality

Commonality: All require regression analysis to establish relationships between variables


The Essence of Regression Analysis

From Correlation to Causation

Correlation:

python
# Simple correlation
corr = df['education'].corr(df['wage'])
# Problem: Cannot control for confounders, cannot infer causality

Regression Analysis:

python
# Multiple regression
model = sm.OLS(df['log_wage'],
               sm.add_constant(df[['education', 'experience', 'ability']])).fit()
# Advantage: Controls for other factors, isolates net effects

Three Primary Uses of Regression

  1. Describing Relationships (Descriptive)

    • Quantifies the strength of relationships between variables
    • Example: Each additional year of education increases wages by 8%
  2. Prediction

    • Predicts dependent variables based on independent variables
    • Example: Predicting future income based on individual characteristics
  3. Causal Inference

    • Infers causal relationships under certain assumptions
    • Example: The causal effect of education on income

Learning Roadmap

Section 1: Simple Linear Regression

Core Concepts:

  • Simple linear regression model:
  • OLS estimation principle: Minimizing sum of squared residuals
  • Goodness of fit: Meaning of
  • Statistical inference: tests, confidence intervals

Python Implementation:

python
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Simple linear regression
X = sm.add_constant(df['education'])
y = df['wage']
model = sm.OLS(y, X).fit()

# Visualization
plt.scatter(df['education'], df['wage'], alpha=0.5)
plt.plot(df['education'], model.fittedvalues, 'r-', linewidth=2)
plt.xlabel('Education (years)')
plt.ylabel('Wage')
plt.title('Simple Linear Regression')
plt.show()

Classic Cases:

  • Galton's height regression (origin of "regression")
  • Mincer wage equation (foundation of education economics)

Section 2: Multiple Linear Regression

Core Concepts:

  • Multiple regression model:
  • Partial regression coefficient
  • Multicollinearity
  • Omitted variable bias

Mathematical Intuition:

Interpretation: is the net effect of on after controlling for other variables

Python Implementation:

python
# Multiple regression
X = sm.add_constant(df[['education', 'experience', 'female', 'urban']])
model = sm.OLS(df['log_wage'], X).fit(cov_type='HC3')
print(model.summary())

Classic Cases:

  • Wage determination equation (Mincer Equation)
  • Cobb-Douglas production function
  • Demand/Supply models

Section 3: Regression Diagnostics

Core Assumptions (Gauss-Markov Assumptions):

AssumptionMeaningViolation ConsequenceTest Method
Linearity$E[YX] = X\beta$Biased coefficients
Exogeneity$E[\varepsilonX] = 0$Biased coefficients
Homoskedasticity$Var(\varepsilonX) = \sigma^2$Biased SE
No AutocorrelationBiased SEDW test
NormalityAffects small sample inferenceShapiro test

Diagnostic Tools:

python
# 1. Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)

# 2. Multicollinearity (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# 3. Residual analysis
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted values, Q-Q plot, residual histogram, leverage plot

Problem Solutions:

  • Heteroskedasticity → Robust standard errors (HC3), WLS
  • Multicollinearity → Remove variables, principal component regression
  • Nonlinearity → Transformations, polynomials, splines

Section 4: Categorical Variables and Interaction Terms

Dummy Variable Trap:

python
# Wrong: Including all dummy variables
dummies = pd.get_dummies(df['region'])  # 4 regions → 4 dummy variables
# Causes perfect collinearity

# Correct: Drop one reference category
dummies = pd.get_dummies(df['region'], drop_first=True)  # 3 dummy variables

Interaction Effects:

python
# Does education return vary by gender?
df['edu_x_female'] = df['education'] * df['female']

model = smf.ols('log_wage ~ education + female + education:female',
                data=df).fit()

# Interpretation:
# Male education return = β₁
# Female education return = β₁ + β₃

Visualizing Interaction Effects:

python
# Plot education-wage curves for different genders

Section 5: Interpretation and Reporting of Regression Results

The Art of Coefficient Interpretation:

Model FormInterpretationExample
Level-Level increases by 1 unit, increases by units
Log-Level increases by 1 unit, increases by %
Level-Log increases by 1%, increases by units
Log-Log increases by 1%, increases by % (elasticity)

Publication-Grade Tables:

python
from statsmodels.iolib.summary2 import summary_col

# Multi-model comparison
results = summary_col([model1, model2, model3],
                      model_names=['(1)', '(2)', '(3)'],
                      stars=True)
print(results.as_latex())

Complete Regression Report Should Include:

  1. Coefficient estimates and standard errors
  2. Significance markers (*** p<0.01, ** p<0.05, * p<0.1)
  3. Sample size, , adjusted
  4. F statistic
  5. Standard error type (robust, clustered)
  6. Control variable specifications

Learning Objectives

After completing this chapter, you will be able to:

Competency DimensionSpecific Objectives
Theoretical UnderstandingUnderstand OLS estimation principles
Master Gauss-Markov assumptions
Understand the meaning of partial regression coefficients
Python ImplementationConduct OLS regression using statsmodels
Perform comprehensive regression diagnostics
Handle categorical variables and interaction terms
Results InterpretationCorrectly interpret regression coefficients
Distinguish statistical significance from substantive significance
Identify pitfalls in causal inference
Academic WritingProduce publication-grade regression tables
Write standardized regression result reports

Common Misconceptions in Regression Analysis

Misconception 1: Correlation Implies Causation

Wrong:

Regression shows ice cream sales are significantly positively correlated with drowning deaths
→ Conclusion: Ice cream causes drowning

Correct:

  • Confounding variable: Summer temperature
  • Correlation ≠ Causation
  • Need to consider endogeneity, omitted variables, etc.

Misconception 2: Higher is Always Better

Wrong: Blindly pursuing high

Correct:

  • For prediction tasks: matters
  • For causal inference: Unbiasedness of coefficients is more important
  • High may indicate overfitting

Misconception 3: Statistical Significance = Practical Importance

Wrong:

p < 0.001, therefore the effect is very important

Correct:

  • Statistical significance: Effect is nonzero
  • Substantive significance: Effect size is meaningful
  • With large samples, tiny effects can be significant

Misconception 4: More Control Variables is Better

Wrong: Including all possible variables

Correct:

  • Only control for confounders
  • Over-controlling can lead to collider bias
  • Bad controls (intermediate variables, outcome variables)

Classic Literature and Cases

Pioneering Research

  1. Mincer (1974): "Schooling, Experience, and Earnings"

    • Mincer wage equation
  2. Card & Krueger (1994): "Minimum Wages and Employment"

    • Impact of minimum wage on employment
    • Difference-in-differences regression
  3. Angrist & Krueger (1991): "Does Compulsory School Attendance Affect Schooling and Earnings?"

    • Birth quarter as instrumental variable
    • Causal return to education

Classic Textbooks

  1. Wooldridge (2020): Introductory Econometrics (7th Edition)

    • Econometrics bible
    • Python example code
  2. Angrist & Pischke (2009): Mostly Harmless Econometrics

    • Causal inference perspective
    • Practice-oriented
  3. Stock & Watson (2020): Introduction to Econometrics (4th Edition)

    • Clear and accessible
    • Abundant case studies

Learning Recommendations

  1. Understand Before Operating: Don't just run code, understand the statistical principles
  2. Visualization First: Plot scatter plots before regression to intuitively understand relationships
  3. Diagnostics are Essential: Check assumptions with every regression
  4. Caution with Causality: Distinguish correlation from causation
  5. Standardized Reporting: Report results according to academic standards

DON'T (Avoid Pitfalls)

  1. Don't p-hack: Don't try models repeatedly until significant
  2. Don't Ignore Assumptions: Gauss-Markov assumptions matter
  3. Don't Over-interpret: Correlation does not equal causation
  4. Don't Forget Robust Standard Errors: Especially with cross-sectional data
  5. Don't Blindly Trust : High doesn't mean good model

Chapter Datasets

DatasetDescriptionSourceVariables
wage_data.csvWage data (cross-sectional)CPSwage, education, experience, female
housing_prices.csvHousing price dataBoston Housingprice, rooms, crime, distance
student_achievement.csvStudent achievementPISAtest_score, class_size, teacher_qual
country_growth.csvCountry growthPenn World Tablegdp_growth, investment, education

Ready to Begin?

Regression analysis is the core method in social science research, and is:

  • An extension of descriptive statistics
  • An application of hypothesis testing
  • A foundation for causal inference
  • A starting point for predictive modeling

Mastering regression analysis will enable you to:

  • Scientifically study relationships between variables
  • Write standardized empirical research papers
  • Conduct data-driven decision analysis
  • Build a foundation for advanced econometric methods

Let's begin our journey into regression analysis!


Chapter File List

module-5_Regression Analysis/
├── 5.1-Chapter Introduction.md            # This file
├── 5.2-Simple Linear Regression.md        # Simple linear regression
├── 5.3-Multiple Regression.md             # Multiple linear regression
├── 5.4-Regression Diagnostics.md          # Regression diagnostics
├── 5.5-Categorical Variables and Interactions.md  # Categorical variables and interaction terms
└── 5.6-Interpretation and Reporting.md    # Results interpretation and reporting

Estimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistics background) Practicality: ⭐⭐⭐⭐⭐ (Essential for social science research)


Next Section: 5.2 - Simple Linear Regression

Let's start with the simplest model!

Released under the MIT License. Content © Author.