Skip to content

2.1 Chapter Introduction (Counterfactual Framework & Randomized Controlled Trials)

The Foundation of Causal Inference: From Potential Outcomes to the Gold Standard

DifficultyImportance


Why Is This Chapter Crucial?

In data science and economic research, correlation ≠ causation is the most common mistake.

Classic Example: Ice Cream and Drowning

Observed Correlation:

  • Ice cream sales ↑ → Drowning incidents ↑
  • Correlation coefficient significant (p < 0.01)

Wrong Conclusion: ❌ "Ice cream causes drowning, we should ban ice cream"

Correct Analysis: ✅ Confounding variable: Summer temperature

  • Summer → Ice cream sales ↑
  • Summer → Swimming population ↑ → Drowning ↑
  • Causal path: Temperature → {Ice cream, Drowning}, no causal relationship between them

Real-World Case: Minimum Wage and Employment

Policy Question: Does raising minimum wage reduce employment?

Problem with Traditional Regression:

python
# ❌ Simple regression has severe endogeneity issues
model = sm.OLS(employment_rate ~ min_wage).fit()
# Cannot distinguish causal effects from selection bias

Issues:

  • States that raise minimum wage may have stronger economies (reverse causality)
  • States with high unemployment may be more likely to raise minimum wage (selection bias)
  • Other policies implemented simultaneously (confounding factors)

Counterfactual Framework Solution:

  • Use Difference-in-Differences (DID) or RCT to identify causal effects
  • Construct counterfactual control groups
  • Eliminate selection bias and confounding factors

Core Content of This Chapter

Section 1: Potential Outcomes Framework

Core Idea: The essence of causal inference is comparing outcomes for the same individual under different treatment states

  • Rubin Causal Model (RCM)
  • Definition of potential outcomes: Yi(1) vs Yi(0)
  • Fundamental problem: We can never observe both states simultaneously
  • Definition of causal effect: τi = Yi(1) - Yi(0)

Case: Causal Effect of Education Training

Individual i:
- Yi(1) = Income after attending training
- Yi(0) = Income without attending training (counterfactual)
- Causal effect τi = Yi(1) - Yi(0)

Problem: We can only observe one outcome!

Section 2: Randomized Controlled Trials (RCTs)

Why is RCT the Gold Standard?

RCT solves the fundamental problem of causal inference through randomization:

  • Eliminates selection bias
  • Balances confounding variables
  • Makes treatment and control groups comparable

Core Mechanism of RCT:

Random assignment: Di ~ Bernoulli(0.5)
- Di = 1 → Treatment group
- Di = 0 → Control group

Key property: E[Yi(0)|Di=1] = E[Yi(0)|Di=0]
i.e., Without treatment, both groups have the same average outcome

Experimental Designs:

  • Simple Randomization
  • Stratified Randomization
  • Matched-Pair Randomization
  • Cluster Randomization

Section 3: Average Treatment Effects

Core Concepts:

Effect TypeDefinitionApplication Scenario
ATEAverage Treatment EffectPopulation-level average causal effect
ATTAverage Treatment Effect on the TreatedAverage effect for treatment group
ATUAverage Treatment Effect on the UntreatedAverage effect for control group
LATELocal Average Treatment EffectLocal effect for compliers
CATEConditional Average Treatment EffectConditional average effect (heterogeneity)

Mathematical Definitions:

ATE = E[Yi(1) - Yi(0)]
    = E[Yi(1)] - E[Yi(0)]

ATT = E[Yi(1) - Yi(0) | Di = 1]
    = E[Yi(1) | Di = 1] - E[Yi(0) | Di = 1]
                          ^^^^^^^^^^^^^^
                          (Counterfactual, unobservable)

Advantage of RCT:

  • Under RCT: ATE = ATT = ATU
  • Simple difference unbiasedly estimates ATE

Section 4: Identification Strategies and Validity

Internal Validity:

  • Whether causal inference is correct within the study sample
  • Threats:
    • Selection Bias
    • Confounding
    • Contemporaneous Events
    • Attrition

External Validity:

  • Whether causal effects generalize to other populations
  • SUTVA Assumption (Stable Unit Treatment Value Assumption)
    • No Spillover effects
    • Treatment Consistency

Identification Strategy Comparison:

MethodRandomness SourceInternal ValidityExternal ValidityImplementation Difficulty
RCTRandom assignment⭐⭐⭐⭐⭐⭐⭐⭐High
DIDExogenous policy shock⭐⭐⭐⭐⭐⭐⭐⭐Medium
RDDRandomness near cutoff⭐⭐⭐⭐⭐⭐Medium
IVInstrumental variable⭐⭐⭐⭐⭐⭐High
PSMConditional independence⭐⭐⭐⭐⭐Low

Section 5: Python Practice - Complete RCT Analysis Workflow

Complete Case: A/B Testing for Online Education Platform

python
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

# 1. Data generation (simulate RCT)
np.random.seed(42)
n = 1000

# Random assignment
treatment = np.random.binomial(1, 0.5, n)

# Potential outcomes
Y0 = np.random.normal(75, 15, n)  # Control group scores
tau = 5  # True causal effect
Y1 = Y0 + tau + np.random.normal(0, 2, n)

# Observed outcome (fundamental problem: only observe one)
Y_obs = treatment * Y1 + (1 - treatment) * Y0

# 2. Balance check
balance_test = stats.ttest_ind(
    Y0[treatment == 1],
    Y0[treatment == 0]
)
print(f"Balance test p-value: {balance_test.pvalue:.4f}")

# 3. ATE estimation (simple difference)
ATE_simple = Y_obs[treatment == 1].mean() - Y_obs[treatment == 0].mean()
print(f"ATE (simple difference): {ATE_simple:.2f}")

# 4. Regression estimation (heteroskedasticity-robust standard errors)
X = sm.add_constant(treatment)
model = sm.OLS(Y_obs, X).fit(cov_type='HC3')
print(model.summary())

# 5. Heterogeneity analysis (CATE)
# Stratify by student baseline
baseline = Y0
high_baseline = baseline > baseline.median()

CATE_high = (Y_obs[treatment == 1 & high_baseline].mean() -
             Y_obs[treatment == 0 & high_baseline].mean())
CATE_low = (Y_obs[treatment == 1 & ~high_baseline].mean() -
            Y_obs[treatment == 0 & ~high_baseline].mean())

print(f"High baseline students CATE: {CATE_high:.2f}")
print(f"Low baseline students CATE: {CATE_low:.2f}")

Learning Objectives

After completing this chapter, you will be able to:

CapabilitySpecific Goals
Conceptual Understanding✅ Understand potential outcomes framework and counterfactual logic
✅ Master core challenges of causal inference (selection bias, confounding)
✅ Understand why RCT is the gold standard
Technical Mastery✅ Design and analyze RCT experiments
✅ Distinguish between ATE, ATT, LATE, and other effects
✅ Conduct balance checks and validity diagnostics
Practical Skills✅ Implement complete RCT analysis using Python
✅ Conduct heterogeneity analysis (CATE)
✅ Correctly interpret and report causal effects

Learning Roadmap

Week 1: Introduction to Counterfactual Thinking
├─ Understand potential outcomes framework
├─ Fundamental problem of causal inference
└─ Simple case analysis

Week 2: RCT Theory and Design
├─ The magic of randomization
├─ Types of experimental designs
└─ Balance and validity

Week 3: Effect Estimation and Inference
├─ Differences between ATE/ATT/LATE
├─ Standard errors and hypothesis testing
└─ Heterogeneity analysis

Week 4: Python Practice
├─ Data generation and simulation
├─ Complete analysis workflow
└─ Results visualization and reporting

Connections to Other Modules

Prerequisites (from Python Fundamentals)

  • Module 3: Basic syntax (conditional statements, loops)
  • Module 4: Data structures (lists, dictionaries, DataFrame)
  • Module 5: Functions and modules
  • Module 9: NumPy, Pandas, visualization

Subsequent Applications

  • Module 3: Data cleaning and variable construction (preparing data for causal analysis)
  • Module 6: OLS regression (regression with control variables)
  • Module 8: Econometrics (IV, DID, RDD, and quasi-experiments)
  • Module 10: Causal inference models (DoWhy, CausalML)

Classic Textbooks

  1. Angrist & Pischke (2009): Mostly Harmless Econometrics

    • Chapter 2: Random Assignment Solves the Selection Problem
    • Practical, intuitive, rich in examples
  2. Imbens & Rubin (2015): Causal Inference for Statistics, Social, and Biomedical Sciences

    • Authoritative textbook on potential outcomes framework
    • Mathematically rigorous yet accessible
  3. Pearl (2009): Causality: Models, Reasoning, and Inference

    • DAG (Directed Acyclic Graph) perspective
    • Highest theoretical depth

Frontier Papers

  • Athey & Imbens (2017): "The State of Applied Econometrics: Causality and Policy Evaluation"
  • Abadie (2020): "Statistical Nonsignificance in Empirical Economics"

Online Resources

  • Mixtape Sessions: Scott Cunningham's causal inference course
  • YouTube: Ben Lambert's econometrics series

Study Recommendations

  1. Start with examples: Every concept needs concrete cases
  2. Comparative thinking: Distinguish correlation vs causation
  3. Hands-on practice: Run Python code, modify parameters to observe changes
  4. Draw diagrams: DAGs (Directed Acyclic Graphs) are the best tool for understanding causality

❌ DON'T (Common Pitfalls)

  1. Don't memorize formulas: Understanding concepts is more important than memorization
  2. Don't skip balance checks: This is the foundation of RCT validity
  3. Don't over-interpret: Causal effects have boundary conditions (SUTVA)
  4. Don't ignore standard errors: Statistical inference is as important as point estimates

Chapter Datasets

We will use the following real/simulated datasets:

DatasetDescriptionSourceSample Size
STAR ProjectTennessee class-size reduction experimentReal RCT11,600
Progresa/OportunidadesMexico conditional cash transfer programReal RCT506 villages
Online Education A/B TestOnline course RCT simulated dataSimulated1,000
Job Training RCTEmployment training experimentSimulated2,000

Self-Assessment Questions (Before Starting)

Before studying this chapter, test your understanding:

  1. Conceptual question: What is a counterfactual? Why is it the core of causal inference?

  2. Case question: Research finds "students who use social media have lower grades." Does this prove social media causes lower grades? Why or why not?

  3. Design question: To study "the causal effect of remote work on employee productivity," how would you design an RCT?

Answer hints:

  • If you can clearly answer question 1, you already have causal inference thinking
  • If question 2 confuses you, this chapter will help build a rigorous causal reasoning framework
  • If question 3 is difficult, this chapter will teach you the complete RCT design process

Are You Ready?

The counterfactual framework and RCT are the foundation of modern causal inference, and the most credible source of causal evidence in economics, sociology, medicine, and other fields.

Mastering this chapter, you will:

  • ✅ Establish rigorous causal thinking
  • ✅ Understand core principles of experimental design
  • ✅ Independently analyze RCT data
  • ✅ Build a foundation for learning advanced quasi-experimental methods

Let's begin! 🚀


Chapter File List

module-2_Counter Factual and RCTs/
├── 00-Chapter Introduction.md              # This file
├── 01-potential-outcomes-framework.md      # Potential outcomes framework
├── 02-randomized-controlled-trials.md      # RCT principles and design
├── 03-average-treatment-effects.md         # Average treatment effects
├── 04-identification-strategies.md         # Identification strategies and validity
└── 05-practical-implementation.md          # Python practice

Estimated Study Time: 12-16 hours Difficulty Level: ⭐⭐⭐⭐ (Requires strong abstract thinking) Practicality: ⭐⭐⭐⭐⭐ (Required course in modern causal inference)


Next Section: 01 - Potential Outcomes Framework

Let the causal inference journey begin! 🎯

Released under the MIT License. Content © Author.