2.1 Chapter Introduction (Counterfactual Framework & Randomized Controlled Trials)

The Foundation of Causal Inference: From Potential Outcomes to the Gold Standard

Why Is This Chapter Crucial?

In data science and economic research, correlation ≠ causation is the most common mistake.

Classic Example: Ice Cream and Drowning

Observed Correlation:

Ice cream sales ↑ → Drowning incidents ↑
Correlation coefficient significant (p < 0.01)

Wrong Conclusion: ❌ "Ice cream causes drowning, we should ban ice cream"

Correct Analysis: ✅ Confounding variable: Summer temperature

Summer → Ice cream sales ↑
Summer → Swimming population ↑ → Drowning ↑
Causal path: Temperature → {Ice cream, Drowning}, no causal relationship between them

Real-World Case: Minimum Wage and Employment

Policy Question: Does raising minimum wage reduce employment?

Problem with Traditional Regression:

python

# ❌ Simple regression has severe endogeneity issues
model = sm.OLS(employment_rate ~ min_wage).fit()
# Cannot distinguish causal effects from selection bias

Issues:

States that raise minimum wage may have stronger economies (reverse causality)
States with high unemployment may be more likely to raise minimum wage (selection bias)
Other policies implemented simultaneously (confounding factors)

Counterfactual Framework Solution:

Use Difference-in-Differences (DID) or RCT to identify causal effects
Construct counterfactual control groups
Eliminate selection bias and confounding factors

Core Content of This Chapter

Section 1: Potential Outcomes Framework

Core Idea: The essence of causal inference is comparing outcomes for the same individual under different treatment states

Rubin Causal Model (RCM)
Definition of potential outcomes: Yi(1) vs Yi(0)
Fundamental problem: We can never observe both states simultaneously
Definition of causal effect: τi = Yi(1) - Yi(0)

Case: Causal Effect of Education Training

Individual i:
- Yi(1) = Income after attending training
- Yi(0) = Income without attending training (counterfactual)
- Causal effect τi = Yi(1) - Yi(0)

Problem: We can only observe one outcome!

Section 2: Randomized Controlled Trials (RCTs)

Why is RCT the Gold Standard?

RCT solves the fundamental problem of causal inference through randomization:

Eliminates selection bias
Balances confounding variables
Makes treatment and control groups comparable

Core Mechanism of RCT:

Random assignment: Di ~ Bernoulli(0.5)
- Di = 1 → Treatment group
- Di = 0 → Control group

Key property: E[Yi(0)|Di=1] = E[Yi(0)|Di=0]
i.e., Without treatment, both groups have the same average outcome

Experimental Designs:

Simple Randomization
Stratified Randomization
Matched-Pair Randomization
Cluster Randomization

Section 3: Average Treatment Effects

Core Concepts:

Effect Type	Definition	Application Scenario
ATE	Average Treatment Effect	Population-level average causal effect
ATT	Average Treatment Effect on the Treated	Average effect for treatment group
ATU	Average Treatment Effect on the Untreated	Average effect for control group
LATE	Local Average Treatment Effect	Local effect for compliers
CATE	Conditional Average Treatment Effect	Conditional average effect (heterogeneity)

Mathematical Definitions:

ATE = E[Yi(1) - Yi(0)]
    = E[Yi(1)] - E[Yi(0)]

ATT = E[Yi(1) - Yi(0) | Di = 1]
    = E[Yi(1) | Di = 1] - E[Yi(0) | Di = 1]
                          ^^^^^^^^^^^^^^
                          (Counterfactual, unobservable)

Advantage of RCT:

Under RCT: ATE = ATT = ATU
Simple difference unbiasedly estimates ATE

Section 4: Identification Strategies and Validity

Internal Validity:

Whether causal inference is correct within the study sample
Threats:
- Selection Bias
- Confounding
- Contemporaneous Events
- Attrition

External Validity:

Whether causal effects generalize to other populations
SUTVA Assumption (Stable Unit Treatment Value Assumption)
- No Spillover effects
- Treatment Consistency

Identification Strategy Comparison:

Method	Randomness Source	Internal Validity	External Validity	Implementation Difficulty
RCT	Random assignment	⭐⭐⭐⭐⭐	⭐⭐⭐	High
DID	Exogenous policy shock	⭐⭐⭐⭐	⭐⭐⭐⭐	Medium
RDD	Randomness near cutoff	⭐⭐⭐⭐	⭐⭐	Medium
IV	Instrumental variable	⭐⭐⭐	⭐⭐⭐	High
PSM	Conditional independence	⭐⭐	⭐⭐⭐	Low

Section 5: Python Practice - Complete RCT Analysis Workflow

Complete Case: A/B Testing for Online Education Platform

python

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

# 1. Data generation (simulate RCT)
np.random.seed(42)
n = 1000

# Random assignment
treatment = np.random.binomial(1, 0.5, n)

# Potential outcomes
Y0 = np.random.normal(75, 15, n)  # Control group scores
tau = 5  # True causal effect
Y1 = Y0 + tau + np.random.normal(0, 2, n)

# Observed outcome (fundamental problem: only observe one)
Y_obs = treatment * Y1 + (1 - treatment) * Y0

# 2. Balance check
balance_test = stats.ttest_ind(
    Y0[treatment == 1],
    Y0[treatment == 0]
)
print(f"Balance test p-value: {balance_test.pvalue:.4f}")

# 3. ATE estimation (simple difference)
ATE_simple = Y_obs[treatment == 1].mean() - Y_obs[treatment == 0].mean()
print(f"ATE (simple difference): {ATE_simple:.2f}")

# 4. Regression estimation (heteroskedasticity-robust standard errors)
X = sm.add_constant(treatment)
model = sm.OLS(Y_obs, X).fit(cov_type='HC3')
print(model.summary())

# 5. Heterogeneity analysis (CATE)
# Stratify by student baseline
baseline = Y0
high_baseline = baseline > baseline.median()

CATE_high = (Y_obs[treatment == 1 & high_baseline].mean() -
             Y_obs[treatment == 0 & high_baseline].mean())
CATE_low = (Y_obs[treatment == 1 & ~high_baseline].mean() -
            Y_obs[treatment == 0 & ~high_baseline].mean())

print(f"High baseline students CATE: {CATE_high:.2f}")
print(f"Low baseline students CATE: {CATE_low:.2f}")

Learning Objectives

After completing this chapter, you will be able to:

Capability	Specific Goals
Conceptual Understanding	✅ Understand potential outcomes framework and counterfactual logic
	✅ Master core challenges of causal inference (selection bias, confounding)
	✅ Understand why RCT is the gold standard
Technical Mastery	✅ Design and analyze RCT experiments
	✅ Distinguish between ATE, ATT, LATE, and other effects
	✅ Conduct balance checks and validity diagnostics
Practical Skills	✅ Implement complete RCT analysis using Python
	✅ Conduct heterogeneity analysis (CATE)
	✅ Correctly interpret and report causal effects

Learning Roadmap

Week 1: Introduction to Counterfactual Thinking
├─ Understand potential outcomes framework
├─ Fundamental problem of causal inference
└─ Simple case analysis

Week 2: RCT Theory and Design
├─ The magic of randomization
├─ Types of experimental designs
└─ Balance and validity

Week 3: Effect Estimation and Inference
├─ Differences between ATE/ATT/LATE
├─ Standard errors and hypothesis testing
└─ Heterogeneity analysis

Week 4: Python Practice
├─ Data generation and simulation
├─ Complete analysis workflow
└─ Results visualization and reporting

Connections to Other Modules

Prerequisites (from Python Fundamentals)

Module 3: Basic syntax (conditional statements, loops)
Module 4: Data structures (lists, dictionaries, DataFrame)
Module 5: Functions and modules
Module 9: NumPy, Pandas, visualization

Subsequent Applications

Module 3: Data cleaning and variable construction (preparing data for causal analysis)
Module 6: OLS regression (regression with control variables)
Module 8: Econometrics (IV, DID, RDD, and quasi-experiments)
Module 10: Causal inference models (DoWhy, CausalML)

Study Recommendations

✅ DO (Recommended Practices)

Start with examples: Every concept needs concrete cases
Comparative thinking: Distinguish correlation vs causation
Hands-on practice: Run Python code, modify parameters to observe changes
Draw diagrams: DAGs (Directed Acyclic Graphs) are the best tool for understanding causality

❌ DON'T (Common Pitfalls)

Don't memorize formulas: Understanding concepts is more important than memorization
Don't skip balance checks: This is the foundation of RCT validity
Don't over-interpret: Causal effects have boundary conditions (SUTVA)
Don't ignore standard errors: Statistical inference is as important as point estimates

Chapter Datasets

We will use the following real/simulated datasets:

Dataset	Description	Source	Sample Size
STAR Project	Tennessee class-size reduction experiment	Real RCT	11,600
Progresa/Oportunidades	Mexico conditional cash transfer program	Real RCT	506 villages
Online Education A/B Test	Online course RCT simulated data	Simulated	1,000
Job Training RCT	Employment training experiment	Simulated	2,000

Self-Assessment Questions (Before Starting)

Before studying this chapter, test your understanding:

Conceptual question: What is a counterfactual? Why is it the core of causal inference?
Case question: Research finds "students who use social media have lower grades." Does this prove social media causes lower grades? Why or why not?
Design question: To study "the causal effect of remote work on employee productivity," how would you design an RCT?

Answer hints:

If you can clearly answer question 1, you already have causal inference thinking
If question 2 confuses you, this chapter will help build a rigorous causal reasoning framework
If question 3 is difficult, this chapter will teach you the complete RCT design process

Are You Ready?

The counterfactual framework and RCT are the foundation of modern causal inference, and the most credible source of causal evidence in economics, sociology, medicine, and other fields.

Mastering this chapter, you will:

✅ Establish rigorous causal thinking
✅ Understand core principles of experimental design
✅ Independently analyze RCT data
✅ Build a foundation for learning advanced quasi-experimental methods

Let's begin! 🚀

Chapter File List

module-2_Counter Factual and RCTs/
├── 00-Chapter Introduction.md              # This file
├── 01-potential-outcomes-framework.md      # Potential outcomes framework
├── 02-randomized-controlled-trials.md      # RCT principles and design
├── 03-average-treatment-effects.md         # Average treatment effects
├── 04-identification-strategies.md         # Identification strategies and validity
└── 05-practical-implementation.md          # Python practice

Estimated Study Time: 12-16 hours Difficulty Level: ⭐⭐⭐⭐ (Requires strong abstract thinking) Practicality: ⭐⭐⭐⭐⭐ (Required course in modern causal inference)

Next Section: 01 - Potential Outcomes Framework

Let the causal inference journey begin! 🎯

2.1 Chapter Introduction (Counterfactual Framework & Randomized Controlled Trials)

Why Is This Chapter Crucial?

Classic Example: Ice Cream and Drowning

Real-World Case: Minimum Wage and Employment

Core Content of This Chapter

Section 1: Potential Outcomes Framework

Section 2: Randomized Controlled Trials (RCTs)

Section 3: Average Treatment Effects

Section 4: Identification Strategies and Validity

Section 5: Python Practice - Complete RCT Analysis Workflow

Learning Objectives

Learning Roadmap

Connections to Other Modules

Prerequisites (from Python Fundamentals)

Subsequent Applications

Recommended Reading

Classic Textbooks

Frontier Papers

Online Resources

Study Recommendations

✅ DO (Recommended Practices)

❌ DON'T (Common Pitfalls)

Chapter Datasets

Self-Assessment Questions (Before Starting)

Are You Ready?

Chapter File List

2.1 Chapter Introduction (Counterfactual Framework & Randomized Controlled Trials) ​

Why Is This Chapter Crucial? ​

Classic Example: Ice Cream and Drowning ​

Real-World Case: Minimum Wage and Employment ​

Core Content of This Chapter ​

Section 1: Potential Outcomes Framework ​

Section 2: Randomized Controlled Trials (RCTs) ​

Section 3: Average Treatment Effects ​

Section 4: Identification Strategies and Validity ​

Section 5: Python Practice - Complete RCT Analysis Workflow ​

Learning Objectives ​

Learning Roadmap ​

Connections to Other Modules ​

Prerequisites (from Python Fundamentals) ​

Subsequent Applications ​

Recommended Reading ​

Classic Textbooks ​

Frontier Papers ​

Online Resources ​

Study Recommendations ​

✅ DO (Recommended Practices) ​

❌ DON'T (Common Pitfalls) ​

Chapter Datasets ​

Self-Assessment Questions (Before Starting) ​

Are You Ready? ​

Chapter File List ​

2.1 Chapter Introduction (Counterfactual Framework & Randomized Controlled Trials)

Why Is This Chapter Crucial?

Classic Example: Ice Cream and Drowning

Real-World Case: Minimum Wage and Employment

Core Content of This Chapter

Section 1: Potential Outcomes Framework

Section 2: Randomized Controlled Trials (RCTs)

Section 3: Average Treatment Effects

Section 4: Identification Strategies and Validity

Section 5: Python Practice - Complete RCT Analysis Workflow

Learning Objectives

Learning Roadmap

Connections to Other Modules

Prerequisites (from Python Fundamentals)

Subsequent Applications

Recommended Reading

Classic Textbooks

Frontier Papers

Online Resources

Study Recommendations

✅ DO (Recommended Practices)

❌ DON'T (Common Pitfalls)

Chapter Datasets

Self-Assessment Questions (Before Starting)

Are You Ready?

Chapter File List