1.4 Logit Regression: Binary Dependent Variable Models
"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write."— H.G. Wells, Writer & Futurist
How to model when the outcome is "yes/no"?
Section Objectives
- Understand application scenarios of Logit regression
- Conduct Logit regression using statsmodels
- Interpret Logit regression coefficients and marginal effects
- Compare with Stata/R
When to Use Logit Regression?
When the dependent variable is binary (0/1), you cannot use OLS—you need Logit or Probit.
Typical Application Scenarios
| Research Question | Dependent Variable | Example Independent Variables |
|---|---|---|
| College attendance | Attend=1, Not=0 | Family income, parental education |
| Employment | Employed=1, Unemployed=0 | Education, experience, gender |
| Default | Default=1, Not=0 | Credit score, income, debt ratio |
| Voting | Vote=1, Not=0 | Age, education, income |
| Illness | Sick=1, Healthy=0 | Age, BMI, smoking history |
Case Study: Factors Affecting College Admission
Research question: What factors influence whether a student is admitted to college?
python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import logit
# Generate simulated data: 500 applicants
np.random.seed(2024)
n = 500
data = pd.DataFrame({
'gpa': np.random.uniform(2.0, 4.0, n), # GPA score
'sat': np.random.randint(800, 1600, n), # SAT score
'extracurricular': np.random.randint(0, 10, n), # Number of extracurricular activities
'income': np.random.uniform(20, 150, n) # Family income (thousands)
})
# Construct admission probability (logit model)
z = -8 + 1.5 * data['gpa'] + 0.003 * data['sat'] + 0.1 * data['extracurricular'] + 0.01 * data['income']
prob = 1 / (1 + np.exp(-z))
data['admitted'] = (np.random.uniform(0, 1, n) < prob).astype(int)
print(data.head(10))
print(f"\nAdmission rate: {data['admitted'].mean():.2%}")Logit Regression: Python Implementation
Method 1: Using sm.Logit() (Recommended)
python
import statsmodels.api as sm
# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']
# Fit Logit model
logit_model = sm.Logit(y, X).fit()
# View results
print(logit_model.summary())Output:
Logit Regression Results
==============================================================================
Dep. Variable: admitted No. Observations: 500
Model: Logit Df Residuals: 495
Method: MLE Df Model: 4
Date: ... Pseudo R-squ.: 0.524
Time: ... Log-Likelihood: -178.23
converged: True LL-Null: -374.56
Covariance Type: nonrobust LLR p-value: 1.23e-82
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
const -7.823 0.987 -7.926 0.000 -9.758 -5.888
gpa 1.485 0.198 7.500 0.000 1.097 1.873
sat 0.0029 0.001 3.625 0.000 0.001 0.005
extracurricular 0.098 0.034 2.882 0.004 0.031 0.165
income 0.0095 0.004 2.375 0.018 0.002 0.017
==================================================================================Interpretation:
- GPA: Coefficient 1.485, p < 0.001, significant positive effect on admission
- SAT: Coefficient 0.0029, p < 0.001, significant positive effect
- Extracurricular: Coefficient 0.098, p < 0.01, significant positive effect
- Family Income: Coefficient 0.0095, p < 0.05, marginally significant
- Pseudo R² = 0.524, good model fit
Method 2: Using Formula Interface (Similar to R)
python
from statsmodels.formula.api import logit
# Use R-style formula
logit_model = logit('admitted ~ gpa + sat + extracurricular + income', data=data).fit()
print(logit_model.summary())Three-Language Comparison
Python (statsmodels)
python
import statsmodels.api as sm
# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']
# Fit model
model = sm.Logit(y, X).fit()
# View results
print(model.summary())
# Predict probabilities
prob = model.predict(X)Stata
stata
* Load data
use admission_data.dta, clear
* Logit regression
logit admitted gpa sat extracurricular income
* View marginal effects
margins, dydx(*)
* Predict probabilities
predict probR
r
# Load data
data <- read.csv("admission_data.csv")
# Logit regression
model <- glm(admitted ~ gpa + sat + extracurricular + income,
data = data,
family = binomial(link = "logit"))
# View results
summary(model)
# Predict probabilities
prob <- predict(model, type = "response")Interpreting Logit Coefficients
Meaning of Coefficients
Logit regression coefficients are not marginal effects, but changes in log-odds.
python
# GPA coefficient is 1.485, meaning:
# Each 1-point increase in GPA increases log-odds by 1.485
# Equivalent to: odds multiply by exp(1.485) = 4.42
print(f"GPA coefficient: {logit_model.params['gpa']:.3f}")
print(f"Odds Ratio: {np.exp(logit_model.params['gpa']):.3f}")Interpretation:
- Each 1-point increase in GPA increases admission odds by a factor of 4.42
- This is a very strong effect!
Get Odds Ratios
python
# Odds Ratios for all variables
odds_ratios = np.exp(logit_model.params)
print(odds_ratios)
'''
const 0.000
gpa 4.416
sat 1.003
extracurricular 1.103
income 1.010
'''Interpretation:
- GPA: Each 1-point increase multiplies admission odds by 4.42
- SAT: Each 1-point increase multiplies admission odds by 1.003 (small effect)
- Extracurricular: Each additional activity multiplies admission odds by 1.10
- Income: Each additional $1,000 multiplies admission odds by 1.01
Marginal Effects
Marginal effects are easier to interpret: How much does admission probability increase when independent variable increases by 1 unit?
python
# Calculate Average Marginal Effects (AME)
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())Output:
Logit Marginal Effects
=====================================
Dep. Variable: admitted
Method: dydx
At: overall
=====================================
dy/dx std err z P>|z|
-------------------------------------------------------------
gpa 0.298 0.039 7.641 0.000
sat 0.001 0.000 3.702 0.000
extracurricular 0.020 0.007 2.857 0.004
income 0.002 0.001 2.344 0.019
-------------------------------------------------------------Interpretation:
- GPA: Each 1-point increase raises admission probability by 29.8 percentage points on average
- SAT: Each 1-point increase raises admission probability by 0.1 percentage points
- Extracurricular: Each additional activity raises admission probability by 2.0 percentage points
- Income: Each additional $1,000 raises admission probability by 0.2 percentage points
Prediction
Predict Probabilities
python
# Predict admission probabilities for all samples
predicted_prob = logit_model.predict(X)
print(predicted_prob[:10])
# Add to dataframe
data['predicted_prob'] = predicted_prob
print(data[['gpa', 'sat', 'admitted', 'predicted_prob']].head())Predict New Student
python
# New applicant: GPA=3.5, SAT=1200, 5 extracurricular activities, income=80k
new_student = pd.DataFrame({
'const': [1],
'gpa': [3.5],
'sat': [1200],
'extracurricular': [5],
'income': [80]
})
prob = logit_model.predict(new_student)
print(f"Admission probability: {prob.values[0]:.2%}")
# Output: Admission probability: 68.34%Model Evaluation
Confusion Matrix
python
from sklearn.metrics import confusion_matrix, classification_report
# Predict class (probability > 0.5 predicts admission)
predicted_class = (predicted_prob > 0.5).astype(int)
# Confusion matrix
cm = confusion_matrix(data['admitted'], predicted_class)
print("Confusion Matrix:")
print(cm)
'''
[[210 35]
[ 28 227]]
'''
print("\nClassification Report:")
print(classification_report(data['admitted'], predicted_class))Output:
precision recall f1-score support
0 0.88 0.86 0.87 245
1 0.87 0.89 0.88 255
accuracy 0.87 500Logit vs OLS
Why Not Use OLS?
python
# ❌ Wrong: Using OLS for binary variable
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']
ols_model = sm.OLS(y, X).fit()
ols_pred = ols_model.predict(X)
# Problem: Predictions may fall outside [0, 1] range
print(f"OLS prediction minimum: {ols_pred.min():.3f}") # May be < 0
print(f"OLS prediction maximum: {ols_pred.max():.3f}") # May be > 1Logit's Advantage
python
# ✅ Correct: Logit guarantees predictions in [0, 1]
logit_pred = logit_model.predict(X)
print(f"Logit prediction minimum: {logit_pred.min():.3f}") # Always ≥ 0
print(f"Logit prediction maximum: {logit_pred.max():.3f}") # Always ≤ 1Practice Exercise
Complete code: Studying employment determinants
python
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Generate data
np.random.seed(42)
n = 800
data = pd.DataFrame({
'employed': np.random.choice([0, 1], n, p=[0.3, 0.7]),
'education': np.random.randint(9, 22, n),
'age': np.random.randint(22, 60, n),
'female': np.random.choice([0, 1], n)
})
# Construct more realistic employment probability
z = -3 + 0.2 * data['education'] + 0.02 * data['age'] - 0.3 * data['female']
prob = 1 / (1 + np.exp(-z))
data['employed'] = (np.random.uniform(0, 1, n) < prob).astype(int)
# Logit regression
X = sm.add_constant(data[['education', 'age', 'female']])
y = data['employed']
model = sm.Logit(y, X).fit()
print(model.summary())
# Marginal effects
print("\nMarginal Effects:")
print(model.get_margeff().summary())
# Odds Ratios
print("\nOdds Ratios:")
print(np.exp(model.params))Key Takeaways
| Content | OLS Regression | Logit Regression |
|---|---|---|
| Dependent Variable | Continuous | Binary (0/1) |
| Prediction Range | (-∞, +∞) | [0, 1] |
| Coefficient Interpretation | Marginal effect | Log-odds |
| More Intuitive Interpretation | Coefficient itself | Odds Ratios or marginal effects |
| Python Command | sm.OLS() | sm.Logit() |
Next Steps
- Article 04:
summary_col()- Elegantly Displaying Multiple Model Comparisons
🎉 You've mastered Logit regression in Python!