1.4 Logit Regression: Binary Dependent Variable Models

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write."— H.G. Wells, Writer & Futurist

How to model when the outcome is "yes/no"?

Section Objectives

Understand application scenarios of Logit regression
Conduct Logit regression using statsmodels
Interpret Logit regression coefficients and marginal effects
Compare with Stata/R

When to Use Logit Regression?

When the dependent variable is binary (0/1), you cannot use OLS—you need Logit or Probit.

Typical Application Scenarios

Research Question	Dependent Variable	Example Independent Variables
College attendance	Attend=1, Not=0	Family income, parental education
Employment	Employed=1, Unemployed=0	Education, experience, gender
Default	Default=1, Not=0	Credit score, income, debt ratio
Voting	Vote=1, Not=0	Age, education, income
Illness	Sick=1, Healthy=0	Age, BMI, smoking history

Case Study: Factors Affecting College Admission

Research question: What factors influence whether a student is admitted to college?

python

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import logit

# Generate simulated data: 500 applicants
np.random.seed(2024)
n = 500

data = pd.DataFrame({
    'gpa': np.random.uniform(2.0, 4.0, n),          # GPA score
    'sat': np.random.randint(800, 1600, n),         # SAT score
    'extracurricular': np.random.randint(0, 10, n), # Number of extracurricular activities
    'income': np.random.uniform(20, 150, n)         # Family income (thousands)
})

# Construct admission probability (logit model)
z = -8 + 1.5 * data['gpa'] + 0.003 * data['sat'] + 0.1 * data['extracurricular'] + 0.01 * data['income']
prob = 1 / (1 + np.exp(-z))
data['admitted'] = (np.random.uniform(0, 1, n) < prob).astype(int)

print(data.head(10))
print(f"\nAdmission rate: {data['admitted'].mean():.2%}")

Logit Regression: Python Implementation

Method 1: Using `sm.Logit()` (Recommended)

python

import statsmodels.api as sm

# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

# Fit Logit model
logit_model = sm.Logit(y, X).fit()

# View results
print(logit_model.summary())

Output:

                           Logit Regression Results
==============================================================================
Dep. Variable:               admitted   No. Observations:                  500
Model:                          Logit   Df Residuals:                      495
Method:                           MLE   Df Model:                            4
Date:                ...              Pseudo R-squ.:                     0.524
Time:                        ...      Log-Likelihood:                -178.23
converged:                       True   LL-Null:                       -374.56
Covariance Type:            nonrobust   LLR p-value:                 1.23e-82
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -7.823      0.987     -7.926      0.000      -9.758      -5.888
gpa                1.485      0.198      7.500      0.000       1.097       1.873
sat                0.0029     0.001      3.625      0.000       0.001       0.005
extracurricular    0.098      0.034      2.882      0.004       0.031       0.165
income             0.0095     0.004      2.375      0.018       0.002       0.017
==================================================================================

Interpretation:

GPA: Coefficient 1.485, p < 0.001, significant positive effect on admission
SAT: Coefficient 0.0029, p < 0.001, significant positive effect
Extracurricular: Coefficient 0.098, p < 0.01, significant positive effect
Family Income: Coefficient 0.0095, p < 0.05, marginally significant
Pseudo R² = 0.524, good model fit

Method 2: Using Formula Interface (Similar to R)

python

from statsmodels.formula.api import logit

# Use R-style formula
logit_model = logit('admitted ~ gpa + sat + extracurricular + income', data=data).fit()
print(logit_model.summary())

Three-Language Comparison

Python (statsmodels)

python

import statsmodels.api as sm

# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

# Fit model
model = sm.Logit(y, X).fit()

# View results
print(model.summary())

# Predict probabilities
prob = model.predict(X)

Stata

stata

* Load data
use admission_data.dta, clear

* Logit regression
logit admitted gpa sat extracurricular income

* View marginal effects
margins, dydx(*)

* Predict probabilities
predict prob

R

# Load data
data <- read.csv("admission_data.csv")

# Logit regression
model <- glm(admitted ~ gpa + sat + extracurricular + income,
             data = data,
             family = binomial(link = "logit"))

# View results
summary(model)

# Predict probabilities
prob <- predict(model, type = "response")

Interpreting Logit Coefficients

Meaning of Coefficients

Logit regression coefficients are not marginal effects, but changes in log-odds.

python

# GPA coefficient is 1.485, meaning:
# Each 1-point increase in GPA increases log-odds by 1.485
# Equivalent to: odds multiply by exp(1.485) = 4.42

print(f"GPA coefficient: {logit_model.params['gpa']:.3f}")
print(f"Odds Ratio: {np.exp(logit_model.params['gpa']):.3f}")

Interpretation:

Each 1-point increase in GPA increases admission odds by a factor of 4.42
This is a very strong effect!

Get Odds Ratios

python

# Odds Ratios for all variables
odds_ratios = np.exp(logit_model.params)
print(odds_ratios)
'''
const                 0.000
gpa                   4.416
sat                   1.003
extracurricular       1.103
income                1.010
'''

Interpretation:

GPA: Each 1-point increase multiplies admission odds by 4.42
SAT: Each 1-point increase multiplies admission odds by 1.003 (small effect)
Extracurricular: Each additional activity multiplies admission odds by 1.10
Income: Each additional $1,000 multiplies admission odds by 1.01

Marginal Effects

Marginal effects are easier to interpret: How much does admission probability increase when independent variable increases by 1 unit?

python

# Calculate Average Marginal Effects (AME)
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())

Output:

        Logit Marginal Effects
=====================================
Dep. Variable:               admitted
Method:                          dydx
At:                           overall
=====================================
                     dy/dx    std err          z      P>|z|
-------------------------------------------------------------
gpa                  0.298      0.039      7.641      0.000
sat                  0.001      0.000      3.702      0.000
extracurricular      0.020      0.007      2.857      0.004
income               0.002      0.001      2.344      0.019
-------------------------------------------------------------

Interpretation:

GPA: Each 1-point increase raises admission probability by 29.8 percentage points on average
SAT: Each 1-point increase raises admission probability by 0.1 percentage points
Extracurricular: Each additional activity raises admission probability by 2.0 percentage points
Income: Each additional $1,000 raises admission probability by 0.2 percentage points

Prediction

Predict Probabilities

python

# Predict admission probabilities for all samples
predicted_prob = logit_model.predict(X)
print(predicted_prob[:10])

# Add to dataframe
data['predicted_prob'] = predicted_prob
print(data[['gpa', 'sat', 'admitted', 'predicted_prob']].head())

Predict New Student

python

# New applicant: GPA=3.5, SAT=1200, 5 extracurricular activities, income=80k
new_student = pd.DataFrame({
    'const': [1],
    'gpa': [3.5],
    'sat': [1200],
    'extracurricular': [5],
    'income': [80]
})

prob = logit_model.predict(new_student)
print(f"Admission probability: {prob.values[0]:.2%}")

# Output: Admission probability: 68.34%

Model Evaluation

Confusion Matrix

python

from sklearn.metrics import confusion_matrix, classification_report

# Predict class (probability > 0.5 predicts admission)
predicted_class = (predicted_prob > 0.5).astype(int)

# Confusion matrix
cm = confusion_matrix(data['admitted'], predicted_class)
print("Confusion Matrix:")
print(cm)
'''
[[210  35]
 [ 28 227]]
'''

print("\nClassification Report:")
print(classification_report(data['admitted'], predicted_class))

Output:

              precision    recall  f1-score   support
           0       0.88      0.86      0.87       245
           1       0.87      0.89      0.88       255
    accuracy                           0.87       500

Logit vs OLS

Why Not Use OLS?

python

# ❌ Wrong: Using OLS for binary variable
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

ols_model = sm.OLS(y, X).fit()
ols_pred = ols_model.predict(X)

# Problem: Predictions may fall outside [0, 1] range
print(f"OLS prediction minimum: {ols_pred.min():.3f}")  # May be < 0
print(f"OLS prediction maximum: {ols_pred.max():.3f}")  # May be > 1

Logit's Advantage

python

# ✅ Correct: Logit guarantees predictions in [0, 1]
logit_pred = logit_model.predict(X)
print(f"Logit prediction minimum: {logit_pred.min():.3f}")  # Always ≥ 0
print(f"Logit prediction maximum: {logit_pred.max():.3f}")  # Always ≤ 1

Practice Exercise

Complete code: Studying employment determinants

python

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate data
np.random.seed(42)
n = 800

data = pd.DataFrame({
    'employed': np.random.choice([0, 1], n, p=[0.3, 0.7]),
    'education': np.random.randint(9, 22, n),
    'age': np.random.randint(22, 60, n),
    'female': np.random.choice([0, 1], n)
})

# Construct more realistic employment probability
z = -3 + 0.2 * data['education'] + 0.02 * data['age'] - 0.3 * data['female']
prob = 1 / (1 + np.exp(-z))
data['employed'] = (np.random.uniform(0, 1, n) < prob).astype(int)

# Logit regression
X = sm.add_constant(data[['education', 'age', 'female']])
y = data['employed']

model = sm.Logit(y, X).fit()
print(model.summary())

# Marginal effects
print("\nMarginal Effects:")
print(model.get_margeff().summary())

# Odds Ratios
print("\nOdds Ratios:")
print(np.exp(model.params))

Key Takeaways

Content	OLS Regression	Logit Regression
Dependent Variable	Continuous	Binary (0/1)
Prediction Range	(-∞, +∞)	[0, 1]
Coefficient Interpretation	Marginal effect	Log-odds
More Intuitive Interpretation	Coefficient itself	Odds Ratios or marginal effects
Python Command	`sm.OLS()`	`sm.Logit()`

Next Steps

Article 04: summary_col() - Elegantly Displaying Multiple Model Comparisons

🎉 You've mastered Logit regression in Python!

1.4 Logit Regression: Binary Dependent Variable Models ​

Section Objectives ​

When to Use Logit Regression? ​

Typical Application Scenarios ​

Case Study: Factors Affecting College Admission ​

Logit Regression: Python Implementation ​

Method 1: Using sm.Logit() (Recommended) ​

Method 2: Using Formula Interface (Similar to R) ​

Three-Language Comparison ​

Python (statsmodels) ​

Stata ​

R ​

Interpreting Logit Coefficients ​

Meaning of Coefficients ​

Get Odds Ratios ​

Marginal Effects ​

Prediction ​

Predict Probabilities ​

Predict New Student ​

Model Evaluation ​

Confusion Matrix ​

Logit vs OLS ​

Why Not Use OLS? ​

Logit's Advantage ​

Practice Exercise ​

Key Takeaways ​

Next Steps ​

1.4 Logit Regression: Binary Dependent Variable Models

Section Objectives

When to Use Logit Regression?

Typical Application Scenarios

Case Study: Factors Affecting College Admission

Logit Regression: Python Implementation

Method 1: Using `sm.Logit()` (Recommended)

Method 2: Using Formula Interface (Similar to R)

Three-Language Comparison

Python (statsmodels)

Stata

R

Interpreting Logit Coefficients

Meaning of Coefficients

Get Odds Ratios

Marginal Effects

Prediction

Predict Probabilities

Predict New Student

Model Evaluation

Confusion Matrix

Logit vs OLS

Why Not Use OLS?

Logit's Advantage

Practice Exercise

Key Takeaways

Next Steps