Skip to content

1.4 Logit Regression: Binary Dependent Variable Models

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write."— H.G. Wells, Writer & Futurist

How to model when the outcome is "yes/no"?


Section Objectives

  • Understand application scenarios of Logit regression
  • Conduct Logit regression using statsmodels
  • Interpret Logit regression coefficients and marginal effects
  • Compare with Stata/R

When to Use Logit Regression?

When the dependent variable is binary (0/1), you cannot use OLS—you need Logit or Probit.

Typical Application Scenarios

Research QuestionDependent VariableExample Independent Variables
College attendanceAttend=1, Not=0Family income, parental education
EmploymentEmployed=1, Unemployed=0Education, experience, gender
DefaultDefault=1, Not=0Credit score, income, debt ratio
VotingVote=1, Not=0Age, education, income
IllnessSick=1, Healthy=0Age, BMI, smoking history

Case Study: Factors Affecting College Admission

Research question: What factors influence whether a student is admitted to college?

python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import logit

# Generate simulated data: 500 applicants
np.random.seed(2024)
n = 500

data = pd.DataFrame({
    'gpa': np.random.uniform(2.0, 4.0, n),          # GPA score
    'sat': np.random.randint(800, 1600, n),         # SAT score
    'extracurricular': np.random.randint(0, 10, n), # Number of extracurricular activities
    'income': np.random.uniform(20, 150, n)         # Family income (thousands)
})

# Construct admission probability (logit model)
z = -8 + 1.5 * data['gpa'] + 0.003 * data['sat'] + 0.1 * data['extracurricular'] + 0.01 * data['income']
prob = 1 / (1 + np.exp(-z))
data['admitted'] = (np.random.uniform(0, 1, n) < prob).astype(int)

print(data.head(10))
print(f"\nAdmission rate: {data['admitted'].mean():.2%}")

Logit Regression: Python Implementation

python
import statsmodels.api as sm

# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

# Fit Logit model
logit_model = sm.Logit(y, X).fit()

# View results
print(logit_model.summary())

Output:

                           Logit Regression Results
==============================================================================
Dep. Variable:               admitted   No. Observations:                  500
Model:                          Logit   Df Residuals:                      495
Method:                           MLE   Df Model:                            4
Date:                ...              Pseudo R-squ.:                     0.524
Time:                        ...      Log-Likelihood:                -178.23
converged:                       True   LL-Null:                       -374.56
Covariance Type:            nonrobust   LLR p-value:                 1.23e-82
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -7.823      0.987     -7.926      0.000      -9.758      -5.888
gpa                1.485      0.198      7.500      0.000       1.097       1.873
sat                0.0029     0.001      3.625      0.000       0.001       0.005
extracurricular    0.098      0.034      2.882      0.004       0.031       0.165
income             0.0095     0.004      2.375      0.018       0.002       0.017
==================================================================================

Interpretation:

  • GPA: Coefficient 1.485, p < 0.001, significant positive effect on admission
  • SAT: Coefficient 0.0029, p < 0.001, significant positive effect
  • Extracurricular: Coefficient 0.098, p < 0.01, significant positive effect
  • Family Income: Coefficient 0.0095, p < 0.05, marginally significant
  • Pseudo R² = 0.524, good model fit

Method 2: Using Formula Interface (Similar to R)

python
from statsmodels.formula.api import logit

# Use R-style formula
logit_model = logit('admitted ~ gpa + sat + extracurricular + income', data=data).fit()
print(logit_model.summary())

Three-Language Comparison

Python (statsmodels)

python
import statsmodels.api as sm

# Prepare data
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

# Fit model
model = sm.Logit(y, X).fit()

# View results
print(model.summary())

# Predict probabilities
prob = model.predict(X)

Stata

stata
* Load data
use admission_data.dta, clear

* Logit regression
logit admitted gpa sat extracurricular income

* View marginal effects
margins, dydx(*)

* Predict probabilities
predict prob

R

r
# Load data
data <- read.csv("admission_data.csv")

# Logit regression
model <- glm(admitted ~ gpa + sat + extracurricular + income,
             data = data,
             family = binomial(link = "logit"))

# View results
summary(model)

# Predict probabilities
prob <- predict(model, type = "response")

Interpreting Logit Coefficients

Meaning of Coefficients

Logit regression coefficients are not marginal effects, but changes in log-odds.

python
# GPA coefficient is 1.485, meaning:
# Each 1-point increase in GPA increases log-odds by 1.485
# Equivalent to: odds multiply by exp(1.485) = 4.42

print(f"GPA coefficient: {logit_model.params['gpa']:.3f}")
print(f"Odds Ratio: {np.exp(logit_model.params['gpa']):.3f}")

Interpretation:

  • Each 1-point increase in GPA increases admission odds by a factor of 4.42
  • This is a very strong effect!

Get Odds Ratios

python
# Odds Ratios for all variables
odds_ratios = np.exp(logit_model.params)
print(odds_ratios)
'''
const                 0.000
gpa                   4.416
sat                   1.003
extracurricular       1.103
income                1.010
'''

Interpretation:

  • GPA: Each 1-point increase multiplies admission odds by 4.42
  • SAT: Each 1-point increase multiplies admission odds by 1.003 (small effect)
  • Extracurricular: Each additional activity multiplies admission odds by 1.10
  • Income: Each additional $1,000 multiplies admission odds by 1.01

Marginal Effects

Marginal effects are easier to interpret: How much does admission probability increase when independent variable increases by 1 unit?

python
# Calculate Average Marginal Effects (AME)
marginal_effects = logit_model.get_margeff()
print(marginal_effects.summary())

Output:

        Logit Marginal Effects
=====================================
Dep. Variable:               admitted
Method:                          dydx
At:                           overall
=====================================
                     dy/dx    std err          z      P>|z|
-------------------------------------------------------------
gpa                  0.298      0.039      7.641      0.000
sat                  0.001      0.000      3.702      0.000
extracurricular      0.020      0.007      2.857      0.004
income               0.002      0.001      2.344      0.019
-------------------------------------------------------------

Interpretation:

  • GPA: Each 1-point increase raises admission probability by 29.8 percentage points on average
  • SAT: Each 1-point increase raises admission probability by 0.1 percentage points
  • Extracurricular: Each additional activity raises admission probability by 2.0 percentage points
  • Income: Each additional $1,000 raises admission probability by 0.2 percentage points

Prediction

Predict Probabilities

python
# Predict admission probabilities for all samples
predicted_prob = logit_model.predict(X)
print(predicted_prob[:10])

# Add to dataframe
data['predicted_prob'] = predicted_prob
print(data[['gpa', 'sat', 'admitted', 'predicted_prob']].head())

Predict New Student

python
# New applicant: GPA=3.5, SAT=1200, 5 extracurricular activities, income=80k
new_student = pd.DataFrame({
    'const': [1],
    'gpa': [3.5],
    'sat': [1200],
    'extracurricular': [5],
    'income': [80]
})

prob = logit_model.predict(new_student)
print(f"Admission probability: {prob.values[0]:.2%}")

# Output: Admission probability: 68.34%

Model Evaluation

Confusion Matrix

python
from sklearn.metrics import confusion_matrix, classification_report

# Predict class (probability > 0.5 predicts admission)
predicted_class = (predicted_prob > 0.5).astype(int)

# Confusion matrix
cm = confusion_matrix(data['admitted'], predicted_class)
print("Confusion Matrix:")
print(cm)
'''
[[210  35]
 [ 28 227]]
'''

print("\nClassification Report:")
print(classification_report(data['admitted'], predicted_class))

Output:

              precision    recall  f1-score   support
           0       0.88      0.86      0.87       245
           1       0.87      0.89      0.88       255
    accuracy                           0.87       500

Logit vs OLS

Why Not Use OLS?

python
# ❌ Wrong: Using OLS for binary variable
X = sm.add_constant(data[['gpa', 'sat', 'extracurricular', 'income']])
y = data['admitted']

ols_model = sm.OLS(y, X).fit()
ols_pred = ols_model.predict(X)

# Problem: Predictions may fall outside [0, 1] range
print(f"OLS prediction minimum: {ols_pred.min():.3f}")  # May be < 0
print(f"OLS prediction maximum: {ols_pred.max():.3f}")  # May be > 1

Logit's Advantage

python
# ✅ Correct: Logit guarantees predictions in [0, 1]
logit_pred = logit_model.predict(X)
print(f"Logit prediction minimum: {logit_pred.min():.3f}")  # Always ≥ 0
print(f"Logit prediction maximum: {logit_pred.max():.3f}")  # Always ≤ 1

Practice Exercise

Complete code: Studying employment determinants

python
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate data
np.random.seed(42)
n = 800

data = pd.DataFrame({
    'employed': np.random.choice([0, 1], n, p=[0.3, 0.7]),
    'education': np.random.randint(9, 22, n),
    'age': np.random.randint(22, 60, n),
    'female': np.random.choice([0, 1], n)
})

# Construct more realistic employment probability
z = -3 + 0.2 * data['education'] + 0.02 * data['age'] - 0.3 * data['female']
prob = 1 / (1 + np.exp(-z))
data['employed'] = (np.random.uniform(0, 1, n) < prob).astype(int)

# Logit regression
X = sm.add_constant(data[['education', 'age', 'female']])
y = data['employed']

model = sm.Logit(y, X).fit()
print(model.summary())

# Marginal effects
print("\nMarginal Effects:")
print(model.get_margeff().summary())

# Odds Ratios
print("\nOdds Ratios:")
print(np.exp(model.params))

Key Takeaways

ContentOLS RegressionLogit Regression
Dependent VariableContinuousBinary (0/1)
Prediction Range(-∞, +∞)[0, 1]
Coefficient InterpretationMarginal effectLog-odds
More Intuitive InterpretationCoefficient itselfOdds Ratios or marginal effects
Python Commandsm.OLS()sm.Logit()

Next Steps

  • Article 04: summary_col() - Elegantly Displaying Multiple Model Comparisons

🎉 You've mastered Logit regression in Python!

Released under the MIT License. Content © Author.