Skip to content

Scikit-learn Machine Learning Basics

Quick Start Guide to Regression, Classification, and Clustering


What is Scikit-learn?

Scikit-learn (sklearn) is Python's most popular machine learning library.

Ideal for Social Science Research:

  • Predictive models (income prediction, voting prediction)
  • Classification problems (risk assessment, customer segmentation)
  • Clustering analysis (market segmentation)

Basic Workflow

python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 1. Prepare data
X = df[['age', 'education_years']]  # Features
y = df['income']                     # Target

# 2. Split training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Create model
model = LinearRegression()

# 4. Train
model.fit(X_train, y_train)

# 5. Predict
y_pred = model.predict(X_test)

# 6. Evaluate
r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.3f}")

# 7. View coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

Common Models

1. Linear Regression

python
from sklearn.linear_model import LinearRegression

# Income prediction
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

2. Logistic Regression (Classification)

python
from sklearn.linear_model import LogisticRegression

# High income prediction (>70k)
y_binary = (df['income'] > 70000).astype(int)

model = LogisticRegression()
model.fit(X_train, y_binary)
predictions = model.predict(X_test)

# Predict probabilities
probs = model.predict_proba(X_test)

3. Decision Trees

python
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

4. Random Forests

python
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_
print(dict(zip(X.columns, importances)))

Practical Example

Case Study: Income Prediction Model

python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Data
df = pd.DataFrame({
    'age': np.random.randint(22, 65, 500),
    'education_years': np.random.randint(12, 20, 500),
    'experience': np.random.randint(0, 30, 500),
    'income': np.random.normal(60000, 20000, 500)
})

# Feature engineering
X = df[['age', 'education_years', 'experience']]
y = df['income']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")

# Feature importance
for feature, importance in zip(X.columns, model.feature_importances_):
    print(f"{feature}: {importance:.3f}")

Comparing with Stata

Stata: Linear Regression

stata
* Stata
regress income age education_years
predict income_hat

Python: Equivalent Code

python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
income_hat = model.predict(X_test)

# But statsmodels is closer to Stata
import statsmodels.formula.api as smf
model = smf.ols('income ~ age + education_years', data=df).fit()
print(model.summary())  # Similar to Stata output

Key Concepts

Training Set vs Test Set

python
# Why split? To avoid overfitting!
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% for testing
    random_state=42    # Fix random seed
)

Cross-Validation

python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)  # 5-fold CV
print(f"CV R²: {scores.mean():.3f} (+/- {scores.std():.3f})")

Practice Exercises

python
# Using sklearn, complete:
# 1. Predict whether respondents will purchase product (logistic regression)
# 2. Predict income based on demographic features (random forest)
# 3. Compare performance of different models

Next Steps

Next Section: PyTorch/TensorFlow Deep Learning Introduction

Keep going!

Released under the MIT License. Content © Author.