Scikit-learn Machine Learning Basics
Quick Start Guide to Regression, Classification, and Clustering
What is Scikit-learn?
Scikit-learn (sklearn) is Python's most popular machine learning library.
Ideal for Social Science Research:
- Predictive models (income prediction, voting prediction)
- Classification problems (risk assessment, customer segmentation)
- Clustering analysis (market segmentation)
Basic Workflow
python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# 1. Prepare data
X = df[['age', 'education_years']] # Features
y = df['income'] # Target
# 2. Split training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Create model
model = LinearRegression()
# 4. Train
model.fit(X_train, y_train)
# 5. Predict
y_pred = model.predict(X_test)
# 6. Evaluate
r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.3f}")
# 7. View coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")Common Models
1. Linear Regression
python
from sklearn.linear_model import LinearRegression
# Income prediction
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)2. Logistic Regression (Classification)
python
from sklearn.linear_model import LogisticRegression
# High income prediction (>70k)
y_binary = (df['income'] > 70000).astype(int)
model = LogisticRegression()
model.fit(X_train, y_binary)
predictions = model.predict(X_test)
# Predict probabilities
probs = model.predict_proba(X_test)3. Decision Trees
python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)4. Random Forests
python
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Feature importance
importances = model.feature_importances_
print(dict(zip(X.columns, importances)))Practical Example
Case Study: Income Prediction Model
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Data
df = pd.DataFrame({
'age': np.random.randint(22, 65, 500),
'education_years': np.random.randint(12, 20, 500),
'experience': np.random.randint(0, 30, 500),
'income': np.random.normal(60000, 20000, 500)
})
# Feature engineering
X = df[['age', 'education_years', 'experience']]
y = df['income']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")
# Feature importance
for feature, importance in zip(X.columns, model.feature_importances_):
print(f"{feature}: {importance:.3f}")Comparing with Stata
Stata: Linear Regression
stata
* Stata
regress income age education_years
predict income_hatPython: Equivalent Code
python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
income_hat = model.predict(X_test)
# But statsmodels is closer to Stata
import statsmodels.formula.api as smf
model = smf.ols('income ~ age + education_years', data=df).fit()
print(model.summary()) # Similar to Stata outputKey Concepts
Training Set vs Test Set
python
# Why split? To avoid overfitting!
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42 # Fix random seed
)Cross-Validation
python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"CV R²: {scores.mean():.3f} (+/- {scores.std():.3f})")Practice Exercises
python
# Using sklearn, complete:
# 1. Predict whether respondents will purchase product (logistic regression)
# 2. Predict income based on demographic features (random forest)
# 3. Compare performance of different modelsNext Steps
Next Section: PyTorch/TensorFlow Deep Learning Introduction
Keep going!