Scikit-learn 机器学习入门
快速上手回归、分类、聚类
什么是 Scikit-learn?
Scikit-learn (sklearn) 是 Python 最流行的机器学习库。
适合社科研究:
- 预测模型(收入预测、投票预测)
- 分类问题(风险评估、客户分群)
- 聚类分析(市场细分)
基本工作流
python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# 1. 准备数据
X = df[['age', 'education_years']] # 特征
y = df['income'] # 目标
# 2. 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. 创建模型
model = LinearRegression()
# 4. 训练
model.fit(X_train, y_train)
# 5. 预测
y_pred = model.predict(X_test)
# 6. 评估
r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.3f}")
# 7. 查看系数
print(f"截距: {model.intercept_}")
print(f"系数: {model.coef_}")常用模型
1. 线性回归
python
from sklearn.linear_model import LinearRegression
# 收入预测
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)2. 逻辑回归(分类)
python
from sklearn.linear_model import LogisticRegression
# 是否高收入(>70k)
y_binary = (df['income'] > 70000).astype(int)
model = LogisticRegression()
model.fit(X_train, y_binary)
predictions = model.predict(X_test)
# 预测概率
probs = model.predict_proba(X_test)3. 决策树
python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)4. 随机森林
python
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 特征重要性
importances = model.feature_importances_
print(dict(zip(X.columns, importances)))实战案例
案例:收入预测模型
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 数据
df = pd.DataFrame({
'age': np.random.randint(22, 65, 500),
'education_years': np.random.randint(12, 20, 500),
'experience': np.random.randint(0, 30, 500),
'income': np.random.normal(60000, 20000, 500)
})
# 特征工程
X = df[['age', 'education_years', 'experience']]
y = df['income']
# 划分数据
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 训练模型
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 预测与评估
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")
# 特征重要性
for feature, importance in zip(X.columns, model.feature_importances_):
print(f"{feature}: {importance:.3f}")与 Stata 对比
Stata: 线性回归
stata
* Stata
regress income age education_years
predict income_hatPython: 等价代码
python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
income_hat = model.predict(X_test)
# 但 statsmodels 更接近 Stata
import statsmodels.formula.api as smf
model = smf.ols('income ~ age + education_years', data=df).fit()
print(model.summary()) # 类似 Stata 的输出关键概念
训练集 vs 测试集
python
# 为什么要分?避免过拟合!
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% 用于测试
random_state=42 # 固定随机种子
)交叉验证
python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"CV R²: {scores.mean():.3f} (+/- {scores.std():.3f})")练习题
python
# 使用 sklearn 完成:
# 1. 预测受访者是否会购买产品(逻辑回归)
# 2. 根据人口特征预测收入(随机森林)
# 3. 比较不同模型的表现下一步
下一节:PyTorch/TensorFlow 深度学习入门
继续!