Scikit-learn 机器学习入门

快速上手回归、分类、聚类

什么是 Scikit-learn？

Scikit-learn (sklearn) 是 Python 最流行的机器学习库。

适合社科研究：

预测模型（收入预测、投票预测）
分类问题（风险评估、客户分群）
聚类分析（市场细分）

基本工作流

python

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 1. 准备数据
X = df[['age', 'education_years']]  # 特征
y = df['income']                     # 目标

# 2. 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. 创建模型
model = LinearRegression()

# 4. 训练
model.fit(X_train, y_train)

# 5. 预测
y_pred = model.predict(X_test)

# 6. 评估
r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.3f}")

# 7. 查看系数
print(f"截距: {model.intercept_}")
print(f"系数: {model.coef_}")

常用模型

1. 线性回归

python

from sklearn.linear_model import LinearRegression

# 收入预测
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

2. 逻辑回归（分类）

python

from sklearn.linear_model import LogisticRegression

# 是否高收入（>70k）
y_binary = (df['income'] > 70000).astype(int)

model = LogisticRegression()
model.fit(X_train, y_binary)
predictions = model.predict(X_test)

# 预测概率
probs = model.predict_proba(X_test)

3. 决策树

python

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

4. 随机森林

python

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 特征重要性
importances = model.feature_importances_
print(dict(zip(X.columns, importances)))

实战案例

案例：收入预测模型

python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 数据
df = pd.DataFrame({
    'age': np.random.randint(22, 65, 500),
    'education_years': np.random.randint(12, 20, 500),
    'experience': np.random.randint(0, 30, 500),
    'income': np.random.normal(60000, 20000, 500)
})

# 特征工程
X = df[['age', 'education_years', 'experience']]
y = df['income']

# 划分数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 训练模型
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 预测与评估
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: ${rmse:,.0f}")
print(f"R²: {r2:.3f}")

# 特征重要性
for feature, importance in zip(X.columns, model.feature_importances_):
    print(f"{feature}: {importance:.3f}")

与 Stata 对比

Stata: 线性回归

stata

* Stata
regress income age education_years
predict income_hat

Python: 等价代码

python

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
income_hat = model.predict(X_test)

# 但 statsmodels 更接近 Stata
import statsmodels.formula.api as smf
model = smf.ols('income ~ age + education_years', data=df).fit()
print(model.summary())  # 类似 Stata 的输出

关键概念

训练集 vs 测试集

python

# 为什么要分？避免过拟合！
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% 用于测试
    random_state=42    # 固定随机种子
)

交叉验证

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)  # 5-fold CV
print(f"CV R²: {scores.mean():.3f} (+/- {scores.std():.3f})")

练习题

python

# 使用 sklearn 完成：
# 1. 预测受访者是否会购买产品（逻辑回归）
# 2. 根据人口特征预测收入（随机森林）
# 3. 比较不同模型的表现

下一步

下一节：PyTorch/TensorFlow 深度学习入门

继续！

Scikit-learn 机器学习入门 ​

什么是 Scikit-learn？ ​

基本工作流 ​

常用模型 ​

1. 线性回归 ​

2. 逻辑回归（分类） ​

3. 决策树 ​

4. 随机森林 ​

实战案例 ​

案例：收入预测模型 ​

与 Stata 对比 ​

Stata: 线性回归 ​

Python: 等价代码 ​

关键概念 ​

训练集 vs 测试集 ​

交叉验证 ​

练习题 ​

下一步 ​

Scikit-learn 机器学习入门

什么是 Scikit-learn？

基本工作流

常用模型

1. 线性回归

2. 逻辑回归（分类）

3. 决策树

4. 随机森林

实战案例

案例：收入预测模型

与 Stata 对比

Stata: 线性回归

Python: 等价代码

关键概念

训练集 vs 测试集

交叉验证

练习题

下一步