Module 9 小结和复习

数据科学核心库 —— NumPy, Pandas, Matplotlib

知识点总结

1. NumPy 基础

核心概念：

ndarray：N维数组，高效的数值计算容器
向量化运算：避免 Python 循环，速度快10-100倍
广播（Broadcasting）：不同形状数组的运算

基本操作：

python

import numpy as np

# 创建数组
arr = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

# 常用创建函数
np.zeros((3, 4))        # 全0数组
np.ones((2, 3))         # 全1数组
np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)    # [0, 0.25, 0.5, 0.75, 1]

# 数组属性
arr.shape    # 形状
arr.dtype    # 数据类型
arr.ndim     # 维度数
arr.size     # 元素总数

# 向量化运算
arr * 2      # 每个元素乘2
arr + 10     # 每个元素加10
arr ** 2     # 每个元素平方

# 数组索引
arr[0]       # 第一个元素
arr[1:4]     # 切片
arr2d[0, 1]  # 二维索引

统计函数：

python

arr.mean()      # 平均值
arr.std()       # 标准差
arr.sum()       # 求和
arr.min()       # 最小值
arr.max()       # 最大值
np.median(arr)  # 中位数
np.percentile(arr, 25)  # 25分位数

2. Pandas 核心

两大数据结构：

Series：一维标签数组
DataFrame：二维表格数据

DataFrame 基本操作：

python

import pandas as pd

# 创建 DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})

# 查看数据
df.head()          # 前5行
df.tail()          # 后5行
df.info()          # 数据信息
df.describe()      # 统计摘要

# 选择数据
df['age']          # 选择列
df[['name', 'age']]  # 选择多列
df.loc[0]          # 按标签选择行
df.iloc[0]         # 按位置选择行

# 筛选数据
df[df['age'] > 25]
df.query('age > 25 and income < 80000')

# 添加列
df['age_squared'] = df['age'] ** 2
df['log_income'] = np.log(df['income'])

# 排序
df.sort_values('age')
df.sort_values(['age', 'income'], ascending=[True, False])

数据清洗：

python

# 缺失值处理
df.isnull()          # 检测缺失值
df.dropna()          # 删除缺失值
df.fillna(0)         # 填充缺失值
df['age'].fillna(df['age'].mean())  # 用平均值填充

# 重复值
df.duplicated()      # 检测重复
df.drop_duplicates() # 删除重复

# 数据类型转换
df['age'] = df['age'].astype(int)
df['income'] = pd.to_numeric(df['income'], errors='coerce')

分组聚合：

python

# GroupBy 操作
df.groupby('gender')['income'].mean()

# 多重聚合
df.groupby('education').agg({
    'income': ['mean', 'median', 'std'],
    'age': ['mean', 'min', 'max']
})

# 透视表
pd.pivot_table(df, values='income',
               index='education',
               columns='gender',
               aggfunc='mean')

3. Matplotlib 与 Seaborn

Matplotlib 基础：

python

import matplotlib.pyplot as plt

# 基本折线图
plt.plot(x, y)
plt.xlabel('X轴')
plt.ylabel('Y轴')
plt.title('标题')
plt.show()

# 散点图
plt.scatter(x, y)

# 柱状图
plt.bar(categories, values)

# 直方图
plt.hist(data, bins=20)

# 子图
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)

Seaborn 高级可视化：

python

import seaborn as sns

# 设置样式
sns.set_style('whitegrid')

# 分布图
sns.histplot(df['income'], kde=True)
sns.boxplot(data=df, x='education', y='income')
sns.violinplot(data=df, x='gender', y='income')

# 关系图
sns.scatterplot(data=df, x='age', y='income', hue='gender')
sns.lineplot(data=df, x='year', y='value')

# 热力图
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# 回归图
sns.regplot(data=df, x='education_years', y='income')

# Pair plot
sns.pairplot(df, hue='gender')

4. 数据分析完整流程

标准流程：

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. 读取数据
df = pd.read_csv('survey_data.csv')

# 2. 初步探索
print(df.head())
print(df.info())
print(df.describe())

# 3. 数据清洗
df = df.dropna(subset=['age', 'income'])
df = df[(df['age'] >= 18) & (df['age'] <= 100)]
df = df[df['income'] > 0]

# 4. 特征工程
df['log_income'] = np.log(df['income'])
df['age_group'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 100],
                          labels=['18-29', '30-39', '40-49', '50+'])

# 5. 描述性统计
summary = df.groupby('education').agg({
    'income': ['count', 'mean', 'median', 'std']
})

# 6. 可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 收入分布
axes[0, 0].hist(df['income'], bins=30)
axes[0, 0].set_title('Income Distribution')

# 按教育水平的收入
df.groupby('education')['income'].mean().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Income by Education')

# 年龄 vs 收入
axes[1, 0].scatter(df['age'], df['income'], alpha=0.5)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Income')

# 相关性热力图
sns.heatmap(df[['age', 'income', 'education_years']].corr(),
            annot=True, ax=axes[1, 1])

plt.tight_layout()
plt.savefig('analysis_report.png', dpi=300)
plt.show()

# 7. 保存结果
df.to_csv('clean_data.csv', index=False)
summary.to_excel('summary_statistics.xlsx')

对比：Pandas vs R vs Stata

操作	Pandas	R	Stata
读取CSV	`pd.read_csv()`	`read.csv()`	`import delimited`
查看数据	`df.head()`	`head()`	`list in 1/5`
筛选	`df[df['age']>25]`	`subset(df, age>25)`	`keep if age>25`
分组聚合	`df.groupby().mean()`	`aggregate()`	`collapse (mean) x, by(group)`
新变量	`df['x2'] = df['x']**2`	`df$x2 <- df$x^2`	`gen x2 = x^2`

️ 常见错误

1. 忘记 inplace 参数

python

#  错误：没有保存结果
df.dropna()  # 不会修改原 df

#  方法1：赋值
df = df.dropna()

#  方法2：inplace
df.dropna(inplace=True)

2. 链式索引警告

python

#  错误：SettingWithCopyWarning
df[df['age'] > 25]['income'] = 100000

#  正确：使用 loc
df.loc[df['age'] > 25, 'income'] = 100000

3. 数组形状不匹配

python

#  错误
arr1 = np.array([1, 2, 3])
arr2 = np.array([[1], [2], [3]])
result = arr1 + arr2  # 维度不匹配

#  正确：调整形状
arr1 = arr1.reshape(-1, 1)
result = arr1 + arr2

最佳实践

1. 链式操作

python

#  使用链式操作，代码清晰
result = (df
    .query('age >= 18')
    .dropna(subset=['income'])
    .assign(log_income=lambda x: np.log(x['income']))
    .groupby('education')['log_income']
    .mean()
    .sort_values(ascending=False)
)

2. 使用向量化而非循环

python

#  慢：使用循环
for i in range(len(df)):
    df.loc[i, 'income_log'] = np.log(df.loc[i, 'income'])

#  快：使用向量化
df['income_log'] = np.log(df['income'])

3. 内存优化

python

#  指定数据类型节省内存
df = pd.read_csv('data.csv', dtype={
    'id': 'int32',
    'age': 'int8',
    'income': 'float32',
    'gender': 'category'
})

编程练习

练习 1：NumPy 数组操作（基础）

难度：⭐⭐ 时间：15 分钟

python

"""
任务：使用 NumPy 进行数据统计

给定一个收入数组，计算：
1. 基本统计量（均值、中位数、标准差）
2. 分位数（25%, 50%, 75%)
3. 标准化（Z-score）
"""

import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

# 你的代码

参考答案

python

import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

print("收入数据分析")
print("=" * 50)

# 1. 基本统计量
mean = incomes.mean()
median = np.median(incomes)
std = incomes.std()
min_val = incomes.min()
max_val = incomes.max()

print(f"样本量: {len(incomes)}")
print(f"均值: ${mean:,.2f}")
print(f"中位数: ${median:,.2f}")
print(f"标准差: ${std:,.2f}")
print(f"最小值: ${min_val:,}")
print(f"最大值: ${max_val:,}")

# 2. 分位数
q25 = np.percentile(incomes, 25)
q50 = np.percentile(incomes, 50)
q75 = np.percentile(incomes, 75)

print(f"\n分位数:")
print(f"25%: ${q25:,.2f}")
print(f"50%: ${q50:,.2f}")
print(f"75%: ${q75:,.2f}")

# 3. 标准化（Z-score）
z_scores = (incomes - mean) / std
print(f"\nZ-scores:")
for i, (income, z) in enumerate(zip(incomes, z_scores), 1):
    print(f"  ${income:,}: {z:+.2f}")

# 4. 识别异常值（|Z| > 2）
outliers = incomes[np.abs(z_scores) > 2]
if len(outliers) > 0:
    print(f"\n异常值（|Z| > 2）:")
    for val in outliers:
        print(f"  ${val:,}")
else:
    print(f"\n无异常值")

# 5. 创建收入分段
bins = [0, 50000, 60000, np.inf]
labels = ['低收入', '中收入', '高收入']
income_categories = np.digitize(incomes, bins) - 1

print(f"\n收入分段:")
for label_idx in range(len(labels)):
    count = np.sum(income_categories == label_idx)
    percentage = count / len(incomes) * 100
    print(f"  {labels[label_idx]}: {count} 人 ({percentage:.1f}%)")

练习 2：Pandas 数据清洗（基础）

难度：⭐⭐ 时间：20 分钟

python

"""
任务：清洗问卷数据

数据问题：
- 缺失值
- 异常值（年龄>100, 收入<0）
- 重复记录

要求：
1. 处理缺失值
2. 删除异常值
3. 删除重复记录
4. 生成清洗报告
"""

import pandas as pd
import numpy as np

# 原始数据（含各种问题）
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

def clean_survey_data(df):
    """清洗问卷数据"""
    # 你的代码
    pass

参考答案

python

import pandas as pd
import numpy as np

def clean_survey_data(df):
    """清洗问卷数据

    返回:
        (cleaned_df, report): 清洗后的数据和报告
    """
    report = {}
    report['original_count'] = len(df)

    print("=" * 60)
    print("数据清洗报告")
    print("=" * 60)
    print(f"原始数据: {len(df)} 行\n")

    # 1. 检查重复记录
    duplicates = df.duplicated()
    duplicate_count = duplicates.sum()
    if duplicate_count > 0:
        print(f"1. 重复记录: {duplicate_count} 条")
        print(f"   重复的 ID: {df[duplicates]['id'].tolist()}")
        df = df.drop_duplicates()
        print(f"   删除后: {len(df)} 行\n")
    else:
        print(f"1. 重复记录: 无\n")

    report['duplicate_removed'] = duplicate_count

    # 2. 缺失值分析
    print(f"2. 缺失值分析:")
    missing = df.isnull().sum()
    for col in missing.index:
        if missing[col] > 0:
            pct = missing[col] / len(df) * 100
            print(f"   {col}: {missing[col]} 个 ({pct:.1f}%)")

    # 处理策略：删除关键列缺失的行
    before_missing = len(df)
    df = df.dropna(subset=['age', 'income'])
    after_missing = len(df)
    print(f"   删除 age/income 缺失的行: {before_missing - after_missing} 条")
    print(f"   保留: {len(df)} 行\n")

    report['missing_removed'] = before_missing - after_missing

    # 3. 异常值检测
    print(f"3. 异常值检测:")

    # 年龄异常
    age_outliers = (df['age'] < 18) | (df['age'] > 100)
    age_outlier_count = age_outliers.sum()
    if age_outlier_count > 0:
        print(f"   年龄异常: {age_outlier_count} 条")
        print(f"   异常值: {df[age_outliers]['age'].tolist()}")
        df = df[~age_outliers]

    # 收入异常
    income_outliers = df['income'] < 0
    income_outlier_count = income_outliers.sum()
    if income_outlier_count > 0:
        print(f"   收入异常（负数）: {income_outlier_count} 条")
        print(f"   异常值: {df[income_outliers]['income'].tolist()}")
        df = df[~income_outliers]

    print(f"   删除异常值后: {len(df)} 行\n")
    report['outliers_removed'] = age_outlier_count + income_outlier_count

    # 4. 数据类型转换
    print(f"4. 数据类型转换:")
    df['age'] = df['age'].astype(int)
    df['income'] = df['income'].astype(float)
    print(f"   age: {df['age'].dtype}")
    print(f"   income: {df['income'].dtype}\n")

    # 5. 最终统计
    report['final_count'] = len(df)
    report['removed_total'] = report['original_count'] - report['final_count']
    report['retention_rate'] = (report['final_count'] / report['original_count']) * 100

    print(f"清洗摘要:")
    print(f"  原始: {report['original_count']} 行")
    print(f"  删除: {report['removed_total']} 行")
    print(f"    - 重复: {report['duplicate_removed']}")
    print(f"    - 缺失: {report['missing_removed']}")
    print(f"    - 异常: {report['outliers_removed']}")
    print(f"  保留: {report['final_count']} 行 ({report['retention_rate']:.1f}%)")
    print("=" * 60)

    return df, report


# 测试数据
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

# 清洗
cleaned_df, report = clean_survey_data(data)

# 显示清洗后的数据
print("\n清洗后的数据:")
print(cleaned_df)

练习 3：数据分组与聚合（中等）

难度：⭐⭐⭐ 时间：30 分钟

python

"""
任务：分析不同教育水平的收入差异

要求：
1. 按教育水平分组
2. 计算每组的统计量
3. 创建收入对比可视化
4. 生成汇总报告
"""

参考答案

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 生成示例数据
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'id': range(1, n+1),
    'age': np.random.randint(25, 60, n),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.3, 0.4, 0.2, 0.1]),
    'gender': np.random.choice(['M', 'F'], n),
    'income': np.random.lognormal(11, 0.5, n)  # 对数正态分布
})

# 按教育水平调整收入
education_multiplier = {
    'High School': 0.7,
    'Bachelor': 1.0,
    'Master': 1.3,
    'PhD': 1.6
}
data['income'] = data.apply(
    lambda row: row['income'] * education_multiplier[row['education']], axis=1
)

print("教育水平与收入分析")
print("=" * 70)

# 1. 按教育水平分组统计
print("\n1. 按教育水平的收入统计:")
edu_stats = data.groupby('education')['income'].agg([
    ('样本量', 'count'),
    ('平均收入', 'mean'),
    ('中位数', 'median'),
    ('标准差', 'std'),
    ('最小值', 'min'),
    ('最大值', 'max')
]).round(2)

# 按平均收入排序
edu_stats = edu_stats.sort_values('平均收入', ascending=False)
print(edu_stats)

# 2. 按教育和性别分组
print("\n2. 按教育和性别的平均收入:")
gender_edu_stats = data.groupby(['education', 'gender'])['income'].mean().unstack()
gender_edu_stats = gender_edu_stats.loc[edu_stats.index]  # 保持顺序
print(gender_edu_stats.round(2))

# 3. 收入分位数
print("\n3. 各教育水平的收入分位数:")
percentiles = data.groupby('education')['income'].quantile([0.25, 0.5, 0.75]).unstack()
percentiles.columns = ['25%', '50%', '75%']
percentiles = percentiles.loc[edu_stats.index]
print(percentiles.round(2))

# 4. 可视化
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 4.1 箱线图
education_order = edu_stats.index.tolist()
sns.boxplot(data=data, x='education', y='income', order=education_order, ax=axes[0, 0])
axes[0, 0].set_title('Income Distribution by Education Level', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Education Level')
axes[0, 0].set_ylabel('Income')
axes[0, 0].tick_params(axis='x', rotation=45)

# 4.2 平均收入柱状图
edu_stats['平均收入'].plot(kind='bar', ax=axes[0, 1], color='skyblue')
axes[0, 1].set_title('Average Income by Education', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Education Level')
axes[0, 1].set_ylabel('Average Income')
axes[0, 1].tick_params(axis='x', rotation=45)

# 添加数值标签
for i, v in enumerate(edu_stats['平均收入']):
    axes[0, 1].text(i, v + 5000, f'${v:,.0f}', ha='center')

# 4.3 按性别分组
gender_edu_stats.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Income by Education and Gender', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Education Level')
axes[1, 0].set_ylabel('Average Income')
axes[1, 0].legend(title='Gender')
axes[1, 0].tick_params(axis='x', rotation=45)

# 4.4 小提琴图
sns.violinplot(data=data, x='education', y='income', order=education_order, ax=axes[1, 1])
axes[1, 1].set_title('Income Distribution (Violin Plot)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Income')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('education_income_analysis.png', dpi=300, bbox_inches='tight')
print("\n 可视化已保存: education_income_analysis.png")
plt.show()

# 5. 统计检验（简化版）
print("\n5. 收入差距分析:")
high_school_income = data[data['education'] == 'High School']['income'].mean()
phd_income = data[data['education'] == 'PhD']['income'].mean()
income_gap = phd_income - high_school_income
gap_percentage = (income_gap / high_school_income) * 100

print(f"高中学历平均收入: ${high_school_income:,.2f}")
print(f"博士学历平均收入: ${phd_income:,.2f}")
print(f"收入差距: ${income_gap:,.2f} ({gap_percentage:.1f}%)")

# 6. 生成报告
report = {
    '分析日期': pd.Timestamp.now().strftime('%Y-%m-%d'),
    '样本量': len(data),
    '教育水平': edu_stats.index.tolist(),
    '各学历样本量': edu_stats['样本量'].tolist(),
    '平均收入': edu_stats['平均收入'].round(2).tolist(),
    '收入差距（PhD vs High School）': f'${income_gap:,.2f}',
    '差距百分比': f'{gap_percentage:.1f}%'
}

print("\n" + "=" * 70)
print("分析报告")
print("=" * 70)
for key, value in report.items():
    print(f"{key}: {value}")
print("=" * 70)

练习 4：时间序列分析（进阶）

难度：⭐⭐⭐⭐ 时间：40 分钟

创建一个年度收入趋势分析系统。

提示

使用 pd.date_range() 创建日期
使用 df.resample() 进行时间聚合
使用 rolling() 计算移动平均
使用 Matplotlib 绘制趋势图

下一步

完成本章后，你已经掌握了：

NumPy 数组操作和向量化
Pandas 数据处理（清洗、转换、聚合）
Matplotlib/Seaborn 数据可视化
完整的数据分析流程

恭喜你完成 Module 9！ 这是Python数据分析的核心模块。

接下来的 Module 10 和 11 将学习机器学习和最佳实践。

扩展阅读

数据科学之旅才刚刚开始！

Module 9 小结和复习 ​

知识点总结 ​

1. NumPy 基础 ​

2. Pandas 核心 ​

3. Matplotlib 与 Seaborn ​

4. 数据分析完整流程 ​

对比：Pandas vs R vs Stata ​

️ 常见错误 ​

1. 忘记 inplace 参数 ​

2. 链式索引警告 ​

3. 数组形状不匹配 ​

最佳实践 ​

1. 链式操作 ​

2. 使用向量化而非循环 ​

3. 内存优化 ​

编程练习 ​

练习 1：NumPy 数组操作（基础） ​

练习 2：Pandas 数据清洗（基础） ​

练习 3：数据分组与聚合（中等） ​

练习 4：时间序列分析（进阶） ​

下一步 ​

扩展阅读 ​

Module 9 小结和复习

知识点总结

1. NumPy 基础

2. Pandas 核心

3. Matplotlib 与 Seaborn

4. 数据分析完整流程

对比：Pandas vs R vs Stata

️ 常见错误

1. 忘记 inplace 参数

2. 链式索引警告

3. 数组形状不匹配

最佳实践

1. 链式操作

2. 使用向量化而非循环

3. 内存优化

编程练习

练习 1：NumPy 数组操作（基础）

练习 2：Pandas 数据清洗（基础）

练习 3：数据分组与聚合（中等）

练习 4：时间序列分析（进阶）

下一步

扩展阅读