6.2 单变量可视化（Univariate Visualization）

"The greatest value of a picture is when it forces us to notice what we never expected to see.""图片的最大价值在于它迫使我们注意到我们从未预期会看到的东西。"— John Tukey, Statistician (统计学家)

理解单个变量的分布特征

本节目标

完成本节后，你将能够：

使用直方图和核密度图展示连续变量分布
使用箱线图和小提琴图识别离群值
使用条形图和饼图展示分类变量
诊断分布形态（偏度、峰度、正态性）
选择合适的图表类型

连续变量可视化

1. 直方图（Histogram）

用途：展示数据的频数分布

原理：将数据分成若干区间（bins），统计每个区间的频数

基础用法

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 设置样式
sns.set_style("whitegrid")
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

# 生成工资数据
np.random.seed(42)
n = 1000
education = np.random.normal(13, 3, n)
log_wage = 1.5 + 0.08*education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

# 直方图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 原始工资（右偏分布）
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_xlabel('工资（千元/月）', fontsize=12)
axes[0].set_ylabel('频数', fontsize=12)
axes[0].set_title('工资分布（右偏）', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# 对数工资（近似正态）
axes[1].hist(log_wage, bins=30, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_xlabel('log(工资)', fontsize=12)
axes[1].set_ylabel('频数', fontsize=12)
axes[1].set_title('log(工资) 分布（近似正态）', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

bins 的选择

问题：bins 太少 → 信息丢失；bins 太多 → 噪音过大

python

# 对比不同 bins
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bins_list = [5, 15, 30, 100]

for i, bins in enumerate(bins_list):
    ax = axes[i//2, i%2]
    ax.hist(wage, bins=bins, edgecolor='black', alpha=0.7)
    ax.set_title(f'bins = {bins}', fontsize=14)
    ax.set_xlabel('工资（千元/月）')
    ax.set_ylabel('频数')
    ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('不同 bins 的影响', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

常用规则：

Sturges' Rule:
Freedman-Diaconis Rule:
Scott's Rule:

python

# 自动选择 bins
from scipy.stats import iqr

n = len(wage)
bins_sturges = int(np.ceil(np.log2(n) + 1))
bins_fd = int(np.ceil((wage.max() - wage.min()) / (2 * iqr(wage) / n**(1/3))))
bins_scott = int(np.ceil((wage.max() - wage.min()) / (3.5 * wage.std() / n**(1/3))))

print(f"Sturges: {bins_sturges} bins")
print(f"Freedman-Diaconis: {bins_fd} bins")
print(f"Scott: {bins_scott} bins")

# 使用 'auto' 自动选择
plt.figure(figsize=(10, 6))
plt.hist(wage, bins='auto', edgecolor='black', alpha=0.7)
plt.xlabel('工资（千元/月）')
plt.ylabel('频数')
plt.title('使用 auto bins', fontsize=14)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

归一化直方图（密度图）

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 频数直方图
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('频数直方图', fontsize=14)
axes[0].set_ylabel('频数')

# 密度直方图
axes[1].hist(wage, bins=30, density=True, edgecolor='black', alpha=0.7)
axes[1].set_title('密度直方图', fontsize=14)
axes[1].set_ylabel('概率密度')

for ax in axes:
    ax.set_xlabel('工资（千元/月）')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

2. 核密度估计图（Kernel Density Estimate, KDE）

优势：平滑的密度曲线，更直观

原理：在每个数据点放置一个核函数（通常是高斯核），然后求和

基础用法

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 方法 1: seaborn
sns.kdeplot(wage, ax=axes[0], fill=True, color='steelblue', alpha=0.6)
axes[0].set_xlabel('工资（千元/月）')
axes[0].set_ylabel('密度')
axes[0].set_title('KDE 图（seaborn）', fontsize=14)
axes[0].grid(True, alpha=0.3)

# 方法 2: matplotlib + scipy
from scipy.stats import gaussian_kde
kde = gaussian_kde(wage)
x_range = np.linspace(wage.min(), wage.max(), 1000)
axes[1].plot(x_range, kde(x_range), linewidth=2, color='coral')
axes[1].fill_between(x_range, kde(x_range), alpha=0.3, color='coral')
axes[1].set_xlabel('工资（千元/月）')
axes[1].set_ylabel('密度')
axes[1].set_title('KDE 图（matplotlib）', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

直方图 + KDE（推荐）

python

plt.figure(figsize=(10, 6))

# 直方图（归一化）
plt.hist(wage, bins=30, density=True, alpha=0.6, color='lightblue', 
         edgecolor='black', label='直方图')

# KDE 曲线
sns.kdeplot(wage, color='darkblue', linewidth=2, label='KDE')

plt.xlabel('工资（千元/月）', fontsize=12)
plt.ylabel('概率密度', fontsize=12)
plt.title('工资分布：直方图 + KDE', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

带宽（Bandwidth）的影响

python

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
bandwidths = [0.5, 1.0, 2.0]

for i, bw in enumerate(bandwidths):
    sns.kdeplot(wage, ax=axes[i], bw_adjust=bw, fill=True, color='steelblue')
    axes[i].set_title(f'带宽乘数 = {bw}', fontsize=14)
    axes[i].set_xlabel('工资（千元/月）')
    axes[i].set_ylabel('密度')
    axes[i].grid(True, alpha=0.3)

plt.suptitle('带宽对 KDE 的影响', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

选择建议：

带宽太小 → 过拟合（噪音）
带宽太大 → 欠拟合（过于平滑）
默认值（Scott's rule）通常效果不错

3. 箱线图（Box Plot）

用途：展示数据的五数概括（最小值、Q1、中位数、Q3、最大值）+ 离群值

五数概括：

最小值：Q1 - 1.5×IQR
Q1（第一四分位数，25%）
中位数（Q2，50%）
Q3（第三四分位数，75%）
最大值：Q3 + 1.5×IQR

IQR（四分位距）：

基础用法

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 垂直箱线图
axes[0].boxplot(wage, vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue', alpha=0.7),
                medianprops=dict(color='red', linewidth=2),
                whiskerprops=dict(linewidth=1.5),
                capprops=dict(linewidth=1.5))
axes[0].set_ylabel('工资（千元/月）')
axes[0].set_title('垂直箱线图', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')

# 水平箱线图（更易比较）
axes[1].boxplot(wage, vert=False, patch_artist=True,
                boxprops=dict(facecolor='lightcoral', alpha=0.7),
                medianprops=dict(color='darkred', linewidth=2))
axes[1].set_xlabel('工资（千元/月）')
axes[1].set_title('水平箱线图', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

带数据点的箱线图

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# seaborn 箱线图
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_ylabel('工资（千元/月）')
axes[0].set_title('箱线图（seaborn）', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')

# 箱线图 + 散点
sns.boxplot(y=wage, ax=axes[1], color='lightblue', width=0.5)
sns.stripplot(y=wage, ax=axes[1], color='black', alpha=0.3, size=3)
axes[1].set_ylabel('工资（千元/月）')
axes[1].set_title('箱线图 + 数据点', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

识别离群值

python

# 计算离群值
Q1 = np.percentile(wage, 25)
Q3 = np.percentile(wage, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = wage[(wage < lower_bound) | (wage > upper_bound)]

print(f"Q1 = {Q1:.2f}, Q3 = {Q3:.2f}, IQR = {IQR:.2f}")
print(f"离群值范围: < {lower_bound:.2f} 或 > {upper_bound:.2f}")
print(f"离群值数量: {len(outliers)} ({len(outliers)/len(wage)*100:.1f}%)")
print(f"离群值: {outliers[:10]}...")  # 显示前10个

4. 小提琴图（Violin Plot）

优势：结合了箱线图和核密度图的优点

解读：

中间是箱线图
两侧是镜像的 KDE

python

fig, axes = plt.subplots(1, 3, figsize=(16, 6))

# 箱线图
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_title('箱线图', fontsize=14)
axes[0].set_ylabel('工资（千元/月）')

# 小提琴图
sns.violinplot(y=wage, ax=axes[1], color='lightgreen')
axes[1].set_title('小提琴图', fontsize=14)
axes[1].set_ylabel('')

# 小提琴图 + 箱线图（推荐）
sns.violinplot(y=wage, ax=axes[2], color='lightgreen', inner=None)
sns.boxplot(y=wage, ax=axes[2], width=0.15, color='white', 
            boxprops=dict(zorder=2))
axes[2].set_title('小提琴图 + 箱线图', fontsize=14)
axes[2].set_ylabel('')

for ax in axes:
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

5. 累积分布函数图（CDF）

用途：展示，适合比较分布

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 经验 CDF
sorted_wage = np.sort(wage)
cdf = np.arange(1, len(sorted_wage)+1) / len(sorted_wage)

axes[0].plot(sorted_wage, cdf, linewidth=2, color='steelblue')
axes[0].set_xlabel('工资（千元/月）')
axes[0].set_ylabel('累积概率')
axes[0].set_title('经验累积分布函数（ECDF）', fontsize=14)
axes[0].grid(True, alpha=0.3)

# 使用 seaborn（更简洁）
sns.ecdfplot(wage, ax=axes[1], linewidth=2, color='coral')
axes[1].set_xlabel('工资（千元/月）')
axes[1].set_ylabel('累积概率')
axes[1].set_title('ECDF（seaborn）', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 计算分位数
print("工资分位数:")
for q in [0.25, 0.50, 0.75, 0.90, 0.95]:
    print(f"  P{int(q*100)}: {np.quantile(wage, q):.2f} 千元")

分类变量可视化

1. 条形图（Bar Chart）

python

# 生成分类数据
np.random.seed(42)
regions = ['东部', '中部', '西部', '东北']
counts = [450, 280, 190, 80]

df_region = pd.DataFrame({'region': regions, 'count': counts})
df_region['percentage'] = df_region['count'] / df_region['count'].sum() * 100

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 频数条形图
axes[0].bar(df_region['region'], df_region['count'], 
           color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
           edgecolor='black', alpha=0.8)
axes[0].set_ylabel('人数', fontsize=12)
axes[0].set_title('各地区样本量', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# 添加数值标签
for i, (region, count) in enumerate(zip(df_region['region'], df_region['count'])):
    axes[0].text(i, count + 10, str(count), ha='center', fontsize=11)

# 百分比条形图
axes[1].bar(df_region['region'], df_region['percentage'],
           color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
           edgecolor='black', alpha=0.8)
axes[1].set_ylabel('百分比（%）', fontsize=12)
axes[1].set_title('各地区占比', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# 添加百分比标签
for i, (region, pct) in enumerate(zip(df_region['region'], df_region['percentage'])):
    axes[1].text(i, pct + 1, f'{pct:.1f}%', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

水平条形图（类别较多时更好）

python

# 按数值排序
df_sorted = df_region.sort_values('count')

plt.figure(figsize=(10, 6))
plt.barh(df_sorted['region'], df_sorted['count'], 
         color='steelblue', edgecolor='black', alpha=0.8)
plt.xlabel('人数', fontsize=12)
plt.title('各地区样本量（降序排列）', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')

# 添加数值
for i, (region, count) in enumerate(zip(df_sorted['region'], df_sorted['count'])):
    plt.text(count + 10, i, str(count), va='center', fontsize=11)

plt.tight_layout()
plt.show()

2. 饼图（Pie Chart）

注意：饼图适合展示部分-整体关系，但不如条形图精确

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 基础饼图
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
axes[0].pie(df_region['count'], labels=df_region['region'], colors=colors,
           autopct='%1.1f%%', startangle=90)
axes[0].set_title('各地区占比（饼图）', fontsize=14, fontweight='bold')

# 突出某个扇区
explode = (0.1, 0, 0, 0)  # 突出第一个
axes[1].pie(df_region['count'], labels=df_region['region'], colors=colors,
           autopct='%1.1f%%', startangle=90, explode=explode,
           shadow=True)
axes[1].set_title('各地区占比（突出东部）', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

建议：

类别少（≤ 5 个）时使用
强调部分-整体关系
避免 3D 饼图（扭曲比例）
类别多时用条形图

分布诊断

1. Q-Q 图（Quantile-Quantile Plot）

用途：检验数据是否服从特定分布（通常是正态分布）

原理：

X 轴：理论分位数
Y 轴：样本分位数
如果点在直线上 → 符合该分布

python

from scipy import stats

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 右偏分布的 Q-Q 图
stats.probplot(wage, dist="norm", plot=axes[0])
axes[0].set_title('工资的 Q-Q 图（不符合正态）', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# 对数转换后的 Q-Q 图
stats.probplot(log_wage, dist="norm", plot=axes[1])
axes[1].set_title('log(工资) 的 Q-Q 图（近似正态）', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

解读：

S 型曲线：厚尾分布
倒 S 型：薄尾分布
上凸：右偏
下凸：左偏

2. 偏度和峰度

python

from scipy.stats import skew, kurtosis

# 计算偏度和峰度
skewness_wage = skew(wage)
kurtosis_wage = kurtosis(wage, fisher=True)  # fisher=True 使用超额峰度

skewness_log = skew(log_wage)
kurtosis_log = kurtosis(log_wage, fisher=True)

print("工资分布:")
print(f"  偏度 = {skewness_wage:.3f} (右偏)")
print(f"  峰度 = {kurtosis_wage:.3f}")

print("\nlog(工资) 分布:")
print(f"  偏度 = {skewness_log:.3f} (近似对称)")
print(f"  峰度 = {kurtosis_log:.3f}")

# 可视化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, data, title in zip(axes, [wage, log_wage], ['工资', 'log(工资)']):
    ax.hist(data, bins=30, density=True, alpha=0.6, edgecolor='black')
    sns.kdeplot(data, ax=ax, color='red', linewidth=2)
    
    # 添加统计量
    sk = skew(data)
    ku = kurtosis(data, fisher=True)
    ax.text(0.02, 0.95, f'偏度 = {sk:.3f}\n峰度 = {ku:.3f}',
           transform=ax.transAxes, fontsize=12, verticalalignment='top',
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    ax.set_title(f'{title} 分布', fontsize=14)
    ax.set_xlabel(title)
    ax.set_ylabel('密度')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

判断标准：

偏度：
- ：近似对称
- ：中等偏斜
- ：高度偏斜
峰度（超额峰度）：
- ：与正态分布相似
- ：尖峰（厚尾）
- ：平峰（薄尾）

实战案例：收入分布分析

python

# 生成更真实的收入数据
np.random.seed(2024)
n = 5000

# 混合分布：大部分人 + 高收入群体
income_low = np.random.lognormal(mean=2.5, sigma=0.5, size=int(n*0.9))
income_high = np.random.lognormal(mean=3.5, sigma=0.3, size=int(n*0.1))
income = np.concatenate([income_low, income_high])

# 完整的单变量分析
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. 直方图 + KDE
ax1 = fig.add_subplot(gs[0, :2])
ax1.hist(income, bins=50, density=True, alpha=0.6, color='lightblue', edgecolor='black')
sns.kdeplot(income, ax=ax1, color='darkblue', linewidth=2)
ax1.set_xlabel('收入（万元）', fontsize=12)
ax1.set_ylabel('密度', fontsize=12)
ax1.set_title('收入分布：直方图 + KDE', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# 添加统计量
mean_income = income.mean()
median_income = np.median(income)
ax1.axvline(mean_income, color='red', linestyle='--', linewidth=2, label=f'均值 = {mean_income:.2f}')
ax1.axvline(median_income, color='green', linestyle='--', linewidth=2, label=f'中位数 = {median_income:.2f}')
ax1.legend(fontsize=11)

# 2. 箱线图
ax2 = fig.add_subplot(gs[0, 2])
ax2.boxplot(income, vert=True, patch_artist=True,
           boxprops=dict(facecolor='lightgreen', alpha=0.7),
           medianprops=dict(color='red', linewidth=2))
ax2.set_ylabel('收入（万元）', fontsize=12)
ax2.set_title('箱线图', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

# 3. 小提琴图
ax3 = fig.add_subplot(gs[1, 0])
sns.violinplot(y=income, ax=ax3, color='lightcoral', inner='box')
ax3.set_ylabel('收入（万元）', fontsize=12)
ax3.set_title('小提琴图', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='y')

# 4. ECDF
ax4 = fig.add_subplot(gs[1, 1])
sns.ecdfplot(income, ax=ax4, linewidth=2, color='purple')
ax4.set_xlabel('收入（万元）', fontsize=12)
ax4.set_ylabel('累积概率', fontsize=12)
ax4.set_title('累积分布函数', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

# 标记关键分位数
for q in [0.25, 0.50, 0.75]:
    val = np.quantile(income, q)
    ax4.plot(val, q, 'ro', markersize=8)
    ax4.text(val, q+0.05, f'P{int(q*100)}', fontsize=10)

# 5. Q-Q 图
ax5 = fig.add_subplot(gs[1, 2])
stats.probplot(income, dist="norm", plot=ax5)
ax5.set_title('Q-Q 图（vs 正态分布）', fontsize=14, fontweight='bold')
ax5.grid(True, alpha=0.3)

# 6. log 转换后的分布
ax6 = fig.add_subplot(gs[2, 0])
log_income = np.log(income)
ax6.hist(log_income, bins=50, density=True, alpha=0.6, color='lightyellow', edgecolor='black')
sns.kdeplot(log_income, ax=ax6, color='orange', linewidth=2)
ax6.set_xlabel('log(收入)', fontsize=12)
ax6.set_ylabel('密度', fontsize=12)
ax6.set_title('log 转换后的分布', fontsize=14, fontweight='bold')
ax6.grid(True, alpha=0.3)

# 7. log 转换后的 Q-Q 图
ax7 = fig.add_subplot(gs[2, 1])
stats.probplot(log_income, dist="norm", plot=ax7)
ax7.set_title('log(收入) 的 Q-Q 图', fontsize=14, fontweight='bold')
ax7.grid(True, alpha=0.3)

# 8. 描述统计表格
ax8 = fig.add_subplot(gs[2, 2])
ax8.axis('off')

stats_data = [
    ['样本量', f'{len(income):,}'],
    ['均值', f'{income.mean():.2f} 万'],
    ['中位数', f'{np.median(income):.2f} 万'],
    ['标准差', f'{income.std():.2f} 万'],
    ['最小值', f'{income.min():.2f} 万'],
    ['最大值', f'{income.max():.2f} 万'],
    ['偏度', f'{skew(income):.3f}'],
    ['峰度', f'{kurtosis(income, fisher=True):.3f}'],
    ['Q1', f'{np.quantile(income, 0.25):.2f} 万'],
    ['Q3', f'{np.quantile(income, 0.75):.2f} 万'],
    ['IQR', f'{np.quantile(income, 0.75) - np.quantile(income, 0.25):.2f} 万']
]

table = ax8.table(cellText=stats_data, colLabels=['统计量', '数值'],
                 cellLoc='left', loc='center',
                 bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

# 设置表头样式
for i in range(2):
    table[(0, i)].set_facecolor('#4CAF50')
    table[(0, i)].set_text_props(weight='bold', color='white')

ax8.set_title('描述统计', fontsize=14, fontweight='bold', pad=20)

plt.suptitle('收入数据的完整单变量分析', fontsize=18, fontweight='bold', y=0.995)
plt.show()

# 打印详细报告
print("\n收入数据分析报告")
print("="*60)
print(f"样本量: {len(income):,}")
print(f"均值: {income.mean():.2f} 万元")
print(f"中位数: {np.median(income):.2f} 万元")
print(f"标准差: {income.std():.2f} 万元")
print(f"变异系数: {income.std()/income.mean():.2f}")
print(f"\n偏度: {skew(income):.3f} ({'右偏' if skew(income) > 0 else '左偏'})")
print(f"峰度: {kurtosis(income, fisher=True):.3f}")
print(f"\n分位数:")
for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  P{int(q*100):2d}: {np.quantile(income, q):6.2f} 万元")

本节小结

图表选择指南

目的	推荐图表	Python 代码
查看分布形状	直方图 + KDE	`plt.hist()` + `sns.kdeplot()`
识别离群值	箱线图	`plt.boxplot()` 或 `sns.boxplot()`
比较分布细节	小提琴图	`sns.violinplot()`
检验正态性	Q-Q 图	`stats.probplot()`
比较分位数	ECDF	`sns.ecdfplot()`
展示分类频数	条形图	`plt.bar()`

关键要点

永远先画图：数字会说谎，图表能揭示真相
选择合适的 bins：太少丢失信息，太多产生噪音
注意数据转换：右偏分布考虑 log 转换
多角度观察：结合多种图表全面了解数据
关注离群值：箱线图是最好的工具

下节预告

在下一节中，我们将学习：

散点图和相关性可视化
连续 vs 分类变量的图表
散点图矩阵
相关矩阵热力图

从单变量到双变量，探索变量之间的关系！

6.2 单变量可视化（Univariate Visualization） ​

本节目标 ​

连续变量可视化 ​

1. 直方图（Histogram） ​

基础用法 ​

bins 的选择 ​

归一化直方图（密度图） ​

2. 核密度估计图（Kernel Density Estimate, KDE） ​

基础用法 ​

直方图 + KDE（推荐） ​

带宽（Bandwidth）的影响 ​

3. 箱线图（Box Plot） ​

基础用法 ​

带数据点的箱线图 ​

识别离群值 ​

4. 小提琴图（Violin Plot） ​

5. 累积分布函数图（CDF） ​

分类变量可视化 ​

1. 条形图（Bar Chart） ​

水平条形图（类别较多时更好） ​

2. 饼图（Pie Chart） ​

分布诊断 ​

1. Q-Q 图（Quantile-Quantile Plot） ​

2. 偏度和峰度 ​

实战案例：收入分布分析 ​

本节小结 ​

图表选择指南 ​

关键要点 ​

下节预告 ​

6.2 单变量可视化（Univariate Visualization）

本节目标

连续变量可视化

1. 直方图（Histogram）

基础用法

bins 的选择

归一化直方图（密度图）

2. 核密度估计图（Kernel Density Estimate, KDE）

基础用法

直方图 + KDE（推荐）

带宽（Bandwidth）的影响

3. 箱线图（Box Plot）

基础用法

带数据点的箱线图

识别离群值

4. 小提琴图（Violin Plot）

5. 累积分布函数图（CDF）

分类变量可视化

1. 条形图（Bar Chart）

水平条形图（类别较多时更好）

2. 饼图（Pie Chart）

分布诊断

1. Q-Q 图（Quantile-Quantile Plot）

2. 偏度和峰度

实战案例：收入分布分析

本节小结

图表选择指南

关键要点

下节预告