6.2 Univariate Visualization

"The greatest value of a picture is when it forces us to notice what we never expected to see."— John Tukey, Statistician

Understanding the distributional characteristics of a single variable

Section Objectives

After completing this section, you will be able to:

Use histograms and kernel density plots to display continuous variable distributions
Use box plots and violin plots to identify outliers
Use bar charts and pie charts to display categorical variables
Diagnose distribution shapes (skewness, kurtosis, normality)
Choose appropriate chart types

Continuous Variable Visualization

1. Histogram

Purpose: Display frequency distribution of data

Principle: Divide data into bins and count frequency in each bin

Basic Usage

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

# Generate wage data
np.random.seed(42)
n = 1000
education = np.random.normal(13, 3, n)
log_wage = 1.5 + 0.08*education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

# Histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original wage (right-skewed distribution)
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_xlabel('Wage (thousand yuan/month)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Wage Distribution (Right-Skewed)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Log wage (approximately normal)
axes[1].hist(log_wage, bins=30, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_xlabel('log(Wage)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('log(Wage) Distribution (Approximately Normal)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Choosing Bins

Problem: Too few bins → information loss; too many bins → excessive noise

python

# Compare different bins
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bins_list = [5, 15, 30, 100]

for i, bins in enumerate(bins_list):
    ax = axes[i//2, i%2]
    ax.hist(wage, bins=bins, edgecolor='black', alpha=0.7)
    ax.set_title(f'bins = {bins}', fontsize=14)
    ax.set_xlabel('Wage (thousand yuan/month)')
    ax.set_ylabel('Frequency')
    ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('Impact of Different Bins', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Common Rules:

Sturges' Rule:
Freedman-Diaconis Rule:
Scott's Rule:

python

# Automatic bin selection
from scipy.stats import iqr

n = len(wage)
bins_sturges = int(np.ceil(np.log2(n) + 1))
bins_fd = int(np.ceil((wage.max() - wage.min()) / (2 * iqr(wage) / n**(1/3))))
bins_scott = int(np.ceil((wage.max() - wage.min()) / (3.5 * wage.std() / n**(1/3))))

print(f"Sturges: {bins_sturges} bins")
print(f"Freedman-Diaconis: {bins_fd} bins")
print(f"Scott: {bins_scott} bins")

# Use 'auto' for automatic selection
plt.figure(figsize=(10, 6))
plt.hist(wage, bins='auto', edgecolor='black', alpha=0.7)
plt.xlabel('Wage (thousand yuan/month)')
plt.ylabel('Frequency')
plt.title('Using Auto Bins', fontsize=14)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Normalized Histogram (Density Plot)

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Frequency histogram
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Frequency Histogram', fontsize=14)
axes[0].set_ylabel('Frequency')

# Density histogram
axes[1].hist(wage, bins=30, density=True, edgecolor='black', alpha=0.7)
axes[1].set_title('Density Histogram', fontsize=14)
axes[1].set_ylabel('Probability Density')

for ax in axes:
    ax.set_xlabel('Wage (thousand yuan/month)')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

2. Kernel Density Estimate (KDE)

Advantage: Smooth density curve, more intuitive

Principle: Place a kernel function (typically Gaussian) at each data point, then sum

Basic Usage

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Method 1: seaborn
sns.kdeplot(wage, ax=axes[0], fill=True, color='steelblue', alpha=0.6)
axes[0].set_xlabel('Wage (thousand yuan/month)')
axes[0].set_ylabel('Density')
axes[0].set_title('KDE Plot (seaborn)', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Method 2: matplotlib + scipy
from scipy.stats import gaussian_kde
kde = gaussian_kde(wage)
x_range = np.linspace(wage.min(), wage.max(), 1000)
axes[1].plot(x_range, kde(x_range), linewidth=2, color='coral')
axes[1].fill_between(x_range, kde(x_range), alpha=0.3, color='coral')
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_ylabel('Density')
axes[1].set_title('KDE Plot (matplotlib)', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Histogram + KDE (Recommended)

python

plt.figure(figsize=(10, 6))

# Histogram (normalized)
plt.hist(wage, bins=30, density=True, alpha=0.6, color='lightblue',
         edgecolor='black', label='Histogram')

# KDE curve
sns.kdeplot(wage, color='darkblue', linewidth=2, label='KDE')

plt.xlabel('Wage (thousand yuan/month)', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Wage Distribution: Histogram + KDE', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Impact of Bandwidth

python

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
bandwidths = [0.5, 1.0, 2.0]

for i, bw in enumerate(bandwidths):
    sns.kdeplot(wage, ax=axes[i], bw_adjust=bw, fill=True, color='steelblue')
    axes[i].set_title(f'Bandwidth Multiplier = {bw}', fontsize=14)
    axes[i].set_xlabel('Wage (thousand yuan/month)')
    axes[i].set_ylabel('Density')
    axes[i].grid(True, alpha=0.3)

plt.suptitle('Impact of Bandwidth on KDE', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Selection Advice:

Bandwidth too small → overfitting (noise)
Bandwidth too large → underfitting (over-smoothing)
Default value (Scott's rule) usually works well

3. Box Plot

Purpose: Display five-number summary (minimum, Q1, median, Q3, maximum) + outliers

Five-Number Summary:

Minimum: Q1 - 1.5×IQR
Q1 (first quartile, 25%)
Median (Q2, 50%)
Q3 (third quartile, 75%)
Maximum: Q3 + 1.5×IQR

IQR (Interquartile Range):

Basic Usage

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Vertical box plot
axes[0].boxplot(wage, vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue', alpha=0.7),
                medianprops=dict(color='red', linewidth=2),
                whiskerprops=dict(linewidth=1.5),
                capprops=dict(linewidth=1.5))
axes[0].set_ylabel('Wage (thousand yuan/month)')
axes[0].set_title('Vertical Box Plot', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')

# Horizontal box plot (easier to compare)
axes[1].boxplot(wage, vert=False, patch_artist=True,
                boxprops=dict(facecolor='lightcoral', alpha=0.7),
                medianprops=dict(color='darkred', linewidth=2))
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_title('Horizontal Box Plot', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

Box Plot with Data Points

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# seaborn box plot
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_ylabel('Wage (thousand yuan/month)')
axes[0].set_title('Box Plot (seaborn)', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')

# Box plot + scatter
sns.boxplot(y=wage, ax=axes[1], color='lightblue', width=0.5)
sns.stripplot(y=wage, ax=axes[1], color='black', alpha=0.3, size=3)
axes[1].set_ylabel('Wage (thousand yuan/month)')
axes[1].set_title('Box Plot + Data Points', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Identifying Outliers

python

# Calculate outliers
Q1 = np.percentile(wage, 25)
Q3 = np.percentile(wage, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = wage[(wage < lower_bound) | (wage > upper_bound)]

print(f"Q1 = {Q1:.2f}, Q3 = {Q3:.2f}, IQR = {IQR:.2f}")
print(f"Outlier range: < {lower_bound:.2f} or > {upper_bound:.2f}")
print(f"Number of outliers: {len(outliers)} ({len(outliers)/len(wage)*100:.1f}%)")
print(f"Outliers: {outliers[:10]}...")  # Show first 10

4. Violin Plot

Advantage: Combines the benefits of box plots and kernel density plots

Interpretation:

Middle section is a box plot
Sides are mirrored KDEs

python

fig, axes = plt.subplots(1, 3, figsize=(16, 6))

# Box plot
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_title('Box Plot', fontsize=14)
axes[0].set_ylabel('Wage (thousand yuan/month)')

# Violin plot
sns.violinplot(y=wage, ax=axes[1], color='lightgreen')
axes[1].set_title('Violin Plot', fontsize=14)
axes[1].set_ylabel('')

# Violin plot + box plot (recommended)
sns.violinplot(y=wage, ax=axes[2], color='lightgreen', inner=None)
sns.boxplot(y=wage, ax=axes[2], width=0.15, color='white',
            boxprops=dict(zorder=2))
axes[2].set_title('Violin Plot + Box Plot', fontsize=14)
axes[2].set_ylabel('')

for ax in axes:
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

5. Cumulative Distribution Function (CDF) Plot

Purpose: Display , suitable for comparing distributions

python

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Empirical CDF
sorted_wage = np.sort(wage)
cdf = np.arange(1, len(sorted_wage)+1) / len(sorted_wage)

axes[0].plot(sorted_wage, cdf, linewidth=2, color='steelblue')
axes[0].set_xlabel('Wage (thousand yuan/month)')
axes[0].set_ylabel('Cumulative Probability')
axes[0].set_title('Empirical Cumulative Distribution Function (ECDF)', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Using seaborn (more concise)
sns.ecdfplot(wage, ax=axes[1], linewidth=2, color='coral')
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_ylabel('Cumulative Probability')
axes[1].set_title('ECDF (seaborn)', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate quantiles
print("Wage Quantiles:")
for q in [0.25, 0.50, 0.75, 0.90, 0.95]:
    print(f"  P{int(q*100)}: {np.quantile(wage, q):.2f} thousand yuan")

Categorical Variable Visualization

1. Bar Chart

python

# Generate categorical data
np.random.seed(42)
regions = ['East', 'Central', 'West', 'Northeast']
counts = [450, 280, 190, 80]

df_region = pd.DataFrame({'region': regions, 'count': counts})
df_region['percentage'] = df_region['count'] / df_region['count'].sum() * 100

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Frequency bar chart
axes[0].bar(df_region['region'], df_region['count'],
           color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
           edgecolor='black', alpha=0.8)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Sample Size by Region', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Add value labels
for i, (region, count) in enumerate(zip(df_region['region'], df_region['count'])):
    axes[0].text(i, count + 10, str(count), ha='center', fontsize=11)

# Percentage bar chart
axes[1].bar(df_region['region'], df_region['percentage'],
           color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
           edgecolor='black', alpha=0.8)
axes[1].set_ylabel('Percentage (%)', fontsize=12)
axes[1].set_title('Proportion by Region', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Add percentage labels
for i, (region, pct) in enumerate(zip(df_region['region'], df_region['percentage'])):
    axes[1].text(i, pct + 1, f'{pct:.1f}%', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

Horizontal Bar Chart (better for many categories)

python

# Sort by value
df_sorted = df_region.sort_values('count')

plt.figure(figsize=(10, 6))
plt.barh(df_sorted['region'], df_sorted['count'],
         color='steelblue', edgecolor='black', alpha=0.8)
plt.xlabel('Count', fontsize=12)
plt.title('Sample Size by Region (Sorted)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')

# Add values
for i, (region, count) in enumerate(zip(df_sorted['region'], df_sorted['count'])):
    plt.text(count + 10, i, str(count), va='center', fontsize=11)

plt.tight_layout()
plt.show()

2. Pie Chart

Note: Pie charts are suitable for showing part-whole relationships, but are less precise than bar charts

python

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Basic pie chart
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
axes[0].pie(df_region['count'], labels=df_region['region'], colors=colors,
           autopct='%1.1f%%', startangle=90)
axes[0].set_title('Regional Distribution (Pie Chart)', fontsize=14, fontweight='bold')

# Highlight a slice
explode = (0.1, 0, 0, 0)  # Highlight first slice
axes[1].pie(df_region['count'], labels=df_region['region'], colors=colors,
           autopct='%1.1f%%', startangle=90, explode=explode,
           shadow=True)
axes[1].set_title('Regional Distribution (Highlight East)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

Recommendations:

Use for few categories (≤ 5)
Emphasize part-whole relationship
Avoid 3D pie charts (distorts proportions)
Use bar charts for many categories

Distribution Diagnostics

1. Q-Q Plot (Quantile-Quantile Plot)

Purpose: Test whether data follows a specific distribution (usually normal distribution)

Principle:

X-axis: Theoretical quantiles
Y-axis: Sample quantiles
If points lie on a straight line → follows that distribution

python

from scipy import stats

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Q-Q plot for right-skewed distribution
stats.probplot(wage, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot for Wage (Not Normal)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Q-Q plot after log transformation
stats.probplot(log_wage, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot for log(Wage) (Approximately Normal)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Interpretation:

S-shaped curve: Heavy-tailed distribution
Inverted S: Light-tailed distribution
Upward concave: Right-skewed
Downward concave: Left-skewed

2. Skewness and Kurtosis

python

from scipy.stats import skew, kurtosis

# Calculate skewness and kurtosis
skewness_wage = skew(wage)
kurtosis_wage = kurtosis(wage, fisher=True)  # fisher=True uses excess kurtosis

skewness_log = skew(log_wage)
kurtosis_log = kurtosis(log_wage, fisher=True)

print("Wage Distribution:")
print(f"  Skewness = {skewness_wage:.3f} (right-skewed)")
print(f"  Kurtosis = {kurtosis_wage:.3f}")

print("\nlog(Wage) Distribution:")
print(f"  Skewness = {skewness_log:.3f} (approximately symmetric)")
print(f"  Kurtosis = {kurtosis_log:.3f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, data, title in zip(axes, [wage, log_wage], ['Wage', 'log(Wage)']):
    ax.hist(data, bins=30, density=True, alpha=0.6, edgecolor='black')
    sns.kdeplot(data, ax=ax, color='red', linewidth=2)

    # Add statistics
    sk = skew(data)
    ku = kurtosis(data, fisher=True)
    ax.text(0.02, 0.95, f'Skewness = {sk:.3f}\nKurtosis = {ku:.3f}',
           transform=ax.transAxes, fontsize=12, verticalalignment='top',
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

    ax.set_title(f'{title} Distribution', fontsize=14)
    ax.set_xlabel(title)
    ax.set_ylabel('Density')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Decision Criteria:

Skewness:
- : Approximately symmetric
- : Moderately skewed
- : Highly skewed
Kurtosis (excess kurtosis):
- : Similar to normal distribution
- : Leptokurtic (heavy-tailed)
- : Platykurtic (light-tailed)

Case Study: Income Distribution Analysis

python

# Generate more realistic income data
np.random.seed(2024)
n = 5000

# Mixture distribution: majority + high-income group
income_low = np.random.lognormal(mean=2.5, sigma=0.5, size=int(n*0.9))
income_high = np.random.lognormal(mean=3.5, sigma=0.3, size=int(n*0.1))
income = np.concatenate([income_low, income_high])

# Complete univariate analysis
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. Histogram + KDE
ax1 = fig.add_subplot(gs[0, :2])
ax1.hist(income, bins=50, density=True, alpha=0.6, color='lightblue', edgecolor='black')
sns.kdeplot(income, ax=ax1, color='darkblue', linewidth=2)
ax1.set_xlabel('Income (ten thousand yuan)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('Income Distribution: Histogram + KDE', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Add statistics
mean_income = income.mean()
median_income = np.median(income)
ax1.axvline(mean_income, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_income:.2f}')
ax1.axvline(median_income, color='green', linestyle='--', linewidth=2, label=f'Median = {median_income:.2f}')
ax1.legend(fontsize=11)

# 2. Box plot
ax2 = fig.add_subplot(gs[0, 2])
ax2.boxplot(income, vert=True, patch_artist=True,
           boxprops=dict(facecolor='lightgreen', alpha=0.7),
           medianprops=dict(color='red', linewidth=2))
ax2.set_ylabel('Income (ten thousand yuan)', fontsize=12)
ax2.set_title('Box Plot', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

# 3. Violin plot
ax3 = fig.add_subplot(gs[1, 0])
sns.violinplot(y=income, ax=ax3, color='lightcoral', inner='box')
ax3.set_ylabel('Income (ten thousand yuan)', fontsize=12)
ax3.set_title('Violin Plot', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='y')

# 4. ECDF
ax4 = fig.add_subplot(gs[1, 1])
sns.ecdfplot(income, ax=ax4, linewidth=2, color='purple')
ax4.set_xlabel('Income (ten thousand yuan)', fontsize=12)
ax4.set_ylabel('Cumulative Probability', fontsize=12)
ax4.set_title('Cumulative Distribution Function', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

# Mark key quantiles
for q in [0.25, 0.50, 0.75]:
    val = np.quantile(income, q)
    ax4.plot(val, q, 'ro', markersize=8)
    ax4.text(val, q+0.05, f'P{int(q*100)}', fontsize=10)

# 5. Q-Q plot
ax5 = fig.add_subplot(gs[1, 2])
stats.probplot(income, dist="norm", plot=ax5)
ax5.set_title('Q-Q Plot (vs Normal Distribution)', fontsize=14, fontweight='bold')
ax5.grid(True, alpha=0.3)

# 6. Log-transformed distribution
ax6 = fig.add_subplot(gs[2, 0])
log_income = np.log(income)
ax6.hist(log_income, bins=50, density=True, alpha=0.6, color='lightyellow', edgecolor='black')
sns.kdeplot(log_income, ax=ax6, color='orange', linewidth=2)
ax6.set_xlabel('log(Income)', fontsize=12)
ax6.set_ylabel('Density', fontsize=12)
ax6.set_title('Log-Transformed Distribution', fontsize=14, fontweight='bold')
ax6.grid(True, alpha=0.3)

# 7. Q-Q plot after log transformation
ax7 = fig.add_subplot(gs[2, 1])
stats.probplot(log_income, dist="norm", plot=ax7)
ax7.set_title('Q-Q Plot for log(Income)', fontsize=14, fontweight='bold')
ax7.grid(True, alpha=0.3)

# 8. Descriptive statistics table
ax8 = fig.add_subplot(gs[2, 2])
ax8.axis('off')

stats_data = [
    ['Sample Size', f'{len(income):,}'],
    ['Mean', f'{income.mean():.2f} k yuan'],
    ['Median', f'{np.median(income):.2f} k yuan'],
    ['Std Dev', f'{income.std():.2f} k yuan'],
    ['Minimum', f'{income.min():.2f} k yuan'],
    ['Maximum', f'{income.max():.2f} k yuan'],
    ['Skewness', f'{skew(income):.3f}'],
    ['Kurtosis', f'{kurtosis(income, fisher=True):.3f}'],
    ['Q1', f'{np.quantile(income, 0.25):.2f} k yuan'],
    ['Q3', f'{np.quantile(income, 0.75):.2f} k yuan'],
    ['IQR', f'{np.quantile(income, 0.75) - np.quantile(income, 0.25):.2f} k yuan']
]

table = ax8.table(cellText=stats_data, colLabels=['Statistic', 'Value'],
                 cellLoc='left', loc='center',
                 bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

# Set header style
for i in range(2):
    table[(0, i)].set_facecolor('#4CAF50')
    table[(0, i)].set_text_props(weight='bold', color='white')

ax8.set_title('Descriptive Statistics', fontsize=14, fontweight='bold', pad=20)

plt.suptitle('Complete Univariate Analysis of Income Data', fontsize=18, fontweight='bold', y=0.995)
plt.show()

# Print detailed report
print("\nIncome Data Analysis Report")
print("="*60)
print(f"Sample Size: {len(income):,}")
print(f"Mean: {income.mean():.2f} ten thousand yuan")
print(f"Median: {np.median(income):.2f} ten thousand yuan")
print(f"Standard Deviation: {income.std():.2f} ten thousand yuan")
print(f"Coefficient of Variation: {income.std()/income.mean():.2f}")
print(f"\nSkewness: {skew(income):.3f} ({'right-skewed' if skew(income) > 0 else 'left-skewed'})")
print(f"Kurtosis: {kurtosis(income, fisher=True):.3f}")
print(f"\nQuantiles:")
for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"  P{int(q*100):2d}: {np.quantile(income, q):6.2f} ten thousand yuan")

Section Summary

Chart Selection Guide

Purpose	Recommended Chart	Python Code
View distribution shape	Histogram + KDE	`plt.hist()` + `sns.kdeplot()`
Identify outliers	Box plot	`plt.boxplot()` or `sns.boxplot()`
Compare distribution details	Violin plot	`sns.violinplot()`
Test normality	Q-Q plot	`stats.probplot()`
Compare quantiles	ECDF	`sns.ecdfplot()`
Display categorical frequencies	Bar chart	`plt.bar()`

Key Takeaways

Always plot first: Numbers can lie, charts reveal truth
Choose appropriate bins: Too few loses information, too many creates noise
Consider data transformations: For right-skewed distributions, consider log transformation
Multi-angle observation: Combine multiple charts for comprehensive understanding
Focus on outliers: Box plots are the best tool

Next Section Preview

In the next section, we will learn:

Scatter plots and correlation visualization
Charts for continuous vs categorical variables
Pair plots
Correlation matrix heatmaps

From univariate to bivariate, exploring relationships between variables!

6.2 Univariate Visualization ​

Section Objectives ​

Continuous Variable Visualization ​

1. Histogram ​

Basic Usage ​

Choosing Bins ​

Normalized Histogram (Density Plot) ​

2. Kernel Density Estimate (KDE) ​

Basic Usage ​

Histogram + KDE (Recommended) ​

Impact of Bandwidth ​

3. Box Plot ​

Basic Usage ​

Box Plot with Data Points ​

Identifying Outliers ​

4. Violin Plot ​

5. Cumulative Distribution Function (CDF) Plot ​

Categorical Variable Visualization ​

1. Bar Chart ​

Horizontal Bar Chart (better for many categories) ​

2. Pie Chart ​

Distribution Diagnostics ​

1. Q-Q Plot (Quantile-Quantile Plot) ​

2. Skewness and Kurtosis ​

Case Study: Income Distribution Analysis ​

Section Summary ​

Chart Selection Guide ​

Key Takeaways ​

Next Section Preview ​

6.2 Univariate Visualization

Section Objectives

Continuous Variable Visualization

1. Histogram

Basic Usage

Choosing Bins

Normalized Histogram (Density Plot)

2. Kernel Density Estimate (KDE)

Basic Usage

Histogram + KDE (Recommended)

Impact of Bandwidth

3. Box Plot

Basic Usage

Box Plot with Data Points

Identifying Outliers

4. Violin Plot

5. Cumulative Distribution Function (CDF) Plot

Categorical Variable Visualization

1. Bar Chart

Horizontal Bar Chart (better for many categories)

2. Pie Chart

Distribution Diagnostics

1. Q-Q Plot (Quantile-Quantile Plot)

2. Skewness and Kurtosis

Case Study: Income Distribution Analysis

Section Summary

Chart Selection Guide

Key Takeaways

Next Section Preview