6.2 Univariate Visualization
"The greatest value of a picture is when it forces us to notice what we never expected to see."— John Tukey, Statistician
Understanding the distributional characteristics of a single variable
Section Objectives
After completing this section, you will be able to:
- Use histograms and kernel density plots to display continuous variable distributions
- Use box plots and violin plots to identify outliers
- Use bar charts and pie charts to display categorical variables
- Diagnose distribution shapes (skewness, kurtosis, normality)
- Choose appropriate chart types
Continuous Variable Visualization
1. Histogram
Purpose: Display frequency distribution of data
Principle: Divide data into bins and count frequency in each bin
Basic Usage
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
# Generate wage data
np.random.seed(42)
n = 1000
education = np.random.normal(13, 3, n)
log_wage = 1.5 + 0.08*education + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)
# Histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Original wage (right-skewed distribution)
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_xlabel('Wage (thousand yuan/month)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Wage Distribution (Right-Skewed)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
# Log wage (approximately normal)
axes[1].hist(log_wage, bins=30, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_xlabel('log(Wage)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('log(Wage) Distribution (Approximately Normal)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()Choosing Bins
Problem: Too few bins → information loss; too many bins → excessive noise
# Compare different bins
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bins_list = [5, 15, 30, 100]
for i, bins in enumerate(bins_list):
ax = axes[i//2, i%2]
ax.hist(wage, bins=bins, edgecolor='black', alpha=0.7)
ax.set_title(f'bins = {bins}', fontsize=14)
ax.set_xlabel('Wage (thousand yuan/month)')
ax.set_ylabel('Frequency')
ax.grid(True, alpha=0.3, axis='y')
plt.suptitle('Impact of Different Bins', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()Common Rules:
- Sturges' Rule:
- Freedman-Diaconis Rule:
- Scott's Rule:
# Automatic bin selection
from scipy.stats import iqr
n = len(wage)
bins_sturges = int(np.ceil(np.log2(n) + 1))
bins_fd = int(np.ceil((wage.max() - wage.min()) / (2 * iqr(wage) / n**(1/3))))
bins_scott = int(np.ceil((wage.max() - wage.min()) / (3.5 * wage.std() / n**(1/3))))
print(f"Sturges: {bins_sturges} bins")
print(f"Freedman-Diaconis: {bins_fd} bins")
print(f"Scott: {bins_scott} bins")
# Use 'auto' for automatic selection
plt.figure(figsize=(10, 6))
plt.hist(wage, bins='auto', edgecolor='black', alpha=0.7)
plt.xlabel('Wage (thousand yuan/month)')
plt.ylabel('Frequency')
plt.title('Using Auto Bins', fontsize=14)
plt.grid(True, alpha=0.3, axis='y')
plt.show()Normalized Histogram (Density Plot)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Frequency histogram
axes[0].hist(wage, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Frequency Histogram', fontsize=14)
axes[0].set_ylabel('Frequency')
# Density histogram
axes[1].hist(wage, bins=30, density=True, edgecolor='black', alpha=0.7)
axes[1].set_title('Density Histogram', fontsize=14)
axes[1].set_ylabel('Probability Density')
for ax in axes:
ax.set_xlabel('Wage (thousand yuan/month)')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()2. Kernel Density Estimate (KDE)
Advantage: Smooth density curve, more intuitive
Principle: Place a kernel function (typically Gaussian) at each data point, then sum
Basic Usage
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Method 1: seaborn
sns.kdeplot(wage, ax=axes[0], fill=True, color='steelblue', alpha=0.6)
axes[0].set_xlabel('Wage (thousand yuan/month)')
axes[0].set_ylabel('Density')
axes[0].set_title('KDE Plot (seaborn)', fontsize=14)
axes[0].grid(True, alpha=0.3)
# Method 2: matplotlib + scipy
from scipy.stats import gaussian_kde
kde = gaussian_kde(wage)
x_range = np.linspace(wage.min(), wage.max(), 1000)
axes[1].plot(x_range, kde(x_range), linewidth=2, color='coral')
axes[1].fill_between(x_range, kde(x_range), alpha=0.3, color='coral')
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_ylabel('Density')
axes[1].set_title('KDE Plot (matplotlib)', fontsize=14)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Histogram + KDE (Recommended)
plt.figure(figsize=(10, 6))
# Histogram (normalized)
plt.hist(wage, bins=30, density=True, alpha=0.6, color='lightblue',
edgecolor='black', label='Histogram')
# KDE curve
sns.kdeplot(wage, color='darkblue', linewidth=2, label='KDE')
plt.xlabel('Wage (thousand yuan/month)', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Wage Distribution: Histogram + KDE', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Impact of Bandwidth
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
bandwidths = [0.5, 1.0, 2.0]
for i, bw in enumerate(bandwidths):
sns.kdeplot(wage, ax=axes[i], bw_adjust=bw, fill=True, color='steelblue')
axes[i].set_title(f'Bandwidth Multiplier = {bw}', fontsize=14)
axes[i].set_xlabel('Wage (thousand yuan/month)')
axes[i].set_ylabel('Density')
axes[i].grid(True, alpha=0.3)
plt.suptitle('Impact of Bandwidth on KDE', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()Selection Advice:
- Bandwidth too small → overfitting (noise)
- Bandwidth too large → underfitting (over-smoothing)
- Default value (Scott's rule) usually works well
3. Box Plot
Purpose: Display five-number summary (minimum, Q1, median, Q3, maximum) + outliers
Five-Number Summary:
- Minimum: Q1 - 1.5×IQR
- Q1 (first quartile, 25%)
- Median (Q2, 50%)
- Q3 (third quartile, 75%)
- Maximum: Q3 + 1.5×IQR
IQR (Interquartile Range):
Basic Usage
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Vertical box plot
axes[0].boxplot(wage, vert=True, patch_artist=True,
boxprops=dict(facecolor='lightblue', alpha=0.7),
medianprops=dict(color='red', linewidth=2),
whiskerprops=dict(linewidth=1.5),
capprops=dict(linewidth=1.5))
axes[0].set_ylabel('Wage (thousand yuan/month)')
axes[0].set_title('Vertical Box Plot', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')
# Horizontal box plot (easier to compare)
axes[1].boxplot(wage, vert=False, patch_artist=True,
boxprops=dict(facecolor='lightcoral', alpha=0.7),
medianprops=dict(color='darkred', linewidth=2))
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_title('Horizontal Box Plot', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()Box Plot with Data Points
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# seaborn box plot
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_ylabel('Wage (thousand yuan/month)')
axes[0].set_title('Box Plot (seaborn)', fontsize=14)
axes[0].grid(True, alpha=0.3, axis='y')
# Box plot + scatter
sns.boxplot(y=wage, ax=axes[1], color='lightblue', width=0.5)
sns.stripplot(y=wage, ax=axes[1], color='black', alpha=0.3, size=3)
axes[1].set_ylabel('Wage (thousand yuan/month)')
axes[1].set_title('Box Plot + Data Points', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()Identifying Outliers
# Calculate outliers
Q1 = np.percentile(wage, 25)
Q3 = np.percentile(wage, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = wage[(wage < lower_bound) | (wage > upper_bound)]
print(f"Q1 = {Q1:.2f}, Q3 = {Q3:.2f}, IQR = {IQR:.2f}")
print(f"Outlier range: < {lower_bound:.2f} or > {upper_bound:.2f}")
print(f"Number of outliers: {len(outliers)} ({len(outliers)/len(wage)*100:.1f}%)")
print(f"Outliers: {outliers[:10]}...") # Show first 104. Violin Plot
Advantage: Combines the benefits of box plots and kernel density plots
Interpretation:
- Middle section is a box plot
- Sides are mirrored KDEs
fig, axes = plt.subplots(1, 3, figsize=(16, 6))
# Box plot
sns.boxplot(y=wage, ax=axes[0], color='lightblue')
axes[0].set_title('Box Plot', fontsize=14)
axes[0].set_ylabel('Wage (thousand yuan/month)')
# Violin plot
sns.violinplot(y=wage, ax=axes[1], color='lightgreen')
axes[1].set_title('Violin Plot', fontsize=14)
axes[1].set_ylabel('')
# Violin plot + box plot (recommended)
sns.violinplot(y=wage, ax=axes[2], color='lightgreen', inner=None)
sns.boxplot(y=wage, ax=axes[2], width=0.15, color='white',
boxprops=dict(zorder=2))
axes[2].set_title('Violin Plot + Box Plot', fontsize=14)
axes[2].set_ylabel('')
for ax in axes:
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()5. Cumulative Distribution Function (CDF) Plot
Purpose: Display , suitable for comparing distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Empirical CDF
sorted_wage = np.sort(wage)
cdf = np.arange(1, len(sorted_wage)+1) / len(sorted_wage)
axes[0].plot(sorted_wage, cdf, linewidth=2, color='steelblue')
axes[0].set_xlabel('Wage (thousand yuan/month)')
axes[0].set_ylabel('Cumulative Probability')
axes[0].set_title('Empirical Cumulative Distribution Function (ECDF)', fontsize=14)
axes[0].grid(True, alpha=0.3)
# Using seaborn (more concise)
sns.ecdfplot(wage, ax=axes[1], linewidth=2, color='coral')
axes[1].set_xlabel('Wage (thousand yuan/month)')
axes[1].set_ylabel('Cumulative Probability')
axes[1].set_title('ECDF (seaborn)', fontsize=14)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate quantiles
print("Wage Quantiles:")
for q in [0.25, 0.50, 0.75, 0.90, 0.95]:
print(f" P{int(q*100)}: {np.quantile(wage, q):.2f} thousand yuan")Categorical Variable Visualization
1. Bar Chart
# Generate categorical data
np.random.seed(42)
regions = ['East', 'Central', 'West', 'Northeast']
counts = [450, 280, 190, 80]
df_region = pd.DataFrame({'region': regions, 'count': counts})
df_region['percentage'] = df_region['count'] / df_region['count'].sum() * 100
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Frequency bar chart
axes[0].bar(df_region['region'], df_region['count'],
color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
edgecolor='black', alpha=0.8)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Sample Size by Region', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
# Add value labels
for i, (region, count) in enumerate(zip(df_region['region'], df_region['count'])):
axes[0].text(i, count + 10, str(count), ha='center', fontsize=11)
# Percentage bar chart
axes[1].bar(df_region['region'], df_region['percentage'],
color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
edgecolor='black', alpha=0.8)
axes[1].set_ylabel('Percentage (%)', fontsize=12)
axes[1].set_title('Proportion by Region', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
# Add percentage labels
for i, (region, pct) in enumerate(zip(df_region['region'], df_region['percentage'])):
axes[1].text(i, pct + 1, f'{pct:.1f}%', ha='center', fontsize=11)
plt.tight_layout()
plt.show()Horizontal Bar Chart (better for many categories)
# Sort by value
df_sorted = df_region.sort_values('count')
plt.figure(figsize=(10, 6))
plt.barh(df_sorted['region'], df_sorted['count'],
color='steelblue', edgecolor='black', alpha=0.8)
plt.xlabel('Count', fontsize=12)
plt.title('Sample Size by Region (Sorted)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
# Add values
for i, (region, count) in enumerate(zip(df_sorted['region'], df_sorted['count'])):
plt.text(count + 10, i, str(count), va='center', fontsize=11)
plt.tight_layout()
plt.show()2. Pie Chart
Note: Pie charts are suitable for showing part-whole relationships, but are less precise than bar charts
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Basic pie chart
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
axes[0].pie(df_region['count'], labels=df_region['region'], colors=colors,
autopct='%1.1f%%', startangle=90)
axes[0].set_title('Regional Distribution (Pie Chart)', fontsize=14, fontweight='bold')
# Highlight a slice
explode = (0.1, 0, 0, 0) # Highlight first slice
axes[1].pie(df_region['count'], labels=df_region['region'], colors=colors,
autopct='%1.1f%%', startangle=90, explode=explode,
shadow=True)
axes[1].set_title('Regional Distribution (Highlight East)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()Recommendations:
- Use for few categories (≤ 5)
- Emphasize part-whole relationship
- Avoid 3D pie charts (distorts proportions)
- Use bar charts for many categories
Distribution Diagnostics
1. Q-Q Plot (Quantile-Quantile Plot)
Purpose: Test whether data follows a specific distribution (usually normal distribution)
Principle:
- X-axis: Theoretical quantiles
- Y-axis: Sample quantiles
- If points lie on a straight line → follows that distribution
from scipy import stats
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Q-Q plot for right-skewed distribution
stats.probplot(wage, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot for Wage (Not Normal)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Q-Q plot after log transformation
stats.probplot(log_wage, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot for log(Wage) (Approximately Normal)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Interpretation:
- S-shaped curve: Heavy-tailed distribution
- Inverted S: Light-tailed distribution
- Upward concave: Right-skewed
- Downward concave: Left-skewed
2. Skewness and Kurtosis
from scipy.stats import skew, kurtosis
# Calculate skewness and kurtosis
skewness_wage = skew(wage)
kurtosis_wage = kurtosis(wage, fisher=True) # fisher=True uses excess kurtosis
skewness_log = skew(log_wage)
kurtosis_log = kurtosis(log_wage, fisher=True)
print("Wage Distribution:")
print(f" Skewness = {skewness_wage:.3f} (right-skewed)")
print(f" Kurtosis = {kurtosis_wage:.3f}")
print("\nlog(Wage) Distribution:")
print(f" Skewness = {skewness_log:.3f} (approximately symmetric)")
print(f" Kurtosis = {kurtosis_log:.3f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, data, title in zip(axes, [wage, log_wage], ['Wage', 'log(Wage)']):
ax.hist(data, bins=30, density=True, alpha=0.6, edgecolor='black')
sns.kdeplot(data, ax=ax, color='red', linewidth=2)
# Add statistics
sk = skew(data)
ku = kurtosis(data, fisher=True)
ax.text(0.02, 0.95, f'Skewness = {sk:.3f}\nKurtosis = {ku:.3f}',
transform=ax.transAxes, fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
ax.set_title(f'{title} Distribution', fontsize=14)
ax.set_xlabel(title)
ax.set_ylabel('Density')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Decision Criteria:
Skewness:
- : Approximately symmetric
- : Moderately skewed
- : Highly skewed
Kurtosis (excess kurtosis):
- : Similar to normal distribution
- : Leptokurtic (heavy-tailed)
- : Platykurtic (light-tailed)
Case Study: Income Distribution Analysis
# Generate more realistic income data
np.random.seed(2024)
n = 5000
# Mixture distribution: majority + high-income group
income_low = np.random.lognormal(mean=2.5, sigma=0.5, size=int(n*0.9))
income_high = np.random.lognormal(mean=3.5, sigma=0.3, size=int(n*0.1))
income = np.concatenate([income_low, income_high])
# Complete univariate analysis
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
# 1. Histogram + KDE
ax1 = fig.add_subplot(gs[0, :2])
ax1.hist(income, bins=50, density=True, alpha=0.6, color='lightblue', edgecolor='black')
sns.kdeplot(income, ax=ax1, color='darkblue', linewidth=2)
ax1.set_xlabel('Income (ten thousand yuan)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('Income Distribution: Histogram + KDE', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
# Add statistics
mean_income = income.mean()
median_income = np.median(income)
ax1.axvline(mean_income, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_income:.2f}')
ax1.axvline(median_income, color='green', linestyle='--', linewidth=2, label=f'Median = {median_income:.2f}')
ax1.legend(fontsize=11)
# 2. Box plot
ax2 = fig.add_subplot(gs[0, 2])
ax2.boxplot(income, vert=True, patch_artist=True,
boxprops=dict(facecolor='lightgreen', alpha=0.7),
medianprops=dict(color='red', linewidth=2))
ax2.set_ylabel('Income (ten thousand yuan)', fontsize=12)
ax2.set_title('Box Plot', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
# 3. Violin plot
ax3 = fig.add_subplot(gs[1, 0])
sns.violinplot(y=income, ax=ax3, color='lightcoral', inner='box')
ax3.set_ylabel('Income (ten thousand yuan)', fontsize=12)
ax3.set_title('Violin Plot', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='y')
# 4. ECDF
ax4 = fig.add_subplot(gs[1, 1])
sns.ecdfplot(income, ax=ax4, linewidth=2, color='purple')
ax4.set_xlabel('Income (ten thousand yuan)', fontsize=12)
ax4.set_ylabel('Cumulative Probability', fontsize=12)
ax4.set_title('Cumulative Distribution Function', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)
# Mark key quantiles
for q in [0.25, 0.50, 0.75]:
val = np.quantile(income, q)
ax4.plot(val, q, 'ro', markersize=8)
ax4.text(val, q+0.05, f'P{int(q*100)}', fontsize=10)
# 5. Q-Q plot
ax5 = fig.add_subplot(gs[1, 2])
stats.probplot(income, dist="norm", plot=ax5)
ax5.set_title('Q-Q Plot (vs Normal Distribution)', fontsize=14, fontweight='bold')
ax5.grid(True, alpha=0.3)
# 6. Log-transformed distribution
ax6 = fig.add_subplot(gs[2, 0])
log_income = np.log(income)
ax6.hist(log_income, bins=50, density=True, alpha=0.6, color='lightyellow', edgecolor='black')
sns.kdeplot(log_income, ax=ax6, color='orange', linewidth=2)
ax6.set_xlabel('log(Income)', fontsize=12)
ax6.set_ylabel('Density', fontsize=12)
ax6.set_title('Log-Transformed Distribution', fontsize=14, fontweight='bold')
ax6.grid(True, alpha=0.3)
# 7. Q-Q plot after log transformation
ax7 = fig.add_subplot(gs[2, 1])
stats.probplot(log_income, dist="norm", plot=ax7)
ax7.set_title('Q-Q Plot for log(Income)', fontsize=14, fontweight='bold')
ax7.grid(True, alpha=0.3)
# 8. Descriptive statistics table
ax8 = fig.add_subplot(gs[2, 2])
ax8.axis('off')
stats_data = [
['Sample Size', f'{len(income):,}'],
['Mean', f'{income.mean():.2f} k yuan'],
['Median', f'{np.median(income):.2f} k yuan'],
['Std Dev', f'{income.std():.2f} k yuan'],
['Minimum', f'{income.min():.2f} k yuan'],
['Maximum', f'{income.max():.2f} k yuan'],
['Skewness', f'{skew(income):.3f}'],
['Kurtosis', f'{kurtosis(income, fisher=True):.3f}'],
['Q1', f'{np.quantile(income, 0.25):.2f} k yuan'],
['Q3', f'{np.quantile(income, 0.75):.2f} k yuan'],
['IQR', f'{np.quantile(income, 0.75) - np.quantile(income, 0.25):.2f} k yuan']
]
table = ax8.table(cellText=stats_data, colLabels=['Statistic', 'Value'],
cellLoc='left', loc='center',
bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
# Set header style
for i in range(2):
table[(0, i)].set_facecolor('#4CAF50')
table[(0, i)].set_text_props(weight='bold', color='white')
ax8.set_title('Descriptive Statistics', fontsize=14, fontweight='bold', pad=20)
plt.suptitle('Complete Univariate Analysis of Income Data', fontsize=18, fontweight='bold', y=0.995)
plt.show()
# Print detailed report
print("\nIncome Data Analysis Report")
print("="*60)
print(f"Sample Size: {len(income):,}")
print(f"Mean: {income.mean():.2f} ten thousand yuan")
print(f"Median: {np.median(income):.2f} ten thousand yuan")
print(f"Standard Deviation: {income.std():.2f} ten thousand yuan")
print(f"Coefficient of Variation: {income.std()/income.mean():.2f}")
print(f"\nSkewness: {skew(income):.3f} ({'right-skewed' if skew(income) > 0 else 'left-skewed'})")
print(f"Kurtosis: {kurtosis(income, fisher=True):.3f}")
print(f"\nQuantiles:")
for q in [0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
print(f" P{int(q*100):2d}: {np.quantile(income, q):6.2f} ten thousand yuan")Section Summary
Chart Selection Guide
| Purpose | Recommended Chart | Python Code |
|---|---|---|
| View distribution shape | Histogram + KDE | plt.hist() + sns.kdeplot() |
| Identify outliers | Box plot | plt.boxplot() or sns.boxplot() |
| Compare distribution details | Violin plot | sns.violinplot() |
| Test normality | Q-Q plot | stats.probplot() |
| Compare quantiles | ECDF | sns.ecdfplot() |
| Display categorical frequencies | Bar chart | plt.bar() |
Key Takeaways
- Always plot first: Numbers can lie, charts reveal truth
- Choose appropriate bins: Too few loses information, too many creates noise
- Consider data transformations: For right-skewed distributions, consider log transformation
- Multi-angle observation: Combine multiple charts for comprehensive understanding
- Focus on outliers: Box plots are the best tool
Next Section Preview
In the next section, we will learn:
- Scatter plots and correlation visualization
- Charts for continuous vs categorical variables
- Pair plots
- Correlation matrix heatmaps
From univariate to bivariate, exploring relationships between variables!