6.1 Chapter Introduction: Data Visualization - Telling the Story of Data
From numbers to charts: making research findings crystal clear
Chapter Objectives
After completing this chapter, you will be able to:
- Understand fundamental principles of data visualization
- Create various statistical charts using matplotlib and seaborn
- Produce exploratory plots for univariate and bivariate data
- Visualize regression analysis results
- Compare distributions across multiple groups
- Create publication-quality figures meeting academic standards
Why is Data Visualization So Important?
The Power of Charts: Anscombe's Quartet
In 1973, statistician Francis Anscombe constructed four datasets with nearly identical descriptive statistics:
- Same means
- Same variances
- Same correlation coefficients
- Same regression lines
But when we plot these data...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Anscombe's Quartet data
anscombe = {
'I': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]},
'II': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]},
'III': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]},
'IV': {'x': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
'y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]}
}
# Calculate statistics
print("Anscombe's Quartet Statistics:")
print("="*60)
for name, data in anscombe.items():
x, y = np.array(data['x']), np.array(data['y'])
print(f"\nDataset {name}:")
print(f" X mean: {x.mean():.2f}, Y mean: {y.mean():.2f}")
print(f" X variance: {x.var():.2f}, Y variance: {y.var():.2f}")
print(f" Correlation: {np.corrcoef(x, y)[0,1]:.3f}")
# Regression
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print(f" Regression equation: Y = {intercept:.2f} + {slope:.2f}X")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for i, (name, data) in enumerate(anscombe.items()):
x, y = np.array(data['x']), np.array(data['y'])
# Scatter plot
axes[i].scatter(x, y, s=80, alpha=0.7)
# Regression line
slope, intercept, _, _, _ = linregress(x, y)
x_line = np.linspace(x.min(), x.max(), 100)
axes[i].plot(x_line, intercept + slope * x_line, 'r-', linewidth=2)
axes[i].set_xlabel('X', fontsize=12)
axes[i].set_ylabel('Y', fontsize=12)
axes[i].set_title(f'Dataset {name}', fontsize=14, fontweight='bold')
axes[i].grid(True, alpha=0.3)
axes[i].set_xlim(3, 20)
axes[i].set_ylim(3, 14)
plt.suptitle("Anscombe's Quartet: Same Statistics, Different Data Structures",
fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()Key Insights:
- Dataset I: Classic linear relationship
- Dataset II: Non-linear relationship (quadratic)
- Dataset III: Linear relationship + 1 outlier
- Dataset IV: Vertical data + 1 extreme leverage point
Conclusion: Always plot your data first!
Three Major Goals of Data Visualization
1. Exploratory Analysis
Goal: Understand basic characteristics of data, discover patterns, identify anomalies
Typical Applications:
- Examine data distributions
- Discover variable relationships
- Identify outliers
- Generate research hypotheses
Example:
# Explore wage data
np.random.seed(42)
n = 500
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)
df = pd.DataFrame({'wage': wage, 'education': education, 'experience': experience})
# Quick exploration
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Histogram
axes[0, 0].hist(df['wage'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Wage Distribution')
axes[0, 0].set_xlabel('Wage (thousand yuan/month)')
# Box plot
axes[0, 1].boxplot(df['wage'])
axes[0, 1].set_title('Wage Box Plot')
axes[0, 1].set_ylabel('Wage (thousand yuan/month)')
# Scatter plot 1
axes[1, 0].scatter(df['education'], df['wage'], alpha=0.5)
axes[1, 0].set_title('Education vs Wage')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Wage (thousand yuan/month)')
# Scatter plot 2
axes[1, 1].scatter(df['experience'], df['wage'], alpha=0.5)
axes[1, 1].set_title('Experience vs Wage')
axes[1, 1].set_xlabel('Work Experience (years)')
axes[1, 1].set_ylabel('Wage (thousand yuan/month)')
plt.tight_layout()
plt.show()2. Explanatory Visualization
Goal: Clearly communicate research findings
Typical Applications:
- Paper figures
- Report charts
- Presentations
- Policy briefs
Requirements:
- Clear titles and labels
- Appropriate legends
- Professional color schemes
- Publication-quality standards
3. Confirmatory Visualization
Goal: Verify statistical and model assumptions
Typical Applications:
- Residual diagnostic plots
- Q-Q plots
- Influence analysis plots
- Hypothesis test visualizations
Fundamental Principles of Visualization
Edward Tufte's Data Visualization Principles
1. Data-Ink Ratio
Principle: Maximize the data-ink ratio by removing unnecessary elements
# Bad example: over-decoration
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Over-decorated chart
ax1.bar(range(5), [3, 5, 2, 7, 4], color=['red', 'blue', 'green', 'yellow', 'purple'])
ax1.set_facecolor('lightgray')
ax1.grid(True, which='both', linestyle='--', linewidth=2)
ax1.set_title('Over-decorated', fontsize=14, fontweight='bold', color='red')
ax1.legend(['Data'], loc='best', frameon=True, shadow=True)
# Good example: clean and clear
ax2.bar(range(5), [3, 5, 2, 7, 4], color='steelblue', alpha=0.8)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.set_title('Clean and Clear', fontsize=14, fontweight='bold')
ax2.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()2. Chart Selection Principles
| Data Type | Recommended Charts | Python Tools |
|---|---|---|
| Single continuous variable | Histogram, KDE, box plot | hist(), kdeplot(), boxplot() |
| Single categorical variable | Bar chart, pie chart | bar(), pie() |
| Two continuous variables | Scatter plot, hexbin | scatter(), hexbin() |
| Continuous vs categorical | Grouped box plot, violin plot | boxplot(), violinplot() |
| Time series | Line plot | plot() |
| Multivariate relationships | Pair plot, heatmap | pairplot(), heatmap() |
| Distribution comparison | Overlapping density plots, CDF | kdeplot(), ecdfplot() |
3. Color Usage Principles
# Colorblind-friendly color schemes
import matplotlib.colors as mcolors
# Using ColorBrewer palettes
colorblind_safe = ['#377eb8', '#ff7f00', '#4daf4a', '#f781bf',
'#a65628', '#984ea3', '#999999', '#e41a1c']
# Example
fig, ax = plt.subplots(figsize=(10, 6))
for i, color in enumerate(colorblind_safe):
ax.bar(i, i+1, color=color, edgecolor='black')
ax.set_title('Colorblind-Friendly Palette', fontsize=14)
ax.set_xticks(range(len(colorblind_safe)))
ax.set_xticklabels([f'Color {i+1}' for i in range(len(colorblind_safe))])
plt.show()Avoid:
- Red-green combinations (8% of males are colorblind)
- Overly bright colors
- Gradients for categorical variables
4. Titles and Labels
Good titles should:
- Clearly describe chart content
- Include key information (sample size, time range, etc.)
- Use active voice
Example:
# Poor title
plt.title('Chart')
# Good title
plt.title('Impact of Education on Wages (N=500, 2020 data)', fontsize=14, fontweight='bold')Python Visualization Ecosystem
Core Library Comparison
| Library | Strengths | Weaknesses | Use Cases |
|---|---|---|---|
| matplotlib | Highly customizable, stable | Verbose syntax | Fine control, paper figures |
| seaborn | Beautiful, strong statistical functions | Less customizable | Quick exploration, statistical charts |
| plotly | Interactive, beautiful | Not suitable for papers | Presentations, online reports |
| plotnine | ggplot2 syntax | Smaller community | R users transitioning |
Chapter Focus: matplotlib + seaborn
# Set default styles
import matplotlib.pyplot as plt
import seaborn as sns
# seaborn style
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)
# Or matplotlib style
plt.style.use('seaborn-v0_8-darkgrid')
# Chinese font settings (avoid garbled text)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # macOS
# plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows
plt.rcParams['axes.unicode_minus'] = False # Negative sign displayChapter Structure
Section 1: Univariate Visualization
- Continuous variables: histogram, KDE, box plot, violin plot
- Categorical variables: bar chart, pie chart
- Distribution diagnostics: Q-Q plot, P-P plot
- Case Study: Visualizing income distribution
Section 2: Bivariate Visualization
- Two continuous variables: scatter plot, hexbin, contour plot
- Continuous vs categorical: grouped box plot, violin plot, strip plot
- Two categorical variables: stacked bar chart, heatmap
- Correlation visualization: correlation matrix heatmap, pair plot
- Case Study: Relationship between education and wages
Section 3: Regression Visualization
- Regression fit plots
- Residual diagnostic plots (four-in-one)
- Partial regression plots
- Influence diagnostic plots
- Prediction interval visualization
- Case Study: Complete regression analysis report
Section 4: Distribution Comparison
- Overlapping density plots
- CDF comparison plots
- Grouped box plots and violin plots
- Ridgeline plots
- Case Study: Wage distribution comparison across regions
Section 5: Publication-Quality Figures
- Chart size and resolution settings
- Font and label specifications
- Multi-panel layouts
- Exporting high-quality images (PNG, PDF, SVG)
- LaTeX figure integration
- Common journal figure requirements
- Case Study: Complete workflow for creating paper figures
Learning Path Recommendations
Beginners (1-2 days)
- Master basic charts in Sections 1-2
- Able to conduct exploratory data analysis
- Use seaborn for quick plotting
Intermediate Learning (3-5 days)
- Deep dive into Sections 3-4
- Master regression diagnostic visualization
- Able to create complex multi-panel layouts
Advanced Applications (1 week)
- Complete all content in Section 5
- Able to create publication-quality figures
- Master custom styles and themes
Practical Recommendations
Visualization Checklist (for every analysis)
Exploration Phase:
- [ ] Plot distributions of all variables
- [ ] Check for outliers (box plots)
- [ ] Plot pair plot of key variables
- [ ] Calculate and visualize correlation matrix
Modeling Phase:
- [ ] Plot regression fit
- [ ] Check residual diagnostics
- [ ] Identify influential points
- [ ] Visualize predictions
Reporting Phase:
- [ ] Select the 3-5 most informative charts
- [ ] Optimize chart styles (titles, labels, legends)
- [ ] Ensure charts are self-contained (independent of text)
- [ ] Export high-resolution images
Classic References
Essential Books
Tufte, E. R. (2001). The Visual Display of Quantitative Information
- The bible of data visualization
- Data-ink ratio principle
Wilke, C. O. (2019). Fundamentals of Data Visualization
- Modern visualization guide
- Free online version
Few, S. (2012). Show Me the Numbers: Designing Tables and Graphs
- Business chart design
Academic Journal Figure Guidelines
- Nature: Figure Guidelines
- Science: Figure Preparation
- PNAS: Figure and Table Specifications
Getting Started
Ready? Let's begin with Section 1: Univariate Visualization!
Remember:
"A picture is worth a thousand words, but a good plot is worth a thousand numbers."
Let the data speak, tell stories with charts!