Skip to content

6.1 Chapter Introduction: Data Visualization - Telling the Story of Data

From numbers to charts: making research findings crystal clear

DifficultyImportance


Chapter Objectives

After completing this chapter, you will be able to:

  • Understand fundamental principles of data visualization
  • Create various statistical charts using matplotlib and seaborn
  • Produce exploratory plots for univariate and bivariate data
  • Visualize regression analysis results
  • Compare distributions across multiple groups
  • Create publication-quality figures meeting academic standards

Why is Data Visualization So Important?

The Power of Charts: Anscombe's Quartet

In 1973, statistician Francis Anscombe constructed four datasets with nearly identical descriptive statistics:

  • Same means
  • Same variances
  • Same correlation coefficients
  • Same regression lines

But when we plot these data...

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Anscombe's Quartet data
anscombe = {
    'I': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
          'y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]},
    'II': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
           'y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]},
    'III': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
            'y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]},
    'IV': {'x': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
           'y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]}
}

# Calculate statistics
print("Anscombe's Quartet Statistics:")
print("="*60)
for name, data in anscombe.items():
    x, y = np.array(data['x']), np.array(data['y'])
    print(f"\nDataset {name}:")
    print(f"  X mean: {x.mean():.2f}, Y mean: {y.mean():.2f}")
    print(f"  X variance: {x.var():.2f}, Y variance: {y.var():.2f}")
    print(f"  Correlation: {np.corrcoef(x, y)[0,1]:.3f}")

    # Regression
    from scipy.stats import linregress
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    print(f"  Regression equation: Y = {intercept:.2f} + {slope:.2f}X")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, (name, data) in enumerate(anscombe.items()):
    x, y = np.array(data['x']), np.array(data['y'])

    # Scatter plot
    axes[i].scatter(x, y, s=80, alpha=0.7)

    # Regression line
    slope, intercept, _, _, _ = linregress(x, y)
    x_line = np.linspace(x.min(), x.max(), 100)
    axes[i].plot(x_line, intercept + slope * x_line, 'r-', linewidth=2)

    axes[i].set_xlabel('X', fontsize=12)
    axes[i].set_ylabel('Y', fontsize=12)
    axes[i].set_title(f'Dataset {name}', fontsize=14, fontweight='bold')
    axes[i].grid(True, alpha=0.3)
    axes[i].set_xlim(3, 20)
    axes[i].set_ylim(3, 14)

plt.suptitle("Anscombe's Quartet: Same Statistics, Different Data Structures",
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

Key Insights:

  • Dataset I: Classic linear relationship
  • Dataset II: Non-linear relationship (quadratic)
  • Dataset III: Linear relationship + 1 outlier
  • Dataset IV: Vertical data + 1 extreme leverage point

Conclusion: Always plot your data first!


Three Major Goals of Data Visualization

1. Exploratory Analysis

Goal: Understand basic characteristics of data, discover patterns, identify anomalies

Typical Applications:

  • Examine data distributions
  • Discover variable relationships
  • Identify outliers
  • Generate research hypotheses

Example:

python
# Explore wage data
np.random.seed(42)
n = 500
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience + np.random.normal(0, 0.3, n)
wage = np.exp(log_wage)

df = pd.DataFrame({'wage': wage, 'education': education, 'experience': experience})

# Quick exploration
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Histogram
axes[0, 0].hist(df['wage'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Wage Distribution')
axes[0, 0].set_xlabel('Wage (thousand yuan/month)')

# Box plot
axes[0, 1].boxplot(df['wage'])
axes[0, 1].set_title('Wage Box Plot')
axes[0, 1].set_ylabel('Wage (thousand yuan/month)')

# Scatter plot 1
axes[1, 0].scatter(df['education'], df['wage'], alpha=0.5)
axes[1, 0].set_title('Education vs Wage')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Wage (thousand yuan/month)')

# Scatter plot 2
axes[1, 1].scatter(df['experience'], df['wage'], alpha=0.5)
axes[1, 1].set_title('Experience vs Wage')
axes[1, 1].set_xlabel('Work Experience (years)')
axes[1, 1].set_ylabel('Wage (thousand yuan/month)')

plt.tight_layout()
plt.show()

2. Explanatory Visualization

Goal: Clearly communicate research findings

Typical Applications:

  • Paper figures
  • Report charts
  • Presentations
  • Policy briefs

Requirements:

  • Clear titles and labels
  • Appropriate legends
  • Professional color schemes
  • Publication-quality standards

3. Confirmatory Visualization

Goal: Verify statistical and model assumptions

Typical Applications:

  • Residual diagnostic plots
  • Q-Q plots
  • Influence analysis plots
  • Hypothesis test visualizations

Fundamental Principles of Visualization

Edward Tufte's Data Visualization Principles

1. Data-Ink Ratio

Principle: Maximize the data-ink ratio by removing unnecessary elements

python
# Bad example: over-decoration
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Over-decorated chart
ax1.bar(range(5), [3, 5, 2, 7, 4], color=['red', 'blue', 'green', 'yellow', 'purple'])
ax1.set_facecolor('lightgray')
ax1.grid(True, which='both', linestyle='--', linewidth=2)
ax1.set_title('Over-decorated', fontsize=14, fontweight='bold', color='red')
ax1.legend(['Data'], loc='best', frameon=True, shadow=True)

# Good example: clean and clear
ax2.bar(range(5), [3, 5, 2, 7, 4], color='steelblue', alpha=0.8)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.set_title('Clean and Clear', fontsize=14, fontweight='bold')
ax2.grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

2. Chart Selection Principles

Data TypeRecommended ChartsPython Tools
Single continuous variableHistogram, KDE, box plothist(), kdeplot(), boxplot()
Single categorical variableBar chart, pie chartbar(), pie()
Two continuous variablesScatter plot, hexbinscatter(), hexbin()
Continuous vs categoricalGrouped box plot, violin plotboxplot(), violinplot()
Time seriesLine plotplot()
Multivariate relationshipsPair plot, heatmappairplot(), heatmap()
Distribution comparisonOverlapping density plots, CDFkdeplot(), ecdfplot()

3. Color Usage Principles

python
# Colorblind-friendly color schemes
import matplotlib.colors as mcolors

# Using ColorBrewer palettes
colorblind_safe = ['#377eb8', '#ff7f00', '#4daf4a', '#f781bf',
                   '#a65628', '#984ea3', '#999999', '#e41a1c']

# Example
fig, ax = plt.subplots(figsize=(10, 6))
for i, color in enumerate(colorblind_safe):
    ax.bar(i, i+1, color=color, edgecolor='black')
ax.set_title('Colorblind-Friendly Palette', fontsize=14)
ax.set_xticks(range(len(colorblind_safe)))
ax.set_xticklabels([f'Color {i+1}' for i in range(len(colorblind_safe))])
plt.show()

Avoid:

  • Red-green combinations (8% of males are colorblind)
  • Overly bright colors
  • Gradients for categorical variables

4. Titles and Labels

Good titles should:

  • Clearly describe chart content
  • Include key information (sample size, time range, etc.)
  • Use active voice

Example:

python
# Poor title
plt.title('Chart')

# Good title
plt.title('Impact of Education on Wages (N=500, 2020 data)', fontsize=14, fontweight='bold')

Python Visualization Ecosystem

Core Library Comparison

LibraryStrengthsWeaknessesUse Cases
matplotlibHighly customizable, stableVerbose syntaxFine control, paper figures
seabornBeautiful, strong statistical functionsLess customizableQuick exploration, statistical charts
plotlyInteractive, beautifulNot suitable for papersPresentations, online reports
plotnineggplot2 syntaxSmaller communityR users transitioning

Chapter Focus: matplotlib + seaborn

python
# Set default styles
import matplotlib.pyplot as plt
import seaborn as sns

# seaborn style
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

# Or matplotlib style
plt.style.use('seaborn-v0_8-darkgrid')

# Chinese font settings (avoid garbled text)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # macOS
# plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows
plt.rcParams['axes.unicode_minus'] = False  # Negative sign display

Chapter Structure

Section 1: Univariate Visualization

  • Continuous variables: histogram, KDE, box plot, violin plot
  • Categorical variables: bar chart, pie chart
  • Distribution diagnostics: Q-Q plot, P-P plot
  • Case Study: Visualizing income distribution

Section 2: Bivariate Visualization

  • Two continuous variables: scatter plot, hexbin, contour plot
  • Continuous vs categorical: grouped box plot, violin plot, strip plot
  • Two categorical variables: stacked bar chart, heatmap
  • Correlation visualization: correlation matrix heatmap, pair plot
  • Case Study: Relationship between education and wages

Section 3: Regression Visualization

  • Regression fit plots
  • Residual diagnostic plots (four-in-one)
  • Partial regression plots
  • Influence diagnostic plots
  • Prediction interval visualization
  • Case Study: Complete regression analysis report

Section 4: Distribution Comparison

  • Overlapping density plots
  • CDF comparison plots
  • Grouped box plots and violin plots
  • Ridgeline plots
  • Case Study: Wage distribution comparison across regions

Section 5: Publication-Quality Figures

  • Chart size and resolution settings
  • Font and label specifications
  • Multi-panel layouts
  • Exporting high-quality images (PNG, PDF, SVG)
  • LaTeX figure integration
  • Common journal figure requirements
  • Case Study: Complete workflow for creating paper figures

Learning Path Recommendations

Beginners (1-2 days)

  1. Master basic charts in Sections 1-2
  2. Able to conduct exploratory data analysis
  3. Use seaborn for quick plotting

Intermediate Learning (3-5 days)

  1. Deep dive into Sections 3-4
  2. Master regression diagnostic visualization
  3. Able to create complex multi-panel layouts

Advanced Applications (1 week)

  1. Complete all content in Section 5
  2. Able to create publication-quality figures
  3. Master custom styles and themes

Practical Recommendations

Visualization Checklist (for every analysis)

Exploration Phase:

  • [ ] Plot distributions of all variables
  • [ ] Check for outliers (box plots)
  • [ ] Plot pair plot of key variables
  • [ ] Calculate and visualize correlation matrix

Modeling Phase:

  • [ ] Plot regression fit
  • [ ] Check residual diagnostics
  • [ ] Identify influential points
  • [ ] Visualize predictions

Reporting Phase:

  • [ ] Select the 3-5 most informative charts
  • [ ] Optimize chart styles (titles, labels, legends)
  • [ ] Ensure charts are self-contained (independent of text)
  • [ ] Export high-resolution images

Classic References

Essential Books

  1. Tufte, E. R. (2001). The Visual Display of Quantitative Information

    • The bible of data visualization
    • Data-ink ratio principle
  2. Wilke, C. O. (2019). Fundamentals of Data Visualization

  3. Few, S. (2012). Show Me the Numbers: Designing Tables and Graphs

    • Business chart design

Academic Journal Figure Guidelines


Getting Started

Ready? Let's begin with Section 1: Univariate Visualization!

Remember:

"A picture is worth a thousand words, but a good plot is worth a thousand numbers."


Let the data speak, tell stories with charts!

Released under the MIT License. Content © Author.