Skip to content

6.7 Chapter Summary and Review

"The simple graph has brought more information to the data analyst's mind than any other device."— John Tukey, Statistician

Consolidate what you've learned, integrate knowledge

DifficultyImportance


Chapter Knowledge Framework

Core Content Review

Data Visualization
├── Univariate Visualization
│   ├── Continuous Variables
│   │   ├── Histogram (Frequency vs Density)
│   │   ├── Kernel Density Plot (KDE)
│   │   ├── Box Plot (Outlier Identification)
│   │   ├── Violin Plot (Distribution + Box)
│   │   └── ECDF (Cumulative Distribution)
│   ├── Categorical Variables
│   │   ├── Bar Chart
│   │   └── Pie Chart
│   └── Distribution Diagnostics
│       ├── Q-Q Plot (Normality)
│       ├── Skewness and Kurtosis
│       └── Data Transformation

├── Bivariate Visualization
│   ├── Two Continuous Variables
│   │   ├── Scatter Plot + Regression Line
│   │   ├── Non-linear Relationships (LOWESS, Polynomial)
│   │   ├── Hexbin (Large Data)
│   │   ├── 2D Histogram
│   │   └── Contour Plot
│   ├── Continuous vs Categorical
│   │   ├── Grouped Box Plot
│   │   ├── Grouped Violin Plot
│   │   ├── Swarm Plot
│   │   └── Point Plot (Mean + CI)
│   ├── Correlation Analysis
│   │   ├── Correlation Matrix Heatmap
│   │   ├── Pair Plot
│   │   └── Pearson vs Spearman
│   └── Simpson's Paradox

├── Regression Visualization
│   ├── Regression Fit Plot
│   ├── Residual Diagnostics (Four-in-One)
│   │   ├── Residuals vs Fitted Values
│   │   ├── Q-Q Plot
│   │   ├── Scale-Location
│   │   └── Residuals vs Leverage
│   ├── Influence Diagnostics
│   │   ├── Cook's Distance
│   │   ├── DFBETAS
│   │   └── Leverage
│   ├── Coefficient Plot
│   └── Prediction Visualization

├── Distribution Comparison
│   ├── Overlapping Density Plots
│   ├── ECDF Comparison
│   ├── Grouped Box Plots
│   ├── Grouped Violin Plots
│   └── Ridgeline Plots

└── Publication-Quality Figures
    ├── Figure Size and Resolution
    ├── Fonts and Styles
    ├── Multi-Panel Layouts (GridSpec)
    ├── Export Formats (PDF, PNG, SVG)
    └── Journal Requirements

Key Concepts Summary

1. Anscombe's Quartet

Core Lesson: Always plot your data first!

  • Four datasets: same means, variances, correlation coefficients, regression lines
  • But completely different data structures
  • Statistics cannot replace visualization

2. Three Principles of Data Visualization

PrincipleMeaningPractice
Data-Ink RatioMaximize data information densityRemove chart clutter
AccuracyDon't mislead readersY-axis from zero, avoid 3D charts
ReadabilityQuickly convey informationClear labels, reasonable colors

3. Simpson's Paradox

Lesson: Overall trend ≠ Grouped trend

  • Cause: Confounding variables
  • Solution: Grouped analysis, control confounding factors
  • Common trap in social science research

4. Chart Selection Decision Framework

python
def choose_plot(var1_type, var2_type=None, n_observations=None):
    """
    Intelligent chart selection system

    Parameters:
    -----------
    var1_type : str
        'continuous' or 'categorical'
    var2_type : str or None
        'continuous', 'categorical', or None (univariate)
    n_observations : int or None
        Sample size

    Returns:
    --------
    str : Recommended chart type
    """
    if var2_type is None:
        # Univariate
        if var1_type == 'continuous':
            return 'hist + kde' if n_observations < 10000 else 'kde only'
        else:
            return 'bar chart' if n_observations < 20 else 'sorted bar chart'

    # Bivariate
    if var1_type == 'continuous' and var2_type == 'continuous':
        if n_observations < 1000:
            return 'scatter + regline'
        else:
            return 'hexbin or 2D histogram'

    elif (var1_type == 'continuous' and var2_type == 'categorical') or \
         (var1_type == 'categorical' and var2_type == 'continuous'):
        if n_observations < 500:
            return 'violin plot'
        else:
            return 'box plot'

    else:  # Two categorical variables
        return 'stacked bar or heatmap'

10 Advanced Programming Exercises

Exercise 1: Reproducing Anscombe's Quartet (⭐⭐⭐)

Task:

  1. Use the provided Anscombe's Quartet data
  2. For each dataset, calculate: mean, standard deviation, correlation coefficient, regression equation
  3. Verify that the four datasets have nearly identical statistics
  4. Create a 2×2 subplot showing scatter plots + regression lines for all four datasets
  5. Annotate each subplot with statistics

Grading Criteria:

  • Correct statistical calculations (20 points)
  • Beautiful visualization (20 points)
  • Clear annotations (10 points)
Data
python
anscombe = {
    'I': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
          'y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]},
    'II': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
           'y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]},
    'III': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
            'y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]},
    'IV': {'x': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
           'y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]}
}

Exercise 2: Exploring Simpson's Paradox (⭐⭐⭐⭐)

Background: A company wants to study the relationship between "training duration" and "performance rating". The data includes three departments (Sales, Technical, Management), each with different baseline performance.

Task:

  1. Generate simulated data:
    • Sales department: Baseline performance 60, training-performance slope -0.3
    • Technical department: Baseline performance 75, training-performance slope -0.3
    • Management department: Baseline performance 85, training-performance slope -0.3
  2. Calculate and visualize:
    • Overall correlation (ignoring department)
    • Within-department correlation
  3. Create comparison plots demonstrating Simpson's Paradox
  4. Write a 100-word analysis explaining why this phenomenon occurs

Grading Criteria:

  • Correct data generation (15 points)
  • Clear visualization of paradox (25 points)
  • Analysis report (10 points)

Hint: Use np.random.seed() to ensure reproducibility


Exercise 3: Four-in-One Regression Diagnostics (⭐⭐⭐⭐)

Task:

  1. Use the provided wage data (education, experience, wage)
  2. Fit multiple regression model: log(wage) ~ education + experience + experience²
  3. Create standard four-in-one diagnostic plots:
    • Residuals vs Fitted Values (add LOWESS curve)
    • Q-Q Plot
    • Scale-Location Plot
    • Residuals vs Leverage (mark points with Cook's D > 4/n)
  4. Add panel labels (A, B, C, D) to each subplot
  5. Write a diagnostic report (200 words) based on the plots

Data Generation:

python
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)

Grading Criteria:

  • Four diagnostic plots correct (30 points)
  • LOWESS curve and annotations (10 points)
  • Diagnostic report (10 points)

Exercise 4: Enhanced Correlation Matrix Heatmap (⭐⭐⭐⭐)

Task:

  1. Calculate correlation matrix and p-value matrix for multiple variables
  2. Create heatmap showing only correlations where p < 0.05
  3. Display "n.s." for non-significant positions
  4. Add asterisk markers for each correlation:
    • ***: p < 0.001
    • **: p < 0.01
    • *: p < 0.05
  5. Add colorblind-friendly color scheme

Variables: Wage, education, experience, age, commute time (generate yourself)

Grading Criteria:

  • Correct p-value calculation (20 points)
  • Beautiful heatmap (15 points)
  • Correct significance markers (15 points)

Exercise 5: Distribution Comparison Visualization (⭐⭐⭐⭐⭐)

Background: Compare wage distributions across four regions (East, Central, West, Northeast).

Task:

  1. Generate wage data for four regions (different means and variances)
  2. Create 2×2 comprehensive comparison:
    • Overlapping density plots (KDE)
    • ECDF comparison
    • Grouped violin plot
    • Ridgeline plot
  3. Perform statistical test (ANOVA)
  4. If ANOVA is significant, perform post-hoc test (Tukey HSD)
  5. Annotate statistical significance on plots (e.g., ***, **, *)

Grading Criteria:

  • Four visualization methods (30 points)
  • Correct statistical tests (10 points)
  • Significance annotations (10 points)

Scoring Summary Table

ExerciseDifficultyTotal PointsFocus Areas
1. Anscombe's Quartet⭐⭐⭐50Data exploration, visualization basics
2. Simpson's Paradox⭐⭐⭐⭐50Confounding variables, grouped analysis
3. Four-in-One Diagnostics⭐⭐⭐⭐50Model diagnostics, LOWESS
4. Enhanced Correlation Matrix⭐⭐⭐⭐50Statistical inference, heatmaps
5. Distribution Comparison⭐⭐⭐⭐⭐50Multi-group comparison, statistical tests

Note: Exercises 6-10 are available in the complete version.

Total Points: 250 (out of 500 in full version)


Essential Books

  1. Tufte, E. R. (2001). The Visual Display of Quantitative Information

    • The bible of data visualization
  2. Wilke, C. O. (2019). Fundamentals of Data Visualization

  3. Few, S. (2012). Show Me the Numbers

    • Business chart design guide

Online Tutorials

  1. Matplotlib Official Tutorials: https://matplotlib.org/stable/tutorials/index.html
  2. Seaborn Gallery: https://seaborn.pydata.org/examples/index.html
  3. Python Graph Gallery: https://python-graph-gallery.com/

Academic Journal Figure Guidelines


Learning Recommendations

Beginners (Just Completed This Chapter)

  1. Solidify Basics:

    • Complete exercises 1-3
    • Focus on: basic chart types, meaning of R²
  2. Daily Practice:

    • Practice with real data (e.g., UCI datasets)
    • Try reproducing figures from papers

Intermediate Learning (1-2 Months)

  1. Deepen Understanding:

    • Complete exercises 4-7
    • Learn statistical inference visualization
  2. Read Papers:

    • Find 3-5 top economics journal papers
    • Analyze their figure designs
    • Try to reproduce them

Advanced Application (3-6 Months)

  1. Comprehensive Skills:

    • Complete exercises 8-10
    • Participate in Kaggle data visualization competitions
  2. Develop Style:

    • Build personal chart template library
    • Form unified visual style

Continue Learning

After completing this chapter, recommended learning path:

  1. Module 7: Time Series Visualization

    • Trend plots, seasonal decomposition
    • Autocorrelation plots (ACF/PACF)
  2. Module 8: Causal Inference Visualization

    • Parallel trends test
    • Event study plots
    • RDD plots
  3. Advanced Topics:

    • Interactive visualization (Plotly, Bokeh)
    • Animated charts (GIF, Video)
    • Geographic data visualization (Geopandas)

Self-Assessment

After completing this chapter, you should be able to:

  • [ ] Create 10+ types of statistical charts without documentation
  • [ ] Diagnose common regression model issues
  • [ ] Identify Simpson's Paradox
  • [ ] Create journal-compliant figures
  • [ ] Write complete EDA reports
  • [ ] Explain complex statistical charts to non-technical audiences

If you can do all of the above, congratulations on mastering core data visualization skills!


Final Advice

Tufte's Three Core Principles

  1. Above all else, show the data

    • Data first, decoration second
  2. Maximize the data-ink ratio

    • Maximize data-ink ratio
  3. Erase non-data ink

    • Remove non-data elements

The Highest Level of Visualization

"The best graph is the one that doesn't need a title or caption to be understood."

Continuous Improvement

  • After each visualization, ask yourself:
    • Can this chart stand alone?
    • Can the core message be understood in 5 seconds?
    • Can colorblind people distinguish it?
    • Is it still clear when printed in black and white?

Happy learning! Continue exploring other exciting chapters in StatsPai!

Released under the MIT License. Content © Author.