6.7 Chapter Summary and Review
"The simple graph has brought more information to the data analyst's mind than any other device."— John Tukey, Statistician
Consolidate what you've learned, integrate knowledge
Chapter Knowledge Framework
Core Content Review
Data Visualization
├── Univariate Visualization
│ ├── Continuous Variables
│ │ ├── Histogram (Frequency vs Density)
│ │ ├── Kernel Density Plot (KDE)
│ │ ├── Box Plot (Outlier Identification)
│ │ ├── Violin Plot (Distribution + Box)
│ │ └── ECDF (Cumulative Distribution)
│ ├── Categorical Variables
│ │ ├── Bar Chart
│ │ └── Pie Chart
│ └── Distribution Diagnostics
│ ├── Q-Q Plot (Normality)
│ ├── Skewness and Kurtosis
│ └── Data Transformation
│
├── Bivariate Visualization
│ ├── Two Continuous Variables
│ │ ├── Scatter Plot + Regression Line
│ │ ├── Non-linear Relationships (LOWESS, Polynomial)
│ │ ├── Hexbin (Large Data)
│ │ ├── 2D Histogram
│ │ └── Contour Plot
│ ├── Continuous vs Categorical
│ │ ├── Grouped Box Plot
│ │ ├── Grouped Violin Plot
│ │ ├── Swarm Plot
│ │ └── Point Plot (Mean + CI)
│ ├── Correlation Analysis
│ │ ├── Correlation Matrix Heatmap
│ │ ├── Pair Plot
│ │ └── Pearson vs Spearman
│ └── Simpson's Paradox
│
├── Regression Visualization
│ ├── Regression Fit Plot
│ ├── Residual Diagnostics (Four-in-One)
│ │ ├── Residuals vs Fitted Values
│ │ ├── Q-Q Plot
│ │ ├── Scale-Location
│ │ └── Residuals vs Leverage
│ ├── Influence Diagnostics
│ │ ├── Cook's Distance
│ │ ├── DFBETAS
│ │ └── Leverage
│ ├── Coefficient Plot
│ └── Prediction Visualization
│
├── Distribution Comparison
│ ├── Overlapping Density Plots
│ ├── ECDF Comparison
│ ├── Grouped Box Plots
│ ├── Grouped Violin Plots
│ └── Ridgeline Plots
│
└── Publication-Quality Figures
├── Figure Size and Resolution
├── Fonts and Styles
├── Multi-Panel Layouts (GridSpec)
├── Export Formats (PDF, PNG, SVG)
└── Journal RequirementsKey Concepts Summary
1. Anscombe's Quartet
Core Lesson: Always plot your data first!
- Four datasets: same means, variances, correlation coefficients, regression lines
- But completely different data structures
- Statistics cannot replace visualization
2. Three Principles of Data Visualization
| Principle | Meaning | Practice |
|---|---|---|
| Data-Ink Ratio | Maximize data information density | Remove chart clutter |
| Accuracy | Don't mislead readers | Y-axis from zero, avoid 3D charts |
| Readability | Quickly convey information | Clear labels, reasonable colors |
3. Simpson's Paradox
Lesson: Overall trend ≠ Grouped trend
- Cause: Confounding variables
- Solution: Grouped analysis, control confounding factors
- Common trap in social science research
4. Chart Selection Decision Framework
def choose_plot(var1_type, var2_type=None, n_observations=None):
"""
Intelligent chart selection system
Parameters:
-----------
var1_type : str
'continuous' or 'categorical'
var2_type : str or None
'continuous', 'categorical', or None (univariate)
n_observations : int or None
Sample size
Returns:
--------
str : Recommended chart type
"""
if var2_type is None:
# Univariate
if var1_type == 'continuous':
return 'hist + kde' if n_observations < 10000 else 'kde only'
else:
return 'bar chart' if n_observations < 20 else 'sorted bar chart'
# Bivariate
if var1_type == 'continuous' and var2_type == 'continuous':
if n_observations < 1000:
return 'scatter + regline'
else:
return 'hexbin or 2D histogram'
elif (var1_type == 'continuous' and var2_type == 'categorical') or \
(var1_type == 'categorical' and var2_type == 'continuous'):
if n_observations < 500:
return 'violin plot'
else:
return 'box plot'
else: # Two categorical variables
return 'stacked bar or heatmap'10 Advanced Programming Exercises
Exercise 1: Reproducing Anscombe's Quartet (⭐⭐⭐)
Task:
- Use the provided Anscombe's Quartet data
- For each dataset, calculate: mean, standard deviation, correlation coefficient, regression equation
- Verify that the four datasets have nearly identical statistics
- Create a 2×2 subplot showing scatter plots + regression lines for all four datasets
- Annotate each subplot with statistics
Grading Criteria:
- Correct statistical calculations (20 points)
- Beautiful visualization (20 points)
- Clear annotations (10 points)
Data
anscombe = {
'I': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]},
'II': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]},
'III': {'x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]},
'IV': {'x': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
'y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]}
}Exercise 2: Exploring Simpson's Paradox (⭐⭐⭐⭐)
Background: A company wants to study the relationship between "training duration" and "performance rating". The data includes three departments (Sales, Technical, Management), each with different baseline performance.
Task:
- Generate simulated data:
- Sales department: Baseline performance 60, training-performance slope -0.3
- Technical department: Baseline performance 75, training-performance slope -0.3
- Management department: Baseline performance 85, training-performance slope -0.3
- Calculate and visualize:
- Overall correlation (ignoring department)
- Within-department correlation
- Create comparison plots demonstrating Simpson's Paradox
- Write a 100-word analysis explaining why this phenomenon occurs
Grading Criteria:
- Correct data generation (15 points)
- Clear visualization of paradox (25 points)
- Analysis report (10 points)
Hint: Use np.random.seed() to ensure reproducibility
Exercise 3: Four-in-One Regression Diagnostics (⭐⭐⭐⭐)
Task:
- Use the provided wage data (education, experience, wage)
- Fit multiple regression model:
log(wage) ~ education + experience + experience² - Create standard four-in-one diagnostic plots:
- Residuals vs Fitted Values (add LOWESS curve)
- Q-Q Plot
- Scale-Location Plot
- Residuals vs Leverage (mark points with Cook's D > 4/n)
- Add panel labels (A, B, C, D) to each subplot
- Write a diagnostic report (200 words) based on the plots
Data Generation:
np.random.seed(42)
n = 200
education = np.random.normal(13, 3, n)
experience = np.random.uniform(0, 30, n)
log_wage = 1.5 + 0.08*education + 0.03*experience - 0.0005*experience**2 + np.random.normal(0, 0.3, n)Grading Criteria:
- Four diagnostic plots correct (30 points)
- LOWESS curve and annotations (10 points)
- Diagnostic report (10 points)
Exercise 4: Enhanced Correlation Matrix Heatmap (⭐⭐⭐⭐)
Task:
- Calculate correlation matrix and p-value matrix for multiple variables
- Create heatmap showing only correlations where p < 0.05
- Display "n.s." for non-significant positions
- Add asterisk markers for each correlation:
***: p < 0.001**: p < 0.01*: p < 0.05
- Add colorblind-friendly color scheme
Variables: Wage, education, experience, age, commute time (generate yourself)
Grading Criteria:
- Correct p-value calculation (20 points)
- Beautiful heatmap (15 points)
- Correct significance markers (15 points)
Exercise 5: Distribution Comparison Visualization (⭐⭐⭐⭐⭐)
Background: Compare wage distributions across four regions (East, Central, West, Northeast).
Task:
- Generate wage data for four regions (different means and variances)
- Create 2×2 comprehensive comparison:
- Overlapping density plots (KDE)
- ECDF comparison
- Grouped violin plot
- Ridgeline plot
- Perform statistical test (ANOVA)
- If ANOVA is significant, perform post-hoc test (Tukey HSD)
- Annotate statistical significance on plots (e.g., ***, **, *)
Grading Criteria:
- Four visualization methods (30 points)
- Correct statistical tests (10 points)
- Significance annotations (10 points)
Scoring Summary Table
| Exercise | Difficulty | Total Points | Focus Areas |
|---|---|---|---|
| 1. Anscombe's Quartet | ⭐⭐⭐ | 50 | Data exploration, visualization basics |
| 2. Simpson's Paradox | ⭐⭐⭐⭐ | 50 | Confounding variables, grouped analysis |
| 3. Four-in-One Diagnostics | ⭐⭐⭐⭐ | 50 | Model diagnostics, LOWESS |
| 4. Enhanced Correlation Matrix | ⭐⭐⭐⭐ | 50 | Statistical inference, heatmaps |
| 5. Distribution Comparison | ⭐⭐⭐⭐⭐ | 50 | Multi-group comparison, statistical tests |
Note: Exercises 6-10 are available in the complete version.
Total Points: 250 (out of 500 in full version)
Recommended Learning Resources
Essential Books
Tufte, E. R. (2001). The Visual Display of Quantitative Information
- The bible of data visualization
Wilke, C. O. (2019). Fundamentals of Data Visualization
- Free online version: https://clauswilke.com/dataviz/
Few, S. (2012). Show Me the Numbers
- Business chart design guide
Online Tutorials
- Matplotlib Official Tutorials: https://matplotlib.org/stable/tutorials/index.html
- Seaborn Gallery: https://seaborn.pydata.org/examples/index.html
- Python Graph Gallery: https://python-graph-gallery.com/
Academic Journal Figure Guidelines
- Nature: https://www.nature.com/nature/for-authors/final-submission
- Science: https://www.science.org/content/page/instructions-authors
- PNAS: https://www.pnas.org/author-center/submitting-your-manuscript
Learning Recommendations
Beginners (Just Completed This Chapter)
Solidify Basics:
- Complete exercises 1-3
- Focus on: basic chart types, meaning of R²
Daily Practice:
- Practice with real data (e.g., UCI datasets)
- Try reproducing figures from papers
Intermediate Learning (1-2 Months)
Deepen Understanding:
- Complete exercises 4-7
- Learn statistical inference visualization
Read Papers:
- Find 3-5 top economics journal papers
- Analyze their figure designs
- Try to reproduce them
Advanced Application (3-6 Months)
Comprehensive Skills:
- Complete exercises 8-10
- Participate in Kaggle data visualization competitions
Develop Style:
- Build personal chart template library
- Form unified visual style
Continue Learning
After completing this chapter, recommended learning path:
Module 7: Time Series Visualization
- Trend plots, seasonal decomposition
- Autocorrelation plots (ACF/PACF)
Module 8: Causal Inference Visualization
- Parallel trends test
- Event study plots
- RDD plots
Advanced Topics:
- Interactive visualization (Plotly, Bokeh)
- Animated charts (GIF, Video)
- Geographic data visualization (Geopandas)
Self-Assessment
After completing this chapter, you should be able to:
- [ ] Create 10+ types of statistical charts without documentation
- [ ] Diagnose common regression model issues
- [ ] Identify Simpson's Paradox
- [ ] Create journal-compliant figures
- [ ] Write complete EDA reports
- [ ] Explain complex statistical charts to non-technical audiences
If you can do all of the above, congratulations on mastering core data visualization skills!
Final Advice
Tufte's Three Core Principles
Above all else, show the data
- Data first, decoration second
Maximize the data-ink ratio
- Maximize data-ink ratio
Erase non-data ink
- Remove non-data elements
The Highest Level of Visualization
"The best graph is the one that doesn't need a title or caption to be understood."
Continuous Improvement
- After each visualization, ask yourself:
- Can this chart stand alone?
- Can the core message be understood in 5 seconds?
- Can colorblind people distinguish it?
- Is it still clear when printed in black and white?
Happy learning! Continue exploring other exciting chapters in StatsPai!