Skip to content

6.3 Bivariate Visualization

"A picture shows me at a glance what it takes dozens of pages of a book to expound."— Ivan Turgenev, Russian Novelist

Exploring relationships between variables: the first step from correlation to causation

DifficultyImportance


Section Objectives

After completing this section, you will be able to:

  • Use scatter plots to display relationships between two continuous variables
  • Identify linear, non-linear, and conditional relationships
  • Use grouped charts to compare distributions across categories
  • Create and interpret correlation matrix heatmaps
  • Use pair plots for multivariate exploration
  • Visualize Simpson's Paradox
  • Handle large-data visualization challenges

Relationships Between Two Continuous Variables

1. Scatter Plot

Purpose: First step in exploring bivariate relationships

Basic Scatter Plot + Regression Line

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress, pearsonr, spearmanr

# Set Chinese font
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

# Generate simulated data
np.random.seed(42)
n = 500

# Data Generating Process
education = np.random.normal(13, 3, n)
education = np.clip(education, 6, 20)
experience = np.random.uniform(0, 30, n)
ability = np.random.normal(0, 1, n)  # Unobserved ability

# Mincer wage equation + ability bias
log_wage = (1.5 + 0.08*education + 0.03*experience -
            0.0005*experience**2 + 0.15*ability +
            np.random.normal(0, 0.3, n))
wage = np.exp(log_wage)

df = pd.DataFrame({
    'wage': wage,
    'log_wage': log_wage,
    'education': education,
    'experience': experience,
    'ability': ability
})

# Create comparison plot: original vs log transformation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: original wage (non-linear relationship)
axes[0].scatter(df['education'], df['wage'], alpha=0.5, s=50, color='steelblue')
slope, intercept, r_value, p_value, std_err = linregress(df['education'], df['wage'])
line_x = np.array([df['education'].min(), df['education'].max()])
line_y = intercept + slope * line_x
axes[0].plot(line_x, line_y, 'r-', linewidth=2, label=f'R² = {r_value**2:.3f}')
axes[0].set_xlabel('Years of Education', fontsize=12)
axes[0].set_ylabel('Wage (thousand yuan/month)', fontsize=12)
axes[0].set_title('Education vs Wage (Level-Level)', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Right plot: log wage (stronger linear relationship)
axes[1].scatter(df['education'], df['log_wage'], alpha=0.5, s=50, color='coral')
slope2, intercept2, r_value2, p_value2, std_err2 = linregress(df['education'], df['log_wage'])
line_y2 = intercept2 + slope2 * line_x
axes[1].plot(line_x, line_y2, 'r-', linewidth=2, label=f'R² = {r_value2**2:.3f}')
axes[1].set_xlabel('Years of Education', fontsize=12)
axes[1].set_ylabel('log(Wage)', fontsize=12)
axes[1].set_title('Education vs log(Wage) (Log-Level)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Model Comparison:")
print(f"Level-Level: R² = {r_value**2:.4f}, Slope = {slope:.3f}")
print(f"Log-Level:   R² = {r_value2**2:.4f}, Slope = {slope2:.3f}")
print(f"\nInterpretation: After log transformation, R² improved from {r_value**2:.3f} to {r_value2**2:.3f}")
print(f"      Each additional year of education increases wage by approximately {slope2*100:.1f}%")

Correlation Coefficient Calculation and Significance Testing

python
# Calculate Pearson and Spearman correlation coefficients
pearson_corr, pearson_p = pearsonr(df['education'], df['wage'])
spearman_corr, spearman_p = spearmanr(df['education'], df['wage'])

print("\nCorrelation Analysis:")
print("="*60)
print(f"Pearson correlation:  r = {pearson_corr:.4f}, p = {pearson_p:.4e}")
print(f"Spearman correlation: ρ = {spearman_corr:.4f}, p = {spearman_p:.4e}")

if pearson_p < 0.001:
    print("\nConclusion: Education and wage are significantly positively correlated at the 0.001 level")
else:
    print("\nConclusion: No significant correlation found")

# Visualize correlations of different strengths
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

correlations = [0.3, 0.6, 0.9]
for i, target_corr in enumerate(correlations):
    # Generate data with specified correlation
    x = np.random.randn(200)
    y = target_corr * x + np.sqrt(1 - target_corr**2) * np.random.randn(200)

    axes[i].scatter(x, y, alpha=0.5, s=40)

    # Add regression line
    slope, intercept, r_value, _, _ = linregress(x, y)
    x_line = np.array([x.min(), x.max()])
    y_line = intercept + slope * x_line
    axes[i].plot(x_line, y_line, 'r-', linewidth=2)

    axes[i].set_title(f'r = {target_corr:.1f} (R² = {target_corr**2:.2f})',
                     fontsize=14, fontweight='bold')
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('Y')
    axes[i].grid(True, alpha=0.3)

plt.suptitle('Linear Correlations of Different Strengths', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

2. Visualizing Non-Linear Relationships

Case: Inverted U-Shaped Relationship Between Experience and Wage

python
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Scatter plot
axes[0].scatter(df['experience'], df['wage'], alpha=0.5, s=50, color='steelblue')
axes[0].set_xlabel('Work Experience (years)', fontsize=12)
axes[0].set_ylabel('Wage (thousand yuan/month)', fontsize=12)
axes[0].set_title('Experience vs Wage (Raw Data)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Add linear fit (inappropriate)
axes[1].scatter(df['experience'], df['wage'], alpha=0.5, s=50, color='steelblue')
slope, intercept, r_value, _, _ = linregress(df['experience'], df['wage'])
x_line = np.linspace(df['experience'].min(), df['experience'].max(), 100)
y_line = intercept + slope * x_line
axes[1].plot(x_line, y_line, 'r-', linewidth=2, label=f'Linear fit (R² = {r_value**2:.3f})')
axes[1].set_xlabel('Work Experience (years)', fontsize=12)
axes[1].set_ylabel('Wage (thousand yuan/month)', fontsize=12)
axes[1].set_title('Linear Fit (Inappropriate)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Add quadratic fit (more appropriate)
from numpy.polynomial import polynomial as P
axes[2].scatter(df['experience'], df['wage'], alpha=0.5, s=50, color='steelblue')

# Quadratic polynomial fit
coeffs = np.polyfit(df['experience'], df['wage'], 2)
poly = np.poly1d(coeffs)
y_poly = poly(x_line)
axes[2].plot(x_line, y_poly, 'r-', linewidth=2, label='Quadratic fit')

# Calculate R² for quadratic model
residuals = df['wage'] - poly(df['experience'])
ss_res = np.sum(residuals**2)
ss_tot = np.sum((df['wage'] - df['wage'].mean())**2)
r2_poly = 1 - (ss_res / ss_tot)

axes[2].set_xlabel('Work Experience (years)', fontsize=12)
axes[2].set_ylabel('Wage (thousand yuan/month)', fontsize=12)
axes[2].set_title(f'Quadratic Fit (R² = {r2_poly:.3f})', fontsize=14, fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nModel Comparison:")
print(f"Linear model R²:   {r_value**2:.4f}")
print(f"Quadratic model R²: {r2_poly:.4f}")
print(f"R² improvement:    {(r2_poly - r_value**2)*100:.2f} percentage points")

LOWESS Smoothing Curve (Locally Weighted Regression)

python
from statsmodels.nonparametric.smoothers_lowess import lowess

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Scatter plot + LOWESS
axes[0].scatter(df['experience'], df['log_wage'], alpha=0.4, s=40, color='lightgray')

# LOWESS curve
smoothed = lowess(df['log_wage'], df['experience'], frac=0.3)
axes[0].plot(smoothed[:, 0], smoothed[:, 1], 'r-', linewidth=3, label='LOWESS curve')
axes[0].set_xlabel('Work Experience (years)', fontsize=12)
axes[0].set_ylabel('log(Wage)', fontsize=12)
axes[0].set_title('LOWESS Smoothing: Identifying Non-Linear Relationships', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Plot 2: Different smoothing levels
axes[1].scatter(df['experience'], df['log_wage'], alpha=0.3, s=30, color='lightgray')

for frac, color, label in [(0.1, 'red', 'Overfitting (frac=0.1)'),
                            (0.3, 'blue', 'Moderate (frac=0.3)'),
                            (0.6, 'green', 'Over-smoothing (frac=0.6)')]:
    smoothed = lowess(df['log_wage'], df['experience'], frac=frac)
    axes[1].plot(smoothed[:, 0], smoothed[:, 1], color=color, linewidth=2, label=label)

axes[1].set_xlabel('Work Experience (years)', fontsize=12)
axes[1].set_ylabel('log(Wage)', fontsize=12)
axes[1].set_title('Impact of Different LOWESS Parameters', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Selection Advice:

  • Bandwidth too small → overfitting (noise)
  • Bandwidth too large → underfitting (over-smoothing)
  • Default value (Scott's rule) usually works well

3. Large Data Visualization: Hexbin and Contour Plots

python
# Generate large dataset
np.random.seed(123)
n_large = 10000
edu_large = np.random.normal(13, 3, n_large)
exp_large = np.random.uniform(0, 30, n_large)
log_wage_large = (1.5 + 0.08*edu_large + 0.03*exp_large -
                  0.0005*exp_large**2 + np.random.normal(0, 0.3, n_large))
wage_large = np.exp(log_wage_large)

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Plot 1: Regular scatter plot (heavy overlap)
axes[0, 0].scatter(edu_large, wage_large, alpha=0.1, s=10, color='steelblue')
axes[0, 0].set_title(f'Scatter Plot (N={n_large:,}, Heavy Overlap)', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Years of Education')
axes[0, 0].set_ylabel('Wage (thousand yuan/month)')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Hexbin plot
hb = axes[0, 1].hexbin(edu_large, wage_large, gridsize=40, cmap='YlOrRd', mincnt=1)
axes[0, 1].set_title('Hexbin Plot (Clear Density)', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Years of Education')
axes[0, 1].set_ylabel('Wage (thousand yuan/month)')
plt.colorbar(hb, ax=axes[0, 1], label='Number of Points')

# Plot 3: 2D histogram
h = axes[1, 0].hist2d(edu_large, wage_large, bins=50, cmap='Blues')
axes[1, 0].set_title('2D Histogram', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Years of Education')
axes[1, 0].set_ylabel('Wage (thousand yuan/month)')
plt.colorbar(h[3], ax=axes[1, 0], label='Frequency')

# Plot 4: Contour plot
from scipy.stats import gaussian_kde
kde = gaussian_kde([edu_large, wage_large])
xi, yi = np.mgrid[edu_large.min():edu_large.max():100j,
                  wage_large.min():wage_large.max():100j]
zi = kde(np.vstack([xi.flatten(), yi.flatten()]))
axes[1, 1].contourf(xi, yi, zi.reshape(xi.shape), levels=15, cmap='viridis')
axes[1, 1].contour(xi, yi, zi.reshape(xi.shape), levels=15, colors='white',
                   linewidths=0.5, alpha=0.5)
axes[1, 1].set_title('Contour Density Plot', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Years of Education')
axes[1, 1].set_ylabel('Wage (thousand yuan/month)')

plt.tight_layout()
plt.show()

Continuous vs Categorical Variables

1. Grouped Box Plots and Violin Plots

python
# Add categorical variables
df['female'] = np.random.binomial(1, 0.5, len(df))
df['gender'] = df['female'].map({0: 'Male', 1: 'Female'})
df['region'] = np.random.choice(['East', 'Central', 'West'], len(df))

# Adjust wage (introduce gender gap)
df.loc[df['female'] == 1, 'log_wage'] -= 0.15
df['wage'] = np.exp(df['log_wage'])

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# First row: Gender comparison
# Plot 1: Box plot
sns.boxplot(x='gender', y='wage', data=df, ax=axes[0, 0], palette=['lightblue', 'lightcoral'])
axes[0, 0].set_title('Gender Wage Gap (Box Plot)', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('Wage (thousand yuan/month)')
axes[0, 0].grid(True, alpha=0.3, axis='y')

# Plot 2: Violin plot
sns.violinplot(x='gender', y='wage', data=df, ax=axes[0, 1], palette=['lightblue', 'lightcoral'])
axes[0, 1].set_title('Gender Wage Gap (Violin Plot)', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('Wage (thousand yuan/month)')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Plot 3: Swarm plot (for small sample sizes)
sample_df = df.sample(100, random_state=42)
sns.violinplot(x='gender', y='wage', data=sample_df, ax=axes[0, 2],
              palette=['lightblue', 'lightcoral'], inner=None, alpha=0.3)
sns.swarmplot(x='gender', y='wage', data=sample_df, ax=axes[0, 2],
             color='black', alpha=0.5, size=3)
axes[0, 2].set_title('Gender Wage Gap (Swarm Plot)', fontsize=14, fontweight='bold')
axes[0, 2].set_ylabel('Wage (thousand yuan/month)')
axes[0, 2].grid(True, alpha=0.3, axis='y')

# Second row: Regional comparison
# Plot 4: Box plot
sns.boxplot(x='region', y='wage', data=df, ax=axes[1, 0],
           order=['East', 'Central', 'West'])
axes[1, 0].set_title('Regional Wage Gap (Box Plot)', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('Wage (thousand yuan/month)')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Plot 5: Violin plot
sns.violinplot(x='region', y='wage', data=df, ax=axes[1, 1],
              order=['East', 'Central', 'West'])
axes[1, 1].set_title('Regional Wage Gap (Violin Plot)', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('Wage (thousand yuan/month)')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Plot 6: Point plot (shows mean and CI)
sns.pointplot(x='region', y='wage', data=df, ax=axes[1, 2],
             order=['East', 'Central', 'West'], capsize=0.2, errwidth=2)
axes[1, 2].set_title('Regional Wage Gap (Point Plot, Mean±SE)', fontsize=14, fontweight='bold')
axes[1, 2].set_ylabel('Wage (thousand yuan/month)')
axes[1, 2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Statistical tests
from scipy.stats import ttest_ind, f_oneway

# t-test (gender)
male_wage = df[df['gender'] == 'Male']['wage']
female_wage = df[df['gender'] == 'Female']['wage']
t_stat, t_p = ttest_ind(male_wage, female_wage)

print("\nGender Wage Gap Test:")
print("="*60)
print(f"Male average wage: {male_wage.mean():.2f} thousand yuan")
print(f"Female average wage: {female_wage.mean():.2f} thousand yuan")
print(f"Difference: {male_wage.mean() - female_wage.mean():.2f} thousand yuan ({(male_wage.mean() / female_wage.mean() - 1)*100:.1f}%)")
print(f"t-test: t = {t_stat:.3f}, p = {t_p:.4e}")

# ANOVA (region)
east = df[df['region'] == 'East']['wage']
central = df[df['region'] == 'Central']['wage']
west = df[df['region'] == 'West']['wage']
f_stat, f_p = f_oneway(east, central, west)

print("\nRegional Wage Gap Test (ANOVA):")
print("="*60)
print(f"East average wage: {east.mean():.2f} thousand yuan")
print(f"Central average wage: {central.mean():.2f} thousand yuan")
print(f"West average wage: {west.mean():.2f} thousand yuan")
print(f"F-test: F = {f_stat:.3f}, p = {f_p:.4e}")

2. Conditional Relationships: Grouped Scatter Plots

python
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Grouped by gender
for gender, color, marker in [('Male', 'blue', 'o'), ('Female', 'red', 's')]:
    mask = df['gender'] == gender
    axes[0].scatter(df.loc[mask, 'education'], df.loc[mask, 'wage'],
                   alpha=0.4, s=50, color=color, marker=marker, label=gender)

    # Fit separate regression lines
    x = df.loc[mask, 'education']
    y = df.loc[mask, 'wage']
    slope, intercept, r_value, _, _ = linregress(x, y)
    x_line = np.array([x.min(), x.max()])
    y_line = intercept + slope * x_line
    axes[0].plot(x_line, y_line, color=color, linewidth=2, linestyle='--')

axes[0].set_xlabel('Years of Education', fontsize=12)
axes[0].set_ylabel('Wage (thousand yuan/month)', fontsize=12)
axes[0].set_title('Education-Wage Relationship by Gender (Parallel Test)', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Plot 2: Using seaborn lmplot style
sns.scatterplot(x='education', y='log_wage', hue='gender', style='gender',
               data=df, ax=axes[1], alpha=0.5, s=50)
sns.regplot(x='education', y='log_wage', data=df[df['gender'] == 'Male'],
           ax=axes[1], scatter=False, color='blue', line_kws={'linewidth': 2})
sns.regplot(x='education', y='log_wage', data=df[df['gender'] == 'Female'],
           ax=axes[1], scatter=False, color='red', line_kws={'linewidth': 2})
axes[1].set_xlabel('Years of Education', fontsize=12)
axes[1].set_ylabel('log(Wage)', fontsize=12)
axes[1].set_title('Education-log(Wage) Relationship by Gender', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Simpson's Paradox

Definition: Phenomenon where overall trends are opposite to grouped trends

python
# Generate Simpson's Paradox data
np.random.seed(999)

# Three schools with different teaching quality and student quality
schools = []
for school, base_score, slope in [('Elite', 80, -0.5),
                                    ('Average', 60, -0.5),
                                    ('Poor', 40, -0.5)]:
    n_school = 100
    study_hours = np.random.uniform(1, 10, n_school)
    scores = base_score + slope * study_hours + np.random.normal(0, 5, n_school)
    schools.append(pd.DataFrame({
        'study_hours': study_hours,
        'score': scores,
        'school': school
    }))

df_simpson = pd.concat(schools, ignore_index=True)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Overall trend (positive correlation)
axes[0].scatter(df_simpson['study_hours'], df_simpson['score'], alpha=0.5, s=50)
slope_all, intercept_all, r_all, _, _ = linregress(df_simpson['study_hours'],
                                                     df_simpson['score'])
x_line = np.array([df_simpson['study_hours'].min(), df_simpson['study_hours'].max()])
y_line = intercept_all + slope_all * x_line
axes[0].plot(x_line, y_line, 'r-', linewidth=2,
            label=f'Overall: slope={slope_all:.3f}')
axes[0].set_xlabel('Study Hours (hours/day)', fontsize=12)
axes[0].set_ylabel('Test Score', fontsize=12)
axes[0].set_title('Overall Trend: More Study, Better Scores?', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Grouped trends (negative correlation)
colors = {'Elite': 'red', 'Average': 'blue', 'Poor': 'green'}
for school, color in colors.items():
    mask = df_simpson['school'] == school
    x = df_simpson.loc[mask, 'study_hours']
    y = df_simpson.loc[mask, 'score']

    axes[1].scatter(x, y, alpha=0.5, s=50, color=color, label=school)

    # Grouped regression lines
    slope, intercept, _, _, _ = linregress(x, y)
    x_line = np.array([x.min(), x.max()])
    y_line = intercept + slope * x_line
    axes[1].plot(x_line, y_line, color=color, linewidth=2, linestyle='--',
                label=f'{school}: slope={slope:.3f}')

axes[1].set_xlabel('Study Hours (hours/day)', fontsize=12)
axes[1].set_ylabel('Test Score', fontsize=12)
axes[1].set_title('Grouped Trend: After Controlling for School, More Study, Worse Scores!', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper right')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nSimpson's Paradox Demonstration:")
print("="*60)
print(f"Overall correlation: {pearsonr(df_simpson['study_hours'], df_simpson['score'])[0]:.3f}")
for school in ['Elite', 'Average', 'Poor']:
    mask = df_simpson['school'] == school
    corr = pearsonr(df_simpson.loc[mask, 'study_hours'],
                    df_simpson.loc[mask, 'score'])[0]
    print(f"{school} school correlation: {corr:.3f}")

print("\nConclusion: Ignoring confounding variables (school) leads to incorrect causal inference!")

Correlation Analysis and Visualization

1. Enhanced Correlation Matrix Heatmap

python
# Select multiple variables
analysis_vars = ['wage', 'education', 'experience', 'log_wage']
df_corr = df[analysis_vars].copy()

# Calculate correlation matrix and p-values
from scipy.stats import pearsonr

def corrfunc(x, y):
    """Calculate correlation coefficient and p-value"""
    return pearsonr(x, y)

# Calculate correlation matrix
corr_matrix = df_corr.corr()

# Calculate p-value matrix
n_vars = len(analysis_vars)
p_matrix = np.zeros((n_vars, n_vars))
for i, var1 in enumerate(analysis_vars):
    for j, var2 in enumerate(analysis_vars):
        if i != j:
            _, p = pearsonr(df_corr[var1], df_corr[var2])
            p_matrix[i, j] = p

# Plot heatmaps
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Correlation coefficients
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
           square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=axes[0],
           vmin=-1, vmax=1)
axes[0].set_title('Correlation Coefficient Matrix', fontsize=14, fontweight='bold')

# Plot 2: Correlation coefficients + significance markers
mask = p_matrix > 0.05  # Non-significant positions
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
           square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=axes[1],
           vmin=-1, vmax=1, mask=mask)
axes[1].set_title('Correlation Coefficient Matrix (Only p<0.05)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Print detailed information
print("\nCorrelation Coefficient Matrix (with Significance):")
print("="*60)
for i, var1 in enumerate(analysis_vars):
    for j, var2 in enumerate(analysis_vars):
        if i < j:  # Only print upper triangle
            corr = corr_matrix.iloc[i, j]
            p = p_matrix[i, j]
            sig = '***' if p < 0.001 else ('**' if p < 0.01 else ('*' if p < 0.05 else ''))
            print(f"{var1:12s} vs {var2:12s}: r = {corr:6.3f}{sig:3s}  (p = {p:.4f})")

2. Pair Plot

python
# Create enhanced pair plot
plot_vars = ['wage', 'education', 'experience', 'log_wage']
df_plot = df[plot_vars + ['gender']].copy()

# Use seaborn's pairplot
g = sns.pairplot(df_plot, hue='gender', diag_kind='kde',
                plot_kws={'alpha': 0.5, 's': 30},
                diag_kws={'linewidth': 2},
                height=3, aspect=1)

# Add correlation coefficients
def corrfunc_plot(x, y, **kwargs):
    """Add correlation coefficients to subplots"""
    r, p = pearsonr(x, y)
    ax = plt.gca()
    ax.annotate(f'r = {r:.3f}', xy=(0.1, 0.9), xycoords=ax.transAxes,
               fontsize=11, fontweight='bold')

g.map_lower(corrfunc_plot)

plt.suptitle('Pair Plot: Multivariate Relationship Exploration', y=1.01, fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Section Summary

Chart Selection Decision Tree

Two-Variable Visualization
├─ Both continuous?
│  ├─ Sample size < 1000?
│  │  └─ Scatter plot + regression line (sns.regplot)
│  └─ Sample size ≥ 1000?
│     ├─ Hexbin plot (hexbin)
│     └─ 2D histogram (hist2d)

├─ One continuous + one categorical?
│  ├─ Focus on distribution?
│  │  ├─ Box plot (boxplot)
│  │  └─ Violin plot (violinplot)
│  └─ Focus on relationship?
│     └─ Grouped scatter plot + grouped regression lines

└─ Multiple continuous variables?
   ├─ Correlation matrix heatmap (heatmap)
   └─ Pair plot (pairplot)

Key Takeaways

  1. Correlation ≠ Causation:

    • Scatter plots show correlation
    • Causal inference requires controlling confounding variables
    • Beware of Simpson's Paradox
  2. Importance of Data Transformation:

    • Right-skewed distribution → log transformation
    • Non-linear relationship → polynomial or LOWESS
  3. Grouped Analysis:

    • Test for interaction effects
    • Identify heterogeneity
    • Control for confounding factors
  4. Large Data Visualization:

    • N > 1000: Use hexbin or contour plots
    • Avoid overplotting

Next Section Preview

In the next section, we will dive deep into regression analysis visualization, including residual diagnostics, influence analysis, and other advanced techniques.

From bivariate to multivariate, from correlation to causation!

Released under the MIT License. Content © Author.