Module 9 Summary and Review

Core Data Science Libraries — NumPy, Pandas, Matplotlib

Knowledge Summary

1. NumPy Basics

Core Concepts:

ndarray: N-dimensional array, efficient numerical computing container
Vectorized operations: Avoid Python loops, 10-100x faster
Broadcasting: Operations on arrays of different shapes

Basic Operations:

python

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

# Common creation functions
np.zeros((3, 4))        # All-zeros array
np.ones((2, 3))         # All-ones array
np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)    # [0, 0.25, 0.5, 0.75, 1]

# Array attributes
arr.shape    # Shape
arr.dtype    # Data type
arr.ndim     # Number of dimensions
arr.size     # Total elements

# Vectorized operations
arr * 2      # Multiply each element by 2
arr + 10     # Add 10 to each element
arr ** 2     # Square each element

# Array indexing
arr[0]       # First element
arr[1:4]     # Slicing
arr2d[0, 1]  # 2D indexing

Statistical Functions:

python

arr.mean()      # Mean
arr.std()       # Standard deviation
arr.sum()       # Sum
arr.min()       # Minimum
arr.max()       # Maximum
np.median(arr)  # Median
np.percentile(arr, 25)  # 25th percentile

2. Pandas Core

Two Main Data Structures:

Series: One-dimensional labeled array
DataFrame: Two-dimensional tabular data

DataFrame Basic Operations:

python

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})

# View data
df.head()          # First 5 rows
df.tail()          # Last 5 rows
df.info()          # Data information
df.describe()      # Statistical summary

# Select data
df['age']          # Select column
df[['name', 'age']]  # Select multiple columns
df.loc[0]          # Select row by label
df.iloc[0]         # Select row by position

# Filter data
df[df['age'] > 25]
df.query('age > 25 and income < 80000')

# Add columns
df['age_squared'] = df['age'] ** 2
df['log_income'] = np.log(df['income'])

# Sort
df.sort_values('age')
df.sort_values(['age', 'income'], ascending=[True, False])

Data Cleaning:

python

# Missing value handling
df.isnull()          # Detect missing values
df.dropna()          # Remove missing values
df.fillna(0)         # Fill missing values
df['age'].fillna(df['age'].mean())  # Fill with mean

# Duplicates
df.duplicated()      # Detect duplicates
df.drop_duplicates() # Remove duplicates

# Data type conversion
df['age'] = df['age'].astype(int)
df['income'] = pd.to_numeric(df['income'], errors='coerce')

Group Aggregation:

python

# GroupBy operations
df.groupby('gender')['income'].mean()

# Multiple aggregations
df.groupby('education').agg({
    'income': ['mean', 'median', 'std'],
    'age': ['mean', 'min', 'max']
})

# Pivot table
pd.pivot_table(df, values='income',
               index='education',
               columns='gender',
               aggfunc='mean')

3. Matplotlib and Seaborn

Matplotlib Basics:

python

import matplotlib.pyplot as plt

# Basic line plot
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Title')
plt.show()

# Scatter plot
plt.scatter(x, y)

# Bar chart
plt.bar(categories, values)

# Histogram
plt.hist(data, bins=20)

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)

Seaborn Advanced Visualization:

python

import seaborn as sns

# Set style
sns.set_style('whitegrid')

# Distribution plots
sns.histplot(df['income'], kde=True)
sns.boxplot(data=df, x='education', y='income')
sns.violinplot(data=df, x='gender', y='income')

# Relationship plots
sns.scatterplot(data=df, x='age', y='income', hue='gender')
sns.lineplot(data=df, x='year', y='value')

# Heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Regression plot
sns.regplot(data=df, x='education_years', y='income')

# Pair plot
sns.pairplot(df, hue='gender')

4. Complete Data Analysis Workflow

Standard Workflow:

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Read data
df = pd.read_csv('survey_data.csv')

# 2. Initial exploration
print(df.head())
print(df.info())
print(df.describe())

# 3. Data cleaning
df = df.dropna(subset=['age', 'income'])
df = df[(df['age'] >= 18) & (df['age'] <= 100)]
df = df[df['income'] > 0]

# 4. Feature engineering
df['log_income'] = np.log(df['income'])
df['age_group'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 100],
                          labels=['18-29', '30-39', '40-49', '50+'])

# 5. Descriptive statistics
summary = df.groupby('education').agg({
    'income': ['count', 'mean', 'median', 'std']
})

# 6. Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Income distribution
axes[0, 0].hist(df['income'], bins=30)
axes[0, 0].set_title('Income Distribution')

# Income by education
df.groupby('education')['income'].mean().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Income by Education')

# Age vs income
axes[1, 0].scatter(df['age'], df['income'], alpha=0.5)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Income')

# Correlation heatmap
sns.heatmap(df[['age', 'income', 'education_years']].corr(),
            annot=True, ax=axes[1, 1])

plt.tight_layout()
plt.savefig('analysis_report.png', dpi=300)
plt.show()

# 7. Save results
df.to_csv('clean_data.csv', index=False)
summary.to_excel('summary_statistics.xlsx')

Comparison: Pandas vs R vs Stata

Operation	Pandas	R	Stata
Read CSV	`pd.read_csv()`	`read.csv()`	`import delimited`
View data	`df.head()`	`head()`	`list in 1/5`
Filter	`df[df['age']>25]`	`subset(df, age>25)`	`keep if age>25`
Group aggregate	`df.groupby().mean()`	`aggregate()`	`collapse (mean) x, by(group)`
New variable	`df['x2'] = df['x']**2`	`df$x2 <- df$x^2`	`gen x2 = x^2`

Common Mistakes

1. Forgetting inplace Parameter

python

# Wrong: Not saving result
df.dropna()  # Doesn't modify original df

# Method 1: Assignment
df = df.dropna()

# Method 2: inplace
df.dropna(inplace=True)

2. Chained Indexing Warning

python

# Wrong: SettingWithCopyWarning
df[df['age'] > 25]['income'] = 100000

# Correct: Use loc
df.loc[df['age'] > 25, 'income'] = 100000

3. Array Shape Mismatch

python

# Wrong
arr1 = np.array([1, 2, 3])
arr2 = np.array([[1], [2], [3]])
result = arr1 + arr2  # Dimension mismatch

# Correct: Adjust shape
arr1 = arr1.reshape(-1, 1)
result = arr1 + arr2

Best Practices

1. Method Chaining

python

# Use method chaining for clear code
result = (df
    .query('age >= 18')
    .dropna(subset=['income'])
    .assign(log_income=lambda x: np.log(x['income']))
    .groupby('education')['log_income']
    .mean()
    .sort_values(ascending=False)
)

2. Use Vectorization Instead of Loops

python

# Slow: Using loops
for i in range(len(df)):
    df.loc[i, 'income_log'] = np.log(df.loc[i, 'income'])

# Fast: Using vectorization
df['income_log'] = np.log(df['income'])

3. Memory Optimization

python

# Specify data types to save memory
df = pd.read_csv('data.csv', dtype={
    'id': 'int32',
    'age': 'int8',
    'income': 'float32',
    'gender': 'category'
})

Programming Exercises

Exercise 1: NumPy Array Operations (Basic)

Difficulty: ⭐⭐ Time: 15 minutes

python

"""
Task: Use NumPy for data statistics

Given an income array, calculate:
1. Basic statistics (mean, median, standard deviation)
2. Quantiles (25%, 50%, 75%)
3. Standardization (Z-score)
"""

import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

# Your code here

Reference Answer

python

import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

print("Income Data Analysis")
print("=" * 50)

# 1. Basic statistics
mean = incomes.mean()
median = np.median(incomes)
std = incomes.std()
min_val = incomes.min()
max_val = incomes.max()

print(f"Sample size: {len(incomes)}")
print(f"Mean: ${mean:,.2f}")
print(f"Median: ${median:,.2f}")
print(f"Standard deviation: ${std:,.2f}")
print(f"Minimum: ${min_val:,}")
print(f"Maximum: ${max_val:,}")

# 2. Quantiles
q25 = np.percentile(incomes, 25)
q50 = np.percentile(incomes, 50)
q75 = np.percentile(incomes, 75)

print(f"\nQuantiles:")
print(f"25%: ${q25:,.2f}")
print(f"50%: ${q50:,.2f}")
print(f"75%: ${q75:,.2f}")

# 3. Standardization (Z-score)
z_scores = (incomes - mean) / std
print(f"\nZ-scores:")
for i, (income, z) in enumerate(zip(incomes, z_scores), 1):
    print(f"  ${income:,}: {z:+.2f}")

# 4. Identify outliers (|Z| > 2)
outliers = incomes[np.abs(z_scores) > 2]
if len(outliers) > 0:
    print(f"\nOutliers (|Z| > 2):")
    for val in outliers:
        print(f"  ${val:,}")
else:
    print(f"\nNo outliers")

# 5. Create income categories
bins = [0, 50000, 60000, np.inf]
labels = ['Low Income', 'Middle Income', 'High Income']
income_categories = np.digitize(incomes, bins) - 1

print(f"\nIncome Categories:")
for label_idx in range(len(labels)):
    count = np.sum(income_categories == label_idx)
    percentage = count / len(incomes) * 100
    print(f"  {labels[label_idx]}: {count} people ({percentage:.1f}%)")

Exercise 2: Pandas Data Cleaning (Basic)

Difficulty: ⭐⭐ Time: 20 minutes

python

"""
Task: Clean survey data

Data issues:
- Missing values
- Outliers (age>100, income<0)
- Duplicate records

Requirements:
1. Handle missing values
2. Remove outliers
3. Remove duplicates
4. Generate cleaning report
"""

import pandas as pd
import numpy as np

# Raw data (with various issues)
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

def clean_survey_data(df):
    """Clean survey data"""
    # Your code here
    pass

Reference Answer

python

import pandas as pd
import numpy as np

def clean_survey_data(df):
    """Clean survey data

    Returns:
        (cleaned_df, report): Cleaned data and report
    """
    report = {}
    report['original_count'] = len(df)

    print("=" * 60)
    print("Data Cleaning Report")
    print("=" * 60)
    print(f"Original data: {len(df)} rows\n")

    # 1. Check for duplicate records
    duplicates = df.duplicated()
    duplicate_count = duplicates.sum()
    if duplicate_count > 0:
        print(f"1. Duplicate records: {duplicate_count}")
        print(f"   Duplicate IDs: {df[duplicates]['id'].tolist()}")
        df = df.drop_duplicates()
        print(f"   After removal: {len(df)} rows\n")
    else:
        print(f"1. Duplicate records: None\n")

    report['duplicate_removed'] = duplicate_count

    # 2. Missing value analysis
    print(f"2. Missing value analysis:")
    missing = df.isnull().sum()
    for col in missing.index:
        if missing[col] > 0:
            pct = missing[col] / len(df) * 100
            print(f"   {col}: {missing[col]} ({pct:.1f}%)")

    # Handling strategy: Remove rows with missing key columns
    before_missing = len(df)
    df = df.dropna(subset=['age', 'income'])
    after_missing = len(df)
    print(f"   Removed rows with missing age/income: {before_missing - after_missing}")
    print(f"   Retained: {len(df)} rows\n")

    report['missing_removed'] = before_missing - after_missing

    # 3. Outlier detection
    print(f"3. Outlier detection:")

    # Age outliers
    age_outliers = (df['age'] < 18) | (df['age'] > 100)
    age_outlier_count = age_outliers.sum()
    if age_outlier_count > 0:
        print(f"   Age outliers: {age_outlier_count}")
        print(f"   Outlier values: {df[age_outliers]['age'].tolist()}")
        df = df[~age_outliers]

    # Income outliers
    income_outliers = df['income'] < 0
    income_outlier_count = income_outliers.sum()
    if income_outlier_count > 0:
        print(f"   Income outliers (negative): {income_outlier_count}")
        print(f"   Outlier values: {df[income_outliers]['income'].tolist()}")
        df = df[~income_outliers]

    print(f"   After outlier removal: {len(df)} rows\n")
    report['outliers_removed'] = age_outlier_count + income_outlier_count

    # 4. Data type conversion
    print(f"4. Data type conversion:")
    df['age'] = df['age'].astype(int)
    df['income'] = df['income'].astype(float)
    print(f"   age: {df['age'].dtype}")
    print(f"   income: {df['income'].dtype}\n")

    # 5. Final statistics
    report['final_count'] = len(df)
    report['removed_total'] = report['original_count'] - report['final_count']
    report['retention_rate'] = (report['final_count'] / report['original_count']) * 100

    print(f"Cleaning summary:")
    print(f"  Original: {report['original_count']} rows")
    print(f"  Removed: {report['removed_total']} rows")
    print(f"    - Duplicates: {report['duplicate_removed']}")
    print(f"    - Missing: {report['missing_removed']}")
    print(f"    - Outliers: {report['outliers_removed']}")
    print(f"  Retained: {report['final_count']} rows ({report['retention_rate']:.1f}%)")
    print("=" * 60)

    return df, report


# Test data
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

# Clean
cleaned_df, report = clean_survey_data(data)

# Display cleaned data
print("\nCleaned data:")
print(cleaned_df)

Exercise 3: Data Grouping and Aggregation (Intermediate)

Difficulty: ⭐⭐⭐ Time: 30 minutes

python

"""
Task: Analyze income differences by education level

Requirements:
1. Group by education level
2. Calculate statistics for each group
3. Create income comparison visualization
4. Generate summary report
"""

Reference Answer

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'id': range(1, n+1),
    'age': np.random.randint(25, 60, n),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.3, 0.4, 0.2, 0.1]),
    'gender': np.random.choice(['M', 'F'], n),
    'income': np.random.lognormal(11, 0.5, n)  # Lognormal distribution
})

# Adjust income by education level
education_multiplier = {
    'High School': 0.7,
    'Bachelor': 1.0,
    'Master': 1.3,
    'PhD': 1.6
}
data['income'] = data.apply(
    lambda row: row['income'] * education_multiplier[row['education']], axis=1
)

print("Education Level and Income Analysis")
print("=" * 70)

# 1. Group statistics by education
print("\n1. Income statistics by education level:")
edu_stats = data.groupby('education')['income'].agg([
    ('Sample Size', 'count'),
    ('Average Income', 'mean'),
    ('Median', 'median'),
    ('Std Dev', 'std'),
    ('Minimum', 'min'),
    ('Maximum', 'max')
]).round(2)

# Sort by average income
edu_stats = edu_stats.sort_values('Average Income', ascending=False)
print(edu_stats)

# 2. Group by education and gender
print("\n2. Average income by education and gender:")
gender_edu_stats = data.groupby(['education', 'gender'])['income'].mean().unstack()
gender_edu_stats = gender_edu_stats.loc[edu_stats.index]  # Maintain order
print(gender_edu_stats.round(2))

# 3. Income quantiles
print("\n3. Income quantiles by education level:")
percentiles = data.groupby('education')['income'].quantile([0.25, 0.5, 0.75]).unstack()
percentiles.columns = ['25%', '50%', '75%']
percentiles = percentiles.loc[edu_stats.index]
print(percentiles.round(2))

# 4. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 4.1 Box plot
education_order = edu_stats.index.tolist()
sns.boxplot(data=data, x='education', y='income', order=education_order, ax=axes[0, 0])
axes[0, 0].set_title('Income Distribution by Education Level', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Education Level')
axes[0, 0].set_ylabel('Income')
axes[0, 0].tick_params(axis='x', rotation=45)

# 4.2 Average income bar chart
edu_stats['Average Income'].plot(kind='bar', ax=axes[0, 1], color='skyblue')
axes[0, 1].set_title('Average Income by Education', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Education Level')
axes[0, 1].set_ylabel('Average Income')
axes[0, 1].tick_params(axis='x', rotation=45)

# Add value labels
for i, v in enumerate(edu_stats['Average Income']):
    axes[0, 1].text(i, v + 5000, f'${v:,.0f}', ha='center')

# 4.3 Group by gender
gender_edu_stats.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Income by Education and Gender', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Education Level')
axes[1, 0].set_ylabel('Average Income')
axes[1, 0].legend(title='Gender')
axes[1, 0].tick_params(axis='x', rotation=45)

# 4.4 Violin plot
sns.violinplot(data=data, x='education', y='income', order=education_order, ax=axes[1, 1])
axes[1, 1].set_title('Income Distribution (Violin Plot)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Income')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('education_income_analysis.png', dpi=300, bbox_inches='tight')
print("\nVisualization saved: education_income_analysis.png")
plt.show()

# 5. Statistical test (simplified)
print("\n5. Income gap analysis:")
high_school_income = data[data['education'] == 'High School']['income'].mean()
phd_income = data[data['education'] == 'PhD']['income'].mean()
income_gap = phd_income - high_school_income
gap_percentage = (income_gap / high_school_income) * 100

print(f"High School average income: ${high_school_income:,.2f}")
print(f"PhD average income: ${phd_income:,.2f}")
print(f"Income gap: ${income_gap:,.2f} ({gap_percentage:.1f}%)")

# 6. Generate report
report = {
    'Analysis date': pd.Timestamp.now().strftime('%Y-%m-%d'),
    'Sample size': len(data),
    'Education levels': edu_stats.index.tolist(),
    'Sample size by education': edu_stats['Sample Size'].tolist(),
    'Average income': edu_stats['Average Income'].round(2).tolist(),
    'Income gap (PhD vs High School)': f'${income_gap:,.2f}',
    'Gap percentage': f'{gap_percentage:.1f}%'
}

print("\n" + "=" * 70)
print("Analysis Report")
print("=" * 70)
for key, value in report.items():
    print(f"{key}: {value}")
print("=" * 70)

Exercise 4: Time Series Analysis (Advanced)

Difficulty: ⭐⭐⭐⭐ Time: 40 minutes

Create an annual income trend analysis system.

Hint

Use pd.date_range() to create dates
Use df.resample() for temporal aggregation
Use rolling() to calculate moving averages
Use Matplotlib to plot trends

Next Steps

After completing this module, you have mastered:

NumPy array operations and vectorization
Pandas data manipulation (cleaning, transformation, aggregation)
Matplotlib/Seaborn data visualization
Complete data analysis workflows

Congratulations on completing Module 9! This is the core module for Python data analysis.

In the next Modules 10 and 11, you'll learn machine learning and best practices.

Extended Reading

Your data science journey has just begun!

Module 9 Summary and Review ​

Knowledge Summary ​

1. NumPy Basics ​

2. Pandas Core ​

3. Matplotlib and Seaborn ​

4. Complete Data Analysis Workflow ​

Comparison: Pandas vs R vs Stata ​

Common Mistakes ​

1. Forgetting inplace Parameter ​

2. Chained Indexing Warning ​

3. Array Shape Mismatch ​

Best Practices ​

1. Method Chaining ​

2. Use Vectorization Instead of Loops ​

3. Memory Optimization ​

Programming Exercises ​

Exercise 1: NumPy Array Operations (Basic) ​

Exercise 2: Pandas Data Cleaning (Basic) ​

Exercise 3: Data Grouping and Aggregation (Intermediate) ​

Exercise 4: Time Series Analysis (Advanced) ​

Next Steps ​

Extended Reading ​

Module 9 Summary and Review

Knowledge Summary

1. NumPy Basics

2. Pandas Core

3. Matplotlib and Seaborn

4. Complete Data Analysis Workflow

Comparison: Pandas vs R vs Stata

Common Mistakes

1. Forgetting inplace Parameter

2. Chained Indexing Warning

3. Array Shape Mismatch

Best Practices

1. Method Chaining

2. Use Vectorization Instead of Loops

3. Memory Optimization

Programming Exercises

Exercise 1: NumPy Array Operations (Basic)

Exercise 2: Pandas Data Cleaning (Basic)

Exercise 3: Data Grouping and Aggregation (Intermediate)

Exercise 4: Time Series Analysis (Advanced)

Next Steps

Extended Reading