Skip to content

Module 9 Summary and Review

Core Data Science Libraries — NumPy, Pandas, Matplotlib


Knowledge Summary

1. NumPy Basics

Core Concepts:

  • ndarray: N-dimensional array, efficient numerical computing container
  • Vectorized operations: Avoid Python loops, 10-100x faster
  • Broadcasting: Operations on arrays of different shapes

Basic Operations:

python
import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

# Common creation functions
np.zeros((3, 4))        # All-zeros array
np.ones((2, 3))         # All-ones array
np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)    # [0, 0.25, 0.5, 0.75, 1]

# Array attributes
arr.shape    # Shape
arr.dtype    # Data type
arr.ndim     # Number of dimensions
arr.size     # Total elements

# Vectorized operations
arr * 2      # Multiply each element by 2
arr + 10     # Add 10 to each element
arr ** 2     # Square each element

# Array indexing
arr[0]       # First element
arr[1:4]     # Slicing
arr2d[0, 1]  # 2D indexing

Statistical Functions:

python
arr.mean()      # Mean
arr.std()       # Standard deviation
arr.sum()       # Sum
arr.min()       # Minimum
arr.max()       # Maximum
np.median(arr)  # Median
np.percentile(arr, 25)  # 25th percentile

2. Pandas Core

Two Main Data Structures:

  • Series: One-dimensional labeled array
  • DataFrame: Two-dimensional tabular data

DataFrame Basic Operations:

python
import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})

# View data
df.head()          # First 5 rows
df.tail()          # Last 5 rows
df.info()          # Data information
df.describe()      # Statistical summary

# Select data
df['age']          # Select column
df[['name', 'age']]  # Select multiple columns
df.loc[0]          # Select row by label
df.iloc[0]         # Select row by position

# Filter data
df[df['age'] > 25]
df.query('age > 25 and income < 80000')

# Add columns
df['age_squared'] = df['age'] ** 2
df['log_income'] = np.log(df['income'])

# Sort
df.sort_values('age')
df.sort_values(['age', 'income'], ascending=[True, False])

Data Cleaning:

python
# Missing value handling
df.isnull()          # Detect missing values
df.dropna()          # Remove missing values
df.fillna(0)         # Fill missing values
df['age'].fillna(df['age'].mean())  # Fill with mean

# Duplicates
df.duplicated()      # Detect duplicates
df.drop_duplicates() # Remove duplicates

# Data type conversion
df['age'] = df['age'].astype(int)
df['income'] = pd.to_numeric(df['income'], errors='coerce')

Group Aggregation:

python
# GroupBy operations
df.groupby('gender')['income'].mean()

# Multiple aggregations
df.groupby('education').agg({
    'income': ['mean', 'median', 'std'],
    'age': ['mean', 'min', 'max']
})

# Pivot table
pd.pivot_table(df, values='income',
               index='education',
               columns='gender',
               aggfunc='mean')

3. Matplotlib and Seaborn

Matplotlib Basics:

python
import matplotlib.pyplot as plt

# Basic line plot
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Title')
plt.show()

# Scatter plot
plt.scatter(x, y)

# Bar chart
plt.bar(categories, values)

# Histogram
plt.hist(data, bins=20)

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)

Seaborn Advanced Visualization:

python
import seaborn as sns

# Set style
sns.set_style('whitegrid')

# Distribution plots
sns.histplot(df['income'], kde=True)
sns.boxplot(data=df, x='education', y='income')
sns.violinplot(data=df, x='gender', y='income')

# Relationship plots
sns.scatterplot(data=df, x='age', y='income', hue='gender')
sns.lineplot(data=df, x='year', y='value')

# Heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Regression plot
sns.regplot(data=df, x='education_years', y='income')

# Pair plot
sns.pairplot(df, hue='gender')

4. Complete Data Analysis Workflow

Standard Workflow:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Read data
df = pd.read_csv('survey_data.csv')

# 2. Initial exploration
print(df.head())
print(df.info())
print(df.describe())

# 3. Data cleaning
df = df.dropna(subset=['age', 'income'])
df = df[(df['age'] >= 18) & (df['age'] <= 100)]
df = df[df['income'] > 0]

# 4. Feature engineering
df['log_income'] = np.log(df['income'])
df['age_group'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 100],
                          labels=['18-29', '30-39', '40-49', '50+'])

# 5. Descriptive statistics
summary = df.groupby('education').agg({
    'income': ['count', 'mean', 'median', 'std']
})

# 6. Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Income distribution
axes[0, 0].hist(df['income'], bins=30)
axes[0, 0].set_title('Income Distribution')

# Income by education
df.groupby('education')['income'].mean().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Income by Education')

# Age vs income
axes[1, 0].scatter(df['age'], df['income'], alpha=0.5)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Income')

# Correlation heatmap
sns.heatmap(df[['age', 'income', 'education_years']].corr(),
            annot=True, ax=axes[1, 1])

plt.tight_layout()
plt.savefig('analysis_report.png', dpi=300)
plt.show()

# 7. Save results
df.to_csv('clean_data.csv', index=False)
summary.to_excel('summary_statistics.xlsx')

Comparison: Pandas vs R vs Stata

OperationPandasRStata
Read CSVpd.read_csv()read.csv()import delimited
View datadf.head()head()list in 1/5
Filterdf[df['age']>25]subset(df, age>25)keep if age>25
Group aggregatedf.groupby().mean()aggregate()collapse (mean) x, by(group)
New variabledf['x2'] = df['x']**2df$x2 <- df$x^2gen x2 = x^2

Common Mistakes

1. Forgetting inplace Parameter

python
# Wrong: Not saving result
df.dropna()  # Doesn't modify original df

# Method 1: Assignment
df = df.dropna()

# Method 2: inplace
df.dropna(inplace=True)

2. Chained Indexing Warning

python
# Wrong: SettingWithCopyWarning
df[df['age'] > 25]['income'] = 100000

# Correct: Use loc
df.loc[df['age'] > 25, 'income'] = 100000

3. Array Shape Mismatch

python
# Wrong
arr1 = np.array([1, 2, 3])
arr2 = np.array([[1], [2], [3]])
result = arr1 + arr2  # Dimension mismatch

# Correct: Adjust shape
arr1 = arr1.reshape(-1, 1)
result = arr1 + arr2

Best Practices

1. Method Chaining

python
# Use method chaining for clear code
result = (df
    .query('age >= 18')
    .dropna(subset=['income'])
    .assign(log_income=lambda x: np.log(x['income']))
    .groupby('education')['log_income']
    .mean()
    .sort_values(ascending=False)
)

2. Use Vectorization Instead of Loops

python
# Slow: Using loops
for i in range(len(df)):
    df.loc[i, 'income_log'] = np.log(df.loc[i, 'income'])

# Fast: Using vectorization
df['income_log'] = np.log(df['income'])

3. Memory Optimization

python
# Specify data types to save memory
df = pd.read_csv('data.csv', dtype={
    'id': 'int32',
    'age': 'int8',
    'income': 'float32',
    'gender': 'category'
})

Programming Exercises

Exercise 1: NumPy Array Operations (Basic)

Difficulty: ⭐⭐ Time: 15 minutes

python
"""
Task: Use NumPy for data statistics

Given an income array, calculate:
1. Basic statistics (mean, median, standard deviation)
2. Quantiles (25%, 50%, 75%)
3. Standardization (Z-score)
"""

import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

# Your code here
Reference Answer
python
import numpy as np

incomes = np.array([45000, 52000, 38000, 67000, 58000,
                    71000, 43000, 55000, 62000, 49000])

print("Income Data Analysis")
print("=" * 50)

# 1. Basic statistics
mean = incomes.mean()
median = np.median(incomes)
std = incomes.std()
min_val = incomes.min()
max_val = incomes.max()

print(f"Sample size: {len(incomes)}")
print(f"Mean: ${mean:,.2f}")
print(f"Median: ${median:,.2f}")
print(f"Standard deviation: ${std:,.2f}")
print(f"Minimum: ${min_val:,}")
print(f"Maximum: ${max_val:,}")

# 2. Quantiles
q25 = np.percentile(incomes, 25)
q50 = np.percentile(incomes, 50)
q75 = np.percentile(incomes, 75)

print(f"\nQuantiles:")
print(f"25%: ${q25:,.2f}")
print(f"50%: ${q50:,.2f}")
print(f"75%: ${q75:,.2f}")

# 3. Standardization (Z-score)
z_scores = (incomes - mean) / std
print(f"\nZ-scores:")
for i, (income, z) in enumerate(zip(incomes, z_scores), 1):
    print(f"  ${income:,}: {z:+.2f}")

# 4. Identify outliers (|Z| > 2)
outliers = incomes[np.abs(z_scores) > 2]
if len(outliers) > 0:
    print(f"\nOutliers (|Z| > 2):")
    for val in outliers:
        print(f"  ${val:,}")
else:
    print(f"\nNo outliers")

# 5. Create income categories
bins = [0, 50000, 60000, np.inf]
labels = ['Low Income', 'Middle Income', 'High Income']
income_categories = np.digitize(incomes, bins) - 1

print(f"\nIncome Categories:")
for label_idx in range(len(labels)):
    count = np.sum(income_categories == label_idx)
    percentage = count / len(incomes) * 100
    print(f"  {labels[label_idx]}: {count} people ({percentage:.1f}%)")

Exercise 2: Pandas Data Cleaning (Basic)

Difficulty: ⭐⭐ Time: 20 minutes

python
"""
Task: Clean survey data

Data issues:
- Missing values
- Outliers (age>100, income<0)
- Duplicate records

Requirements:
1. Handle missing values
2. Remove outliers
3. Remove duplicates
4. Generate cleaning report
"""

import pandas as pd
import numpy as np

# Raw data (with various issues)
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

def clean_survey_data(df):
    """Clean survey data"""
    # Your code here
    pass
Reference Answer
python
import pandas as pd
import numpy as np

def clean_survey_data(df):
    """Clean survey data

    Returns:
        (cleaned_df, report): Cleaned data and report
    """
    report = {}
    report['original_count'] = len(df)

    print("=" * 60)
    print("Data Cleaning Report")
    print("=" * 60)
    print(f"Original data: {len(df)} rows\n")

    # 1. Check for duplicate records
    duplicates = df.duplicated()
    duplicate_count = duplicates.sum()
    if duplicate_count > 0:
        print(f"1. Duplicate records: {duplicate_count}")
        print(f"   Duplicate IDs: {df[duplicates]['id'].tolist()}")
        df = df.drop_duplicates()
        print(f"   After removal: {len(df)} rows\n")
    else:
        print(f"1. Duplicate records: None\n")

    report['duplicate_removed'] = duplicate_count

    # 2. Missing value analysis
    print(f"2. Missing value analysis:")
    missing = df.isnull().sum()
    for col in missing.index:
        if missing[col] > 0:
            pct = missing[col] / len(df) * 100
            print(f"   {col}: {missing[col]} ({pct:.1f}%)")

    # Handling strategy: Remove rows with missing key columns
    before_missing = len(df)
    df = df.dropna(subset=['age', 'income'])
    after_missing = len(df)
    print(f"   Removed rows with missing age/income: {before_missing - after_missing}")
    print(f"   Retained: {len(df)} rows\n")

    report['missing_removed'] = before_missing - after_missing

    # 3. Outlier detection
    print(f"3. Outlier detection:")

    # Age outliers
    age_outliers = (df['age'] < 18) | (df['age'] > 100)
    age_outlier_count = age_outliers.sum()
    if age_outlier_count > 0:
        print(f"   Age outliers: {age_outlier_count}")
        print(f"   Outlier values: {df[age_outliers]['age'].tolist()}")
        df = df[~age_outliers]

    # Income outliers
    income_outliers = df['income'] < 0
    income_outlier_count = income_outliers.sum()
    if income_outlier_count > 0:
        print(f"   Income outliers (negative): {income_outlier_count}")
        print(f"   Outlier values: {df[income_outliers]['income'].tolist()}")
        df = df[~income_outliers]

    print(f"   After outlier removal: {len(df)} rows\n")
    report['outliers_removed'] = age_outlier_count + income_outlier_count

    # 4. Data type conversion
    print(f"4. Data type conversion:")
    df['age'] = df['age'].astype(int)
    df['income'] = df['income'].astype(float)
    print(f"   age: {df['age'].dtype}")
    print(f"   income: {df['income'].dtype}\n")

    # 5. Final statistics
    report['final_count'] = len(df)
    report['removed_total'] = report['original_count'] - report['final_count']
    report['retention_rate'] = (report['final_count'] / report['original_count']) * 100

    print(f"Cleaning summary:")
    print(f"  Original: {report['original_count']} rows")
    print(f"  Removed: {report['removed_total']} rows")
    print(f"    - Duplicates: {report['duplicate_removed']}")
    print(f"    - Missing: {report['missing_removed']}")
    print(f"    - Outliers: {report['outliers_removed']}")
    print(f"  Retained: {report['final_count']} rows ({report['retention_rate']:.1f}%)")
    print("=" * 60)

    return df, report


# Test data
data = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
    'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
    'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
    'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})

# Clean
cleaned_df, report = clean_survey_data(data)

# Display cleaned data
print("\nCleaned data:")
print(cleaned_df)

Exercise 3: Data Grouping and Aggregation (Intermediate)

Difficulty: ⭐⭐⭐ Time: 30 minutes

python
"""
Task: Analyze income differences by education level

Requirements:
1. Group by education level
2. Calculate statistics for each group
3. Create income comparison visualization
4. Generate summary report
"""
Reference Answer
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'id': range(1, n+1),
    'age': np.random.randint(25, 60, n),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.3, 0.4, 0.2, 0.1]),
    'gender': np.random.choice(['M', 'F'], n),
    'income': np.random.lognormal(11, 0.5, n)  # Lognormal distribution
})

# Adjust income by education level
education_multiplier = {
    'High School': 0.7,
    'Bachelor': 1.0,
    'Master': 1.3,
    'PhD': 1.6
}
data['income'] = data.apply(
    lambda row: row['income'] * education_multiplier[row['education']], axis=1
)

print("Education Level and Income Analysis")
print("=" * 70)

# 1. Group statistics by education
print("\n1. Income statistics by education level:")
edu_stats = data.groupby('education')['income'].agg([
    ('Sample Size', 'count'),
    ('Average Income', 'mean'),
    ('Median', 'median'),
    ('Std Dev', 'std'),
    ('Minimum', 'min'),
    ('Maximum', 'max')
]).round(2)

# Sort by average income
edu_stats = edu_stats.sort_values('Average Income', ascending=False)
print(edu_stats)

# 2. Group by education and gender
print("\n2. Average income by education and gender:")
gender_edu_stats = data.groupby(['education', 'gender'])['income'].mean().unstack()
gender_edu_stats = gender_edu_stats.loc[edu_stats.index]  # Maintain order
print(gender_edu_stats.round(2))

# 3. Income quantiles
print("\n3. Income quantiles by education level:")
percentiles = data.groupby('education')['income'].quantile([0.25, 0.5, 0.75]).unstack()
percentiles.columns = ['25%', '50%', '75%']
percentiles = percentiles.loc[edu_stats.index]
print(percentiles.round(2))

# 4. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 4.1 Box plot
education_order = edu_stats.index.tolist()
sns.boxplot(data=data, x='education', y='income', order=education_order, ax=axes[0, 0])
axes[0, 0].set_title('Income Distribution by Education Level', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Education Level')
axes[0, 0].set_ylabel('Income')
axes[0, 0].tick_params(axis='x', rotation=45)

# 4.2 Average income bar chart
edu_stats['Average Income'].plot(kind='bar', ax=axes[0, 1], color='skyblue')
axes[0, 1].set_title('Average Income by Education', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Education Level')
axes[0, 1].set_ylabel('Average Income')
axes[0, 1].tick_params(axis='x', rotation=45)

# Add value labels
for i, v in enumerate(edu_stats['Average Income']):
    axes[0, 1].text(i, v + 5000, f'${v:,.0f}', ha='center')

# 4.3 Group by gender
gender_edu_stats.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Income by Education and Gender', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Education Level')
axes[1, 0].set_ylabel('Average Income')
axes[1, 0].legend(title='Gender')
axes[1, 0].tick_params(axis='x', rotation=45)

# 4.4 Violin plot
sns.violinplot(data=data, x='education', y='income', order=education_order, ax=axes[1, 1])
axes[1, 1].set_title('Income Distribution (Violin Plot)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Income')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('education_income_analysis.png', dpi=300, bbox_inches='tight')
print("\nVisualization saved: education_income_analysis.png")
plt.show()

# 5. Statistical test (simplified)
print("\n5. Income gap analysis:")
high_school_income = data[data['education'] == 'High School']['income'].mean()
phd_income = data[data['education'] == 'PhD']['income'].mean()
income_gap = phd_income - high_school_income
gap_percentage = (income_gap / high_school_income) * 100

print(f"High School average income: ${high_school_income:,.2f}")
print(f"PhD average income: ${phd_income:,.2f}")
print(f"Income gap: ${income_gap:,.2f} ({gap_percentage:.1f}%)")

# 6. Generate report
report = {
    'Analysis date': pd.Timestamp.now().strftime('%Y-%m-%d'),
    'Sample size': len(data),
    'Education levels': edu_stats.index.tolist(),
    'Sample size by education': edu_stats['Sample Size'].tolist(),
    'Average income': edu_stats['Average Income'].round(2).tolist(),
    'Income gap (PhD vs High School)': f'${income_gap:,.2f}',
    'Gap percentage': f'{gap_percentage:.1f}%'
}

print("\n" + "=" * 70)
print("Analysis Report")
print("=" * 70)
for key, value in report.items():
    print(f"{key}: {value}")
print("=" * 70)

Exercise 4: Time Series Analysis (Advanced)

Difficulty: ⭐⭐⭐⭐ Time: 40 minutes

Create an annual income trend analysis system.

Hint
  • Use pd.date_range() to create dates
  • Use df.resample() for temporal aggregation
  • Use rolling() to calculate moving averages
  • Use Matplotlib to plot trends

Next Steps

After completing this module, you have mastered:

  • NumPy array operations and vectorization
  • Pandas data manipulation (cleaning, transformation, aggregation)
  • Matplotlib/Seaborn data visualization
  • Complete data analysis workflows

Congratulations on completing Module 9! This is the core module for Python data analysis.

In the next Modules 10 and 11, you'll learn machine learning and best practices.


Extended Reading

Your data science journey has just begun!

Released under the MIT License. Content © Author.