Module 9 Summary and Review
Core Data Science Libraries — NumPy, Pandas, Matplotlib
Knowledge Summary
1. NumPy Basics
Core Concepts:
- ndarray: N-dimensional array, efficient numerical computing container
- Vectorized operations: Avoid Python loops, 10-100x faster
- Broadcasting: Operations on arrays of different shapes
Basic Operations:
python
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
# Common creation functions
np.zeros((3, 4)) # All-zeros array
np.ones((2, 3)) # All-ones array
np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
# Array attributes
arr.shape # Shape
arr.dtype # Data type
arr.ndim # Number of dimensions
arr.size # Total elements
# Vectorized operations
arr * 2 # Multiply each element by 2
arr + 10 # Add 10 to each element
arr ** 2 # Square each element
# Array indexing
arr[0] # First element
arr[1:4] # Slicing
arr2d[0, 1] # 2D indexingStatistical Functions:
python
arr.mean() # Mean
arr.std() # Standard deviation
arr.sum() # Sum
arr.min() # Minimum
arr.max() # Maximum
np.median(arr) # Median
np.percentile(arr, 25) # 25th percentile2. Pandas Core
Two Main Data Structures:
- Series: One-dimensional labeled array
- DataFrame: Two-dimensional tabular data
DataFrame Basic Operations:
python
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'income': [50000, 75000, 85000]
})
# View data
df.head() # First 5 rows
df.tail() # Last 5 rows
df.info() # Data information
df.describe() # Statistical summary
# Select data
df['age'] # Select column
df[['name', 'age']] # Select multiple columns
df.loc[0] # Select row by label
df.iloc[0] # Select row by position
# Filter data
df[df['age'] > 25]
df.query('age > 25 and income < 80000')
# Add columns
df['age_squared'] = df['age'] ** 2
df['log_income'] = np.log(df['income'])
# Sort
df.sort_values('age')
df.sort_values(['age', 'income'], ascending=[True, False])Data Cleaning:
python
# Missing value handling
df.isnull() # Detect missing values
df.dropna() # Remove missing values
df.fillna(0) # Fill missing values
df['age'].fillna(df['age'].mean()) # Fill with mean
# Duplicates
df.duplicated() # Detect duplicates
df.drop_duplicates() # Remove duplicates
# Data type conversion
df['age'] = df['age'].astype(int)
df['income'] = pd.to_numeric(df['income'], errors='coerce')Group Aggregation:
python
# GroupBy operations
df.groupby('gender')['income'].mean()
# Multiple aggregations
df.groupby('education').agg({
'income': ['mean', 'median', 'std'],
'age': ['mean', 'min', 'max']
})
# Pivot table
pd.pivot_table(df, values='income',
index='education',
columns='gender',
aggfunc='mean')3. Matplotlib and Seaborn
Matplotlib Basics:
python
import matplotlib.pyplot as plt
# Basic line plot
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Title')
plt.show()
# Scatter plot
plt.scatter(x, y)
# Bar chart
plt.bar(categories, values)
# Histogram
plt.hist(data, bins=20)
# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)Seaborn Advanced Visualization:
python
import seaborn as sns
# Set style
sns.set_style('whitegrid')
# Distribution plots
sns.histplot(df['income'], kde=True)
sns.boxplot(data=df, x='education', y='income')
sns.violinplot(data=df, x='gender', y='income')
# Relationship plots
sns.scatterplot(data=df, x='age', y='income', hue='gender')
sns.lineplot(data=df, x='year', y='value')
# Heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
# Regression plot
sns.regplot(data=df, x='education_years', y='income')
# Pair plot
sns.pairplot(df, hue='gender')4. Complete Data Analysis Workflow
Standard Workflow:
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Read data
df = pd.read_csv('survey_data.csv')
# 2. Initial exploration
print(df.head())
print(df.info())
print(df.describe())
# 3. Data cleaning
df = df.dropna(subset=['age', 'income'])
df = df[(df['age'] >= 18) & (df['age'] <= 100)]
df = df[df['income'] > 0]
# 4. Feature engineering
df['log_income'] = np.log(df['income'])
df['age_group'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 100],
labels=['18-29', '30-39', '40-49', '50+'])
# 5. Descriptive statistics
summary = df.groupby('education').agg({
'income': ['count', 'mean', 'median', 'std']
})
# 6. Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Income distribution
axes[0, 0].hist(df['income'], bins=30)
axes[0, 0].set_title('Income Distribution')
# Income by education
df.groupby('education')['income'].mean().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Income by Education')
# Age vs income
axes[1, 0].scatter(df['age'], df['income'], alpha=0.5)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Income')
# Correlation heatmap
sns.heatmap(df[['age', 'income', 'education_years']].corr(),
annot=True, ax=axes[1, 1])
plt.tight_layout()
plt.savefig('analysis_report.png', dpi=300)
plt.show()
# 7. Save results
df.to_csv('clean_data.csv', index=False)
summary.to_excel('summary_statistics.xlsx')Comparison: Pandas vs R vs Stata
| Operation | Pandas | R | Stata |
|---|---|---|---|
| Read CSV | pd.read_csv() | read.csv() | import delimited |
| View data | df.head() | head() | list in 1/5 |
| Filter | df[df['age']>25] | subset(df, age>25) | keep if age>25 |
| Group aggregate | df.groupby().mean() | aggregate() | collapse (mean) x, by(group) |
| New variable | df['x2'] = df['x']**2 | df$x2 <- df$x^2 | gen x2 = x^2 |
Common Mistakes
1. Forgetting inplace Parameter
python
# Wrong: Not saving result
df.dropna() # Doesn't modify original df
# Method 1: Assignment
df = df.dropna()
# Method 2: inplace
df.dropna(inplace=True)2. Chained Indexing Warning
python
# Wrong: SettingWithCopyWarning
df[df['age'] > 25]['income'] = 100000
# Correct: Use loc
df.loc[df['age'] > 25, 'income'] = 1000003. Array Shape Mismatch
python
# Wrong
arr1 = np.array([1, 2, 3])
arr2 = np.array([[1], [2], [3]])
result = arr1 + arr2 # Dimension mismatch
# Correct: Adjust shape
arr1 = arr1.reshape(-1, 1)
result = arr1 + arr2Best Practices
1. Method Chaining
python
# Use method chaining for clear code
result = (df
.query('age >= 18')
.dropna(subset=['income'])
.assign(log_income=lambda x: np.log(x['income']))
.groupby('education')['log_income']
.mean()
.sort_values(ascending=False)
)2. Use Vectorization Instead of Loops
python
# Slow: Using loops
for i in range(len(df)):
df.loc[i, 'income_log'] = np.log(df.loc[i, 'income'])
# Fast: Using vectorization
df['income_log'] = np.log(df['income'])3. Memory Optimization
python
# Specify data types to save memory
df = pd.read_csv('data.csv', dtype={
'id': 'int32',
'age': 'int8',
'income': 'float32',
'gender': 'category'
})Programming Exercises
Exercise 1: NumPy Array Operations (Basic)
Difficulty: ⭐⭐ Time: 15 minutes
python
"""
Task: Use NumPy for data statistics
Given an income array, calculate:
1. Basic statistics (mean, median, standard deviation)
2. Quantiles (25%, 50%, 75%)
3. Standardization (Z-score)
"""
import numpy as np
incomes = np.array([45000, 52000, 38000, 67000, 58000,
71000, 43000, 55000, 62000, 49000])
# Your code hereReference Answer
python
import numpy as np
incomes = np.array([45000, 52000, 38000, 67000, 58000,
71000, 43000, 55000, 62000, 49000])
print("Income Data Analysis")
print("=" * 50)
# 1. Basic statistics
mean = incomes.mean()
median = np.median(incomes)
std = incomes.std()
min_val = incomes.min()
max_val = incomes.max()
print(f"Sample size: {len(incomes)}")
print(f"Mean: ${mean:,.2f}")
print(f"Median: ${median:,.2f}")
print(f"Standard deviation: ${std:,.2f}")
print(f"Minimum: ${min_val:,}")
print(f"Maximum: ${max_val:,}")
# 2. Quantiles
q25 = np.percentile(incomes, 25)
q50 = np.percentile(incomes, 50)
q75 = np.percentile(incomes, 75)
print(f"\nQuantiles:")
print(f"25%: ${q25:,.2f}")
print(f"50%: ${q50:,.2f}")
print(f"75%: ${q75:,.2f}")
# 3. Standardization (Z-score)
z_scores = (incomes - mean) / std
print(f"\nZ-scores:")
for i, (income, z) in enumerate(zip(incomes, z_scores), 1):
print(f" ${income:,}: {z:+.2f}")
# 4. Identify outliers (|Z| > 2)
outliers = incomes[np.abs(z_scores) > 2]
if len(outliers) > 0:
print(f"\nOutliers (|Z| > 2):")
for val in outliers:
print(f" ${val:,}")
else:
print(f"\nNo outliers")
# 5. Create income categories
bins = [0, 50000, 60000, np.inf]
labels = ['Low Income', 'Middle Income', 'High Income']
income_categories = np.digitize(incomes, bins) - 1
print(f"\nIncome Categories:")
for label_idx in range(len(labels)):
count = np.sum(income_categories == label_idx)
percentage = count / len(incomes) * 100
print(f" {labels[label_idx]}: {count} people ({percentage:.1f}%)")Exercise 2: Pandas Data Cleaning (Basic)
Difficulty: ⭐⭐ Time: 20 minutes
python
"""
Task: Clean survey data
Data issues:
- Missing values
- Outliers (age>100, income<0)
- Duplicate records
Requirements:
1. Handle missing values
2. Remove outliers
3. Remove duplicates
4. Generate cleaning report
"""
import pandas as pd
import numpy as np
# Raw data (with various issues)
data = pd.DataFrame({
'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})
def clean_survey_data(df):
"""Clean survey data"""
# Your code here
passReference Answer
python
import pandas as pd
import numpy as np
def clean_survey_data(df):
"""Clean survey data
Returns:
(cleaned_df, report): Cleaned data and report
"""
report = {}
report['original_count'] = len(df)
print("=" * 60)
print("Data Cleaning Report")
print("=" * 60)
print(f"Original data: {len(df)} rows\n")
# 1. Check for duplicate records
duplicates = df.duplicated()
duplicate_count = duplicates.sum()
if duplicate_count > 0:
print(f"1. Duplicate records: {duplicate_count}")
print(f" Duplicate IDs: {df[duplicates]['id'].tolist()}")
df = df.drop_duplicates()
print(f" After removal: {len(df)} rows\n")
else:
print(f"1. Duplicate records: None\n")
report['duplicate_removed'] = duplicate_count
# 2. Missing value analysis
print(f"2. Missing value analysis:")
missing = df.isnull().sum()
for col in missing.index:
if missing[col] > 0:
pct = missing[col] / len(df) * 100
print(f" {col}: {missing[col]} ({pct:.1f}%)")
# Handling strategy: Remove rows with missing key columns
before_missing = len(df)
df = df.dropna(subset=['age', 'income'])
after_missing = len(df)
print(f" Removed rows with missing age/income: {before_missing - after_missing}")
print(f" Retained: {len(df)} rows\n")
report['missing_removed'] = before_missing - after_missing
# 3. Outlier detection
print(f"3. Outlier detection:")
# Age outliers
age_outliers = (df['age'] < 18) | (df['age'] > 100)
age_outlier_count = age_outliers.sum()
if age_outlier_count > 0:
print(f" Age outliers: {age_outlier_count}")
print(f" Outlier values: {df[age_outliers]['age'].tolist()}")
df = df[~age_outliers]
# Income outliers
income_outliers = df['income'] < 0
income_outlier_count = income_outliers.sum()
if income_outlier_count > 0:
print(f" Income outliers (negative): {income_outlier_count}")
print(f" Outlier values: {df[income_outliers]['income'].tolist()}")
df = df[~income_outliers]
print(f" After outlier removal: {len(df)} rows\n")
report['outliers_removed'] = age_outlier_count + income_outlier_count
# 4. Data type conversion
print(f"4. Data type conversion:")
df['age'] = df['age'].astype(int)
df['income'] = df['income'].astype(float)
print(f" age: {df['age'].dtype}")
print(f" income: {df['income'].dtype}\n")
# 5. Final statistics
report['final_count'] = len(df)
report['removed_total'] = report['original_count'] - report['final_count']
report['retention_rate'] = (report['final_count'] / report['original_count']) * 100
print(f"Cleaning summary:")
print(f" Original: {report['original_count']} rows")
print(f" Removed: {report['removed_total']} rows")
print(f" - Duplicates: {report['duplicate_removed']}")
print(f" - Missing: {report['missing_removed']}")
print(f" - Outliers: {report['outliers_removed']}")
print(f" Retained: {report['final_count']} rows ({report['retention_rate']:.1f}%)")
print("=" * 60)
return df, report
# Test data
data = pd.DataFrame({
'id': [1, 2, 3, 3, 4, 5, 6, 7, 8, 9],
'age': [25, None, 35, 35, 28, 150, 32, 40, 27, 22],
'income': [50000, 75000, None, 85000, 60000, 70000, -5000, 90000, 55000, 65000],
'gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', None, 'M', 'F']
})
# Clean
cleaned_df, report = clean_survey_data(data)
# Display cleaned data
print("\nCleaned data:")
print(cleaned_df)Exercise 3: Data Grouping and Aggregation (Intermediate)
Difficulty: ⭐⭐⭐ Time: 30 minutes
python
"""
Task: Analyze income differences by education level
Requirements:
1. Group by education level
2. Calculate statistics for each group
3. Create income comparison visualization
4. Generate summary report
"""Reference Answer
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate sample data
np.random.seed(42)
n = 200
data = pd.DataFrame({
'id': range(1, n+1),
'age': np.random.randint(25, 60, n),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n, p=[0.3, 0.4, 0.2, 0.1]),
'gender': np.random.choice(['M', 'F'], n),
'income': np.random.lognormal(11, 0.5, n) # Lognormal distribution
})
# Adjust income by education level
education_multiplier = {
'High School': 0.7,
'Bachelor': 1.0,
'Master': 1.3,
'PhD': 1.6
}
data['income'] = data.apply(
lambda row: row['income'] * education_multiplier[row['education']], axis=1
)
print("Education Level and Income Analysis")
print("=" * 70)
# 1. Group statistics by education
print("\n1. Income statistics by education level:")
edu_stats = data.groupby('education')['income'].agg([
('Sample Size', 'count'),
('Average Income', 'mean'),
('Median', 'median'),
('Std Dev', 'std'),
('Minimum', 'min'),
('Maximum', 'max')
]).round(2)
# Sort by average income
edu_stats = edu_stats.sort_values('Average Income', ascending=False)
print(edu_stats)
# 2. Group by education and gender
print("\n2. Average income by education and gender:")
gender_edu_stats = data.groupby(['education', 'gender'])['income'].mean().unstack()
gender_edu_stats = gender_edu_stats.loc[edu_stats.index] # Maintain order
print(gender_edu_stats.round(2))
# 3. Income quantiles
print("\n3. Income quantiles by education level:")
percentiles = data.groupby('education')['income'].quantile([0.25, 0.5, 0.75]).unstack()
percentiles.columns = ['25%', '50%', '75%']
percentiles = percentiles.loc[edu_stats.index]
print(percentiles.round(2))
# 4. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 4.1 Box plot
education_order = edu_stats.index.tolist()
sns.boxplot(data=data, x='education', y='income', order=education_order, ax=axes[0, 0])
axes[0, 0].set_title('Income Distribution by Education Level', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Education Level')
axes[0, 0].set_ylabel('Income')
axes[0, 0].tick_params(axis='x', rotation=45)
# 4.2 Average income bar chart
edu_stats['Average Income'].plot(kind='bar', ax=axes[0, 1], color='skyblue')
axes[0, 1].set_title('Average Income by Education', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Education Level')
axes[0, 1].set_ylabel('Average Income')
axes[0, 1].tick_params(axis='x', rotation=45)
# Add value labels
for i, v in enumerate(edu_stats['Average Income']):
axes[0, 1].text(i, v + 5000, f'${v:,.0f}', ha='center')
# 4.3 Group by gender
gender_edu_stats.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Income by Education and Gender', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Education Level')
axes[1, 0].set_ylabel('Average Income')
axes[1, 0].legend(title='Gender')
axes[1, 0].tick_params(axis='x', rotation=45)
# 4.4 Violin plot
sns.violinplot(data=data, x='education', y='income', order=education_order, ax=axes[1, 1])
axes[1, 1].set_title('Income Distribution (Violin Plot)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Income')
axes[1, 1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('education_income_analysis.png', dpi=300, bbox_inches='tight')
print("\nVisualization saved: education_income_analysis.png")
plt.show()
# 5. Statistical test (simplified)
print("\n5. Income gap analysis:")
high_school_income = data[data['education'] == 'High School']['income'].mean()
phd_income = data[data['education'] == 'PhD']['income'].mean()
income_gap = phd_income - high_school_income
gap_percentage = (income_gap / high_school_income) * 100
print(f"High School average income: ${high_school_income:,.2f}")
print(f"PhD average income: ${phd_income:,.2f}")
print(f"Income gap: ${income_gap:,.2f} ({gap_percentage:.1f}%)")
# 6. Generate report
report = {
'Analysis date': pd.Timestamp.now().strftime('%Y-%m-%d'),
'Sample size': len(data),
'Education levels': edu_stats.index.tolist(),
'Sample size by education': edu_stats['Sample Size'].tolist(),
'Average income': edu_stats['Average Income'].round(2).tolist(),
'Income gap (PhD vs High School)': f'${income_gap:,.2f}',
'Gap percentage': f'{gap_percentage:.1f}%'
}
print("\n" + "=" * 70)
print("Analysis Report")
print("=" * 70)
for key, value in report.items():
print(f"{key}: {value}")
print("=" * 70)Exercise 4: Time Series Analysis (Advanced)
Difficulty: ⭐⭐⭐⭐ Time: 40 minutes
Create an annual income trend analysis system.
Hint
- Use
pd.date_range()to create dates - Use
df.resample()for temporal aggregation - Use
rolling()to calculate moving averages - Use Matplotlib to plot trends
Next Steps
After completing this module, you have mastered:
- NumPy array operations and vectorization
- Pandas data manipulation (cleaning, transformation, aggregation)
- Matplotlib/Seaborn data visualization
- Complete data analysis workflows
Congratulations on completing Module 9! This is the core module for Python data analysis.
In the next Modules 10 and 11, you'll learn machine learning and best practices.
Extended Reading
Your data science journey has just begun!