Module 9: Core Data Science Libraries

NumPy, Pandas, Matplotlib — The Three Pillars of Python Data Analysis

Module Overview

This is the most important chapter in the entire book! This module provides in-depth coverage of Python's core data science libraries: NumPy (numerical computing), Pandas (data manipulation), and Matplotlib/Seaborn (data visualization). Master these three libraries, and you'll have mastered 80% of Python's core data analysis skills.

Important Note: This chapter contains the most content (6 articles) but is also the most practical. We recommend allocating 2-3 weeks for in-depth study and practice.

Learning Objectives

After completing this module, you will be able to:

Use NumPy for efficient numerical computing
Use Pandas to read, clean, and transform data
Master advanced Pandas operations like grouping, merging, and pivoting
Create professional charts using Matplotlib and Seaborn
Calculate descriptive statistics
Scrape web data using Requests
Build complete data analysis workflows

Chapter Contents

01 - NumPy Basics

Core Question: Why do we need NumPy?

Core Content:

What is NumPy?
- Foundation library for scientific computing in Python
- Pandas is built on top of NumPy
- 10-100 times faster than Python lists

Creating Arrays:

python

import numpy as np

# Create from list
ages = np.array([25, 30, 35, 40])

# Special arrays
zeros = np.zeros(5)  # [0. 0. 0. 0. 0.]
ones = np.ones(5)    # [1. 1. 1. 1. 1.]
seq = np.arange(0, 10, 2)  # [0 2 4 6 8]

# Random numbers
rand = np.random.rand(5)  # Uniform distribution [0, 1)
randn = np.random.randn(5)  # Standard normal distribution

Array Operations:
- Vectorized operations (no loops needed)
- Indexing and slicing
- Statistical functions: mean(), std(), sum(), max()
- Linear algebra: matrix multiplication, transpose
Comparison with R and Stata:
- R: c(1, 2, 3) ≈ NumPy array
- Stata: matrix command ≈ NumPy

Why is this important?

Pandas DataFrames are built on NumPy arrays
Understanding NumPy helps you better understand Pandas
Essential for numerical computing and matrix operations

Practical Application:

python

# Generate simulated data
np.random.seed(42)
n = 1000
education = np.random.normal(16, 2, n)  # Years of education
income = 30000 + 5000 * education + np.random.normal(0, 10000, n)

# Calculate correlation coefficient
correlation = np.corrcoef(education, income)[0, 1]

02 - Pandas Introduction

Core Question: How do I use Pandas to handle tabular data?

Core Content:

What is Pandas?
- Python's equivalent of "Excel/Stata"
- Core data structures: DataFrame (table) and Series (column)

Creating DataFrames:

python

import pandas as pd

# Create from dictionary
df = pd.DataFrame({
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})

# Read from CSV
df = pd.read_csv('survey.csv')

Data Exploration:
- df.head(): First few rows
- df.info(): Data types and missing values
- df.describe(): Descriptive statistics
- df.shape: Number of rows and columns
Data Selection:
- Select columns: df['age'] or df[['age', 'income']]
- Select rows: df.loc[0] (by label), df.iloc[0] (by position)
- Conditional filtering: df[df['age'] > 30]
Data Operations:
- New column: df['log_income'] = np.log(df['income'])
- Delete: df.drop('column', axis=1)
- Sort: df.sort_values('income')
- Rename: df.rename(columns={'old': 'new'})

Comparison with Stata/R:

Operation	Stata	R	Pandas
Select columns	`keep age income`	`df[c('age', 'income')]`	`df[['age', 'income']]`
Filter	`keep if age > 30`	`df[df$age > 30, ]`	`df[df['age'] > 30]`
New column	`gen log_income = log(income)`	`df$log_income <- log(df$income)`	`df['log_income'] = np.log(df['income'])`

03 - Advanced Pandas Operations

Core Question: How do I perform complex data manipulations?

Core Content:

Missing Value Handling:

python

df.isnull().sum()  # Count missing values
df.dropna()  # Remove missing values
df.fillna(df.mean())  # Fill with mean

GroupBy Aggregation:

python

# Group by gender and calculate average income
df.groupby('gender')['income'].mean()

# Multiple statistics
df.groupby('gender').agg({
    'income': ['mean', 'std', 'count'],
    'age': 'mean'
})

Merging Data:

python

# Similar to SQL JOIN
df_merged = pd.merge(df1, df2, on='id', how='left')

# Vertical merge (append rows)
df_combined = pd.concat([df1, df2], ignore_index=True)

Pivot Tables:

python

# Similar to Excel pivot table
pivot = df.pivot_table(
    values='income',
    index='gender',
    columns='education_level',
    aggfunc='mean'
)

Data Reshaping:
- melt(): Wide format → Long format
- pivot(): Long format → Wide format

Apply and Custom Functions:

python

df['income_category'] = df['income'].apply(
    lambda x: 'High Income' if x > 80000 else 'Low-Mid Income'
)

Comparison with Stata:

Operation	Stata	Pandas
Group statistics	`bysort gender: summarize income`	`df.groupby('gender')['income'].describe()`
Merge	`merge 1:1 id using other.dta`	`pd.merge(df1, df2, on='id')`
Pivot	`table gender education, c(mean income)`	`df.pivot_table(...)`
Reshape	`reshape`	`melt()` / `pivot()`

04 - Matplotlib and Seaborn

Core Question: How do I create professional data visualizations?

Core Content:

Matplotlib Basics:

python

import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(df['education'], df['income'])
plt.xlabel('Education Years')
plt.ylabel('Income ($)')
plt.title('Education vs Income')
plt.show()

# Histogram
plt.hist(df['age'], bins=20, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Seaborn (More Beautiful):

python

import seaborn as sns

# Scatter plot + regression line
sns.regplot(x='education', y='income', data=df)

# Box plot
sns.boxplot(x='gender', y='income', data=df)

# Heatmap (correlation)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

Common Chart Types:
- Scatter plot: Variable relationships
- Histogram: Distributions
- Box plot: Group comparisons
- Bar chart: Categorical statistics
- Line plot: Time series
- Heatmap: Correlation matrix

Subplots and Layout:

python

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].hist(df['age'])
axes[0, 1].scatter(df['education'], df['income'])

Comparison with Stata/R:

Stata: graph twoway scatter, histogram
R: ggplot2 (most powerful visualization library)
Python: Matplotlib (low-level) + Seaborn (high-level)

05 - Descriptive Statistics

Core Question: How do I calculate statistical measures?

Core Content:

Univariate Statistics:

python

# Mean, median, standard deviation
df['income'].mean()
df['income'].median()
df['income'].std()

# Quantiles
df['income'].quantile([0.25, 0.5, 0.75])

# Complete description
df['income'].describe()

Grouped Statistics:

python

df.groupby('gender')['income'].agg(['mean', 'std', 'count'])

Correlation Analysis:

python

# Correlation matrix
df[['age', 'education', 'income']].corr()

# Two-variable correlation
df['education'].corr(df['income'])

Frequency Distribution:

python

# Frequency table
df['gender'].value_counts()

# Cross-tabulation
pd.crosstab(df['gender'], df['education_level'])

Using scipy.stats:

python

from scipy import stats

# t-test
stats.ttest_ind(group1, group2)

# Chi-square test
stats.chi2_contingency(crosstab)

Comparison with Stata:

Operation	Stata	Pandas
Descriptive statistics	`summarize`	`df.describe()`
Group statistics	`bysort gender: summarize income`	`df.groupby('gender')['income'].describe()`
Correlation	`correlate income education`	`df[['income', 'education']].corr()`
Frequency table	`tabulate gender`	`df['gender'].value_counts()`

06 - Web Scraping Introduction

Core Question: How do I retrieve web data?

Core Content:

Requests Library:

python

import requests

# GET request
response = requests.get('https://example.com')
print(response.status_code)  # 200 means success
print(response.text)  # HTML content

BeautifulSoup for HTML Parsing:

python

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Extract title
title = soup.find('title').text

# Extract all links
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

Practical Case: Scraping Table Data:

python

# Scrape Wikipedia table
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url)  # Automatically extract all tables
df = tables[0]  # First table

API Calls:

python

# Get JSON data
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)

Important Considerations:
- Respect robots.txt
- Add delays to avoid being blocked
- Check website terms of use

Social Science Applications:

Scrape news articles
Retrieve social media data
Download public datasets
Call government/academic APIs

Python vs Stata vs R Data Analysis Comparison

Task	Stata	R	Pandas
Read data	`use data.dta`	`read.csv()`	`pd.read_csv()`
View data	`describe`	`summary()`	`df.describe()`
Select columns	`keep age income`	`df[c('age', 'income')]`	`df[['age', 'income']]`
Filter rows	`keep if age > 30`	`df[df$age > 30, ]`	`df[df['age'] > 30]`
New column	`gen log_income = log(income)`	`df$log_income <- log(df$income)`	`df['log_income'] = np.log(df['income'])`
Group statistics	`bysort gender: summarize`	`aggregate()` / `group_by()`	`df.groupby().agg()`
Merge	`merge`	`merge()`	`pd.merge()`
Visualization	`graph`	`ggplot2`	`matplotlib` / `seaborn`

How to Study This Module?

Learning Path (2-3 Weeks)

Days 1-2 (4 hours): NumPy Basics

Read 01 - NumPy Basics
Practice array creation and operations
Understand vectorized operations

Days 3-5 (10 hours): Pandas Introduction

Read 02 - Pandas Introduction
Practice reading CSV files, filtering, creating new columns
Complete practice exercises

Days 6-8 (10 hours): Advanced Pandas

Read 03 - Advanced Pandas Operations
Practice GroupBy, Merge, Pivot
Work with real datasets

Days 9-10 (6 hours): Data Visualization

Read 04 - Matplotlib and Seaborn
Create various chart types
Beautify charts

Day 11 (3 hours): Descriptive Statistics

Read 05 - Descriptive Statistics
Calculate statistical measures
Correlation analysis

Day 12 (3 hours): Web Scraping (Optional)

Read 06 - Web Scraping Introduction
Try scraping simple web pages
Call public APIs

Total Time: 36 hours (2-3 weeks)

Minimalist Learning Path

For social science students, absolute essentials:

Must Learn (Essential for daily analysis, 20 hours):

02 - Pandas Introduction (complete study)
03 - Advanced Pandas Operations (GroupBy, Merge)
04 - Matplotlib Basics (scatter plots, histograms, box plots)
05 - Descriptive Statistics

Important (Efficiency boost, 10 hours):

01 - NumPy Basics (understand the foundation)
Seaborn visualization
Pandas method chaining

Optional (Specific needs):

06 - Web Scraping (when you need to scrape data)
NumPy linear algebra
Advanced visualization (Plotly interactive charts)

Study Tips

Learn by doing with real data
- Don't just read sample code
- Download real datasets (Kaggle, UCI, government data)
- Try to replicate descriptive statistics tables from papers

Develop Pandas thinking patterns

python

# Loop thinking (slow and ugly)
for i in range(len(df)):
    if df.loc[i, 'age'] > 30:
        df.loc[i, 'age_group'] = 'old'

# Pandas thinking (fast and elegant)
df['age_group'] = df['age'].apply(lambda x: 'old' if x > 30 else 'young')
# Or
df['age_group'] = np.where(df['age'] > 30, 'old', 'young')

Memorize common operation idioms

python

# Read data
df = pd.read_csv('data.csv')

# Quick view
df.head()
df.info()
df.describe()

# Data cleaning pipeline
df_clean = (df
    .dropna(subset=['age', 'income'])  # Remove missing values
    .query('age >= 18 & age <= 100')   # Filter
    .assign(log_income=lambda x: np.log(x['income']))  # New column
    .sort_values('income')             # Sort
)

Practice project: Complete data analysis

python

# Project: Income inequality analysis

# 1. Read data
df = pd.read_csv('income_survey.csv')

# 2. Data cleaning
df_clean = df.dropna().query('age >= 18 & income > 0')

# 3. Descriptive statistics
print(df_clean.describe())
print(df_clean.groupby('gender')['income'].mean())

# 4. Visualization
sns.boxplot(x='education_level', y='income', data=df_clean)
sns.histplot(df_clean['income'], bins=50)

# 5. Save results
df_clean.to_csv('income_clean.csv', index=False)

Common Questions

Q: What's the difference between NumPy and Pandas? A:

NumPy: Numerical computing, suitable for homogeneous data (all numbers)
Pandas: Data manipulation, suitable for heterogeneous data (numbers + text)
Pandas is built on top of NumPy

Q: When should I use loc vs iloc? A:

loc: By label (column name, row index name)
iloc: By position (0, 1, 2...)

python

df.loc[0, 'age']  # Row 0, column 'age'
df.iloc[0, 1]     # Row 0, column 1

Q: Why is my Pandas code slow? A:

Don't use loops: for i in range(len(df))
Use vectorized operations: df['new'] = df['old'] * 2
Use apply: df['new'] = df['old'].apply(func)
Use NumPy: df['new'] = np.log(df['old'])

Q: Should I choose Matplotlib or Seaborn? A:

Matplotlib: Low-level, flexible, complex
Seaborn: High-level, beautiful, simple
Recommendation: Use Seaborn for quick exploration, Matplotlib for fine-tuning

Q: How do I learn Pandas? I can't remember all these functions! A:

Don't memorize everything
Remember core operations: read_csv(), head(), query(), groupby(), merge()
Look up documentation for other functions as needed
Practice more to develop muscle memory

Next Steps

After completing this module, you will have mastered:

NumPy numerical computing and array operations
Pandas data reading, cleaning, and transformation
Group aggregation, data merging, and pivot tables
Matplotlib/Seaborn data visualization
Descriptive statistical analysis
Web data scraping

In Module 10, we will learn machine learning and LLM API calls, exploring more cutting-edge applications.

In Module 11, we will learn code standards, debugging, and Git, improving code quality.

This is the most important chapter! Master NumPy and Pandas, and you'll have mastered the core of Python data analysis!

Module 9: Core Data Science Libraries ​

Module Overview ​

Learning Objectives ​

Chapter Contents ​

01 - NumPy Basics ​

02 - Pandas Introduction ​

03 - Advanced Pandas Operations ​

04 - Matplotlib and Seaborn ​

05 - Descriptive Statistics ​

06 - Web Scraping Introduction ​

Python vs Stata vs R Data Analysis Comparison ​

How to Study This Module? ​

Learning Path (2-3 Weeks) ​

Minimalist Learning Path ​

Study Tips ​

Common Questions ​

Next Steps ​

Quick Links ​

Module 9: Core Data Science Libraries

Module Overview

Learning Objectives

Chapter Contents

01 - NumPy Basics

02 - Pandas Introduction

03 - Advanced Pandas Operations

04 - Matplotlib and Seaborn

05 - Descriptive Statistics

06 - Web Scraping Introduction

Python vs Stata vs R Data Analysis Comparison

How to Study This Module?

Learning Path (2-3 Weeks)

Minimalist Learning Path

Study Tips

Common Questions

Next Steps

Quick Links