Module 9: Core Data Science Libraries
NumPy, Pandas, Matplotlib — The Three Pillars of Python Data Analysis
Module Overview
This is the most important chapter in the entire book! This module provides in-depth coverage of Python's core data science libraries: NumPy (numerical computing), Pandas (data manipulation), and Matplotlib/Seaborn (data visualization). Master these three libraries, and you'll have mastered 80% of Python's core data analysis skills.
Important Note: This chapter contains the most content (6 articles) but is also the most practical. We recommend allocating 2-3 weeks for in-depth study and practice.
Learning Objectives
After completing this module, you will be able to:
- Use NumPy for efficient numerical computing
- Use Pandas to read, clean, and transform data
- Master advanced Pandas operations like grouping, merging, and pivoting
- Create professional charts using Matplotlib and Seaborn
- Calculate descriptive statistics
- Scrape web data using Requests
- Build complete data analysis workflows
Chapter Contents
01 - NumPy Basics
Core Question: Why do we need NumPy?
Core Content:
- What is NumPy?
- Foundation library for scientific computing in Python
- Pandas is built on top of NumPy
- 10-100 times faster than Python lists
- Creating Arrays:python
import numpy as np # Create from list ages = np.array([25, 30, 35, 40]) # Special arrays zeros = np.zeros(5) # [0. 0. 0. 0. 0.] ones = np.ones(5) # [1. 1. 1. 1. 1.] seq = np.arange(0, 10, 2) # [0 2 4 6 8] # Random numbers rand = np.random.rand(5) # Uniform distribution [0, 1) randn = np.random.randn(5) # Standard normal distribution - Array Operations:
- Vectorized operations (no loops needed)
- Indexing and slicing
- Statistical functions:
mean(),std(),sum(),max() - Linear algebra: matrix multiplication, transpose
- Comparison with R and Stata:
- R:
c(1, 2, 3)≈ NumPy array - Stata:
matrixcommand ≈ NumPy
- R:
Why is this important?
- Pandas DataFrames are built on NumPy arrays
- Understanding NumPy helps you better understand Pandas
- Essential for numerical computing and matrix operations
Practical Application:
# Generate simulated data
np.random.seed(42)
n = 1000
education = np.random.normal(16, 2, n) # Years of education
income = 30000 + 5000 * education + np.random.normal(0, 10000, n)
# Calculate correlation coefficient
correlation = np.corrcoef(education, income)[0, 1]02 - Pandas Introduction
Core Question: How do I use Pandas to handle tabular data?
Core Content:
- What is Pandas?
- Python's equivalent of "Excel/Stata"
- Core data structures: DataFrame (table) and Series (column)
- Creating DataFrames:python
import pandas as pd # Create from dictionary df = pd.DataFrame({ 'age': [25, 30, 35], 'income': [50000, 75000, 85000] }) # Read from CSV df = pd.read_csv('survey.csv') - Data Exploration:
df.head(): First few rowsdf.info(): Data types and missing valuesdf.describe(): Descriptive statisticsdf.shape: Number of rows and columns
- Data Selection:
- Select columns:
df['age']ordf[['age', 'income']] - Select rows:
df.loc[0](by label),df.iloc[0](by position) - Conditional filtering:
df[df['age'] > 30]
- Select columns:
- Data Operations:
- New column:
df['log_income'] = np.log(df['income']) - Delete:
df.drop('column', axis=1) - Sort:
df.sort_values('income') - Rename:
df.rename(columns={'old': 'new'})
- New column:
Comparison with Stata/R:
| Operation | Stata | R | Pandas |
|---|---|---|---|
| Select columns | keep age income | df[c('age', 'income')] | df[['age', 'income']] |
| Filter | keep if age > 30 | df[df$age > 30, ] | df[df['age'] > 30] |
| New column | gen log_income = log(income) | df$log_income <- log(df$income) | df['log_income'] = np.log(df['income']) |
03 - Advanced Pandas Operations
Core Question: How do I perform complex data manipulations?
Core Content:
- Missing Value Handling:python
df.isnull().sum() # Count missing values df.dropna() # Remove missing values df.fillna(df.mean()) # Fill with mean - GroupBy Aggregation:python
# Group by gender and calculate average income df.groupby('gender')['income'].mean() # Multiple statistics df.groupby('gender').agg({ 'income': ['mean', 'std', 'count'], 'age': 'mean' }) - Merging Data:python
# Similar to SQL JOIN df_merged = pd.merge(df1, df2, on='id', how='left') # Vertical merge (append rows) df_combined = pd.concat([df1, df2], ignore_index=True) - Pivot Tables:python
# Similar to Excel pivot table pivot = df.pivot_table( values='income', index='gender', columns='education_level', aggfunc='mean' ) - Data Reshaping:
melt(): Wide format → Long formatpivot(): Long format → Wide format
- Apply and Custom Functions:python
df['income_category'] = df['income'].apply( lambda x: 'High Income' if x > 80000 else 'Low-Mid Income' )
Comparison with Stata:
| Operation | Stata | Pandas |
|---|---|---|
| Group statistics | bysort gender: summarize income | df.groupby('gender')['income'].describe() |
| Merge | merge 1:1 id using other.dta | pd.merge(df1, df2, on='id') |
| Pivot | table gender education, c(mean income) | df.pivot_table(...) |
| Reshape | reshape | melt() / pivot() |
04 - Matplotlib and Seaborn
Core Question: How do I create professional data visualizations?
Core Content:
- Matplotlib Basics:python
import matplotlib.pyplot as plt # Scatter plot plt.scatter(df['education'], df['income']) plt.xlabel('Education Years') plt.ylabel('Income ($)') plt.title('Education vs Income') plt.show() # Histogram plt.hist(df['age'], bins=20, edgecolor='black') plt.xlabel('Age') plt.ylabel('Frequency') plt.show() - Seaborn (More Beautiful):python
import seaborn as sns # Scatter plot + regression line sns.regplot(x='education', y='income', data=df) # Box plot sns.boxplot(x='gender', y='income', data=df) # Heatmap (correlation) sns.heatmap(df.corr(), annot=True, cmap='coolwarm') - Common Chart Types:
- Scatter plot: Variable relationships
- Histogram: Distributions
- Box plot: Group comparisons
- Bar chart: Categorical statistics
- Line plot: Time series
- Heatmap: Correlation matrix
- Subplots and Layout:python
fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes[0, 0].hist(df['age']) axes[0, 1].scatter(df['education'], df['income'])
Comparison with Stata/R:
- Stata:
graph twoway scatter,histogram - R:
ggplot2(most powerful visualization library) - Python: Matplotlib (low-level) + Seaborn (high-level)
05 - Descriptive Statistics
Core Question: How do I calculate statistical measures?
Core Content:
- Univariate Statistics:python
# Mean, median, standard deviation df['income'].mean() df['income'].median() df['income'].std() # Quantiles df['income'].quantile([0.25, 0.5, 0.75]) # Complete description df['income'].describe() - Grouped Statistics:python
df.groupby('gender')['income'].agg(['mean', 'std', 'count']) - Correlation Analysis:python
# Correlation matrix df[['age', 'education', 'income']].corr() # Two-variable correlation df['education'].corr(df['income']) - Frequency Distribution:python
# Frequency table df['gender'].value_counts() # Cross-tabulation pd.crosstab(df['gender'], df['education_level']) - Using scipy.stats:python
from scipy import stats # t-test stats.ttest_ind(group1, group2) # Chi-square test stats.chi2_contingency(crosstab)
Comparison with Stata:
| Operation | Stata | Pandas |
|---|---|---|
| Descriptive statistics | summarize | df.describe() |
| Group statistics | bysort gender: summarize income | df.groupby('gender')['income'].describe() |
| Correlation | correlate income education | df[['income', 'education']].corr() |
| Frequency table | tabulate gender | df['gender'].value_counts() |
06 - Web Scraping Introduction
Core Question: How do I retrieve web data?
Core Content:
- Requests Library:python
import requests # GET request response = requests.get('https://example.com') print(response.status_code) # 200 means success print(response.text) # HTML content - BeautifulSoup for HTML Parsing:python
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') # Extract title title = soup.find('title').text # Extract all links links = soup.find_all('a') for link in links: print(link.get('href')) - Practical Case: Scraping Table Data:python
# Scrape Wikipedia table url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)' tables = pd.read_html(url) # Automatically extract all tables df = tables[0] # First table - API Calls:python
# Get JSON data response = requests.get('https://api.example.com/data') data = response.json() df = pd.DataFrame(data) - Important Considerations:
- Respect robots.txt
- Add delays to avoid being blocked
- Check website terms of use
Social Science Applications:
- Scrape news articles
- Retrieve social media data
- Download public datasets
- Call government/academic APIs
Python vs Stata vs R Data Analysis Comparison
| Task | Stata | R | Pandas |
|---|---|---|---|
| Read data | use data.dta | read.csv() | pd.read_csv() |
| View data | describe | summary() | df.describe() |
| Select columns | keep age income | df[c('age', 'income')] | df[['age', 'income']] |
| Filter rows | keep if age > 30 | df[df$age > 30, ] | df[df['age'] > 30] |
| New column | gen log_income = log(income) | df$log_income <- log(df$income) | df['log_income'] = np.log(df['income']) |
| Group statistics | bysort gender: summarize | aggregate() / group_by() | df.groupby().agg() |
| Merge | merge | merge() | pd.merge() |
| Visualization | graph | ggplot2 | matplotlib / seaborn |
How to Study This Module?
Learning Path (2-3 Weeks)
Days 1-2 (4 hours): NumPy Basics
- Read 01 - NumPy Basics
- Practice array creation and operations
- Understand vectorized operations
Days 3-5 (10 hours): Pandas Introduction
- Read 02 - Pandas Introduction
- Practice reading CSV files, filtering, creating new columns
- Complete practice exercises
Days 6-8 (10 hours): Advanced Pandas
- Read 03 - Advanced Pandas Operations
- Practice GroupBy, Merge, Pivot
- Work with real datasets
Days 9-10 (6 hours): Data Visualization
- Read 04 - Matplotlib and Seaborn
- Create various chart types
- Beautify charts
Day 11 (3 hours): Descriptive Statistics
- Read 05 - Descriptive Statistics
- Calculate statistical measures
- Correlation analysis
Day 12 (3 hours): Web Scraping (Optional)
- Read 06 - Web Scraping Introduction
- Try scraping simple web pages
- Call public APIs
Total Time: 36 hours (2-3 weeks)
Minimalist Learning Path
For social science students, absolute essentials:
Must Learn (Essential for daily analysis, 20 hours):
- 02 - Pandas Introduction (complete study)
- 03 - Advanced Pandas Operations (GroupBy, Merge)
- 04 - Matplotlib Basics (scatter plots, histograms, box plots)
- 05 - Descriptive Statistics
Important (Efficiency boost, 10 hours):
- 01 - NumPy Basics (understand the foundation)
- Seaborn visualization
- Pandas method chaining
Optional (Specific needs):
- 06 - Web Scraping (when you need to scrape data)
- NumPy linear algebra
- Advanced visualization (Plotly interactive charts)
Study Tips
Learn by doing with real data
- Don't just read sample code
- Download real datasets (Kaggle, UCI, government data)
- Try to replicate descriptive statistics tables from papers
Develop Pandas thinking patterns
python# Loop thinking (slow and ugly) for i in range(len(df)): if df.loc[i, 'age'] > 30: df.loc[i, 'age_group'] = 'old' # Pandas thinking (fast and elegant) df['age_group'] = df['age'].apply(lambda x: 'old' if x > 30 else 'young') # Or df['age_group'] = np.where(df['age'] > 30, 'old', 'young')Memorize common operation idioms
python# Read data df = pd.read_csv('data.csv') # Quick view df.head() df.info() df.describe() # Data cleaning pipeline df_clean = (df .dropna(subset=['age', 'income']) # Remove missing values .query('age >= 18 & age <= 100') # Filter .assign(log_income=lambda x: np.log(x['income'])) # New column .sort_values('income') # Sort )Practice project: Complete data analysis
python# Project: Income inequality analysis # 1. Read data df = pd.read_csv('income_survey.csv') # 2. Data cleaning df_clean = df.dropna().query('age >= 18 & income > 0') # 3. Descriptive statistics print(df_clean.describe()) print(df_clean.groupby('gender')['income'].mean()) # 4. Visualization sns.boxplot(x='education_level', y='income', data=df_clean) sns.histplot(df_clean['income'], bins=50) # 5. Save results df_clean.to_csv('income_clean.csv', index=False)
Common Questions
Q: What's the difference between NumPy and Pandas? A:
- NumPy: Numerical computing, suitable for homogeneous data (all numbers)
- Pandas: Data manipulation, suitable for heterogeneous data (numbers + text)
- Pandas is built on top of NumPy
Q: When should I use loc vs iloc? A:
loc: By label (column name, row index name)iloc: By position (0, 1, 2...)
df.loc[0, 'age'] # Row 0, column 'age'
df.iloc[0, 1] # Row 0, column 1Q: Why is my Pandas code slow? A:
- Don't use loops:
for i in range(len(df)) - Use vectorized operations:
df['new'] = df['old'] * 2 - Use apply:
df['new'] = df['old'].apply(func) - Use NumPy:
df['new'] = np.log(df['old'])
Q: Should I choose Matplotlib or Seaborn? A:
- Matplotlib: Low-level, flexible, complex
- Seaborn: High-level, beautiful, simple
- Recommendation: Use Seaborn for quick exploration, Matplotlib for fine-tuning
Q: How do I learn Pandas? I can't remember all these functions! A:
- Don't memorize everything
- Remember core operations:
read_csv(),head(),query(),groupby(),merge() - Look up documentation for other functions as needed
- Practice more to develop muscle memory
Next Steps
After completing this module, you will have mastered:
- NumPy numerical computing and array operations
- Pandas data reading, cleaning, and transformation
- Group aggregation, data merging, and pivot tables
- Matplotlib/Seaborn data visualization
- Descriptive statistical analysis
- Web data scraping
In Module 10, we will learn machine learning and LLM API calls, exploring more cutting-edge applications.
In Module 11, we will learn code standards, debugging, and Git, improving code quality.
This is the most important chapter! Master NumPy and Pandas, and you'll have mastered the core of Python data analysis!