Skip to content

Module 9: Core Data Science Libraries

NumPy, Pandas, Matplotlib — The Three Pillars of Python Data Analysis


Module Overview

This is the most important chapter in the entire book! This module provides in-depth coverage of Python's core data science libraries: NumPy (numerical computing), Pandas (data manipulation), and Matplotlib/Seaborn (data visualization). Master these three libraries, and you'll have mastered 80% of Python's core data analysis skills.

Important Note: This chapter contains the most content (6 articles) but is also the most practical. We recommend allocating 2-3 weeks for in-depth study and practice.


Learning Objectives

After completing this module, you will be able to:

  • Use NumPy for efficient numerical computing
  • Use Pandas to read, clean, and transform data
  • Master advanced Pandas operations like grouping, merging, and pivoting
  • Create professional charts using Matplotlib and Seaborn
  • Calculate descriptive statistics
  • Scrape web data using Requests
  • Build complete data analysis workflows

Chapter Contents

01 - NumPy Basics

Core Question: Why do we need NumPy?

Core Content:

  • What is NumPy?
    • Foundation library for scientific computing in Python
    • Pandas is built on top of NumPy
    • 10-100 times faster than Python lists
  • Creating Arrays:
    python
    import numpy as np
    
    # Create from list
    ages = np.array([25, 30, 35, 40])
    
    # Special arrays
    zeros = np.zeros(5)  # [0. 0. 0. 0. 0.]
    ones = np.ones(5)    # [1. 1. 1. 1. 1.]
    seq = np.arange(0, 10, 2)  # [0 2 4 6 8]
    
    # Random numbers
    rand = np.random.rand(5)  # Uniform distribution [0, 1)
    randn = np.random.randn(5)  # Standard normal distribution
  • Array Operations:
    • Vectorized operations (no loops needed)
    • Indexing and slicing
    • Statistical functions: mean(), std(), sum(), max()
    • Linear algebra: matrix multiplication, transpose
  • Comparison with R and Stata:
    • R: c(1, 2, 3) ≈ NumPy array
    • Stata: matrix command ≈ NumPy

Why is this important?

  • Pandas DataFrames are built on NumPy arrays
  • Understanding NumPy helps you better understand Pandas
  • Essential for numerical computing and matrix operations

Practical Application:

python
# Generate simulated data
np.random.seed(42)
n = 1000
education = np.random.normal(16, 2, n)  # Years of education
income = 30000 + 5000 * education + np.random.normal(0, 10000, n)

# Calculate correlation coefficient
correlation = np.corrcoef(education, income)[0, 1]

02 - Pandas Introduction

Core Question: How do I use Pandas to handle tabular data?

Core Content:

  • What is Pandas?
    • Python's equivalent of "Excel/Stata"
    • Core data structures: DataFrame (table) and Series (column)
  • Creating DataFrames:
    python
    import pandas as pd
    
    # Create from dictionary
    df = pd.DataFrame({
        'age': [25, 30, 35],
        'income': [50000, 75000, 85000]
    })
    
    # Read from CSV
    df = pd.read_csv('survey.csv')
  • Data Exploration:
    • df.head(): First few rows
    • df.info(): Data types and missing values
    • df.describe(): Descriptive statistics
    • df.shape: Number of rows and columns
  • Data Selection:
    • Select columns: df['age'] or df[['age', 'income']]
    • Select rows: df.loc[0] (by label), df.iloc[0] (by position)
    • Conditional filtering: df[df['age'] > 30]
  • Data Operations:
    • New column: df['log_income'] = np.log(df['income'])
    • Delete: df.drop('column', axis=1)
    • Sort: df.sort_values('income')
    • Rename: df.rename(columns={'old': 'new'})

Comparison with Stata/R:

OperationStataRPandas
Select columnskeep age incomedf[c('age', 'income')]df[['age', 'income']]
Filterkeep if age > 30df[df$age > 30, ]df[df['age'] > 30]
New columngen log_income = log(income)df$log_income <- log(df$income)df['log_income'] = np.log(df['income'])

03 - Advanced Pandas Operations

Core Question: How do I perform complex data manipulations?

Core Content:

  • Missing Value Handling:
    python
    df.isnull().sum()  # Count missing values
    df.dropna()  # Remove missing values
    df.fillna(df.mean())  # Fill with mean
  • GroupBy Aggregation:
    python
    # Group by gender and calculate average income
    df.groupby('gender')['income'].mean()
    
    # Multiple statistics
    df.groupby('gender').agg({
        'income': ['mean', 'std', 'count'],
        'age': 'mean'
    })
  • Merging Data:
    python
    # Similar to SQL JOIN
    df_merged = pd.merge(df1, df2, on='id', how='left')
    
    # Vertical merge (append rows)
    df_combined = pd.concat([df1, df2], ignore_index=True)
  • Pivot Tables:
    python
    # Similar to Excel pivot table
    pivot = df.pivot_table(
        values='income',
        index='gender',
        columns='education_level',
        aggfunc='mean'
    )
  • Data Reshaping:
    • melt(): Wide format → Long format
    • pivot(): Long format → Wide format
  • Apply and Custom Functions:
    python
    df['income_category'] = df['income'].apply(
        lambda x: 'High Income' if x > 80000 else 'Low-Mid Income'
    )

Comparison with Stata:

OperationStataPandas
Group statisticsbysort gender: summarize incomedf.groupby('gender')['income'].describe()
Mergemerge 1:1 id using other.dtapd.merge(df1, df2, on='id')
Pivottable gender education, c(mean income)df.pivot_table(...)
Reshapereshapemelt() / pivot()

04 - Matplotlib and Seaborn

Core Question: How do I create professional data visualizations?

Core Content:

  • Matplotlib Basics:
    python
    import matplotlib.pyplot as plt
    
    # Scatter plot
    plt.scatter(df['education'], df['income'])
    plt.xlabel('Education Years')
    plt.ylabel('Income ($)')
    plt.title('Education vs Income')
    plt.show()
    
    # Histogram
    plt.hist(df['age'], bins=20, edgecolor='black')
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.show()
  • Seaborn (More Beautiful):
    python
    import seaborn as sns
    
    # Scatter plot + regression line
    sns.regplot(x='education', y='income', data=df)
    
    # Box plot
    sns.boxplot(x='gender', y='income', data=df)
    
    # Heatmap (correlation)
    sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
  • Common Chart Types:
    • Scatter plot: Variable relationships
    • Histogram: Distributions
    • Box plot: Group comparisons
    • Bar chart: Categorical statistics
    • Line plot: Time series
    • Heatmap: Correlation matrix
  • Subplots and Layout:
    python
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes[0, 0].hist(df['age'])
    axes[0, 1].scatter(df['education'], df['income'])

Comparison with Stata/R:

  • Stata: graph twoway scatter, histogram
  • R: ggplot2 (most powerful visualization library)
  • Python: Matplotlib (low-level) + Seaborn (high-level)

05 - Descriptive Statistics

Core Question: How do I calculate statistical measures?

Core Content:

  • Univariate Statistics:
    python
    # Mean, median, standard deviation
    df['income'].mean()
    df['income'].median()
    df['income'].std()
    
    # Quantiles
    df['income'].quantile([0.25, 0.5, 0.75])
    
    # Complete description
    df['income'].describe()
  • Grouped Statistics:
    python
    df.groupby('gender')['income'].agg(['mean', 'std', 'count'])
  • Correlation Analysis:
    python
    # Correlation matrix
    df[['age', 'education', 'income']].corr()
    
    # Two-variable correlation
    df['education'].corr(df['income'])
  • Frequency Distribution:
    python
    # Frequency table
    df['gender'].value_counts()
    
    # Cross-tabulation
    pd.crosstab(df['gender'], df['education_level'])
  • Using scipy.stats:
    python
    from scipy import stats
    
    # t-test
    stats.ttest_ind(group1, group2)
    
    # Chi-square test
    stats.chi2_contingency(crosstab)

Comparison with Stata:

OperationStataPandas
Descriptive statisticssummarizedf.describe()
Group statisticsbysort gender: summarize incomedf.groupby('gender')['income'].describe()
Correlationcorrelate income educationdf[['income', 'education']].corr()
Frequency tabletabulate genderdf['gender'].value_counts()

06 - Web Scraping Introduction

Core Question: How do I retrieve web data?

Core Content:

  • Requests Library:
    python
    import requests
    
    # GET request
    response = requests.get('https://example.com')
    print(response.status_code)  # 200 means success
    print(response.text)  # HTML content
  • BeautifulSoup for HTML Parsing:
    python
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract title
    title = soup.find('title').text
    
    # Extract all links
    links = soup.find_all('a')
    for link in links:
        print(link.get('href'))
  • Practical Case: Scraping Table Data:
    python
    # Scrape Wikipedia table
    url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
    tables = pd.read_html(url)  # Automatically extract all tables
    df = tables[0]  # First table
  • API Calls:
    python
    # Get JSON data
    response = requests.get('https://api.example.com/data')
    data = response.json()
    df = pd.DataFrame(data)
  • Important Considerations:
    • Respect robots.txt
    • Add delays to avoid being blocked
    • Check website terms of use

Social Science Applications:

  • Scrape news articles
  • Retrieve social media data
  • Download public datasets
  • Call government/academic APIs

Python vs Stata vs R Data Analysis Comparison

TaskStataRPandas
Read datause data.dtaread.csv()pd.read_csv()
View datadescribesummary()df.describe()
Select columnskeep age incomedf[c('age', 'income')]df[['age', 'income']]
Filter rowskeep if age > 30df[df$age > 30, ]df[df['age'] > 30]
New columngen log_income = log(income)df$log_income <- log(df$income)df['log_income'] = np.log(df['income'])
Group statisticsbysort gender: summarizeaggregate() / group_by()df.groupby().agg()
Mergemergemerge()pd.merge()
Visualizationgraphggplot2matplotlib / seaborn

How to Study This Module?

Learning Path (2-3 Weeks)

Days 1-2 (4 hours): NumPy Basics

  • Read 01 - NumPy Basics
  • Practice array creation and operations
  • Understand vectorized operations

Days 3-5 (10 hours): Pandas Introduction

  • Read 02 - Pandas Introduction
  • Practice reading CSV files, filtering, creating new columns
  • Complete practice exercises

Days 6-8 (10 hours): Advanced Pandas

  • Read 03 - Advanced Pandas Operations
  • Practice GroupBy, Merge, Pivot
  • Work with real datasets

Days 9-10 (6 hours): Data Visualization

  • Read 04 - Matplotlib and Seaborn
  • Create various chart types
  • Beautify charts

Day 11 (3 hours): Descriptive Statistics

  • Read 05 - Descriptive Statistics
  • Calculate statistical measures
  • Correlation analysis

Day 12 (3 hours): Web Scraping (Optional)

  • Read 06 - Web Scraping Introduction
  • Try scraping simple web pages
  • Call public APIs

Total Time: 36 hours (2-3 weeks)

Minimalist Learning Path

For social science students, absolute essentials:

Must Learn (Essential for daily analysis, 20 hours):

  • 02 - Pandas Introduction (complete study)
  • 03 - Advanced Pandas Operations (GroupBy, Merge)
  • 04 - Matplotlib Basics (scatter plots, histograms, box plots)
  • 05 - Descriptive Statistics

Important (Efficiency boost, 10 hours):

  • 01 - NumPy Basics (understand the foundation)
  • Seaborn visualization
  • Pandas method chaining

Optional (Specific needs):

  • 06 - Web Scraping (when you need to scrape data)
  • NumPy linear algebra
  • Advanced visualization (Plotly interactive charts)

Study Tips

  1. Learn by doing with real data

    • Don't just read sample code
    • Download real datasets (Kaggle, UCI, government data)
    • Try to replicate descriptive statistics tables from papers
  2. Develop Pandas thinking patterns

    python
    # Loop thinking (slow and ugly)
    for i in range(len(df)):
        if df.loc[i, 'age'] > 30:
            df.loc[i, 'age_group'] = 'old'
    
    # Pandas thinking (fast and elegant)
    df['age_group'] = df['age'].apply(lambda x: 'old' if x > 30 else 'young')
    # Or
    df['age_group'] = np.where(df['age'] > 30, 'old', 'young')
  3. Memorize common operation idioms

    python
    # Read data
    df = pd.read_csv('data.csv')
    
    # Quick view
    df.head()
    df.info()
    df.describe()
    
    # Data cleaning pipeline
    df_clean = (df
        .dropna(subset=['age', 'income'])  # Remove missing values
        .query('age >= 18 & age <= 100')   # Filter
        .assign(log_income=lambda x: np.log(x['income']))  # New column
        .sort_values('income')             # Sort
    )
  4. Practice project: Complete data analysis

    python
    # Project: Income inequality analysis
    
    # 1. Read data
    df = pd.read_csv('income_survey.csv')
    
    # 2. Data cleaning
    df_clean = df.dropna().query('age >= 18 & income > 0')
    
    # 3. Descriptive statistics
    print(df_clean.describe())
    print(df_clean.groupby('gender')['income'].mean())
    
    # 4. Visualization
    sns.boxplot(x='education_level', y='income', data=df_clean)
    sns.histplot(df_clean['income'], bins=50)
    
    # 5. Save results
    df_clean.to_csv('income_clean.csv', index=False)

Common Questions

Q: What's the difference between NumPy and Pandas? A:

  • NumPy: Numerical computing, suitable for homogeneous data (all numbers)
  • Pandas: Data manipulation, suitable for heterogeneous data (numbers + text)
  • Pandas is built on top of NumPy

Q: When should I use loc vs iloc? A:

  • loc: By label (column name, row index name)
  • iloc: By position (0, 1, 2...)
python
df.loc[0, 'age']  # Row 0, column 'age'
df.iloc[0, 1]     # Row 0, column 1

Q: Why is my Pandas code slow? A:

  • Don't use loops: for i in range(len(df))
  • Use vectorized operations: df['new'] = df['old'] * 2
  • Use apply: df['new'] = df['old'].apply(func)
  • Use NumPy: df['new'] = np.log(df['old'])

Q: Should I choose Matplotlib or Seaborn? A:

  • Matplotlib: Low-level, flexible, complex
  • Seaborn: High-level, beautiful, simple
  • Recommendation: Use Seaborn for quick exploration, Matplotlib for fine-tuning

Q: How do I learn Pandas? I can't remember all these functions! A:

  • Don't memorize everything
  • Remember core operations: read_csv(), head(), query(), groupby(), merge()
  • Look up documentation for other functions as needed
  • Practice more to develop muscle memory

Next Steps

After completing this module, you will have mastered:

  • NumPy numerical computing and array operations
  • Pandas data reading, cleaning, and transformation
  • Group aggregation, data merging, and pivot tables
  • Matplotlib/Seaborn data visualization
  • Descriptive statistical analysis
  • Web data scraping

In Module 10, we will learn machine learning and LLM API calls, exploring more cutting-edge applications.

In Module 11, we will learn code standards, debugging, and Git, improving code quality.

This is the most important chapter! Master NumPy and Pandas, and you'll have mastered the core of Python data analysis!


Released under the MIT License. Content © Author.