NumPy Basics
Efficient Numerical Computing Library — The Foundation of Pandas
What is NumPy?
NumPy (Numerical Python) is the foundation library for scientific computing in Python.
Why learn NumPy?
- Pandas is built on NumPy
- 10-100 times faster than Python lists
- Matrix operations (linear algebra)
Analogies:
- Stata: Built-in matrix functionality
- R: Vector operations
- Python: NumPy arrays
Installation and Import
bash
pip install numpypython
import numpy as np # Standard aliasCreating Arrays
1. Creating from Lists
python
import numpy as np
# 1D array
ages = np.array([25, 30, 35, 40])
print(ages) # [25 30 35 40]
print(type(ages)) # <class 'numpy.ndarray'>
# 2D array (matrix)
data = np.array([
[1, 2, 3],
[4, 5, 6]
])
print(data)
# [[1 2 3]
# [4 5 6]]2. Special Arrays
python
# All zeros
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
# All ones
ones = np.ones((2, 3)) # 2x3 matrix of ones
# Sequence
seq = np.arange(0, 10, 2) # [0 2 4 6 8]
# Linear spacing
lin = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
# Random numbers
rand = np.random.rand(5) # 5 random numbers between 0-1
randn = np.random.randn(5) # 5 standard normal random numbersArray Attributes
python
data = np.array([[1, 2, 3], [4, 5, 6]])
print(data.shape) # (2, 3) - shape
print(data.ndim) # 2 - dimensions
print(data.size) # 6 - total elements
print(data.dtype) # int64 - data typeArray Operations
Vectorized Operations (Faster than Loops)
python
incomes = np.array([50000, 60000, 75000, 80000])
# Vectorized (fast)
after_tax = incomes * 0.75
log_incomes = np.log(incomes)
# Loop (slow)
after_tax = []
for income in incomes:
after_tax.append(income * 0.75)Basic Operations
python
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
print(a + b) # [11 22 33 44]
print(a * b) # [10 40 90 160]
print(a ** 2) # [1 4 9 16]
print(a > 2) # [False False True True]Statistical Functions
python
scores = np.array([85, 92, 78, 90, 88])
print(scores.mean()) # 86.6 - mean
print(scores.std()) # 5.08 - standard deviation
print(scores.min()) # 78 - minimum
print(scores.max()) # 92 - maximum
print(scores.sum()) # 433 - sumPractical Examples
Example 1: Standardizing Data
python
# Raw income data
incomes = np.array([50000, 60000, 75000, 80000, 95000])
# Z-score standardization
mean = incomes.mean()
std = incomes.std()
incomes_std = (incomes - mean) / std
print(f"Standardized: {incomes_std}")
# Standardized: [-1.38 -0.74 0.21 0.53 1.38]Example 2: Batch Calculations
python
# Income of 100 respondents
np.random.seed(42)
incomes = np.random.normal(70000, 15000, 100) # Mean 70k, std 15k
# Calculate after-tax income (25% tax rate)
after_tax = incomes * 0.75
# Statistics
print(f"Average after-tax income: ${after_tax.mean():,.0f}")
print(f"Median: ${np.median(after_tax):,.0f}")
print(f"Standard deviation: ${after_tax.std():,.0f}")Example 3: Conditional Filtering
python
ages = np.array([22, 35, 45, 28, 55, 30, 48])
# Filter ages 30-50
mask = (ages >= 30) & (ages <= 50)
middle_aged = ages[mask]
print(middle_aged) # [35 45 30 48]
# Statistics
print(f"Ages 30-50: {len(middle_aged)} people")
print(f"Percentage: {len(middle_aged)/len(ages)*100:.1f}%")NumPy vs Python Lists
python
import time
# Create large array
size = 1000000
py_list = list(range(size))
np_array = np.arange(size)
# Python list (slow)
start = time.time()
result = [x * 2 for x in py_list]
print(f"List: {time.time() - start:.4f}s")
# NumPy array (fast)
start = time.time()
result = np_array * 2
print(f"NumPy: {time.time() - start:.4f}s")
# NumPy is typically 10-100 times faster!Common Functions
Mathematical Functions
python
x = np.array([1, 4, 9, 16, 25])
np.sqrt(x) # Square root
np.log(x) # Natural logarithm
np.log10(x) # Base-10 logarithm
np.exp(x) # e^x
np.abs(x) # Absolute valueAggregation Functions
python
data = np.array([85, 92, 78, 90, 88, 76, 95])
np.mean(data) # Mean
np.median(data) # Median
np.std(data) # Standard deviation
np.var(data) # Variance
np.min(data) # Minimum
np.max(data) # Maximum
np.percentile(data, 25) # 25th percentileNumPy in Data Analysis
Application 1: Correlation Coefficient
python
# Age and income
ages = np.array([25, 30, 35, 40, 45, 50])
incomes = np.array([50000, 60000, 75000, 80000, 90000, 95000])
# Calculate correlation coefficient
correlation = np.corrcoef(ages, incomes)[0, 1]
print(f"Correlation: {correlation:.3f}") # 0.995 (strong positive)Application 2: Group Statistics
python
# Income by gender
male_incomes = np.array([60000, 75000, 80000, 90000])
female_incomes = np.array([55000, 70000, 72000, 85000])
print(f"Male average: ${male_incomes.mean():,.0f}")
print(f"Female average: ${female_incomes.mean():,.0f}")
print(f"Gender gap: ${male_incomes.mean() - female_incomes.mean():,.0f}")Practice Exercises
python
# Exercise 1: Creating and manipulating arrays
# Create an array from 1-100, filter even numbers, calculate sum of squares
# Exercise 2: Statistical analysis
ages = np.array([22, 25, 28, 30, 35, 40, 45, 50, 55, 60])
# Calculate: mean, median, standard deviation, 25th and 75th percentiles
# Exercise 3: Data standardization
scores = np.array([65, 72, 85, 90, 78, 88, 92, 70, 95, 82])
# Perform Min-Max standardization to 0-1 rangeNext Steps
NumPy is the foundation. In the next section, we'll learn Pandas (a data analysis library built on NumPy), which is the core tool for social science data analysis!
Keep going!