Skip to content

NumPy Basics

Efficient Numerical Computing Library — The Foundation of Pandas


What is NumPy?

NumPy (Numerical Python) is the foundation library for scientific computing in Python.

Why learn NumPy?

  • Pandas is built on NumPy
  • 10-100 times faster than Python lists
  • Matrix operations (linear algebra)

Analogies:

  • Stata: Built-in matrix functionality
  • R: Vector operations
  • Python: NumPy arrays

Installation and Import

bash
pip install numpy
python
import numpy as np  # Standard alias

Creating Arrays

1. Creating from Lists

python
import numpy as np

# 1D array
ages = np.array([25, 30, 35, 40])
print(ages)  # [25 30 35 40]
print(type(ages))  # <class 'numpy.ndarray'>

# 2D array (matrix)
data = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(data)
# [[1 2 3]
#  [4 5 6]]

2. Special Arrays

python
# All zeros
zeros = np.zeros(5)  # [0. 0. 0. 0. 0.]

# All ones
ones = np.ones((2, 3))  # 2x3 matrix of ones

# Sequence
seq = np.arange(0, 10, 2)  # [0 2 4 6 8]

# Linear spacing
lin = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

# Random numbers
rand = np.random.rand(5)  # 5 random numbers between 0-1
randn = np.random.randn(5)  # 5 standard normal random numbers

Array Attributes

python
data = np.array([[1, 2, 3], [4, 5, 6]])

print(data.shape)   # (2, 3) - shape
print(data.ndim)    # 2 - dimensions
print(data.size)    # 6 - total elements
print(data.dtype)   # int64 - data type

Array Operations

Vectorized Operations (Faster than Loops)

python
incomes = np.array([50000, 60000, 75000, 80000])

# Vectorized (fast)
after_tax = incomes * 0.75
log_incomes = np.log(incomes)

# Loop (slow)
after_tax = []
for income in incomes:
    after_tax.append(income * 0.75)

Basic Operations

python
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print(a + b)   # [11 22 33 44]
print(a * b)   # [10 40 90 160]
print(a ** 2)  # [1 4 9 16]
print(a > 2)   # [False False  True  True]

Statistical Functions

python
scores = np.array([85, 92, 78, 90, 88])

print(scores.mean())   # 86.6 - mean
print(scores.std())    # 5.08 - standard deviation
print(scores.min())    # 78 - minimum
print(scores.max())    # 92 - maximum
print(scores.sum())    # 433 - sum

Practical Examples

Example 1: Standardizing Data

python
# Raw income data
incomes = np.array([50000, 60000, 75000, 80000, 95000])

# Z-score standardization
mean = incomes.mean()
std = incomes.std()
incomes_std = (incomes - mean) / std

print(f"Standardized: {incomes_std}")
# Standardized: [-1.38 -0.74  0.21  0.53  1.38]

Example 2: Batch Calculations

python
# Income of 100 respondents
np.random.seed(42)
incomes = np.random.normal(70000, 15000, 100)  # Mean 70k, std 15k

# Calculate after-tax income (25% tax rate)
after_tax = incomes * 0.75

# Statistics
print(f"Average after-tax income: ${after_tax.mean():,.0f}")
print(f"Median: ${np.median(after_tax):,.0f}")
print(f"Standard deviation: ${after_tax.std():,.0f}")

Example 3: Conditional Filtering

python
ages = np.array([22, 35, 45, 28, 55, 30, 48])

# Filter ages 30-50
mask = (ages >= 30) & (ages <= 50)
middle_aged = ages[mask]
print(middle_aged)  # [35 45 30 48]

# Statistics
print(f"Ages 30-50: {len(middle_aged)} people")
print(f"Percentage: {len(middle_aged)/len(ages)*100:.1f}%")

NumPy vs Python Lists

python
import time

# Create large array
size = 1000000
py_list = list(range(size))
np_array = np.arange(size)

# Python list (slow)
start = time.time()
result = [x * 2 for x in py_list]
print(f"List: {time.time() - start:.4f}s")

# NumPy array (fast)
start = time.time()
result = np_array * 2
print(f"NumPy: {time.time() - start:.4f}s")

# NumPy is typically 10-100 times faster!

Common Functions

Mathematical Functions

python
x = np.array([1, 4, 9, 16, 25])

np.sqrt(x)    # Square root
np.log(x)     # Natural logarithm
np.log10(x)   # Base-10 logarithm
np.exp(x)     # e^x
np.abs(x)     # Absolute value

Aggregation Functions

python
data = np.array([85, 92, 78, 90, 88, 76, 95])

np.mean(data)      # Mean
np.median(data)    # Median
np.std(data)       # Standard deviation
np.var(data)       # Variance
np.min(data)       # Minimum
np.max(data)       # Maximum
np.percentile(data, 25)  # 25th percentile

NumPy in Data Analysis

Application 1: Correlation Coefficient

python
# Age and income
ages = np.array([25, 30, 35, 40, 45, 50])
incomes = np.array([50000, 60000, 75000, 80000, 90000, 95000])

# Calculate correlation coefficient
correlation = np.corrcoef(ages, incomes)[0, 1]
print(f"Correlation: {correlation:.3f}")  # 0.995 (strong positive)

Application 2: Group Statistics

python
# Income by gender
male_incomes = np.array([60000, 75000, 80000, 90000])
female_incomes = np.array([55000, 70000, 72000, 85000])

print(f"Male average: ${male_incomes.mean():,.0f}")
print(f"Female average: ${female_incomes.mean():,.0f}")
print(f"Gender gap: ${male_incomes.mean() - female_incomes.mean():,.0f}")

Practice Exercises

python
# Exercise 1: Creating and manipulating arrays
# Create an array from 1-100, filter even numbers, calculate sum of squares

# Exercise 2: Statistical analysis
ages = np.array([22, 25, 28, 30, 35, 40, 45, 50, 55, 60])
# Calculate: mean, median, standard deviation, 25th and 75th percentiles

# Exercise 3: Data standardization
scores = np.array([65, 72, 85, 90, 78, 88, 92, 70, 95, 82])
# Perform Min-Max standardization to 0-1 range

Next Steps

NumPy is the foundation. In the next section, we'll learn Pandas (a data analysis library built on NumPy), which is the core tool for social science data analysis!

Keep going!

Released under the MIT License. Content © Author.