Web Data Scraping
Using requests and BeautifulSoup to Retrieve Web Data
Why Learn Web Scraping?
- Retrieve public data (news, social media)
- Collect research data
- Monitor website changes
Note: Respect website robots.txt and terms of use
requests Basics
python
import requests
# GET request
response = requests.get('https://example.com')
print(response.status_code) # 200 means success
print(response.text) # HTML content
print(response.json()) # If it's a JSON APIBeautifulSoup for HTML Parsing
python
from bs4 import BeautifulSoup
import requests
# Get webpage
url = 'https://example.com/data'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements
title = soup.find('h1').text
items = soup.find_all('div', class_='item')
# Extract data
for item in items:
name = item.find('span', class_='name').text
value = item.find('span', class_='value').text
print(f"{name}: {value}")Practical Examples
Example: Scraping Table Data
python
import pandas as pd
# Pandas can directly read HTML tables
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP'
tables = pd.read_html(url)
# Select the needed table
df = tables[0]
print(df.head())Example: API Data Retrieval
python
import requests
import pandas as pd
# World Bank API example
url = 'https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD'
params = {
'format': 'json',
'per_page': 10
}
response = requests.get(url, params=params)
data = response.json()
# Convert to DataFrame
df = pd.DataFrame(data[1])
print(df[['date', 'value']].head())Best Practices
python
import time
def safe_get(url, max_retries=3):
"""Safe request function"""
for i in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
time.sleep(1) # Polite delay
return response
except requests.RequestException as e:
print(f"Attempt {i+1} failed: {e}")
time.sleep(2 ** i)
return NonePractice Exercises
python
# Try scraping:
# 1. Faculty list from a university
# 2. Headlines from a news website
# 3. Weather data
# (Use public websites that allow scraping)Module 9 Complete!
You have now mastered:
- NumPy arrays
- Pandas data analysis
- Data visualization
- Descriptive statistics
- Web scraping
Next module: Advanced Data Science (sklearn, PyTorch, LLMs)
Keep going!