Skip to content

Web Data Scraping

Using requests and BeautifulSoup to Retrieve Web Data


Why Learn Web Scraping?

  • Retrieve public data (news, social media)
  • Collect research data
  • Monitor website changes

Note: Respect website robots.txt and terms of use


requests Basics

python
import requests

# GET request
response = requests.get('https://example.com')

print(response.status_code)  # 200 means success
print(response.text)         # HTML content
print(response.json())       # If it's a JSON API

BeautifulSoup for HTML Parsing

python
from bs4 import BeautifulSoup
import requests

# Get webpage
url = 'https://example.com/data'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find elements
title = soup.find('h1').text
items = soup.find_all('div', class_='item')

# Extract data
for item in items:
    name = item.find('span', class_='name').text
    value = item.find('span', class_='value').text
    print(f"{name}: {value}")

Practical Examples

Example: Scraping Table Data

python
import pandas as pd

# Pandas can directly read HTML tables
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP'
tables = pd.read_html(url)

# Select the needed table
df = tables[0]
print(df.head())

Example: API Data Retrieval

python
import requests
import pandas as pd

# World Bank API example
url = 'https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD'
params = {
    'format': 'json',
    'per_page': 10
}

response = requests.get(url, params=params)
data = response.json()

# Convert to DataFrame
df = pd.DataFrame(data[1])
print(df[['date', 'value']].head())

Best Practices

python
import time

def safe_get(url, max_retries=3):
    """Safe request function"""
    for i in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            time.sleep(1)  # Polite delay
            return response
        except requests.RequestException as e:
            print(f"Attempt {i+1} failed: {e}")
            time.sleep(2 ** i)
    return None

Practice Exercises

python
# Try scraping:
# 1. Faculty list from a university
# 2. Headlines from a news website
# 3. Weather data
# (Use public websites that allow scraping)

Module 9 Complete!

You have now mastered:

  • NumPy arrays
  • Pandas data analysis
  • Data visualization
  • Descriptive statistics
  • Web scraping

Next module: Advanced Data Science (sklearn, PyTorch, LLMs)

Keep going!

Released under the MIT License. Content © Author.