网页数据爬取

使用 requests 和 BeautifulSoup 获取网络数据

为什么学爬虫？

获取公开数据（新闻、社交媒体）
收集研究数据
监控网站变化

注意：遵守网站的 robots.txt 和使用条款

requests 基础

python

import requests

# GET 请求
response = requests.get('https://example.com')

print(response.status_code)  # 200 表示成功
print(response.text)         # HTML 内容
print(response.json())       # 如果是 JSON API

BeautifulSoup 解析 HTML

python

from bs4 import BeautifulSoup
import requests

# 获取网页
url = 'https://example.com/data'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 查找元素
title = soup.find('h1').text
items = soup.find_all('div', class_='item')

# 提取数据
for item in items:
    name = item.find('span', class_='name').text
    value = item.find('span', class_='value').text
    print(f"{name}: {value}")

实战案例

案例：爬取表格数据

python

import pandas as pd

# Pandas 可以直接读取 HTML 表格
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP'
tables = pd.read_html(url)

# 选择需要的表格
df = tables[0]
print(df.head())

案例：API 数据获取

python

import requests
import pandas as pd

# 世界银行 API 示例
url = 'https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.PCAP.CD'
params = {
    'format': 'json',
    'per_page': 10
}

response = requests.get(url, params=params)
data = response.json()

# 转为 DataFrame
df = pd.DataFrame(data[1])
print(df[['date', 'value']].head())

最佳实践

python

import time

def safe_get(url, max_retries=3):
    """安全的请求函数"""
    for i in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            time.sleep(1)  # 礼貌延迟
            return response
        except requests.RequestException as e:
            print(f"尝试 {i+1} 失败: {e}")
            time.sleep(2 ** i)
    return None

练习题

python

# 尝试爬取：
# 1. 某大学的教师列表
# 2. 新闻网站的标题
# 3. 天气数据
# (使用公开的、允许爬取的网站)

Module 9 完成！

你已掌握：

NumPy 数组
Pandas 数据分析
数据可视化
描述统计
网页爬虫

下一个模块：数据科学进阶（sklearn, PyTorch, LLMs）

继续！

网页数据爬取 ​

为什么学爬虫？ ​

requests 基础 ​

BeautifulSoup 解析 HTML ​

实战案例 ​

案例：爬取表格数据 ​

案例：API 数据获取 ​

最佳实践 ​

练习题 ​

Module 9 完成！ ​

网页数据爬取

为什么学爬虫？

requests 基础

BeautifulSoup 解析 HTML

实战案例

案例：爬取表格数据

案例：API 数据获取

最佳实践

练习题

Module 9 完成！