集合（Sets）

无序且唯一的元素集 —— 去重和集合运算的利器

什么是集合？

集合（Set）是 Python 中存储 唯一元素 的无序数据结构，类似于数学中的集合概念。

关键特点：

无序：没有索引，不能用 set[0] 访问
唯一：自动去除重复元素
可变：可以添加/删除元素
快速查找：检查元素是否存在非常快（O(1)）

主要用途：

去重
成员检查（是否存在）
集合运算（交集、并集、差集）

创建集合

python

# 空集合（注意：不能用 {}，那是空字典）
empty_set = set()

# 基本集合
majors = {"Economics", "Sociology", "Political Science"}

# 从列表创建（自动去重）
ages = [25, 30, 25, 35, 30, 40]
unique_ages = set(ages)
print(unique_ages)  # {25, 30, 35, 40}（顺序可能不同）

# 从字符串创建（拆分为字符）
letters = set("hello")
print(letters)  # {'h', 'e', 'l', 'o'}（自动去重）

️ 基本操作

1. 添加元素

python

majors = {"Economics", "Sociology"}

# add(): 添加单个元素
majors.add("Political Science")
print(majors)  # {'Economics', 'Sociology', 'Political Science'}

# 添加重复元素（无效）
majors.add("Economics")
print(majors)  # 还是3个元素（自动去重）

# update(): 添加多个元素
majors.update(["Psychology", "Anthropology"])
print(majors)  # 5个元素

2. 删除元素

python

majors = {"Economics", "Sociology", "Political Science"}

# remove(): 删除元素（不存在会报错）
majors.remove("Sociology")
print(majors)

# discard(): 删除元素（不存在不报错）
majors.discard("Physics")  # 不报错，即使不存在

# pop(): 随机删除一个元素
removed = majors.pop()
print(f"删除了: {removed}")

# clear(): 清空集合
majors.clear()
print(majors)  # set()

3. 成员检查

python

majors = {"Economics", "Sociology", "Political Science"}

# 检查元素是否存在
print("Economics" in majors)  # True
print("Physics" in majors)    # False

# 数量和遍历
print(len(majors))  # 3

for major in majors:
    print(major)

集合运算

1. 并集（Union）

python

# 两个调查的受访者ID
survey1 = {101, 102, 103, 104}
survey2 = {103, 104, 105, 106}

# 并集：所有参与过的人
all_respondents = survey1 | survey2
# 或
all_respondents = survey1.union(survey2)

print(all_respondents)  # {101, 102, 103, 104, 105, 106}

2. 交集（Intersection）

python

# 交集：两次调查都参与的人
both_surveys = survey1 & survey2
# 或
both_surveys = survey1.intersection(survey2)

print(both_surveys)  # {103, 104}

3. 差集（Difference）

python

# 差集：只参与第一次调查的人
only_first = survey1 - survey2
# 或
only_first = survey1.difference(survey2)

print(only_first)  # {101, 102}

# 反向差集
only_second = survey2 - survey1
print(only_second)  # {105, 106}

4. 对称差集（Symmetric Difference）

python

# 对称差集：只参与一次调查的人（不包括两次都参与的）
only_one_survey = survey1 ^ survey2
# 或
only_one_survey = survey1.symmetric_difference(survey2)

print(only_one_survey)  # {101, 102, 105, 106}

集合运算总结：

运算	符号	方法	含义
并集	`A	B`	`A.union(B)`
交集	`A & B`	`A.intersection(B)`	A 和 B 都有的元素
差集	`A - B`	`A.difference(B)`	A 有但 B 没有的元素
对称差集	`A ^ B`	`A.symmetric_difference(B)`	A 或 B 有，但不同时在两者中

实战案例

案例 1：数据去重

python

# 受访者ID（有重复）
respondent_ids = [1001, 1002, 1001, 1003, 1002, 1004, 1003]

# 去重
unique_ids = set(respondent_ids)
print(f"原始数量: {len(respondent_ids)}")
print(f"去重后: {len(unique_ids)}")
print(f"重复数量: {len(respondent_ids) - len(unique_ids)}")

# 转回列表
unique_ids_list = sorted(list(unique_ids))
print(unique_ids_list)  # [1001, 1002, 1003, 1004]

案例 2：找出新受访者

python

# 第一波调查的受访者
wave1 = {1001, 1002, 1003, 1004, 1005}

# 第二波调查的受访者
wave2 = {1003, 1004, 1005, 1006, 1007, 1008}

# 分析
print("=== 调查分析 ===")
print(f"第一波: {len(wave1)} 人")
print(f"第二波: {len(wave2)} 人")
print(f"两波都参与: {len(wave1 & wave2)} 人")
print(f"新增受访者: {len(wave2 - wave1)} 人 → {wave2 - wave1}")
print(f"流失受访者: {len(wave1 - wave2)} 人 → {wave1 - wave2}")
print(f"总覆盖: {len(wave1 | wave2)} 人")

案例 3：问卷质量检查

python

# 必填字段
required_fields = {"id", "age", "gender", "income"}

# 受访者1的数据
respondent1 = {"id", "age", "gender", "income", "education"}
respondent2 = {"id", "age", "gender"}  # 缺失 income

# 检查是否完整
print("=== 受访者1 ===")
missing1 = required_fields - respondent1
if missing1:
    print(f" 缺失字段: {missing1}")
else:
    print(" 数据完整")

print("\n=== 受访者2 ===")
missing2 = required_fields - respondent2
if missing2:
    print(f" 缺失字段: {missing2}")
else:
    print(" 数据完整")

案例 4：专业交叉分析

python

# 不同课程的选课学生
econ_students = {"Alice", "Bob", "Carol", "David", "Emma"}
stat_students = {"Bob", "Carol", "Frank", "Grace"}
python_students = {"Alice", "Carol", "Emma", "Frank", "Henry"}

# 分析
print("=== 选课分析 ===")

# 三门课都选的学生
all_three = econ_students & stat_students & python_students
print(f"三门课都选: {all_three}")

# 至少选一门的学生
at_least_one = econ_students | stat_students | python_students
print(f"至少选一门: {len(at_least_one)} 人")

# 只选经济学的学生
only_econ = econ_students - stat_students - python_students
print(f"只选经济学: {only_econ}")

# 选经济学或统计学但不选Python的学生
econ_or_stat_not_python = (econ_students | stat_students) - python_students
print(f"选经济/统计但不选Python: {econ_or_stat_not_python}")

高级技巧

1. 冻结集合（frozenset）

不可变的集合，可以作为字典的键或集合的元素。

python

# 普通集合不能嵌套
# s = {{1, 2}, {3, 4}}  #  TypeError

# frozenset 可以
s = {frozenset({1, 2}), frozenset({3, 4})}
print(s)  # {frozenset({1, 2}), frozenset({3, 4})}

# 作为字典键
survey_participants = {
    frozenset({1001, 1002}): "第一组",
    frozenset({1003, 1004}): "第二组"
}

2. 集合推导式

python

# 从列表生成唯一平方数
numbers = [1, 2, 2, 3, 3, 3, 4]
squares = {x**2 for x in numbers}
print(squares)  # {1, 4, 9, 16}

# 筛选偶数平方
even_squares = {x**2 for x in range(10) if x % 2 == 0}
print(even_squares)  # {0, 4, 16, 36, 64}

3. 子集和超集判断

python

# 定义集合
social_science = {"Economics", "Sociology", "Political Science"}
all_majors = {"Economics", "Sociology", "Political Science", "Physics", "Math"}

# 判断子集
print(social_science.issubset(all_majors))  # True
print(social_science <= all_majors)         # True（等价写法）

# 判断超集
print(all_majors.issuperset(social_science))  # True
print(all_majors >= social_science)           # True（等价写法）

# 判断不相交
physics = {"Physics", "Chemistry"}
print(social_science.isdisjoint(physics))  # True（没有交集）

何时使用集合？

场景	使用列表	使用集合
保留顺序
允许重复
快速查找
去重
集合运算
按索引访问

示例：

python

#  用列表查找（慢）
students = ["Alice", "Bob", "Carol", ...1000个学生...]
if "Alice" in students:  # 需要遍历，O(n)
    print("找到了")

#  用集合查找（快）
students = {"Alice", "Bob", "Carol", ...1000个学生...}
if "Alice" in students:  # 哈希查找，O(1)
    print("找到了")

常见错误

错误 1：尝试使用索引

python

majors = {"Economics", "Sociology"}
print(majors[0])  #  TypeError: 'set' object is not subscriptable

错误 2：混淆空集合和空字典

python

empty = {}         #  这是空字典
empty_set = set()  #  这才是空集合

print(type(empty))      # <class 'dict'>
print(type(empty_set))  # <class 'set'>

错误 3：添加可变对象

python

#  列表不能加入集合
# s = {[1, 2], [3, 4]}  # TypeError

#  元组可以
s = {(1, 2), (3, 4)}

练习题

练习 1：去重并排序

python

# 受访者年龄（有重复）
ages = [25, 30, 25, 35, 30, 40, 25, 28, 30, 35]

# 任务：
# 1. 去重
# 2. 从小到大排序
# 3. 输出唯一年龄及数量

练习 2：问卷完整性检查

python

# 必填字段
required_fields = {"id", "age", "gender", "income", "education"}

# 批量检查
responses = [
    {"id", "age", "gender", "income", "education"},  # 完整
    {"id", "age", "gender", "income"},                # 缺 education
    {"id", "age", "gender"},                          # 缺 income, education
]

# 任务：检查每个响应是否完整，输出缺失字段

练习 3：共同好友

python

# Alice 的好友
alice_friends = {"Bob", "Carol", "David", "Emma"}

# Bob 的好友
bob_friends = {"Alice", "Carol", "Frank", "Grace"}

# 任务：
# 1. 找出 Alice 和 Bob 的共同好友
# 2. 找出只是 Alice 好友的人
# 3. 找出两人好友的总数（不重复）

总结

你现在已经掌握了 Python 的四大数据结构：

数据结构	有序	重复	用途
列表（List）			通用序列
元组（Tuple）			不变数据
字典（Dict）	*	键唯一	键值对
集合（Set）			去重、集合运算

*Python 3.7+ 字典保持插入顺序

下一步：我们将学习 函数与模块，让代码更加模块化和可复用。

集合（Sets） ​

什么是集合？ ​

创建集合 ​

️ 基本操作 ​

1. 添加元素 ​

2. 删除元素 ​

3. 成员检查 ​

集合运算 ​

1. 并集（Union） ​

2. 交集（Intersection） ​

3. 差集（Difference） ​

4. 对称差集（Symmetric Difference） ​

实战案例 ​

案例 1：数据去重 ​

案例 2：找出新受访者 ​

案例 3：问卷质量检查 ​

案例 4：专业交叉分析 ​

高级技巧 ​

1. 冻结集合（frozenset） ​

2. 集合推导式 ​

3. 子集和超集判断 ​

何时使用集合？ ​

常见错误 ​

错误 1：尝试使用索引 ​

错误 2：混淆空集合和空字典 ​

错误 3：添加可变对象 ​

练习题 ​

练习 1：去重并排序 ​

练习 2：问卷完整性检查 ​

练习 3：共同好友 ​

总结 ​

集合（Sets）

什么是集合？

创建集合

️ 基本操作

1. 添加元素

2. 删除元素

3. 成员检查

集合运算

1. 并集（Union）

2. 交集（Intersection）

3. 差集（Difference）

4. 对称差集（Symmetric Difference）

实战案例

案例 1：数据去重

案例 2：找出新受访者

案例 3：问卷质量检查

案例 4：专业交叉分析

高级技巧

1. 冻结集合（frozenset）

2. 集合推导式

3. 子集和超集判断

何时使用集合？

常见错误

错误 1：尝试使用索引

错误 2：混淆空集合和空字典

错误 3：添加可变对象

练习题

练习 1：去重并排序

练习 2：问卷完整性检查

练习 3：共同好友

总结