Sets
Unordered and unique element collections — the tool for deduplication and set operations
What is a Set?
A set is a Python data structure for storing unique elements in an unordered manner, similar to the mathematical concept of sets.
Key Characteristics:
- Unordered: No indexing, cannot use
set[0] - Unique: Automatically removes duplicate elements
- Mutable: Can add/remove elements
- Fast lookup: Checking if element exists is very fast (O(1))
Main Uses:
- Deduplication
- Membership checking (whether exists)
- Set operations (intersection, union, difference)
Creating Sets
python
# Empty set (note: cannot use {}, that's an empty dictionary)
empty_set = set()
# Basic set
majors = {"Economics", "Sociology", "Political Science"}
# From list (auto-deduplication)
ages = [25, 30, 25, 35, 30, 40]
unique_ages = set(ages)
print(unique_ages) # {25, 30, 35, 40} (order may vary)
# From string (split into characters)
letters = set("hello")
print(letters) # {'h', 'e', 'l', 'o'} (auto-deduplication)✏️ Basic Operations
1. Adding Elements
python
majors = {"Economics", "Sociology"}
# add(): Add single element
majors.add("Political Science")
print(majors) # {'Economics', 'Sociology', 'Political Science'}
# Adding duplicate element (no effect)
majors.add("Economics")
print(majors) # Still 3 elements (auto-deduplication)
# update(): Add multiple elements
majors.update(["Psychology", "Anthropology"])
print(majors) # 5 elements2. Removing Elements
python
majors = {"Economics", "Sociology", "Political Science"}
# remove(): Remove element (raises error if doesn't exist)
majors.remove("Sociology")
print(majors)
# discard(): Remove element (no error if doesn't exist)
majors.discard("Physics") # No error even if doesn't exist
# pop(): Randomly remove one element
removed = majors.pop()
print(f"Removed: {removed}")
# clear(): Empty the set
majors.clear()
print(majors) # set()3. Membership Checking
python
majors = {"Economics", "Sociology", "Political Science"}
# Check if element exists
print("Economics" in majors) # True
print("Physics" in majors) # False
# Count and iteration
print(len(majors)) # 3
for major in majors:
print(major)🔄 Set Operations
1. Union
python
# Respondent IDs from two surveys
survey1 = {101, 102, 103, 104}
survey2 = {103, 104, 105, 106}
# Union: all participants
all_respondents = survey1 | survey2
# or
all_respondents = survey1.union(survey2)
print(all_respondents) # {101, 102, 103, 104, 105, 106}2. Intersection
python
# Intersection: participated in both surveys
both_surveys = survey1 & survey2
# or
both_surveys = survey1.intersection(survey2)
print(both_surveys) # {103, 104}3. Difference
python
# Difference: only participated in first survey
only_first = survey1 - survey2
# or
only_first = survey1.difference(survey2)
print(only_first) # {101, 102}
# Reverse difference
only_second = survey2 - survey1
print(only_second) # {105, 106}4. Symmetric Difference
python
# Symmetric difference: participated in only one survey (not both)
only_one_survey = survey1 ^ survey2
# or
only_one_survey = survey1.symmetric_difference(survey2)
print(only_one_survey) # {101, 102, 105, 106}Set Operations Summary:
| Operation | Symbol | Method | Meaning |
|---|---|---|---|
| Union | `A | B` | A.union(B) |
| Intersection | A & B | A.intersection(B) | Elements in both A and B |
| Difference | A - B | A.difference(B) | Elements in A but not B |
| Symmetric Difference | A ^ B | A.symmetric_difference(B) | Elements in A or B, but not both |
🔬 Real-World Cases
Case 1: Data Deduplication
python
# Respondent IDs (with duplicates)
respondent_ids = [1001, 1002, 1001, 1003, 1002, 1004, 1003]
# Deduplicate
unique_ids = set(respondent_ids)
print(f"Original count: {len(respondent_ids)}")
print(f"After deduplication: {len(unique_ids)}")
print(f"Duplicates removed: {len(respondent_ids) - len(unique_ids)}")
# Convert back to list
unique_ids_list = sorted(list(unique_ids))
print(unique_ids_list) # [1001, 1002, 1003, 1004]Case 2: Finding New Respondents
python
# First wave respondents
wave1 = {1001, 1002, 1003, 1004, 1005}
# Second wave respondents
wave2 = {1003, 1004, 1005, 1006, 1007, 1008}
# Analysis
print("=== Survey Analysis ===")
print(f"Wave 1: {len(wave1)} people")
print(f"Wave 2: {len(wave2)} people")
print(f"Both waves: {len(wave1 & wave2)} people")
print(f"New respondents: {len(wave2 - wave1)} people → {wave2 - wave1}")
print(f"Lost respondents: {len(wave1 - wave2)} people → {wave1 - wave2}")
print(f"Total coverage: {len(wave1 | wave2)} people")Case 3: Survey Quality Check
python
# Required fields
required_fields = {"id", "age", "gender", "income"}
# Respondent 1 data
respondent1 = {"id", "age", "gender", "income", "education"}
respondent2 = {"id", "age", "gender"} # Missing income
# Check completeness
print("=== Respondent 1 ===")
missing1 = required_fields - respondent1
if missing1:
print(f"❌ Missing fields: {missing1}")
else:
print("✅ Data complete")
print("\n=== Respondent 2 ===")
missing2 = required_fields - respondent2
if missing2:
print(f"❌ Missing fields: {missing2}")
else:
print("✅ Data complete")Case 4: Course Enrollment Analysis
python
# Students enrolled in different courses
econ_students = {"Alice", "Bob", "Carol", "David", "Emma"}
stat_students = {"Bob", "Carol", "Frank", "Grace"}
python_students = {"Alice", "Carol", "Emma", "Frank", "Henry"}
# Analysis
print("=== Course Enrollment Analysis ===")
# Students taking all three courses
all_three = econ_students & stat_students & python_students
print(f"All three courses: {all_three}")
# Students taking at least one course
at_least_one = econ_students | stat_students | python_students
print(f"At least one course: {len(at_least_one)} students")
# Students taking only economics
only_econ = econ_students - stat_students - python_students
print(f"Only economics: {only_econ}")
# Students taking economics or statistics but not Python
econ_or_stat_not_python = (econ_students | stat_students) - python_students
print(f"Econ/Stat but not Python: {econ_or_stat_not_python}")🚀 Advanced Techniques
1. Frozen Sets (frozenset)
Immutable sets, can be used as dictionary keys or set elements.
python
# Regular sets cannot be nested
# s = {{1, 2}, {3, 4}} # ❌ TypeError
# frozenset can
s = {frozenset({1, 2}), frozenset({3, 4})}
print(s) # {frozenset({1, 2}), frozenset({3, 4})}
# As dictionary keys
survey_participants = {
frozenset({1001, 1002}): "Group 1",
frozenset({1003, 1004}): "Group 2"
}2. Set Comprehensions
python
# Generate unique squares from list
numbers = [1, 2, 2, 3, 3, 3, 4]
squares = {x**2 for x in numbers}
print(squares) # {1, 4, 9, 16}
# Filter even number squares
even_squares = {x**2 for x in range(10) if x % 2 == 0}
print(even_squares) # {0, 4, 16, 36, 64}3. Subset and Superset Testing
python
# Define sets
social_science = {"Economics", "Sociology", "Political Science"}
all_majors = {"Economics", "Sociology", "Political Science", "Physics", "Math"}
# Test subset
print(social_science.issubset(all_majors)) # True
print(social_science <= all_majors) # True (equivalent)
# Test superset
print(all_majors.issuperset(social_science)) # True
print(all_majors >= social_science) # True (equivalent)
# Test disjoint
physics = {"Physics", "Chemistry"}
print(social_science.isdisjoint(physics)) # True (no intersection)🤔 When to Use Sets?
| Scenario | Use List | Use Set |
|---|---|---|
| Preserve order | ✅ | ❌ |
| Allow duplicates | ✅ | ❌ |
| Fast lookup | ❌ | ✅ |
| Deduplication | ❌ | ✅ |
| Set operations | ❌ | ✅ |
| Access by index | ✅ | ❌ |
Example:
python
# ❌ Using list for lookup (slow)
students = ["Alice", "Bob", "Carol", ...1000 students...]
if "Alice" in students: # Need to traverse, O(n)
print("Found")
# ✅ Using set for lookup (fast)
students = {"Alice", "Bob", "Carol", ...1000 students...}
if "Alice" in students: # Hash lookup, O(1)
print("Found")⚠️ Common Errors
Error 1: Trying to Use Indexing
python
majors = {"Economics", "Sociology"}
print(majors[0]) # ❌ TypeError: 'set' object is not subscriptableError 2: Confusing Empty Set and Empty Dictionary
python
empty = {} # ❌ This is empty dictionary
empty_set = set() # ✅ This is empty set
print(type(empty)) # <class 'dict'>
print(type(empty_set)) # <class 'set'>Error 3: Adding Mutable Objects
python
# ❌ Lists cannot be added to sets
# s = {[1, 2], [3, 4]} # TypeError
# ✅ Tuples can
s = {(1, 2), (3, 4)}💪 Practice Problems
Exercise 1: Deduplicate and Sort
python
# Respondent ages (with duplicates)
ages = [25, 30, 25, 35, 30, 40, 25, 28, 30, 35]
# Tasks:
# 1. Deduplicate
# 2. Sort from low to high
# 3. Output unique ages and countExercise 2: Survey Completeness Check
python
# Required fields
required_fields = {"id", "age", "gender", "income", "education"}
# Batch check
responses = [
{"id", "age", "gender", "income", "education"}, # Complete
{"id", "age", "gender", "income"}, # Missing education
{"id", "age", "gender"}, # Missing income, education
]
# Task: Check each response for completeness, output missing fieldsExercise 3: Common Friends
python
# Alice's friends
alice_friends = {"Bob", "Carol", "David", "Emma"}
# Bob's friends
bob_friends = {"Alice", "Carol", "Frank", "Grace"}
# Tasks:
# 1. Find common friends of Alice and Bob
# 2. Find people who are only Alice's friends
# 3. Find total number of friends (no duplicates)📝 Summary
You've now mastered Python's four data structures:
| Data Structure | Ordered | Mutable | Duplicates | Use |
|---|---|---|---|---|
| List | ✓ | ✓ | ✓ | General sequences |
| Tuple | ✓ | ✗ | ✓ | Immutable data |
| Dict | * | ✓ | Keys unique | Key-value pairs |
| Set | ✗ | ✓ | ✗ | Deduplication, set operations |
*Python 3.7+ dictionaries maintain insertion order
Next Step: We'll learn about Functions and Modules, making code more modular and reusable.
Ready? Keep going!