Skip to content

Sets

Unordered and unique element collections — the tool for deduplication and set operations


What is a Set?

A set is a Python data structure for storing unique elements in an unordered manner, similar to the mathematical concept of sets.

Key Characteristics:

  • Unordered: No indexing, cannot use set[0]
  • Unique: Automatically removes duplicate elements
  • Mutable: Can add/remove elements
  • Fast lookup: Checking if element exists is very fast (O(1))

Main Uses:

  • Deduplication
  • Membership checking (whether exists)
  • Set operations (intersection, union, difference)

Creating Sets

python
# Empty set (note: cannot use {}, that's an empty dictionary)
empty_set = set()

# Basic set
majors = {"Economics", "Sociology", "Political Science"}

# From list (auto-deduplication)
ages = [25, 30, 25, 35, 30, 40]
unique_ages = set(ages)
print(unique_ages)  # {25, 30, 35, 40} (order may vary)

# From string (split into characters)
letters = set("hello")
print(letters)  # {'h', 'e', 'l', 'o'} (auto-deduplication)

✏️ Basic Operations

1. Adding Elements

python
majors = {"Economics", "Sociology"}

# add(): Add single element
majors.add("Political Science")
print(majors)  # {'Economics', 'Sociology', 'Political Science'}

# Adding duplicate element (no effect)
majors.add("Economics")
print(majors)  # Still 3 elements (auto-deduplication)

# update(): Add multiple elements
majors.update(["Psychology", "Anthropology"])
print(majors)  # 5 elements

2. Removing Elements

python
majors = {"Economics", "Sociology", "Political Science"}

# remove(): Remove element (raises error if doesn't exist)
majors.remove("Sociology")
print(majors)

# discard(): Remove element (no error if doesn't exist)
majors.discard("Physics")  # No error even if doesn't exist

# pop(): Randomly remove one element
removed = majors.pop()
print(f"Removed: {removed}")

# clear(): Empty the set
majors.clear()
print(majors)  # set()

3. Membership Checking

python
majors = {"Economics", "Sociology", "Political Science"}

# Check if element exists
print("Economics" in majors)  # True
print("Physics" in majors)    # False

# Count and iteration
print(len(majors))  # 3

for major in majors:
    print(major)

🔄 Set Operations

1. Union

python
# Respondent IDs from two surveys
survey1 = {101, 102, 103, 104}
survey2 = {103, 104, 105, 106}

# Union: all participants
all_respondents = survey1 | survey2
# or
all_respondents = survey1.union(survey2)

print(all_respondents)  # {101, 102, 103, 104, 105, 106}

2. Intersection

python
# Intersection: participated in both surveys
both_surveys = survey1 & survey2
# or
both_surveys = survey1.intersection(survey2)

print(both_surveys)  # {103, 104}

3. Difference

python
# Difference: only participated in first survey
only_first = survey1 - survey2
# or
only_first = survey1.difference(survey2)

print(only_first)  # {101, 102}

# Reverse difference
only_second = survey2 - survey1
print(only_second)  # {105, 106}

4. Symmetric Difference

python
# Symmetric difference: participated in only one survey (not both)
only_one_survey = survey1 ^ survey2
# or
only_one_survey = survey1.symmetric_difference(survey2)

print(only_one_survey)  # {101, 102, 105, 106}

Set Operations Summary:

OperationSymbolMethodMeaning
Union`AB`A.union(B)
IntersectionA & BA.intersection(B)Elements in both A and B
DifferenceA - BA.difference(B)Elements in A but not B
Symmetric DifferenceA ^ BA.symmetric_difference(B)Elements in A or B, but not both

🔬 Real-World Cases

Case 1: Data Deduplication

python
# Respondent IDs (with duplicates)
respondent_ids = [1001, 1002, 1001, 1003, 1002, 1004, 1003]

# Deduplicate
unique_ids = set(respondent_ids)
print(f"Original count: {len(respondent_ids)}")
print(f"After deduplication: {len(unique_ids)}")
print(f"Duplicates removed: {len(respondent_ids) - len(unique_ids)}")

# Convert back to list
unique_ids_list = sorted(list(unique_ids))
print(unique_ids_list)  # [1001, 1002, 1003, 1004]

Case 2: Finding New Respondents

python
# First wave respondents
wave1 = {1001, 1002, 1003, 1004, 1005}

# Second wave respondents
wave2 = {1003, 1004, 1005, 1006, 1007, 1008}

# Analysis
print("=== Survey Analysis ===")
print(f"Wave 1: {len(wave1)} people")
print(f"Wave 2: {len(wave2)} people")
print(f"Both waves: {len(wave1 & wave2)} people")
print(f"New respondents: {len(wave2 - wave1)} people → {wave2 - wave1}")
print(f"Lost respondents: {len(wave1 - wave2)} people → {wave1 - wave2}")
print(f"Total coverage: {len(wave1 | wave2)} people")

Case 3: Survey Quality Check

python
# Required fields
required_fields = {"id", "age", "gender", "income"}

# Respondent 1 data
respondent1 = {"id", "age", "gender", "income", "education"}
respondent2 = {"id", "age", "gender"}  # Missing income

# Check completeness
print("=== Respondent 1 ===")
missing1 = required_fields - respondent1
if missing1:
    print(f"❌ Missing fields: {missing1}")
else:
    print("✅ Data complete")

print("\n=== Respondent 2 ===")
missing2 = required_fields - respondent2
if missing2:
    print(f"❌ Missing fields: {missing2}")
else:
    print("✅ Data complete")

Case 4: Course Enrollment Analysis

python
# Students enrolled in different courses
econ_students = {"Alice", "Bob", "Carol", "David", "Emma"}
stat_students = {"Bob", "Carol", "Frank", "Grace"}
python_students = {"Alice", "Carol", "Emma", "Frank", "Henry"}

# Analysis
print("=== Course Enrollment Analysis ===")

# Students taking all three courses
all_three = econ_students & stat_students & python_students
print(f"All three courses: {all_three}")

# Students taking at least one course
at_least_one = econ_students | stat_students | python_students
print(f"At least one course: {len(at_least_one)} students")

# Students taking only economics
only_econ = econ_students - stat_students - python_students
print(f"Only economics: {only_econ}")

# Students taking economics or statistics but not Python
econ_or_stat_not_python = (econ_students | stat_students) - python_students
print(f"Econ/Stat but not Python: {econ_or_stat_not_python}")

🚀 Advanced Techniques

1. Frozen Sets (frozenset)

Immutable sets, can be used as dictionary keys or set elements.

python
# Regular sets cannot be nested
# s = {{1, 2}, {3, 4}}  # ❌ TypeError

# frozenset can
s = {frozenset({1, 2}), frozenset({3, 4})}
print(s)  # {frozenset({1, 2}), frozenset({3, 4})}

# As dictionary keys
survey_participants = {
    frozenset({1001, 1002}): "Group 1",
    frozenset({1003, 1004}): "Group 2"
}

2. Set Comprehensions

python
# Generate unique squares from list
numbers = [1, 2, 2, 3, 3, 3, 4]
squares = {x**2 for x in numbers}
print(squares)  # {1, 4, 9, 16}

# Filter even number squares
even_squares = {x**2 for x in range(10) if x % 2 == 0}
print(even_squares)  # {0, 4, 16, 36, 64}

3. Subset and Superset Testing

python
# Define sets
social_science = {"Economics", "Sociology", "Political Science"}
all_majors = {"Economics", "Sociology", "Political Science", "Physics", "Math"}

# Test subset
print(social_science.issubset(all_majors))  # True
print(social_science <= all_majors)         # True (equivalent)

# Test superset
print(all_majors.issuperset(social_science))  # True
print(all_majors >= social_science)           # True (equivalent)

# Test disjoint
physics = {"Physics", "Chemistry"}
print(social_science.isdisjoint(physics))  # True (no intersection)

🤔 When to Use Sets?

ScenarioUse ListUse Set
Preserve order
Allow duplicates
Fast lookup
Deduplication
Set operations
Access by index

Example:

python
# ❌ Using list for lookup (slow)
students = ["Alice", "Bob", "Carol", ...1000 students...]
if "Alice" in students:  # Need to traverse, O(n)
    print("Found")

# ✅ Using set for lookup (fast)
students = {"Alice", "Bob", "Carol", ...1000 students...}
if "Alice" in students:  # Hash lookup, O(1)
    print("Found")

⚠️ Common Errors

Error 1: Trying to Use Indexing

python
majors = {"Economics", "Sociology"}
print(majors[0])  # ❌ TypeError: 'set' object is not subscriptable

Error 2: Confusing Empty Set and Empty Dictionary

python
empty = {}         # ❌ This is empty dictionary
empty_set = set()  # ✅ This is empty set

print(type(empty))      # <class 'dict'>
print(type(empty_set))  # <class 'set'>

Error 3: Adding Mutable Objects

python
# ❌ Lists cannot be added to sets
# s = {[1, 2], [3, 4]}  # TypeError

# ✅ Tuples can
s = {(1, 2), (3, 4)}

💪 Practice Problems

Exercise 1: Deduplicate and Sort

python
# Respondent ages (with duplicates)
ages = [25, 30, 25, 35, 30, 40, 25, 28, 30, 35]

# Tasks:
# 1. Deduplicate
# 2. Sort from low to high
# 3. Output unique ages and count

Exercise 2: Survey Completeness Check

python
# Required fields
required_fields = {"id", "age", "gender", "income", "education"}

# Batch check
responses = [
    {"id", "age", "gender", "income", "education"},  # Complete
    {"id", "age", "gender", "income"},                # Missing education
    {"id", "age", "gender"},                          # Missing income, education
]

# Task: Check each response for completeness, output missing fields

Exercise 3: Common Friends

python
# Alice's friends
alice_friends = {"Bob", "Carol", "David", "Emma"}

# Bob's friends
bob_friends = {"Alice", "Carol", "Frank", "Grace"}

# Tasks:
# 1. Find common friends of Alice and Bob
# 2. Find people who are only Alice's friends
# 3. Find total number of friends (no duplicates)

📝 Summary

You've now mastered Python's four data structures:

Data StructureOrderedMutableDuplicatesUse
ListGeneral sequences
TupleImmutable data
Dict*Keys uniqueKey-value pairs
SetDeduplication, set operations

*Python 3.7+ dictionaries maintain insertion order

Next Step: We'll learn about Functions and Modules, making code more modular and reusable.

Ready? Keep going!

Released under the MIT License. Content © Author.