Skip to content

Module 4: Python Data Structures

The Art of Organizing Data — Lists, Dictionaries, Tuples, Sets


Chapter Overview

If variables are containers for storing individual data points, then data structures are the ways to organize multiple data points. This chapter introduces Python's four core data structures: Lists, Tuples, Dictionaries, and Sets. Mastering them will enable you to efficiently handle complex research data.


Learning Objectives

After completing this chapter, you will be able to:

  • Understand the characteristics and use cases of four data structures
  • Proficiently manipulate lists (create, read, update, delete, slice, sort)
  • Use dictionaries to store key-value pair data
  • Understand tuple immutability and its applications
  • Use sets for deduplication and set operations
  • Choose the appropriate data structure for practical problems
  • Compare Python's data organization methods with Stata/R

Chapter Contents

01 - Lists

Core Question: How to store and manipulate ordered collections of data?

Core Content:

  • List creation: [], list(), range()
  • Index access (forward/backward)
  • Slicing operations: list[start:end:step]
  • List methods:
    • Adding: append(), insert(), extend()
    • Removing: remove(), pop(), clear()
    • Sorting: sort(), sorted(), reverse()
    • Finding: index(), count(), in
  • List comprehensions (advanced syntax)
  • Nested lists (two-dimensional data)
  • Comparison with R vectors and Stata variables

Practical Application:

python
# Store grades for multiple students
grades = [85, 92, 78, 95, 88]

# Filter passing grades
passed = [g for g in grades if g >= 60]

# Calculate average grade
avg_grade = sum(grades) / len(grades)

# Find highest and lowest grades
max_grade = max(grades)
min_grade = min(grades)

Research Scenarios:

  • Storing sample data (age, income, years of education)
  • Batch processing variable names
  • Storing regression coefficients
  • Time series data

02 - Tuples

Core Question: When do you need an immutable data structure?

Core Content:

  • Tuple creation: (), tuple()
  • Meaning of immutability
  • Tuple unpacking
  • Single-element tuple trap: (1,) vs (1)
  • Tuple vs List: when to use which?
  • Named tuples (namedtuple)

Practical Application:

python
# Store fixed configuration (should not be modified)
regression_params = ("OLS", 0.05, 1000)  # Model type, significance level, sample size

# Function returns multiple values
def calculate_stats(data):
    return (mean(data), median(data), std(data))

mean_val, median_val, std_val = calculate_stats(incomes)

# Dictionary keys (must be immutable)
results = {
    ("Model1", "OLS"): 0.85,
    ("Model2", "Logit"): 0.78
}

When to Use Tuples?

  • Data should not be modified (configuration parameters, constants)
  • As dictionary keys
  • Function returns multiple values
  • High performance requirements (tuples are faster than lists)

03 - Dictionaries

Core Question: How to store key-value mapped data?

Core Content:

  • Dictionary creation: {}, dict(), dictionary comprehensions
  • Access and modification: dict[key], dict.get(key, default)
  • Dictionary methods:
    • Keys/values/items: keys(), values(), items()
    • Update: update(), setdefault()
    • Delete: pop(), del, clear()
  • Nested dictionaries (multi-level data)
  • Dictionary iteration
  • Comparison with R's named list

Practical Application:

python
# Store personal information
student = {
    "name": "Alice",
    "age": 22,
    "major": "Economics",
    "gpa": 3.8
}

# Store regression results
regression_results = {
    "Model1": {"coef": 0.45, "se": 0.12, "r2": 0.65},
    "Model2": {"coef": 0.52, "se": 0.10, "r2": 0.72}
}

# Variable mapping
var_labels = {
    "edu": "Years of Education",
    "income": "Annual Income",
    "age": "Age"
}

Research Scenarios:

  • Storing individual attributes (ID → attribute values)
  • Organizing regression results
  • Variable name mapping and labels
  • Configuration files (parameter settings)

04 - Sets

Core Question: How to handle unique values and set operations?

Core Content:

  • Set creation: {}, set()
  • Set characteristics: unordered, unique, mutable
  • Set operations:
    • Add/remove: add(), remove(), discard()
    • Set operations: union (|), intersection (&), difference (-), symmetric difference (^)
    • Subset testing: issubset(), issuperset()
  • Deduplication
  • Membership testing (high efficiency)

Practical Application:

python
# Data deduplication
all_ids = [101, 102, 103, 101, 104, 102]
unique_ids = set(all_ids)  # {101, 102, 103, 104}

# Find intersection of two respondent groups
group_a = {101, 102, 103, 104}
group_b = {103, 104, 105, 106}
both_groups = group_a & group_b  # {103, 104}

# Fast membership checking (faster than lists)
if 101 in unique_ids:
    print("ID 101 exists")

Research Scenarios:

  • Data deduplication
  • Sample matching (intersection)
  • Difference analysis (difference set)
  • ID duplicate checking

05 - Summary and Review

Content:

  • Comparison table of four data structures
  • Selection decision tree
  • Comprehensive practice problems
  • Performance comparison
  • Common errors and best practices

Four Data Structures Comparison

FeatureListTupleDictSet
Syntax[...](...){k:v, ...}{...}
Ordered✓ (3.7+ maintains insertion order)
Mutable
DuplicatesKeys unique, values can repeat
IndexingInteger indexInteger indexKey indexNo index
Typical UseOrdered collectionImmutable collectionKey-value mappingUnique values, set operations

Selection Guide

When to use lists?

  • Need to store ordered elements (grades, years, prices)
  • Need to modify data (add, delete, sort)
  • Need to access by index

When to use tuples?

  • Data should not be modified (configuration, constants)
  • As dictionary keys
  • Function returns multiple values
  • Performance priority (faster than lists)

When to use dictionaries?

  • Need to look up data by name/ID
  • Store attributes (name → properties)
  • Counting, mapping, lookup tables

When to use sets?

  • Need deduplication
  • Need set operations (intersection/union/difference)
  • Fast membership testing

How to Study This Chapter?

Learning Roadmap

Day 1 (3 hours): Lists

  • Read 01 - Lists
  • Practice indexing, slicing, methods
  • Write list comprehensions

Day 2 (2 hours): Tuples

  • Read 02 - Tuples
  • Understand immutability
  • Practice tuple unpacking

Day 3 (3 hours): Dictionaries

  • Read 03 - Dictionaries
  • Create nested dictionaries
  • Practice dictionary iteration and methods

Day 4 (2 hours): Sets

  • Read 04 - Sets
  • Practice set operations
  • Data deduplication practice

Day 5 (2 hours): Review and comprehensive application

  • Complete 05 - Summary and Review
  • Comprehensive practice problems
  • Compare four structures

Total Time: 12 hours (1-2 weeks)

Minimal Learning Path

If time is limited:

Must Learn (core structures, 8 hours):

  • 01 - Lists (complete)
  • 03 - Dictionaries (complete)
  • 02 - Tuples (basics)
  • 04 - Sets (deduplication)

Optional (advanced techniques):

  • List comprehensions
  • Nested dictionaries
  • Set operations
  • Named tuples

Study Recommendations

  1. Start from Use Cases

    • Think: "Which structure fits my research data?"
    • Map Stata/R data organization to Python
    • Practice with real data
  2. Comparative Learning

    python
    # List: ordered, mutable
    grades = [85, 92, 78]
    grades.append(95)  # Can modify
    
    # Tuple: ordered, immutable
    config = (85, 92, 78)
    # config.append(95)  # Error! Cannot modify
    
    # Dictionary: key-value pairs
    student = {"name": "Alice", "grade": 85}
    
    # Set: unique values
    unique_grades = {85, 92, 78, 85}  # {85, 92, 78}
  3. Performance Awareness

    • List search: O(n) — slow
    • Dictionary/set search: O(1) — fast
    • Choose appropriate structure for large datasets
  4. Avoid Common Errors

    • Don't confuse [] (list), () (tuple), {} (dictionary/set)
    • Remember list indexing starts at 0
    • Accessing non-existent dictionary keys raises errors, use get() for safety

Common Questions

Q: Why so many data structures? Can't we just use lists? A: Different structures have different advantages. Lists are good for ordered data, dictionaries for lookups, sets for deduplication. Choosing the right structure makes code more efficient and readable.

Q: What's the Stata/R equivalent of Python dictionaries? A:

  • Stata has no direct equivalent (closest is value labels)
  • R's named list is similar to dictionaries

Q: When should I use list comprehensions? A: When you need a loop to create a list, comprehensions are more concise. But if logic is complex, regular loops are clearer.

Q: What's the difference between sets and list deduplication? A: set(list) is the fastest deduplication method, but loses order. If you need to preserve order, use list(dict.fromkeys(list)).

Q: Why must dictionary keys be immutable types? A: Because dictionaries are implemented with hash tables, keys must be hashable (immutable). So tuples work, but lists don't.


Next Steps

After completing this chapter, you will have mastered:

  • Python's four core data structures
  • How to choose the appropriate structure for organizing data
  • Efficient data manipulation methods

In Module 5, we'll learn about functions and modules, making code more modular and reusable.

In Module 6-7, we'll learn Pandas, which integrates all these structures into the powerful DataFrame!

Keep going! Mastering data structures puts you one step away from real data analysis!


Released under the MIT License. Content © Author.