Module 4: Python Data Structures

The Art of Organizing Data — Lists, Dictionaries, Tuples, Sets

Chapter Overview

If variables are containers for storing individual data points, then data structures are the ways to organize multiple data points. This chapter introduces Python's four core data structures: Lists, Tuples, Dictionaries, and Sets. Mastering them will enable you to efficiently handle complex research data.

Learning Objectives

After completing this chapter, you will be able to:

Understand the characteristics and use cases of four data structures
Proficiently manipulate lists (create, read, update, delete, slice, sort)
Use dictionaries to store key-value pair data
Understand tuple immutability and its applications
Use sets for deduplication and set operations
Choose the appropriate data structure for practical problems
Compare Python's data organization methods with Stata/R

Chapter Contents

01 - Lists

Core Question: How to store and manipulate ordered collections of data?

Core Content:

List creation: [], list(), range()
Index access (forward/backward)
Slicing operations: list[start:end:step]
List methods:
- Adding: append(), insert(), extend()
- Removing: remove(), pop(), clear()
- Sorting: sort(), sorted(), reverse()
- Finding: index(), count(), in
List comprehensions (advanced syntax)
Nested lists (two-dimensional data)
Comparison with R vectors and Stata variables

Practical Application:

python

# Store grades for multiple students
grades = [85, 92, 78, 95, 88]

# Filter passing grades
passed = [g for g in grades if g >= 60]

# Calculate average grade
avg_grade = sum(grades) / len(grades)

# Find highest and lowest grades
max_grade = max(grades)
min_grade = min(grades)

Research Scenarios:

Storing sample data (age, income, years of education)
Batch processing variable names
Storing regression coefficients
Time series data

02 - Tuples

Core Question: When do you need an immutable data structure?

Core Content:

Tuple creation: (), tuple()
Meaning of immutability
Tuple unpacking
Single-element tuple trap: (1,) vs (1)
Tuple vs List: when to use which?
Named tuples (namedtuple)

Practical Application:

python

# Store fixed configuration (should not be modified)
regression_params = ("OLS", 0.05, 1000)  # Model type, significance level, sample size

# Function returns multiple values
def calculate_stats(data):
    return (mean(data), median(data), std(data))

mean_val, median_val, std_val = calculate_stats(incomes)

# Dictionary keys (must be immutable)
results = {
    ("Model1", "OLS"): 0.85,
    ("Model2", "Logit"): 0.78
}

When to Use Tuples?

Data should not be modified (configuration parameters, constants)
As dictionary keys
Function returns multiple values
High performance requirements (tuples are faster than lists)

03 - Dictionaries

Core Question: How to store key-value mapped data?

Core Content:

Dictionary creation: {}, dict(), dictionary comprehensions
Access and modification: dict[key], dict.get(key, default)
Dictionary methods:
- Keys/values/items: keys(), values(), items()
- Update: update(), setdefault()
- Delete: pop(), del, clear()
Nested dictionaries (multi-level data)
Dictionary iteration
Comparison with R's named list

Practical Application:

python

# Store personal information
student = {
    "name": "Alice",
    "age": 22,
    "major": "Economics",
    "gpa": 3.8
}

# Store regression results
regression_results = {
    "Model1": {"coef": 0.45, "se": 0.12, "r2": 0.65},
    "Model2": {"coef": 0.52, "se": 0.10, "r2": 0.72}
}

# Variable mapping
var_labels = {
    "edu": "Years of Education",
    "income": "Annual Income",
    "age": "Age"
}

Research Scenarios:

Storing individual attributes (ID → attribute values)
Organizing regression results
Variable name mapping and labels
Configuration files (parameter settings)

04 - Sets

Core Question: How to handle unique values and set operations?

Core Content:

Set creation: {}, set()
Set characteristics: unordered, unique, mutable
Set operations:
- Add/remove: add(), remove(), discard()
- Set operations: union (|), intersection (&), difference (-), symmetric difference (^)
- Subset testing: issubset(), issuperset()
Deduplication
Membership testing (high efficiency)

Practical Application:

python

# Data deduplication
all_ids = [101, 102, 103, 101, 104, 102]
unique_ids = set(all_ids)  # {101, 102, 103, 104}

# Find intersection of two respondent groups
group_a = {101, 102, 103, 104}
group_b = {103, 104, 105, 106}
both_groups = group_a & group_b  # {103, 104}

# Fast membership checking (faster than lists)
if 101 in unique_ids:
    print("ID 101 exists")

Research Scenarios:

Data deduplication
Sample matching (intersection)
Difference analysis (difference set)
ID duplicate checking

05 - Summary and Review

Content:

Comparison table of four data structures
Selection decision tree
Comprehensive practice problems
Performance comparison
Common errors and best practices

Four Data Structures Comparison

Feature	List	Tuple	Dict	Set
Syntax	`[...]`	`(...)`	`{k:v, ...}`	`{...}`
Ordered	✓	✓	✓ (3.7+ maintains insertion order)	✗
Mutable	✓	✗	✓	✓
Duplicates	✓	✓	Keys unique, values can repeat	✗
Indexing	Integer index	Integer index	Key index	No index
Typical Use	Ordered collection	Immutable collection	Key-value mapping	Unique values, set operations

Selection Guide

When to use lists?

Need to store ordered elements (grades, years, prices)
Need to modify data (add, delete, sort)
Need to access by index

When to use tuples?

Data should not be modified (configuration, constants)
As dictionary keys
Function returns multiple values
Performance priority (faster than lists)

When to use dictionaries?

Need to look up data by name/ID
Store attributes (name → properties)
Counting, mapping, lookup tables

When to use sets?

Need deduplication
Need set operations (intersection/union/difference)
Fast membership testing

How to Study This Chapter?

Learning Roadmap

Day 1 (3 hours): Lists

Read 01 - Lists
Practice indexing, slicing, methods
Write list comprehensions

Day 2 (2 hours): Tuples

Read 02 - Tuples
Understand immutability
Practice tuple unpacking

Day 3 (3 hours): Dictionaries

Read 03 - Dictionaries
Create nested dictionaries
Practice dictionary iteration and methods

Day 4 (2 hours): Sets

Read 04 - Sets
Practice set operations
Data deduplication practice

Day 5 (2 hours): Review and comprehensive application

Complete 05 - Summary and Review
Comprehensive practice problems
Compare four structures

Total Time: 12 hours (1-2 weeks)

Minimal Learning Path

If time is limited:

Must Learn (core structures, 8 hours):

01 - Lists (complete)
03 - Dictionaries (complete)
02 - Tuples (basics)
04 - Sets (deduplication)

Optional (advanced techniques):

List comprehensions
Nested dictionaries
Set operations
Named tuples

Study Recommendations

Start from Use Cases
- Think: "Which structure fits my research data?"
- Map Stata/R data organization to Python
- Practice with real data

Comparative Learning

python

# List: ordered, mutable
grades = [85, 92, 78]
grades.append(95)  # Can modify

# Tuple: ordered, immutable
config = (85, 92, 78)
# config.append(95)  # Error! Cannot modify

# Dictionary: key-value pairs
student = {"name": "Alice", "grade": 85}

# Set: unique values
unique_grades = {85, 92, 78, 85}  # {85, 92, 78}

Performance Awareness
- List search: O(n) — slow
- Dictionary/set search: O(1) — fast
- Choose appropriate structure for large datasets
Avoid Common Errors
- Don't confuse [] (list), () (tuple), {} (dictionary/set)
- Remember list indexing starts at 0
- Accessing non-existent dictionary keys raises errors, use get() for safety

Common Questions

Q: Why so many data structures? Can't we just use lists? A: Different structures have different advantages. Lists are good for ordered data, dictionaries for lookups, sets for deduplication. Choosing the right structure makes code more efficient and readable.

Q: What's the Stata/R equivalent of Python dictionaries? A:

Stata has no direct equivalent (closest is value labels)
R's named list is similar to dictionaries

Q: When should I use list comprehensions? A: When you need a loop to create a list, comprehensions are more concise. But if logic is complex, regular loops are clearer.

Q: What's the difference between sets and list deduplication? A: set(list) is the fastest deduplication method, but loses order. If you need to preserve order, use list(dict.fromkeys(list)).

Q: Why must dictionary keys be immutable types? A: Because dictionaries are implemented with hash tables, keys must be hashable (immutable). So tuples work, but lists don't.

Next Steps

After completing this chapter, you will have mastered:

Python's four core data structures
How to choose the appropriate structure for organizing data
Efficient data manipulation methods

In Module 5, we'll learn about functions and modules, making code more modular and reusable.

In Module 6-7, we'll learn Pandas, which integrates all these structures into the powerful DataFrame!

Keep going! Mastering data structures puts you one step away from real data analysis!

Module 4: Python Data Structures ​

Chapter Overview ​

Learning Objectives ​

Chapter Contents ​

01 - Lists ​

02 - Tuples ​

03 - Dictionaries ​

04 - Sets ​

05 - Summary and Review ​

Four Data Structures Comparison ​

Selection Guide ​

How to Study This Chapter? ​

Learning Roadmap ​

Minimal Learning Path ​

Study Recommendations ​

Common Questions ​

Next Steps ​

Quick Links ​

Module 4: Python Data Structures

Chapter Overview

Learning Objectives

Chapter Contents

01 - Lists

02 - Tuples

03 - Dictionaries

04 - Sets

05 - Summary and Review

Four Data Structures Comparison

Selection Guide

How to Study This Chapter?

Learning Roadmap

Minimal Learning Path

Study Recommendations

Common Questions

Next Steps

Quick Links