Module 4: Python Data Structures
The Art of Organizing Data — Lists, Dictionaries, Tuples, Sets
Chapter Overview
If variables are containers for storing individual data points, then data structures are the ways to organize multiple data points. This chapter introduces Python's four core data structures: Lists, Tuples, Dictionaries, and Sets. Mastering them will enable you to efficiently handle complex research data.
Learning Objectives
After completing this chapter, you will be able to:
- Understand the characteristics and use cases of four data structures
- Proficiently manipulate lists (create, read, update, delete, slice, sort)
- Use dictionaries to store key-value pair data
- Understand tuple immutability and its applications
- Use sets for deduplication and set operations
- Choose the appropriate data structure for practical problems
- Compare Python's data organization methods with Stata/R
Chapter Contents
01 - Lists
Core Question: How to store and manipulate ordered collections of data?
Core Content:
- List creation:
[],list(),range() - Index access (forward/backward)
- Slicing operations:
list[start:end:step] - List methods:
- Adding:
append(),insert(),extend() - Removing:
remove(),pop(),clear() - Sorting:
sort(),sorted(),reverse() - Finding:
index(),count(),in
- Adding:
- List comprehensions (advanced syntax)
- Nested lists (two-dimensional data)
- Comparison with R vectors and Stata variables
Practical Application:
# Store grades for multiple students
grades = [85, 92, 78, 95, 88]
# Filter passing grades
passed = [g for g in grades if g >= 60]
# Calculate average grade
avg_grade = sum(grades) / len(grades)
# Find highest and lowest grades
max_grade = max(grades)
min_grade = min(grades)Research Scenarios:
- Storing sample data (age, income, years of education)
- Batch processing variable names
- Storing regression coefficients
- Time series data
02 - Tuples
Core Question: When do you need an immutable data structure?
Core Content:
- Tuple creation:
(),tuple() - Meaning of immutability
- Tuple unpacking
- Single-element tuple trap:
(1,)vs(1) - Tuple vs List: when to use which?
- Named tuples (
namedtuple)
Practical Application:
# Store fixed configuration (should not be modified)
regression_params = ("OLS", 0.05, 1000) # Model type, significance level, sample size
# Function returns multiple values
def calculate_stats(data):
return (mean(data), median(data), std(data))
mean_val, median_val, std_val = calculate_stats(incomes)
# Dictionary keys (must be immutable)
results = {
("Model1", "OLS"): 0.85,
("Model2", "Logit"): 0.78
}When to Use Tuples?
- Data should not be modified (configuration parameters, constants)
- As dictionary keys
- Function returns multiple values
- High performance requirements (tuples are faster than lists)
03 - Dictionaries
Core Question: How to store key-value mapped data?
Core Content:
- Dictionary creation:
{},dict(), dictionary comprehensions - Access and modification:
dict[key],dict.get(key, default) - Dictionary methods:
- Keys/values/items:
keys(),values(),items() - Update:
update(),setdefault() - Delete:
pop(),del,clear()
- Keys/values/items:
- Nested dictionaries (multi-level data)
- Dictionary iteration
- Comparison with R's named list
Practical Application:
# Store personal information
student = {
"name": "Alice",
"age": 22,
"major": "Economics",
"gpa": 3.8
}
# Store regression results
regression_results = {
"Model1": {"coef": 0.45, "se": 0.12, "r2": 0.65},
"Model2": {"coef": 0.52, "se": 0.10, "r2": 0.72}
}
# Variable mapping
var_labels = {
"edu": "Years of Education",
"income": "Annual Income",
"age": "Age"
}Research Scenarios:
- Storing individual attributes (ID → attribute values)
- Organizing regression results
- Variable name mapping and labels
- Configuration files (parameter settings)
04 - Sets
Core Question: How to handle unique values and set operations?
Core Content:
- Set creation:
{},set() - Set characteristics: unordered, unique, mutable
- Set operations:
- Add/remove:
add(),remove(),discard() - Set operations: union (
|), intersection (&), difference (-), symmetric difference (^) - Subset testing:
issubset(),issuperset()
- Add/remove:
- Deduplication
- Membership testing (high efficiency)
Practical Application:
# Data deduplication
all_ids = [101, 102, 103, 101, 104, 102]
unique_ids = set(all_ids) # {101, 102, 103, 104}
# Find intersection of two respondent groups
group_a = {101, 102, 103, 104}
group_b = {103, 104, 105, 106}
both_groups = group_a & group_b # {103, 104}
# Fast membership checking (faster than lists)
if 101 in unique_ids:
print("ID 101 exists")Research Scenarios:
- Data deduplication
- Sample matching (intersection)
- Difference analysis (difference set)
- ID duplicate checking
05 - Summary and Review
Content:
- Comparison table of four data structures
- Selection decision tree
- Comprehensive practice problems
- Performance comparison
- Common errors and best practices
Four Data Structures Comparison
| Feature | List | Tuple | Dict | Set |
|---|---|---|---|---|
| Syntax | [...] | (...) | {k:v, ...} | {...} |
| Ordered | ✓ | ✓ | ✓ (3.7+ maintains insertion order) | ✗ |
| Mutable | ✓ | ✗ | ✓ | ✓ |
| Duplicates | ✓ | ✓ | Keys unique, values can repeat | ✗ |
| Indexing | Integer index | Integer index | Key index | No index |
| Typical Use | Ordered collection | Immutable collection | Key-value mapping | Unique values, set operations |
Selection Guide
When to use lists?
- Need to store ordered elements (grades, years, prices)
- Need to modify data (add, delete, sort)
- Need to access by index
When to use tuples?
- Data should not be modified (configuration, constants)
- As dictionary keys
- Function returns multiple values
- Performance priority (faster than lists)
When to use dictionaries?
- Need to look up data by name/ID
- Store attributes (name → properties)
- Counting, mapping, lookup tables
When to use sets?
- Need deduplication
- Need set operations (intersection/union/difference)
- Fast membership testing
How to Study This Chapter?
Learning Roadmap
Day 1 (3 hours): Lists
- Read 01 - Lists
- Practice indexing, slicing, methods
- Write list comprehensions
Day 2 (2 hours): Tuples
- Read 02 - Tuples
- Understand immutability
- Practice tuple unpacking
Day 3 (3 hours): Dictionaries
- Read 03 - Dictionaries
- Create nested dictionaries
- Practice dictionary iteration and methods
Day 4 (2 hours): Sets
- Read 04 - Sets
- Practice set operations
- Data deduplication practice
Day 5 (2 hours): Review and comprehensive application
- Complete 05 - Summary and Review
- Comprehensive practice problems
- Compare four structures
Total Time: 12 hours (1-2 weeks)
Minimal Learning Path
If time is limited:
Must Learn (core structures, 8 hours):
- 01 - Lists (complete)
- 03 - Dictionaries (complete)
- 02 - Tuples (basics)
- 04 - Sets (deduplication)
Optional (advanced techniques):
- List comprehensions
- Nested dictionaries
- Set operations
- Named tuples
Study Recommendations
Start from Use Cases
- Think: "Which structure fits my research data?"
- Map Stata/R data organization to Python
- Practice with real data
Comparative Learning
python# List: ordered, mutable grades = [85, 92, 78] grades.append(95) # Can modify # Tuple: ordered, immutable config = (85, 92, 78) # config.append(95) # Error! Cannot modify # Dictionary: key-value pairs student = {"name": "Alice", "grade": 85} # Set: unique values unique_grades = {85, 92, 78, 85} # {85, 92, 78}Performance Awareness
- List search:
O(n)— slow - Dictionary/set search:
O(1)— fast - Choose appropriate structure for large datasets
- List search:
Avoid Common Errors
- Don't confuse
[](list),()(tuple),{}(dictionary/set) - Remember list indexing starts at 0
- Accessing non-existent dictionary keys raises errors, use
get()for safety
- Don't confuse
Common Questions
Q: Why so many data structures? Can't we just use lists? A: Different structures have different advantages. Lists are good for ordered data, dictionaries for lookups, sets for deduplication. Choosing the right structure makes code more efficient and readable.
Q: What's the Stata/R equivalent of Python dictionaries? A:
- Stata has no direct equivalent (closest is value labels)
- R's named list is similar to dictionaries
Q: When should I use list comprehensions? A: When you need a loop to create a list, comprehensions are more concise. But if logic is complex, regular loops are clearer.
Q: What's the difference between sets and list deduplication? A: set(list) is the fastest deduplication method, but loses order. If you need to preserve order, use list(dict.fromkeys(list)).
Q: Why must dictionary keys be immutable types? A: Because dictionaries are implemented with hash tables, keys must be hashable (immutable). So tuples work, but lists don't.
Next Steps
After completing this chapter, you will have mastered:
- Python's four core data structures
- How to choose the appropriate structure for organizing data
- Efficient data manipulation methods
In Module 5, we'll learn about functions and modules, making code more modular and reusable.
In Module 6-7, we'll learn Pandas, which integrates all these structures into the powerful DataFrame!
Keep going! Mastering data structures puts you one step away from real data analysis!