Module 5: Functions and Modules
Making Code Reusable — From Repetitive Tasks to Elegant Programming
Chapter Overview
When writing code, you'll often find many operations need to be executed repeatedly: calculating statistics, cleaning data, running regressions. Functions allow you to encapsulate these operations for reuse anytime, avoiding code duplication. Modules let you organize and reuse functions written by others (or yourself). Mastering functions and modules will elevate your code from "scripts" to "programs."
Learning Objectives
After completing this chapter, you will be able to:
- ✅ Understand the concept and purpose of functions
- ✅ Define and call your own functions
- ✅ Master parameter passing (positional, keyword, default parameters)
- ✅ Use
*argsand**kwargsto handle variable arguments - ✅ Understand Lambda functions and their use cases
- ✅ Import and use standard libraries and third-party modules
- ✅ Organize your own code into modules and packages
- ✅ Compare Stata's
programand R'sfunction()
Chapter Contents
5.2 - Function Basics
Core Question: How to make code reusable?
Core Content:
- Function definition and calling
- Parameters and return values
- Docstrings
- Local vs. global variables
- The meaning of
None - Comparison with Stata (
program define) and R (function())
Practical Application:
def calculate_bmi(weight, height):
"""Calculate BMI index
Parameters:
weight: Weight in kilograms
height: Height in meters
Returns:
BMI value
"""
bmi = weight / (height ** 2)
return bmi
# Call the function
result = calculate_bmi(70, 1.75)
print(f"BMI: {result:.2f}") # BMI: 22.86Research Scenarios:
- Calculate statistics (mean, standard deviation, quantiles)
- Data cleaning (handling missing values, outlier detection)
- Repeated analysis (multiple models, multiple subsamples)
- Robustness checks
5.3 - Function Arguments
Core Question: How to flexibly pass data to functions?
Core Content:
- Positional arguments: Pass by order
- Keyword arguments: Pass by name
- Default arguments: Provide default values
- Variable positional arguments (
*args): Accept any number of positional arguments - Variable keyword arguments (
**kwargs): Accept any number of keyword arguments - Parameter order rules
- Argument unpacking (
*,**)
Practical Application:
def run_regression(y, X, model_type="OLS", robust=True, **options):
"""Run regression analysis
Parameters:
y: Dependent variable
X: Independent variables
model_type: Model type (default OLS)
robust: Whether to use robust standard errors (default True)
**options: Additional options
"""
print(f"Model: {model_type}")
print(f"Robust standard errors: {robust}")
print(f"Other options: {options}")
# Flexible calling
run_regression(y, X) # Use default parameters
run_regression(y, X, model_type="Logit") # Change model type
run_regression(y, X, robust=False, cluster="firm_id") # Pass extra optionsResearch Scenarios:
- Provide default configurations (significance level, iteration count)
- Flexible model options
- Batch processing (pass multiple parameter combinations)
5.4 - Lambda Functions
Core Question: When to use anonymous functions?
Core Content:
- Lambda syntax:
lambda arguments: expression - Comparison with regular functions
- Use cases:
- Sorting (
sorted(),list.sort()) - Mapping (
map()) - Filtering (
filter()) - Pandas data operations
- Sorting (
- Limitations of Lambda
Practical Application:
# Sort student list by income
students = [
{"name": "Alice", "income": 50000},
{"name": "Bob", "income": 75000},
{"name": "Carol", "income": 60000}
]
sorted_students = sorted(students, key=lambda x: x["income"])
# Pandas data cleaning
df["log_income"] = df["income"].apply(lambda x: np.log(x) if x > 0 else None)
# Filter high earners
high_earners = list(filter(lambda x: x["income"] > 60000, students))When to Use Lambda?
- ✅ Simple one-time functions
- ✅ Sorting, mapping, filtering
- ✅ Pandas
apply()operations - ❌ Complex logic (use regular functions)
- ❌ Need docstrings (use regular functions)
5.5 - Modules and Packages
Core Question: How to organize and reuse code?
Core Content:
- Module: Single
.pyfile - Package: Folder containing multiple modules
- Import methods:
import modulefrom module import functionimport module as aliasfrom module import *(not recommended)
- Standard library introduction:
math: Mathematical functionsstatistics: Statistical functionsrandom: Random number generationdatetime: Date and time handlingos,sys: System operations
- Third-party library installation (
pip install) - Creating your own modules and packages
Practical Application:
# Import standard library
import math
import statistics as stats
from datetime import datetime
# Use mathematical functions
log_income = math.log(50000)
sqrt_value = math.sqrt(100)
# Calculate statistics
mean_income = stats.mean([50000, 60000, 75000])
median_income = stats.median([50000, 60000, 75000])
# Import data analysis libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Create your own module
# my_stats.py
def winsorize(data, lower=0.01, upper=0.99):
"""Winsorization"""
p_lower = np.quantile(data, lower)
p_upper = np.quantile(data, upper)
return np.clip(data, p_lower, p_upper)
# Import in other files
from my_stats import winsorize
clean_data = winsorize(incomes)Research Scenarios:
- Use NumPy for numerical computation
- Use Pandas for data processing
- Use statsmodels for regression
- Organize your own data cleaning function library
5.6 - Summary and Review
Content:
- Function design best practices
- Common errors and debugging tips
- Module organization recommendations
- Comprehensive exercises
- Stata/R/Python comparison
Stata vs R vs Python Function Comparison
Defining Functions
| Language | Syntax |
|---|---|
| Stata | program define func_name |
| R | func_name <- function(args) { ... } |
| Python | def func_name(args): ... |
Example: Calculate BMI
Stata:
program define calc_bmi
args weight height
gen bmi = `weight' / (`height'^2)
end
calc_bmi 70 1.75R:
calc_bmi <- function(weight, height) {
bmi <- weight / (height^2)
return(bmi)
}
result <- calc_bmi(70, 1.75)Python:
def calc_bmi(weight, height):
bmi = weight / (height ** 2)
return bmi
result = calc_bmi(70, 1.75)How to Learn This Chapter?
Learning Path
Day 1 (3 hours): Function Basics
- 📖 Read 5.2 - Function Basics
- 💻 Define simple functions
- 📝 Understand return values and docstrings
Day 2 (3 hours): Function Arguments
- 📖 Read 5.3 - Function Arguments
- 💻 Practice positional/keyword/default parameters
- 📝 Understand
*argsand**kwargs
Day 3 (2 hours): Lambda Functions
- 📖 Read 5.4 - Lambda Functions
- 💻 Use Lambda for sorting and filtering
- 📝 Use Lambda in Pandas
Day 4 (3 hours): Modules and Packages
- 📖 Read 5.5 - Modules and Packages
- 💻 Import and use standard libraries
- 📝 Create your own modules
Day 5 (2 hours): Review and Comprehensive Application
- 📖 Complete 5.6 - Summary and Review
- 💻 Write a complete analysis function library
- 📝 Organize code into modules
Total Time: 13 hours (1-2 weeks)
Minimal Learning Path
If time is limited:
Must Learn (Core concepts, 8 hours):
- ✅ 5.2 - Function Basics (complete study)
- ✅ 5.3 - Function Arguments (positional/keyword/default)
- ✅ 5.5 - Modules and Packages (import standard libraries)
Optional (Advanced techniques):
- 📌
*argsand**kwargs - 📌 Lambda functions
- 📌 Creating your own packages
Learning Recommendations
Start from Requirements
- Discover repeated code → Extract into function
- Need configuration options → Use default parameters
- Code getting too long → Split into modules
Function Design Principles
- Single Responsibility: One function does one thing
- Meaningful Names:
calculate_bmi()is better thanfunc1() - Write Docstrings: Explain functionality, parameters, return values
- Avoid Side Effects: Try not to modify global variables
Practice Project Create a
my_stats.pymodule containing:python# my_stats.py def mean(data): """Calculate mean""" return sum(data) / len(data) def std(data): """Calculate standard deviation""" m = mean(data) variance = sum((x - m) ** 2 for x in data) / len(data) return variance ** 0.5 def winsorize(data, lower=0.01, upper=0.99): """Winsorization""" import numpy as np p_lower = np.quantile(data, lower) p_upper = np.quantile(data, upper) return np.clip(data, p_lower, p_upper)Comparative Learning
- Reproduce Stata's
programin Python - Reproduce R's
function()in Python - Understand the philosophy behind syntax differences
- Reproduce Stata's
Common Questions
Q: What's the difference between functions and variables? A: Variables store data, functions store operations. Variables are "nouns," functions are "verbs."
Q: When should I write a function? A: Follow the DRY principle (Don't Repeat Yourself). If a piece of code is used more than 3 times, it should be extracted into a function.
Q: Why do we need docstrings? A: In a few weeks, you'll forget how the function works. Docstrings are documentation for your future self (and collaborators).
Q: What's the difference between return and print()? A:
return: Returns value to caller, can be assigned to a variableprint(): Only displays on screen, doesn't return a value
def bad_function(x):
print(x * 2) # Only prints, doesn't return
def good_function(x):
return x * 2 # Returns value
result = bad_function(5) # result = None
result = good_function(5) # result = 10Q: When to use Lambda vs regular functions? A:
- Lambda: Simple one-line expressions (sorting, mapping, filtering)
- Regular function: Complex logic, multiple lines of code, needs docstring
Q: How to find useful third-party libraries? A:
- Data analysis:
numpy,pandas,statsmodels - Data visualization:
matplotlib,seaborn,plotly - Machine learning:
scikit-learn,xgboost,lightgbm - Text analysis:
nltk,spaCy,transformers - Web scraping:
requests,beautifulsoup4,scrapy
Next Steps
After completing this chapter, you will master:
- ✅ Writing reusable functions
- ✅ Flexibly using parameter passing
- ✅ Importing and organizing modules
- ✅ Elevating code from "scripts" to "programs"
In Module 6-7, we'll learn Pandas, the core library for Python data analysis, integrating all the concepts we've learned!
In Module 8-9, we'll learn data visualization and advanced data processing techniques.
Congratulations! After completing the first 5 modules, you've mastered Python's core syntax! Now it's time for practical data analysis!