Module 5: Functions and Modules

Making Code Reusable — From Repetitive Tasks to Elegant Programming

Chapter Overview

When writing code, you'll often find many operations need to be executed repeatedly: calculating statistics, cleaning data, running regressions. Functions allow you to encapsulate these operations for reuse anytime, avoiding code duplication. Modules let you organize and reuse functions written by others (or yourself). Mastering functions and modules will elevate your code from "scripts" to "programs."

Learning Objectives

After completing this chapter, you will be able to:

✅ Understand the concept and purpose of functions
✅ Define and call your own functions
✅ Master parameter passing (positional, keyword, default parameters)
✅ Use *args and **kwargs to handle variable arguments
✅ Understand Lambda functions and their use cases
✅ Import and use standard libraries and third-party modules
✅ Organize your own code into modules and packages
✅ Compare Stata's program and R's function()

Chapter Contents

5.2 - Function Basics

Core Question: How to make code reusable?

Core Content:

Function definition and calling
Parameters and return values
Docstrings
Local vs. global variables
The meaning of None
Comparison with Stata (program define) and R (function())

Practical Application:

python

def calculate_bmi(weight, height):
    """Calculate BMI index

    Parameters:
        weight: Weight in kilograms
        height: Height in meters

    Returns:
        BMI value
    """
    bmi = weight / (height ** 2)
    return bmi

# Call the function
result = calculate_bmi(70, 1.75)
print(f"BMI: {result:.2f}")  # BMI: 22.86

Research Scenarios:

Calculate statistics (mean, standard deviation, quantiles)
Data cleaning (handling missing values, outlier detection)
Repeated analysis (multiple models, multiple subsamples)
Robustness checks

5.3 - Function Arguments

Core Question: How to flexibly pass data to functions?

Core Content:

Positional arguments: Pass by order
Keyword arguments: Pass by name
Default arguments: Provide default values
Variable positional arguments (*args): Accept any number of positional arguments
Variable keyword arguments (**kwargs): Accept any number of keyword arguments
Parameter order rules
Argument unpacking (*, **)

Practical Application:

python

def run_regression(y, X, model_type="OLS", robust=True, **options):
    """Run regression analysis

    Parameters:
        y: Dependent variable
        X: Independent variables
        model_type: Model type (default OLS)
        robust: Whether to use robust standard errors (default True)
        **options: Additional options
    """
    print(f"Model: {model_type}")
    print(f"Robust standard errors: {robust}")
    print(f"Other options: {options}")

# Flexible calling
run_regression(y, X)  # Use default parameters
run_regression(y, X, model_type="Logit")  # Change model type
run_regression(y, X, robust=False, cluster="firm_id")  # Pass extra options

Research Scenarios:

Provide default configurations (significance level, iteration count)
Flexible model options
Batch processing (pass multiple parameter combinations)

5.4 - Lambda Functions

Core Question: When to use anonymous functions?

Core Content:

Lambda syntax: lambda arguments: expression
Comparison with regular functions
Use cases:
- Sorting (sorted(), list.sort())
- Mapping (map())
- Filtering (filter())
- Pandas data operations
Limitations of Lambda

Practical Application:

python

# Sort student list by income
students = [
    {"name": "Alice", "income": 50000},
    {"name": "Bob", "income": 75000},
    {"name": "Carol", "income": 60000}
]

sorted_students = sorted(students, key=lambda x: x["income"])

# Pandas data cleaning
df["log_income"] = df["income"].apply(lambda x: np.log(x) if x > 0 else None)

# Filter high earners
high_earners = list(filter(lambda x: x["income"] > 60000, students))

When to Use Lambda?

✅ Simple one-time functions
✅ Sorting, mapping, filtering
✅ Pandas apply() operations
❌ Complex logic (use regular functions)
❌ Need docstrings (use regular functions)

5.5 - Modules and Packages

Core Question: How to organize and reuse code?

Core Content:

Module: Single .py file
Package: Folder containing multiple modules
Import methods:
- import module
- from module import function
- import module as alias
- from module import * (not recommended)
Standard library introduction:
- math: Mathematical functions
- statistics: Statistical functions
- random: Random number generation
- datetime: Date and time handling
- os, sys: System operations
Third-party library installation (pip install)
Creating your own modules and packages

Practical Application:

python

# Import standard library
import math
import statistics as stats
from datetime import datetime

# Use mathematical functions
log_income = math.log(50000)
sqrt_value = math.sqrt(100)

# Calculate statistics
mean_income = stats.mean([50000, 60000, 75000])
median_income = stats.median([50000, 60000, 75000])

# Import data analysis libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Create your own module
# my_stats.py
def winsorize(data, lower=0.01, upper=0.99):
    """Winsorization"""
    p_lower = np.quantile(data, lower)
    p_upper = np.quantile(data, upper)
    return np.clip(data, p_lower, p_upper)

# Import in other files
from my_stats import winsorize
clean_data = winsorize(incomes)

Research Scenarios:

Use NumPy for numerical computation
Use Pandas for data processing
Use statsmodels for regression
Organize your own data cleaning function library

5.6 - Summary and Review

Content:

Function design best practices
Common errors and debugging tips
Module organization recommendations
Comprehensive exercises
Stata/R/Python comparison

Stata vs R vs Python Function Comparison

Defining Functions

Language	Syntax
Stata	`program define func_name`
R	`func_name <- function(args) { ... }`
Python	`def func_name(args): ...`

Example: Calculate BMI

Stata:

stata

program define calc_bmi
    args weight height
    gen bmi = `weight' / (`height'^2)
end

calc_bmi 70 1.75

calc_bmi <- function(weight, height) {
  bmi <- weight / (height^2)
  return(bmi)
}

result <- calc_bmi(70, 1.75)

Python:

python

def calc_bmi(weight, height):
    bmi = weight / (height ** 2)
    return bmi

result = calc_bmi(70, 1.75)

How to Learn This Chapter?

Learning Path

Day 1 (3 hours): Function Basics

📖 Read 5.2 - Function Basics
💻 Define simple functions
📝 Understand return values and docstrings

Day 2 (3 hours): Function Arguments

📖 Read 5.3 - Function Arguments
💻 Practice positional/keyword/default parameters
📝 Understand *args and **kwargs

Day 3 (2 hours): Lambda Functions

📖 Read 5.4 - Lambda Functions
💻 Use Lambda for sorting and filtering
📝 Use Lambda in Pandas

Day 4 (3 hours): Modules and Packages

📖 Read 5.5 - Modules and Packages
💻 Import and use standard libraries
📝 Create your own modules

Day 5 (2 hours): Review and Comprehensive Application

📖 Complete 5.6 - Summary and Review
💻 Write a complete analysis function library
📝 Organize code into modules

Total Time: 13 hours (1-2 weeks)

Minimal Learning Path

If time is limited:

Must Learn (Core concepts, 8 hours):

✅ 5.2 - Function Basics (complete study)
✅ 5.3 - Function Arguments (positional/keyword/default)
✅ 5.5 - Modules and Packages (import standard libraries)

Optional (Advanced techniques):

📌 *args and **kwargs
📌 Lambda functions
📌 Creating your own packages

Learning Recommendations

Start from Requirements
- Discover repeated code → Extract into function
- Need configuration options → Use default parameters
- Code getting too long → Split into modules
Function Design Principles
- Single Responsibility: One function does one thing
- Meaningful Names: calculate_bmi() is better than func1()
- Write Docstrings: Explain functionality, parameters, return values
- Avoid Side Effects: Try not to modify global variables

Practice Project Create a my_stats.py module containing:

python

# my_stats.py

def mean(data):
    """Calculate mean"""
    return sum(data) / len(data)

def std(data):
    """Calculate standard deviation"""
    m = mean(data)
    variance = sum((x - m) ** 2 for x in data) / len(data)
    return variance ** 0.5

def winsorize(data, lower=0.01, upper=0.99):
    """Winsorization"""
    import numpy as np
    p_lower = np.quantile(data, lower)
    p_upper = np.quantile(data, upper)
    return np.clip(data, p_lower, p_upper)

Comparative Learning
- Reproduce Stata's program in Python
- Reproduce R's function() in Python
- Understand the philosophy behind syntax differences

Common Questions

Q: What's the difference between functions and variables? A: Variables store data, functions store operations. Variables are "nouns," functions are "verbs."

Q: When should I write a function? A: Follow the DRY principle (Don't Repeat Yourself). If a piece of code is used more than 3 times, it should be extracted into a function.

Q: Why do we need docstrings? A: In a few weeks, you'll forget how the function works. Docstrings are documentation for your future self (and collaborators).

Q: What's the difference between return and print()? A:

return: Returns value to caller, can be assigned to a variable
print(): Only displays on screen, doesn't return a value

python

def bad_function(x):
    print(x * 2)  # Only prints, doesn't return

def good_function(x):
    return x * 2  # Returns value

result = bad_function(5)  # result = None
result = good_function(5)  # result = 10

Q: When to use Lambda vs regular functions? A:

Lambda: Simple one-line expressions (sorting, mapping, filtering)
Regular function: Complex logic, multiple lines of code, needs docstring

Q: How to find useful third-party libraries? A:

Data analysis: numpy, pandas, statsmodels
Data visualization: matplotlib, seaborn, plotly
Machine learning: scikit-learn, xgboost, lightgbm
Text analysis: nltk, spaCy, transformers
Web scraping: requests, beautifulsoup4, scrapy

Next Steps

After completing this chapter, you will master:

✅ Writing reusable functions
✅ Flexibly using parameter passing
✅ Importing and organizing modules
✅ Elevating code from "scripts" to "programs"

In Module 6-7, we'll learn Pandas, the core library for Python data analysis, integrating all the concepts we've learned!

In Module 8-9, we'll learn data visualization and advanced data processing techniques.

Congratulations! After completing the first 5 modules, you've mastered Python's core syntax! Now it's time for practical data analysis!

Module 5: Functions and Modules ​

Chapter Overview ​

Learning Objectives ​

Chapter Contents ​

5.2 - Function Basics ​

5.3 - Function Arguments ​

5.4 - Lambda Functions ​

5.5 - Modules and Packages ​

5.6 - Summary and Review ​

Stata vs R vs Python Function Comparison ​

Defining Functions ​

Example: Calculate BMI ​

How to Learn This Chapter? ​

Learning Path ​

Minimal Learning Path ​

Learning Recommendations ​

Common Questions ​

Next Steps ​

Quick Links ​

Module 5: Functions and Modules

Chapter Overview

Learning Objectives

Chapter Contents

5.2 - Function Basics

5.3 - Function Arguments

5.4 - Lambda Functions

5.5 - Modules and Packages

5.6 - Summary and Review

Stata vs R vs Python Function Comparison

Defining Functions

Example: Calculate BMI

How to Learn This Chapter?

Learning Path

Minimal Learning Path

Learning Recommendations

Common Questions

Next Steps

Quick Links