Module 6: Object-Oriented Programming Basics (OOP)
Understanding the Secret of
df.method()— Why Data Science Needs OOP
Chapter Overview
If you've used Pandas, you're already using Object-Oriented Programming (OOP)! Every time you call df.head(), df.mean(), or model.fit(X, y), you're interacting with objects. This chapter will demystify OOP, help you understand the design philosophy of data science libraries, and teach you how to create your own classes.
Important Note: Social science students don't need to master OOP deeply, but understanding its basic concepts is essential for effectively using libraries like Pandas and Scikit-learn.
Learning Objectives
After completing this chapter, you will be able to:
- Understand the concepts of Classes and Objects
- Know why Pandas and Scikit-learn use OOP
- Understand the meaning of
df.method()anddf.attribute - Create simple classes to organize code
- Use special methods (
__init__,__str__,__len__) - Compare object-oriented vs functional programming
- Build data analysis pipeline classes
Chapter Contents
01 - Introduction to Object-Oriented Programming
Core Question: Why should social science students learn OOP?
Core Content:
- You're already using OOP:python
df = pd.DataFrame({'age': [25, 30, 35]}) result = df.mean() # df is an object, mean() is a method - OOP core concepts:
- Object: A collection of data + methods
- Class: A template/blueprint for objects
- Method: A function belonging to an object
- Attribute: Data belonging to an object
- Why we need OOP: Data and methods are bound together, making code more organized
- Understanding Pandas' OOP design:python
df.head() # Method call df.shape # Attribute access df.to_csv() # Method call - Comparing Python (object-oriented) vs R (functional):
- Python:
df.mean() - R:
mean(df$x)
- Python:
- Creating your first class: Student class, Survey response class
- Choosing between object-oriented vs functional programming
Why It Matters?
- Understand how Pandas and Scikit-learn are used
- Know when to use classes vs functions
- Make code more maintainable and reusable
Practical Application:
class SurveyResponse:
def __init__(self, id, age, income):
self.id = id
self.age = age
self.income = income
def is_valid(self):
return 18 <= self.age <= 100 and self.income >= 0
def income_category(self):
if self.income < 50000:
return "Low Income"
elif self.income < 100000:
return "Middle Income"
else:
return "High Income"
resp = SurveyResponse(1001, 30, 75000)
print(resp.is_valid()) # True
print(resp.income_category()) # Middle Income02 - Classes and Objects in Detail
Core Question: How to create and use classes?
Core Content:
- Complete class structure:python
class ClassName: class_variable = "shared data" # Class attribute def __init__(self, param): self.param = param # Instance attribute def method(self): # Instance method return self.param - Instance attributes vs class attributes:
- Instance attributes: Unique to each object (
self.name) - Class attributes: Shared by all objects (
ClassName.variable)
- Instance attributes: Unique to each object (
- Common special methods (magic methods):
__init__: Constructor (called when creating an object)__str__: String representation (forprint())__repr__: Developer representation (for debugging)__len__: Length (forlen())
- @property decorator: Convert methods to attributespython
@property def net_income(self): return self.income * 0.75 # Usage: resp.net_income (no parentheses) - Encapsulation: Public vs Private:
- Public:
self.balance - Convention private:
self._transactions(single underscore) - True private:
self.__pin(double underscore)
- Public:
Practical Case:
class Student:
school_name = "Peking University" # Class attribute
def __init__(self, id, name, major, gpa=0.0):
self.id = id
self.name = name
self.major = major
self.gpa = gpa
self.courses = []
def enroll_course(self, name, credits):
self.courses.append({'name': name, 'credits': credits})
def get_total_credits(self):
return sum(c['credits'] for c in self.courses)
def __str__(self):
return f"{self.name} ({self.major}, GPA: {self.gpa})"
alice = Student(2024001, "Alice", "Economics", 3.8)
alice.enroll_course("Microeconomics", 4)
print(alice) # Alice (Economics, GPA: 3.8)03 - OOP in Data Science
Core Question: Why do data science libraries use OOP?
Core Content:
- Pandas' OOP design:
- DataFrame and Series are both objects
- Advantages of method chaining:python
result = (df .query('age > 30') .assign(log_income=lambda x: np.log(x['income'])) .sort_values('income') .reset_index(drop=True) )
- Scikit-learn's OOP design:
- Unified API:
fit()→predict()
pythonmodel = LinearRegression() model.fit(X, y) predictions = model.predict(X_test) print(model.coef_, model.intercept_) # Access attributes - Unified API:
- Statsmodels' OOP design:python
model = smf.ols('income ~ education + age', data=df) results = model.fit() print(results.summary(), results.rsquared) - Creating your own data science classes:
- Simple linear regression class (educational)
- Data processing pipeline class
- OOP best practices:
- Design chainable methods (return
self) - Use attributes to store metadata (
is_fitted,n_features) - Implement
__repr__for debugging
- Design chainable methods (return
Practical Case: Data Pipeline Class:
class DataPipeline:
def __init__(self, df):
self.df = df.copy()
self.steps = []
def remove_missing(self):
self.df = self.df.dropna()
self.steps.append("remove_missing")
return self # Support method chaining
def filter_age(self, min_age, max_age):
self.df = self.df[(self.df['age'] >= min_age) &
(self.df['age'] <= max_age)]
self.steps.append(f"filter_age({min_age}, {max_age})")
return self
def get_result(self):
return self.df
# Method chaining
result = (DataPipeline(df)
.remove_missing()
.filter_age(18, 65)
.get_result()
)Object-Oriented vs Functional Programming
| Dimension | Functional Programming | Object-Oriented Programming |
|---|---|---|
| Organization | Functions + data separated | Data and methods bound together |
| Typical Languages | R, MATLAB | Python, Java |
| Use Cases | Simple scripts, data analysis | Large projects, library development |
| Code Style | mean(df$x) | df['x'].mean() |
Functional Style (R Style)
# Data and functions separated
def calculate_tax(income, rate):
return income * rate
def is_valid(age):
return age >= 18
income = 75000
tax = calculate_tax(income, 0.25)Object-Oriented Style (Python Style)
# Data and methods bound together
class Respondent:
def __init__(self, income, age):
self.income = income
self.age = age
def calculate_tax(self, rate=0.25):
return self.income * rate
def is_valid(self):
return self.age >= 18
resp = Respondent(75000, 30)
tax = resp.calculate_tax()How to Study This Chapter?
Learning Roadmap
Day 1 (2 hours): OOP Introduction
- Read 01 - Introduction to Object-Oriented Programming
- Understand concepts of class, object, method, attribute
- Create your first class (Student or SurveyResponse)
Day 2 (3 hours): Classes and Objects in Detail
- Read 02 - Classes and Objects in Detail
- Learn special methods (
__init__,__str__,__len__) - Practice @property decorator
Day 3 (2 hours): OOP in Data Science
- Read 03 - OOP in Data Science
- Understand Pandas and Scikit-learn's OOP design
- Create a simple data pipeline class
Total Time: 7 hours (1 week)
Minimal Learning Path
For social science students, OOP is not a core skill. Priorities:
Must Learn (understand others' code, 4 hours):
- 01 - OOP Introduction (understand Pandas' OOP design)
- 03 - Data Science Applications (understand Scikit-learn usage)
- Know the difference between
df.method()anddf.attribute
Optional (write your own classes, 3 hours):
- 02 - Classes and Objects in Detail
- Special methods, @property
- Create your own classes
Can Skip (advanced topics):
- Class methods, static methods
- Inheritance, polymorphism
- Complex encapsulation design
Study Recommendations
Understanding Before Creating
- First understand how Pandas and Scikit-learn use OOP
- Then consider whether you need to create your own classes
- Most data analysis can be done with functions
From Usage to Creation
python# Step 1: Use others' classes df = pd.DataFrame({'x': [1, 2, 3]}) df.mean() # Step 2: Understand the principles # DataFrame is a class, df is an object, mean() is a method # Step 3: Create your own classes (optional) class MyDataFrame: def __init__(self, data): self.data = dataWhen to Create Classes?
- Need to encapsulate complex data + operations
- Need reusable components (data pipelines, custom models)
- Large projects need code organization
- Simple scripts, one-off analysis (use functions)
Comparative Learning
- Python's OOP style: data and methods bound together
- R's functional style: data and functions separated
- Stata: almost no OOP (all commands)
Common Questions
Q: Do I have to learn OOP? Can I skip it? A: You can't skip it completely. You don't need to master it deeply, but you must understand basic concepts or you won't be able to understand Pandas and Scikit-learn documentation. Recommendation: Just understand what df.method() means, you don't have to write complex classes yourself.
Q: Why does Python use OOP while R uses functional style? A:
- Python is a general-purpose programming language, OOP is better for large projects
- R is a statistical language, functional style is more aligned with mathematical conventions
- Both can accomplish data analysis, just different styles
Q: What's the difference between df.head() and head(df)? A:
df.head(): OOP style, DataFrame object calls its own methodhead(df): Functional style, function takes DataFrame as parameter- Python prefers the former, R prefers the latter
Q: What is self? A: self represents the object itself. When you call resp.is_valid(), Python automatically passes resp to self, so the method can access the object's attributes.
Q: When do I need to create my own classes? A:
- Need to: Build reusable data pipelines, encapsulate complex analysis logic, large projects
- Don't need to: Simple scripts, one-off analysis, exploratory data analysis
Q: Does OOP make code faster? A: No. OOP's advantages are in organization and maintainability, not performance. For data analysis, performance mainly depends on algorithms and libraries (NumPy, Pandas), not OOP.
Next Steps
After completing this chapter, you will have mastered:
- Basic OOP concepts (class, object, method, attribute)
- Understanding of Pandas and Scikit-learn's OOP design
- Knowing when to use classes vs functions
- Ability to create simple classes to organize code
In Module 7, we'll learn file I/O, how to read and write CSV, Excel, JSON, Stata, and other data files.
In Module 9, we'll dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.
OOP is not a core skill for social science students, understanding is enough! Keep moving forward!