Module 6: Object-Oriented Programming Basics (OOP)

Understanding the Secret of df.method() — Why Data Science Needs OOP

Chapter Overview

If you've used Pandas, you're already using Object-Oriented Programming (OOP)! Every time you call df.head(), df.mean(), or model.fit(X, y), you're interacting with objects. This chapter will demystify OOP, help you understand the design philosophy of data science libraries, and teach you how to create your own classes.

Important Note: Social science students don't need to master OOP deeply, but understanding its basic concepts is essential for effectively using libraries like Pandas and Scikit-learn.

Learning Objectives

After completing this chapter, you will be able to:

Understand the concepts of Classes and Objects
Know why Pandas and Scikit-learn use OOP
Understand the meaning of df.method() and df.attribute
Create simple classes to organize code
Use special methods (__init__, __str__, __len__)
Compare object-oriented vs functional programming
Build data analysis pipeline classes

Chapter Contents

01 - Introduction to Object-Oriented Programming

Core Question: Why should social science students learn OOP?

Core Content:

You're already using OOP:

python

df = pd.DataFrame({'age': [25, 30, 35]})
result = df.mean()  # df is an object, mean() is a method

OOP core concepts:
- Object: A collection of data + methods
- Class: A template/blueprint for objects
- Method: A function belonging to an object
- Attribute: Data belonging to an object
Why we need OOP: Data and methods are bound together, making code more organized

Understanding Pandas' OOP design:

python

df.head()      # Method call
df.shape       # Attribute access
df.to_csv()    # Method call

Comparing Python (object-oriented) vs R (functional):
- Python: df.mean()
- R: mean(df$x)
Creating your first class: Student class, Survey response class
Choosing between object-oriented vs functional programming

Why It Matters?

Understand how Pandas and Scikit-learn are used
Know when to use classes vs functions
Make code more maintainable and reusable

Practical Application:

python

class SurveyResponse:
    def __init__(self, id, age, income):
        self.id = id
        self.age = age
        self.income = income

    def is_valid(self):
        return 18 <= self.age <= 100 and self.income >= 0

    def income_category(self):
        if self.income < 50000:
            return "Low Income"
        elif self.income < 100000:
            return "Middle Income"
        else:
            return "High Income"

resp = SurveyResponse(1001, 30, 75000)
print(resp.is_valid())         # True
print(resp.income_category())  # Middle Income

02 - Classes and Objects in Detail

Core Question: How to create and use classes?

Core Content:

Complete class structure:

python

class ClassName:
    class_variable = "shared data"  # Class attribute

    def __init__(self, param):
        self.param = param      # Instance attribute

    def method(self):           # Instance method
        return self.param

Instance attributes vs class attributes:
- Instance attributes: Unique to each object (self.name)
- Class attributes: Shared by all objects (ClassName.variable)
Common special methods (magic methods):
- __init__: Constructor (called when creating an object)
- __str__: String representation (for print())
- __repr__: Developer representation (for debugging)
- __len__: Length (for len())

@property decorator: Convert methods to attributes

python

@property
def net_income(self):
    return self.income * 0.75

# Usage: resp.net_income (no parentheses)

Encapsulation: Public vs Private:
- Public: self.balance
- Convention private: self._transactions (single underscore)
- True private: self.__pin (double underscore)

Practical Case:

python

class Student:
    school_name = "Peking University"  # Class attribute

    def __init__(self, id, name, major, gpa=0.0):
        self.id = id
        self.name = name
        self.major = major
        self.gpa = gpa
        self.courses = []

    def enroll_course(self, name, credits):
        self.courses.append({'name': name, 'credits': credits})

    def get_total_credits(self):
        return sum(c['credits'] for c in self.courses)

    def __str__(self):
        return f"{self.name} ({self.major}, GPA: {self.gpa})"

alice = Student(2024001, "Alice", "Economics", 3.8)
alice.enroll_course("Microeconomics", 4)
print(alice)  # Alice (Economics, GPA: 3.8)

03 - OOP in Data Science

Core Question: Why do data science libraries use OOP?

Core Content:

Pandas' OOP design:

DataFrame and Series are both objects

Advantages of method chaining:

python

result = (df
    .query('age > 30')
    .assign(log_income=lambda x: np.log(x['income']))
    .sort_values('income')
    .reset_index(drop=True)
)

Scikit-learn's OOP design:

Unified API: fit() → predict()

python

model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_test)
print(model.coef_, model.intercept_)  # Access attributes

Statsmodels' OOP design:

python

model = smf.ols('income ~ education + age', data=df)
results = model.fit()
print(results.summary(), results.rsquared)

Creating your own data science classes:
- Simple linear regression class (educational)
- Data processing pipeline class
OOP best practices:
- Design chainable methods (return self)
- Use attributes to store metadata (is_fitted, n_features)
- Implement __repr__ for debugging

Practical Case: Data Pipeline Class:

python

class DataPipeline:
    def __init__(self, df):
        self.df = df.copy()
        self.steps = []

    def remove_missing(self):
        self.df = self.df.dropna()
        self.steps.append("remove_missing")
        return self  # Support method chaining

    def filter_age(self, min_age, max_age):
        self.df = self.df[(self.df['age'] >= min_age) &
                          (self.df['age'] <= max_age)]
        self.steps.append(f"filter_age({min_age}, {max_age})")
        return self

    def get_result(self):
        return self.df

# Method chaining
result = (DataPipeline(df)
    .remove_missing()
    .filter_age(18, 65)
    .get_result()
)

Object-Oriented vs Functional Programming

Dimension	Functional Programming	Object-Oriented Programming
Organization	Functions + data separated	Data and methods bound together
Typical Languages	R, MATLAB	Python, Java
Use Cases	Simple scripts, data analysis	Large projects, library development
Code Style	`mean(df$x)`	`df['x'].mean()`

Functional Style (R Style)

python

# Data and functions separated
def calculate_tax(income, rate):
    return income * rate

def is_valid(age):
    return age >= 18

income = 75000
tax = calculate_tax(income, 0.25)

Object-Oriented Style (Python Style)

python

# Data and methods bound together
class Respondent:
    def __init__(self, income, age):
        self.income = income
        self.age = age

    def calculate_tax(self, rate=0.25):
        return self.income * rate

    def is_valid(self):
        return self.age >= 18

resp = Respondent(75000, 30)
tax = resp.calculate_tax()

How to Study This Chapter?

Learning Roadmap

Day 1 (2 hours): OOP Introduction

Read 01 - Introduction to Object-Oriented Programming
Understand concepts of class, object, method, attribute
Create your first class (Student or SurveyResponse)

Day 2 (3 hours): Classes and Objects in Detail

Read 02 - Classes and Objects in Detail
Learn special methods (__init__, __str__, __len__)
Practice @property decorator

Day 3 (2 hours): OOP in Data Science

Read 03 - OOP in Data Science
Understand Pandas and Scikit-learn's OOP design
Create a simple data pipeline class

Total Time: 7 hours (1 week)

Minimal Learning Path

For social science students, OOP is not a core skill. Priorities:

Must Learn (understand others' code, 4 hours):

01 - OOP Introduction (understand Pandas' OOP design)
03 - Data Science Applications (understand Scikit-learn usage)
Know the difference between df.method() and df.attribute

Optional (write your own classes, 3 hours):

02 - Classes and Objects in Detail
Special methods, @property
Create your own classes

Can Skip (advanced topics):

Class methods, static methods
Inheritance, polymorphism
Complex encapsulation design

Study Recommendations

Understanding Before Creating
- First understand how Pandas and Scikit-learn use OOP
- Then consider whether you need to create your own classes
- Most data analysis can be done with functions

From Usage to Creation

python

# Step 1: Use others' classes
df = pd.DataFrame({'x': [1, 2, 3]})
df.mean()

# Step 2: Understand the principles
# DataFrame is a class, df is an object, mean() is a method

# Step 3: Create your own classes (optional)
class MyDataFrame:
    def __init__(self, data):
        self.data = data

When to Create Classes?
- Need to encapsulate complex data + operations
- Need reusable components (data pipelines, custom models)
- Large projects need code organization
- Simple scripts, one-off analysis (use functions)
Comparative Learning
- Python's OOP style: data and methods bound together
- R's functional style: data and functions separated
- Stata: almost no OOP (all commands)

Common Questions

Q: Do I have to learn OOP? Can I skip it? A: You can't skip it completely. You don't need to master it deeply, but you must understand basic concepts or you won't be able to understand Pandas and Scikit-learn documentation. Recommendation: Just understand what df.method() means, you don't have to write complex classes yourself.

Q: Why does Python use OOP while R uses functional style? A:

Python is a general-purpose programming language, OOP is better for large projects
R is a statistical language, functional style is more aligned with mathematical conventions
Both can accomplish data analysis, just different styles

Q: What's the difference between df.head() and head(df)? A:

df.head(): OOP style, DataFrame object calls its own method
head(df): Functional style, function takes DataFrame as parameter
Python prefers the former, R prefers the latter

Q: What is self? A: self represents the object itself. When you call resp.is_valid(), Python automatically passes resp to self, so the method can access the object's attributes.

Q: When do I need to create my own classes? A:

Need to: Build reusable data pipelines, encapsulate complex analysis logic, large projects
Don't need to: Simple scripts, one-off analysis, exploratory data analysis

Q: Does OOP make code faster? A: No. OOP's advantages are in organization and maintainability, not performance. For data analysis, performance mainly depends on algorithms and libraries (NumPy, Pandas), not OOP.

Next Steps

After completing this chapter, you will have mastered:

Basic OOP concepts (class, object, method, attribute)
Understanding of Pandas and Scikit-learn's OOP design
Knowing when to use classes vs functions
Ability to create simple classes to organize code

In Module 7, we'll learn file I/O, how to read and write CSV, Excel, JSON, Stata, and other data files.

In Module 9, we'll dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.

OOP is not a core skill for social science students, understanding is enough! Keep moving forward!

Module 6: Object-Oriented Programming Basics (OOP) ​

Chapter Overview ​

Learning Objectives ​

Chapter Contents ​

01 - Introduction to Object-Oriented Programming ​

02 - Classes and Objects in Detail ​

03 - OOP in Data Science ​

Object-Oriented vs Functional Programming ​

Functional Style (R Style) ​

Object-Oriented Style (Python Style) ​

How to Study This Chapter? ​

Learning Roadmap ​

Minimal Learning Path ​

Study Recommendations ​

Common Questions ​

Next Steps ​

Quick Links ​

Module 6: Object-Oriented Programming Basics (OOP)

Chapter Overview

Learning Objectives

Chapter Contents

01 - Introduction to Object-Oriented Programming

02 - Classes and Objects in Detail

03 - OOP in Data Science

Object-Oriented vs Functional Programming

Functional Style (R Style)

Object-Oriented Style (Python Style)

How to Study This Chapter?

Learning Roadmap

Minimal Learning Path

Study Recommendations

Common Questions

Next Steps

Quick Links