Skip to content

Module 6: Object-Oriented Programming Basics (OOP)

Understanding the Secret of df.method() — Why Data Science Needs OOP


Chapter Overview

If you've used Pandas, you're already using Object-Oriented Programming (OOP)! Every time you call df.head(), df.mean(), or model.fit(X, y), you're interacting with objects. This chapter will demystify OOP, help you understand the design philosophy of data science libraries, and teach you how to create your own classes.

Important Note: Social science students don't need to master OOP deeply, but understanding its basic concepts is essential for effectively using libraries like Pandas and Scikit-learn.


Learning Objectives

After completing this chapter, you will be able to:

  • Understand the concepts of Classes and Objects
  • Know why Pandas and Scikit-learn use OOP
  • Understand the meaning of df.method() and df.attribute
  • Create simple classes to organize code
  • Use special methods (__init__, __str__, __len__)
  • Compare object-oriented vs functional programming
  • Build data analysis pipeline classes

Chapter Contents

01 - Introduction to Object-Oriented Programming

Core Question: Why should social science students learn OOP?

Core Content:

  • You're already using OOP:
    python
    df = pd.DataFrame({'age': [25, 30, 35]})
    result = df.mean()  # df is an object, mean() is a method
  • OOP core concepts:
    • Object: A collection of data + methods
    • Class: A template/blueprint for objects
    • Method: A function belonging to an object
    • Attribute: Data belonging to an object
  • Why we need OOP: Data and methods are bound together, making code more organized
  • Understanding Pandas' OOP design:
    python
    df.head()      # Method call
    df.shape       # Attribute access
    df.to_csv()    # Method call
  • Comparing Python (object-oriented) vs R (functional):
    • Python: df.mean()
    • R: mean(df$x)
  • Creating your first class: Student class, Survey response class
  • Choosing between object-oriented vs functional programming

Why It Matters?

  • Understand how Pandas and Scikit-learn are used
  • Know when to use classes vs functions
  • Make code more maintainable and reusable

Practical Application:

python
class SurveyResponse:
    def __init__(self, id, age, income):
        self.id = id
        self.age = age
        self.income = income

    def is_valid(self):
        return 18 <= self.age <= 100 and self.income >= 0

    def income_category(self):
        if self.income < 50000:
            return "Low Income"
        elif self.income < 100000:
            return "Middle Income"
        else:
            return "High Income"

resp = SurveyResponse(1001, 30, 75000)
print(resp.is_valid())         # True
print(resp.income_category())  # Middle Income

02 - Classes and Objects in Detail

Core Question: How to create and use classes?

Core Content:

  • Complete class structure:
    python
    class ClassName:
        class_variable = "shared data"  # Class attribute
    
        def __init__(self, param):
            self.param = param      # Instance attribute
    
        def method(self):           # Instance method
            return self.param
  • Instance attributes vs class attributes:
    • Instance attributes: Unique to each object (self.name)
    • Class attributes: Shared by all objects (ClassName.variable)
  • Common special methods (magic methods):
    • __init__: Constructor (called when creating an object)
    • __str__: String representation (for print())
    • __repr__: Developer representation (for debugging)
    • __len__: Length (for len())
  • @property decorator: Convert methods to attributes
    python
    @property
    def net_income(self):
        return self.income * 0.75
    
    # Usage: resp.net_income (no parentheses)
  • Encapsulation: Public vs Private:
    • Public: self.balance
    • Convention private: self._transactions (single underscore)
    • True private: self.__pin (double underscore)

Practical Case:

python
class Student:
    school_name = "Peking University"  # Class attribute

    def __init__(self, id, name, major, gpa=0.0):
        self.id = id
        self.name = name
        self.major = major
        self.gpa = gpa
        self.courses = []

    def enroll_course(self, name, credits):
        self.courses.append({'name': name, 'credits': credits})

    def get_total_credits(self):
        return sum(c['credits'] for c in self.courses)

    def __str__(self):
        return f"{self.name} ({self.major}, GPA: {self.gpa})"

alice = Student(2024001, "Alice", "Economics", 3.8)
alice.enroll_course("Microeconomics", 4)
print(alice)  # Alice (Economics, GPA: 3.8)

03 - OOP in Data Science

Core Question: Why do data science libraries use OOP?

Core Content:

  • Pandas' OOP design:
    • DataFrame and Series are both objects
    • Advantages of method chaining:
      python
      result = (df
          .query('age > 30')
          .assign(log_income=lambda x: np.log(x['income']))
          .sort_values('income')
          .reset_index(drop=True)
      )
  • Scikit-learn's OOP design:
    • Unified API: fit()predict()
    python
    model = LinearRegression()
    model.fit(X, y)
    predictions = model.predict(X_test)
    print(model.coef_, model.intercept_)  # Access attributes
  • Statsmodels' OOP design:
    python
    model = smf.ols('income ~ education + age', data=df)
    results = model.fit()
    print(results.summary(), results.rsquared)
  • Creating your own data science classes:
    • Simple linear regression class (educational)
    • Data processing pipeline class
  • OOP best practices:
    • Design chainable methods (return self)
    • Use attributes to store metadata (is_fitted, n_features)
    • Implement __repr__ for debugging

Practical Case: Data Pipeline Class:

python
class DataPipeline:
    def __init__(self, df):
        self.df = df.copy()
        self.steps = []

    def remove_missing(self):
        self.df = self.df.dropna()
        self.steps.append("remove_missing")
        return self  # Support method chaining

    def filter_age(self, min_age, max_age):
        self.df = self.df[(self.df['age'] >= min_age) &
                          (self.df['age'] <= max_age)]
        self.steps.append(f"filter_age({min_age}, {max_age})")
        return self

    def get_result(self):
        return self.df

# Method chaining
result = (DataPipeline(df)
    .remove_missing()
    .filter_age(18, 65)
    .get_result()
)

Object-Oriented vs Functional Programming

DimensionFunctional ProgrammingObject-Oriented Programming
OrganizationFunctions + data separatedData and methods bound together
Typical LanguagesR, MATLABPython, Java
Use CasesSimple scripts, data analysisLarge projects, library development
Code Stylemean(df$x)df['x'].mean()

Functional Style (R Style)

python
# Data and functions separated
def calculate_tax(income, rate):
    return income * rate

def is_valid(age):
    return age >= 18

income = 75000
tax = calculate_tax(income, 0.25)

Object-Oriented Style (Python Style)

python
# Data and methods bound together
class Respondent:
    def __init__(self, income, age):
        self.income = income
        self.age = age

    def calculate_tax(self, rate=0.25):
        return self.income * rate

    def is_valid(self):
        return self.age >= 18

resp = Respondent(75000, 30)
tax = resp.calculate_tax()

How to Study This Chapter?

Learning Roadmap

Day 1 (2 hours): OOP Introduction

  • Read 01 - Introduction to Object-Oriented Programming
  • Understand concepts of class, object, method, attribute
  • Create your first class (Student or SurveyResponse)

Day 2 (3 hours): Classes and Objects in Detail

  • Read 02 - Classes and Objects in Detail
  • Learn special methods (__init__, __str__, __len__)
  • Practice @property decorator

Day 3 (2 hours): OOP in Data Science

  • Read 03 - OOP in Data Science
  • Understand Pandas and Scikit-learn's OOP design
  • Create a simple data pipeline class

Total Time: 7 hours (1 week)

Minimal Learning Path

For social science students, OOP is not a core skill. Priorities:

Must Learn (understand others' code, 4 hours):

  • 01 - OOP Introduction (understand Pandas' OOP design)
  • 03 - Data Science Applications (understand Scikit-learn usage)
  • Know the difference between df.method() and df.attribute

Optional (write your own classes, 3 hours):

  • 02 - Classes and Objects in Detail
  • Special methods, @property
  • Create your own classes

Can Skip (advanced topics):

  • Class methods, static methods
  • Inheritance, polymorphism
  • Complex encapsulation design

Study Recommendations

  1. Understanding Before Creating

    • First understand how Pandas and Scikit-learn use OOP
    • Then consider whether you need to create your own classes
    • Most data analysis can be done with functions
  2. From Usage to Creation

    python
    # Step 1: Use others' classes
    df = pd.DataFrame({'x': [1, 2, 3]})
    df.mean()
    
    # Step 2: Understand the principles
    # DataFrame is a class, df is an object, mean() is a method
    
    # Step 3: Create your own classes (optional)
    class MyDataFrame:
        def __init__(self, data):
            self.data = data
  3. When to Create Classes?

    • Need to encapsulate complex data + operations
    • Need reusable components (data pipelines, custom models)
    • Large projects need code organization
    • Simple scripts, one-off analysis (use functions)
  4. Comparative Learning

    • Python's OOP style: data and methods bound together
    • R's functional style: data and functions separated
    • Stata: almost no OOP (all commands)

Common Questions

Q: Do I have to learn OOP? Can I skip it? A: You can't skip it completely. You don't need to master it deeply, but you must understand basic concepts or you won't be able to understand Pandas and Scikit-learn documentation. Recommendation: Just understand what df.method() means, you don't have to write complex classes yourself.

Q: Why does Python use OOP while R uses functional style? A:

  • Python is a general-purpose programming language, OOP is better for large projects
  • R is a statistical language, functional style is more aligned with mathematical conventions
  • Both can accomplish data analysis, just different styles

Q: What's the difference between df.head() and head(df)? A:

  • df.head(): OOP style, DataFrame object calls its own method
  • head(df): Functional style, function takes DataFrame as parameter
  • Python prefers the former, R prefers the latter

Q: What is self? A: self represents the object itself. When you call resp.is_valid(), Python automatically passes resp to self, so the method can access the object's attributes.

Q: When do I need to create my own classes? A:

  • Need to: Build reusable data pipelines, encapsulate complex analysis logic, large projects
  • Don't need to: Simple scripts, one-off analysis, exploratory data analysis

Q: Does OOP make code faster? A: No. OOP's advantages are in organization and maintainability, not performance. For data analysis, performance mainly depends on algorithms and libraries (NumPy, Pandas), not OOP.


Next Steps

After completing this chapter, you will have mastered:

  • Basic OOP concepts (class, object, method, attribute)
  • Understanding of Pandas and Scikit-learn's OOP design
  • Knowing when to use classes vs functions
  • Ability to create simple classes to organize code

In Module 7, we'll learn file I/O, how to read and write CSV, Excel, JSON, Stata, and other data files.

In Module 9, we'll dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.

OOP is not a core skill for social science students, understanding is enough! Keep moving forward!


Released under the MIT License. Content © Author.