Skip to content

Module 6 Summary and Review

Object-Oriented Programming Basics — Understanding Classes and Objects


Knowledge Summary

1. OOP Core Concepts

What is OOP?

  • Object-oriented programming is a paradigm that organizes data and the methods that operate on that data together
  • Object: A collection of data + methods
  • Class: A template/blueprint for objects
  • Method: A function belonging to an object

Why do we need OOP?

  • Data and methods are naturally bound together
  • Code is more organized
  • Easier to reuse and maintain
  • Aligns with real-world modeling

Core Terminology:

TermDefinitionExample
ClassObject templateclass Student:
Object/InstanceConcrete instance of a classalice = Student()
AttributeObject's dataalice.name = "Alice"
MethodObject's functionalice.calculate_gpa()
selfRefers to current objectself.name
ConstructorInitialize object__init__()

2. Basic Class Structure

python
class ClassName:
    """Class docstring"""

    # Class attribute (shared by all objects)
    class_variable = "shared"

    def __init__(self, param1, param2):
        """Constructor"""
        self.param1 = param1  # Instance attribute
        self.param2 = param2

    def instance_method(self):
        """Instance method"""
        return self.param1

    @classmethod
    def class_method(cls):
        """Class method"""
        return cls.class_variable

    @staticmethod
    def static_method():
        """Static method"""
        return "Does not depend on class or instance"

Three Method Types:

Method TypeFirst ParameterAccess Instance AttributesAccess Class AttributesUse Case
Instance methodselfMost common, operate on object data
Class methodclsFactory methods, alternative constructors
Static methodNoneUtility functions

3. Instance Attributes vs Class Attributes

python
class Survey:
    # Class attribute (shared by all objects)
    total_surveys = 0

    def __init__(self, name, year):
        # Instance attributes (unique to each object)
        self.name = name
        self.year = year
        Survey.total_surveys += 1  # Modify class attribute

# Usage
survey1 = Survey("Income Survey", 2024)
survey2 = Survey("Health Survey", 2024)

print(survey1.name)           # Income Survey (instance attribute)
print(Survey.total_surveys)   # 2 (class attribute)

Differences:

  • Instance attributes: Unique to each object, accessed via self.attr
  • Class attributes: Shared by all objects, accessed via ClassName.attr

4. Special Methods (Magic Methods)

MethodPurposeTriggered By
__init__()Constructorobj = Class()
__str__()String representation (user-friendly)print(obj)
__repr__()Developer representationrepr(obj)
__len__()Lengthlen(obj)
__getitem__()Index accessobj[key]
__eq__()Equality comparisonobj1 == obj2

Example:

python
class Survey:
    def __init__(self, name):
        self.name = name
        self.responses = []

    def __str__(self):
        return f"Survey: {self.name} ({len(self.responses)} responses)"

    def __len__(self):
        return len(self.responses)

    def __getitem__(self, index):
        return self.responses[index]

# Usage
survey = Survey("Test")
survey.responses = [1, 2, 3]

print(survey)         # Survey: Test (3 responses)
print(len(survey))    # 3
print(survey[0])      # 1

5. Encapsulation: Public vs Private

python
class BankAccount:
    def __init__(self, balance):
        self.balance = balance       # Public attribute
        self._transactions = []      # Convention private (single underscore)
        self.__pin = "1234"          # True private (double underscore)

    def deposit(self, amount):
        """Public method"""
        self.balance += amount
        self._log_transaction("deposit", amount)

    def _log_transaction(self, type, amount):
        """Private method (convention)"""
        self._transactions.append({'type': type, 'amount': amount})

Naming Conventions:

  • name: Public (directly accessible)
  • _name: Convention private (discouraged external access, but possible)
  • __name: True private (Python name-mangles, difficult to access externally)

6. OOP in Data Science Applications

Pandas DataFrame:

python
import pandas as pd

df = pd.DataFrame({'age': [25, 30, 35]})

# Attributes
df.shape      # (3, 1)
df.columns    # Index(['age'])

# Methods
df.head()
df.mean()
df.to_csv('output.csv')

# Method chaining
result = (df
    .query('age > 25')
    .assign(age_squared=lambda x: x['age']**2)
    .sort_values('age')
)

Scikit-learn Models:

python
from sklearn.linear_model import LinearRegression

model = LinearRegression()  # Create object
model.fit(X, y)             # Train (method)
predictions = model.predict(X_new)  # Predict (method)

# Access attributes
print(model.coef_)       # Coefficients
print(model.intercept_)  # Intercept

Python vs Stata vs R

Object-Oriented Comparison

Python (fully object-oriented):

python
df = pd.DataFrame({'x': [1, 2, 3]})
df.mean()           # Method call
df.shape            # Attribute access

R (partially object-oriented):

r
df <- data.frame(x = c(1, 2, 3))
mean(df$x)          # Function call
dim(df)             # Function call

Stata (procedural):

stata
* Stata is mainly command-based
summarize income
generate log_income = log(income)
regress y x1 x2

Common Errors

1. Forgetting the self Parameter

python
# Wrong
class Student:
    def __init__(name, age):  # Forgot self
        name = name  # Won't save to object

# Correct
class Student:
    def __init__(self, name, age):
        self.name = name
        self.age = age

2. Confusing Instance and Class Attributes

python
# Wrong
class Counter:
    count = 0  # Class attribute

    def increment(self):
        count += 1  # NameError: doesn't specify self.count or Counter.count

# Correct
class Counter:
    count = 0

    def increment(self):
        Counter.count += 1  # Or self.__class__.count += 1

3. Directly Modifying Class Attributes Causes Unexpected Sharing

python
# Wrong
class Survey:
    responses = []  # Class attribute!

    def add_response(self, resp):
        self.responses.append(resp)  # All objects share the same list

# Correct
class Survey:
    def __init__(self):
        self.responses = []  # Instance attribute

4. Forgetting to Implement __str__ Leads to Unfriendly Output

python
# Bad
class Student:
    def __init__(self, name):
        self.name = name

s = Student("Alice")
print(s)  # <__main__.Student object at 0x...>

# Good
class Student:
    def __init__(self, name):
        self.name = name

    def __str__(self):
        return f"Student(name='{self.name}')"

s = Student("Alice")
print(s)  # Student(name='Alice')

Best Practices

1. Use CapWords Naming for Classes

python
# Good
class StudentRecord:
    pass

class SurveyData:
    pass

# Bad
class student_record:
    pass

class surveydata:
    pass

2. Use snake_case Naming for Methods

python
class DataAnalyzer:
    def calculate_mean(self):  # ✓
        pass

    def CalculateMean(self):   # ✗
        pass

3. Use Docstrings

python
class Survey:
    """Survey class

    Manages survey data including adding responses, statistical analysis, etc.

    Attributes:
        name (str): Survey name
        year (int): Survey year
        responses (list): Response list
    """

    def __init__(self, name, year):
        self.name = name
        self.year = year
        self.responses = []

4. Support Method Chaining

python
class DataPipeline:
    def remove_outliers(self):
        # Processing logic...
        return self  # Return self

    def standardize(self):
        # Processing logic...
        return self

    def filter_missing(self):
        # Processing logic...
        return self

# Method chaining
pipeline = (DataPipeline(data)
    .remove_outliers()
    .standardize()
    .filter_missing()
)

Programming Exercises

Exercise 1: Student Grade Management System (Basic)

Difficulty: ⭐⭐ Time: 20 minutes

Create a Student class.

Requirements:

python
class Student:
    """Student class"""

    def __init__(self, student_id, name, major):
        pass

    def add_grade(self, course, grade):
        """Add a grade"""
        pass

    def get_gpa(self):
        """Calculate GPA (assuming 100-point scale, convert to 4.0 scale)"""
        pass

    def __str__(self):
        return f"Student: {self.name} ({self.major}), GPA: {self.get_gpa():.2f}"

# Test
alice = Student(2024001, "Alice Wang", "Economics")
alice.add_grade("Microeconomics", 85)
alice.add_grade("Econometrics", 90)
alice.add_grade("Statistics", 78)

print(alice)
print(f"GPA: {alice.get_gpa():.2f}")
✅ Reference Solution
python
class Student:
    """Student class"""

    def __init__(self, student_id, name, major):
        self.student_id = student_id
        self.name = name
        self.major = major
        self.grades = {}  # {course: grade}

    def add_grade(self, course, grade):
        """Add a grade"""
        if not (0 <= grade <= 100):
            raise ValueError("Grade must be between 0-100")
        self.grades[course] = grade

    def get_gpa(self):
        """Calculate GPA (100-point to 4.0 scale conversion)"""
        if not self.grades:
            return 0.0

        # Conversion rules: 90-100=4.0, 80-89=3.0, 70-79=2.0, 60-69=1.0, <60=0.0
        total_points = 0
        for grade in self.grades.values():
            if grade >= 90:
                total_points += 4.0
            elif grade >= 80:
                total_points += 3.0
            elif grade >= 70:
                total_points += 2.0
            elif grade >= 60:
                total_points += 1.0
            else:
                total_points += 0.0

        return total_points / len(self.grades)

    def get_average_score(self):
        """Calculate average score"""
        if not self.grades:
            return 0.0
        return sum(self.grades.values()) / len(self.grades)

    def __str__(self):
        return f"Student: {self.name} ({self.major}), GPA: {self.get_gpa():.2f}"

    def __repr__(self):
        return f"Student(id={self.student_id}, name='{self.name}', courses={len(self.grades)})"


# Test
alice = Student(2024001, "Alice Wang", "Economics")
alice.add_grade("Microeconomics", 85)
alice.add_grade("Econometrics", 90)
alice.add_grade("Statistics", 78)

print(alice)                           # Student: Alice Wang (Economics), GPA: 3.00
print(f"Average score: {alice.get_average_score():.1f}")  # 84.3
print(repr(alice))                      # Student(id=2024001, name='Alice Wang', courses=3)

Exercise 2: Survey Data Container (Basic)

Difficulty: ⭐⭐ Time: 25 minutes

python
class SurveyData:
    """Survey data management class"""

    def __init__(self, survey_name):
        pass

    def add_response(self, response):
        """Add a response"""
        pass

    def _validate(self, response):
        """Private method: validate data"""
        pass

    def get_average_income(self):
        """Calculate average income"""
        pass

    def filter_by_age(self, min_age, max_age):
        """Filter by age"""
        pass

    def __len__(self):
        return len(self.responses)

    def __str__(self):
        return f"{self.survey_name}: {len(self)} responses"

# Test
survey = SurveyData("2024 Income Survey")
survey.add_response({'id': 1, 'age': 30, 'income': 75000})
survey.add_response({'id': 2, 'age': 35, 'income': 85000})

print(survey)
print(f"Average income: ${survey.get_average_income():,.0f}")
✅ Reference Solution
python
class SurveyData:
    """Survey data management class"""

    def __init__(self, survey_name):
        self.survey_name = survey_name
        self.responses = []

    def add_response(self, response):
        """Add a response"""
        if self._validate(response):
            self.responses.append(response)
            return True
        else:
            print(f"⚠️ Invalid data: {response}")
            return False

    def _validate(self, response):
        """Private method: validate data"""
        required_fields = ['id', 'age', 'income']

        # Check required fields
        if not all(field in response for field in required_fields):
            return False

        # Validate age
        if not (0 < response['age'] < 120):
            return False

        # Validate income
        if response['income'] < 0:
            return False

        return True

    def get_average_income(self):
        """Calculate average income"""
        if not self.responses:
            return 0
        incomes = [r['income'] for r in self.responses]
        return sum(incomes) / len(incomes)

    def filter_by_age(self, min_age, max_age):
        """Filter by age"""
        return [r for r in self.responses
                if min_age <= r['age'] <= max_age]

    def get_income_stats(self):
        """Income statistics"""
        if not self.responses:
            return {}

        incomes = [r['income'] for r in self.responses]
        return {
            'mean': sum(incomes) / len(incomes),
            'min': min(incomes),
            'max': max(incomes),
            'count': len(incomes)
        }

    def __len__(self):
        return len(self.responses)

    def __str__(self):
        return f"{self.survey_name}: {len(self)} responses"

    def __getitem__(self, index):
        """Support index access"""
        return self.responses[index]


# Test
survey = SurveyData("2024 Income Survey")

# Add valid data
survey.add_response({'id': 1, 'age': 30, 'income': 75000})
survey.add_response({'id': 2, 'age': 35, 'income': 85000})
survey.add_response({'id': 3, 'age': 45, 'income': 95000})

# Add invalid data (will be rejected)
survey.add_response({'id': 4, 'age': -5, 'income': 50000})  # Invalid age
survey.add_response({'id': 5, 'age': 28})  # Missing income field

print(survey)  # 2024 Income Survey: 3 responses
print(f"Average income: ${survey.get_average_income():,.0f}")
print(f"Ages 30-40: {len(survey.filter_by_age(30, 40))} people")
print(f"First record: {survey[0]}")

stats = survey.get_income_stats()
print(f"\nIncome statistics:")
print(f"  Sample size: {stats['count']}")
print(f"  Average: ${stats['mean']:,.0f}")
print(f"  Range: ${stats['min']:,} - ${stats['max']:,}")

Exercise 3: Data Analysis Pipeline (Intermediate)

Difficulty: ⭐⭐⭐ Time: 35 minutes

Create a data processing pipeline that supports method chaining.

python
class DataPipeline:
    """Data processing pipeline"""

    def __init__(self, data):
        pass

    def filter_by(self, condition):
        """Filter by condition, supports Lambda"""
        pass

    def transform(self, func):
        """Transform data"""
        pass

    def group_by(self, key):
        """Group by"""
        pass

    def get_result(self):
        """Get result"""
        pass

    def summary(self):
        """Processing summary"""
        pass

# Test
data = [
    {'id': 1, 'age': 25, 'income': 50000, 'city': 'Beijing'},
    {'id': 2, 'age': 35, 'income': 80000, 'city': 'Shanghai'},
    # ...
]

result = (DataPipeline(data)
    .filter_by(lambda x: x['age'] >= 30)
    .transform(lambda x: {**x, 'income_万元': x['income'] / 10000})
    .get_result()
)
✅ Reference Solution
python
class DataPipeline:
    """Data processing pipeline"""

    def __init__(self, data):
        self.original_data = data.copy()
        self.data = data.copy()
        self.steps = []

    def filter_by(self, condition):
        """Filter by condition"""
        self.data = [item for item in self.data if condition(item)]
        self.steps.append(f"filter_by (kept {len(self.data)} records)")
        return self

    def transform(self, func):
        """Transform data"""
        self.data = [func(item) for item in self.data]
        self.steps.append("transform")
        return self

    def remove_field(self, *fields):
        """Remove fields"""
        self.data = [{k: v for k, v in item.items() if k not in fields}
                     for item in self.data]
        self.steps.append(f"remove_field({', '.join(fields)})")
        return self

    def add_field(self, field_name, func):
        """Add new field"""
        for item in self.data:
            item[field_name] = func(item)
        self.steps.append(f"add_field('{field_name}')")
        return self

    def sort_by(self, key, reverse=False):
        """Sort"""
        self.data = sorted(self.data, key=key, reverse=reverse)
        self.steps.append(f"sort_by (reverse={reverse})")
        return self

    def limit(self, n):
        """Limit number"""
        self.data = self.data[:n]
        self.steps.append(f"limit({n})")
        return self

    def group_by(self, key_func):
        """Group by"""
        groups = {}
        for item in self.data:
            group_key = key_func(item)
            if group_key not in groups:
                groups[group_key] = []
            groups[group_key].append(item)

        # Convert to grouped result format
        self.data = [
            {'group': k, 'items': v, 'count': len(v)}
            for k, v in groups.items()
        ]
        self.steps.append(f"group_by ({len(self.data)} groups)")
        return self

    def get_result(self):
        """Get result"""
        return self.data

    def summary(self):
        """Processing summary"""
        print("=" * 50)
        print(f"Data Processing Pipeline Summary")
        print("=" * 50)
        print(f"Original data: {len(self.original_data)} records")
        print(f"After processing: {len(self.data)} records")
        print(f"\nProcessing steps:")
        for i, step in enumerate(self.steps, 1):
            print(f"  {i}. {step}")
        print("=" * 50)

    def __len__(self):
        return len(self.data)

    def __repr__(self):
        return f"DataPipeline(records={len(self.data)}, steps={len(self.steps)})"


# Test
data = [
    {'id': 1, 'age': 25, 'income': 50000, 'city': 'Beijing', 'gender': 'F'},
    {'id': 2, 'age': 35, 'income': 80000, 'city': 'Shanghai', 'gender': 'M'},
    {'id': 3, 'age': 45, 'income': 120000, 'city': 'Beijing', 'gender': 'F'},
    {'id': 4, 'age': 28, 'income': 65000, 'city': 'Guangzhou', 'gender': 'M'},
    {'id': 5, 'age': 32, 'income': 95000, 'city': 'Shanghai', 'gender': 'F'},
    {'id': 6, 'age': 40, 'income': 110000, 'city': 'Beijing', 'gender': 'M'},
]

# Example 1: Basic pipeline
print("Example 1: Filter age >= 30, convert income to 10k units")
result1 = (DataPipeline(data)
    .filter_by(lambda x: x['age'] >= 30)
    .add_field('income_万元', lambda x: round(x['income'] / 10000, 2))
    .remove_field('gender')
    .sort_by(lambda x: x['income'], reverse=True)
    .get_result()
)

for r in result1:
    print(f"  ID{r['id']}: {r['age']} years old, {r['city']}, {r['income_万元']} 万元")

# Example 2: Group statistics
print("\nExample 2: Group by city")
pipeline2 = DataPipeline(data)
result2 = (pipeline2
    .filter_by(lambda x: x['age'] >= 25)
    .group_by(lambda x: x['city'])
    .get_result()
)

for group in result2:
    avg_income = sum(item['income'] for item in group['items']) / len(group['items'])
    print(f"  {group['group']:12s}: {group['count']} people, average income ${avg_income:,.0f}")

pipeline2.summary()

# Example 3: Top N
print("\nExample 3: Top 3 highest incomes")
result3 = (DataPipeline(data)
    .sort_by(lambda x: x['income'], reverse=True)
    .limit(3)
    .get_result()
)

for i, r in enumerate(result3, 1):
    print(f"  {i}. ID{r['id']}: {r['age']} years old, ${r['income']:,}")

Exercise 4: Simple Linear Regression Class (Advanced)

Difficulty: ⭐⭐⭐⭐ Time: 40 minutes

Implement a simple linear regression class, mimicking Scikit-learn's API design.

python
class SimpleLinearRegression:
    """Simple linear regression"""

    def __init__(self):
        pass

    def fit(self, X, y):
        """Fit model"""
        pass

    def predict(self, X):
        """Predict"""
        pass

    def score(self, X, y):
        """Calculate R²"""
        pass

    def __repr__(self):
        pass

# Test
X = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

model = SimpleLinearRegression()
model.fit(X, y)
print(model)  # Display slope and intercept

predictions = model.predict([6, 7, 8])
print(f"Predictions: {predictions}")

r2 = model.score(X, y)
print(f"R² = {r2:.3f}")
✅ Reference Solution
python
import numpy as np

class SimpleLinearRegression:
    """Simple linear regression (y = slope * x + intercept)"""

    def __init__(self):
        self.slope = None
        self.intercept = None
        self.is_fitted = False

    def fit(self, X, y):
        """Fit model

        Parameters:
            X: Independent variable (1D array)
            y: Dependent variable (1D array)

        Returns:
            self (supports method chaining)
        """
        X = np.array(X)
        y = np.array(y)

        if len(X) != len(y):
            raise ValueError("X and y must have the same length")

        # Calculate slope and intercept
        x_mean = X.mean()
        y_mean = y.mean()

        # slope = Σ((x - x̄)(y - ȳ)) / Σ((x - x̄)²)
        numerator = ((X - x_mean) * (y - y_mean)).sum()
        denominator = ((X - x_mean) ** 2).sum()

        if denominator == 0:
            raise ValueError("X has zero variance, cannot fit")

        self.slope = numerator / denominator
        self.intercept = y_mean - self.slope * x_mean
        self.is_fitted = True

        return self  # Support method chaining

    def predict(self, X):
        """Predict

        Parameters:
            X: Independent variable

        Returns:
            Array of predictions
        """
        if not self.is_fitted:
            raise ValueError("Model not trained, please call fit() first")

        X = np.array(X)
        return self.slope * X + self.intercept

    def score(self, X, y):
        """Calculate R² (coefficient of determination)

        R² = 1 - (SS_res / SS_tot)

        Parameters:
            X: Independent variable
            y: True values

        Returns:
            R² value (0-1, closer to 1 is better)
        """
        y = np.array(y)
        y_pred = self.predict(X)

        # Residual sum of squares
        ss_res = ((y - y_pred) ** 2).sum()

        # Total sum of squares
        ss_tot = ((y - y.mean()) ** 2).sum()

        if ss_tot == 0:
            return 0.0

        return 1 - (ss_res / ss_tot)

    def get_residuals(self, X, y):
        """Calculate residuals"""
        y_pred = self.predict(X)
        return np.array(y) - y_pred

    def summary(self):
        """Print model summary"""
        if not self.is_fitted:
            print("Model not trained")
            return

        print("=" * 50)
        print("Simple Linear Regression Model Summary")
        print("=" * 50)
        print(f"Slope:     {self.slope:.4f}")
        print(f"Intercept: {self.intercept:.4f}")
        print(f"Equation: y = {self.slope:.4f}x + {self.intercept:.4f}")
        print("=" * 50)

    def __repr__(self):
        if not self.is_fitted:
            return "SimpleLinearRegression(unfitted)"
        return f"SimpleLinearRegression(slope={self.slope:.4f}, intercept={self.intercept:.4f})"

    def __str__(self):
        if not self.is_fitted:
            return "Untrained model"
        return f"y = {self.slope:.4f}x + {self.intercept:.4f}"


# Test
print("=" * 60)
print("Simple Linear Regression Test")
print("=" * 60)

# Data 1: Perfect linear relationship
print("\nTest 1: Perfect linear relationship (y = 2x)")
X1 = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]

model1 = SimpleLinearRegression()
model1.fit(X1, y1)
print(model1)
model1.summary()

predictions1 = model1.predict([6, 7, 8])
print(f"Predictions for x=[6,7,8]: {predictions1}")
print(f"R² = {model1.score(X1, y1):.4f}")

# Data 2: Linear relationship with noise
print("\nTest 2: Linear relationship with noise")
X2 = [1, 2, 3, 4, 5]
y2 = [2, 4, 5, 4, 5]

model2 = SimpleLinearRegression()
model2.fit(X2, y2)
print(model2)

predictions2 = model2.predict([6, 7, 8])
print(f"Predictions for x=[6,7,8]: {predictions2}")
print(f"R² = {model2.score(X2, y2):.4f}")

# Residual analysis
residuals = model2.get_residuals(X2, y2)
print(f"Residuals: {residuals}")

# Data 3: Income and years of education
print("\nTest 3: Income vs Years of Education")
education_years = [12, 14, 16, 18, 20]  # Years of education
income = [35000, 45000, 60000, 75000, 90000]  # Income

model3 = SimpleLinearRegression()
model3.fit(education_years, income)
model3.summary()

# Predict: Bachelor's (16 years) and Master's (18 years)
predictions3 = model3.predict([16, 18, 20])
print(f"\nPredicted income:")
print(f"  Bachelor's (16 years): ${predictions3[0]:,.0f}")
print(f"  Master's (18 years): ${predictions3[1]:,.0f}")
print(f"  PhD (20 years): ${predictions3[2]:,.0f}")
print(f"\nR² = {model3.score(education_years, income):.4f}")

print("\n" + "=" * 60)

Next Steps

After completing this chapter, you have mastered:

  • OOP core concepts (class, object, method, attribute)
  • Special methods (__init__, __str__, __len__, etc.)
  • Encapsulation (public/private)
  • OOP applications in data science

Congratulations on completing Module 6!

In Module 7, we'll learn file operations, including reading and writing CSV, Excel, Stata, and other data files.


Further Reading

Ready to learn about file operations?


Released under the MIT License. Content © Author.