Module 6 Summary and Review
Object-Oriented Programming Basics — Understanding Classes and Objects
Knowledge Summary
1. OOP Core Concepts
What is OOP?
- Object-oriented programming is a paradigm that organizes data and the methods that operate on that data together
- Object: A collection of data + methods
- Class: A template/blueprint for objects
- Method: A function belonging to an object
Why do we need OOP?
- Data and methods are naturally bound together
- Code is more organized
- Easier to reuse and maintain
- Aligns with real-world modeling
Core Terminology:
| Term | Definition | Example |
|---|---|---|
| Class | Object template | class Student: |
| Object/Instance | Concrete instance of a class | alice = Student() |
| Attribute | Object's data | alice.name = "Alice" |
| Method | Object's function | alice.calculate_gpa() |
| self | Refers to current object | self.name |
| Constructor | Initialize object | __init__() |
2. Basic Class Structure
class ClassName:
"""Class docstring"""
# Class attribute (shared by all objects)
class_variable = "shared"
def __init__(self, param1, param2):
"""Constructor"""
self.param1 = param1 # Instance attribute
self.param2 = param2
def instance_method(self):
"""Instance method"""
return self.param1
@classmethod
def class_method(cls):
"""Class method"""
return cls.class_variable
@staticmethod
def static_method():
"""Static method"""
return "Does not depend on class or instance"Three Method Types:
| Method Type | First Parameter | Access Instance Attributes | Access Class Attributes | Use Case |
|---|---|---|---|---|
| Instance method | self | ✓ | ✓ | Most common, operate on object data |
| Class method | cls | ✗ | ✓ | Factory methods, alternative constructors |
| Static method | None | ✗ | ✗ | Utility functions |
3. Instance Attributes vs Class Attributes
class Survey:
# Class attribute (shared by all objects)
total_surveys = 0
def __init__(self, name, year):
# Instance attributes (unique to each object)
self.name = name
self.year = year
Survey.total_surveys += 1 # Modify class attribute
# Usage
survey1 = Survey("Income Survey", 2024)
survey2 = Survey("Health Survey", 2024)
print(survey1.name) # Income Survey (instance attribute)
print(Survey.total_surveys) # 2 (class attribute)Differences:
- Instance attributes: Unique to each object, accessed via
self.attr - Class attributes: Shared by all objects, accessed via
ClassName.attr
4. Special Methods (Magic Methods)
| Method | Purpose | Triggered By |
|---|---|---|
__init__() | Constructor | obj = Class() |
__str__() | String representation (user-friendly) | print(obj) |
__repr__() | Developer representation | repr(obj) |
__len__() | Length | len(obj) |
__getitem__() | Index access | obj[key] |
__eq__() | Equality comparison | obj1 == obj2 |
Example:
class Survey:
def __init__(self, name):
self.name = name
self.responses = []
def __str__(self):
return f"Survey: {self.name} ({len(self.responses)} responses)"
def __len__(self):
return len(self.responses)
def __getitem__(self, index):
return self.responses[index]
# Usage
survey = Survey("Test")
survey.responses = [1, 2, 3]
print(survey) # Survey: Test (3 responses)
print(len(survey)) # 3
print(survey[0]) # 15. Encapsulation: Public vs Private
class BankAccount:
def __init__(self, balance):
self.balance = balance # Public attribute
self._transactions = [] # Convention private (single underscore)
self.__pin = "1234" # True private (double underscore)
def deposit(self, amount):
"""Public method"""
self.balance += amount
self._log_transaction("deposit", amount)
def _log_transaction(self, type, amount):
"""Private method (convention)"""
self._transactions.append({'type': type, 'amount': amount})Naming Conventions:
name: Public (directly accessible)_name: Convention private (discouraged external access, but possible)__name: True private (Python name-mangles, difficult to access externally)
6. OOP in Data Science Applications
Pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'age': [25, 30, 35]})
# Attributes
df.shape # (3, 1)
df.columns # Index(['age'])
# Methods
df.head()
df.mean()
df.to_csv('output.csv')
# Method chaining
result = (df
.query('age > 25')
.assign(age_squared=lambda x: x['age']**2)
.sort_values('age')
)Scikit-learn Models:
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Create object
model.fit(X, y) # Train (method)
predictions = model.predict(X_new) # Predict (method)
# Access attributes
print(model.coef_) # Coefficients
print(model.intercept_) # InterceptPython vs Stata vs R
Object-Oriented Comparison
Python (fully object-oriented):
df = pd.DataFrame({'x': [1, 2, 3]})
df.mean() # Method call
df.shape # Attribute accessR (partially object-oriented):
df <- data.frame(x = c(1, 2, 3))
mean(df$x) # Function call
dim(df) # Function callStata (procedural):
* Stata is mainly command-based
summarize income
generate log_income = log(income)
regress y x1 x2Common Errors
1. Forgetting the self Parameter
# Wrong
class Student:
def __init__(name, age): # Forgot self
name = name # Won't save to object
# Correct
class Student:
def __init__(self, name, age):
self.name = name
self.age = age2. Confusing Instance and Class Attributes
# Wrong
class Counter:
count = 0 # Class attribute
def increment(self):
count += 1 # NameError: doesn't specify self.count or Counter.count
# Correct
class Counter:
count = 0
def increment(self):
Counter.count += 1 # Or self.__class__.count += 13. Directly Modifying Class Attributes Causes Unexpected Sharing
# Wrong
class Survey:
responses = [] # Class attribute!
def add_response(self, resp):
self.responses.append(resp) # All objects share the same list
# Correct
class Survey:
def __init__(self):
self.responses = [] # Instance attribute4. Forgetting to Implement __str__ Leads to Unfriendly Output
# Bad
class Student:
def __init__(self, name):
self.name = name
s = Student("Alice")
print(s) # <__main__.Student object at 0x...>
# Good
class Student:
def __init__(self, name):
self.name = name
def __str__(self):
return f"Student(name='{self.name}')"
s = Student("Alice")
print(s) # Student(name='Alice')Best Practices
1. Use CapWords Naming for Classes
# Good
class StudentRecord:
pass
class SurveyData:
pass
# Bad
class student_record:
pass
class surveydata:
pass2. Use snake_case Naming for Methods
class DataAnalyzer:
def calculate_mean(self): # ✓
pass
def CalculateMean(self): # ✗
pass3. Use Docstrings
class Survey:
"""Survey class
Manages survey data including adding responses, statistical analysis, etc.
Attributes:
name (str): Survey name
year (int): Survey year
responses (list): Response list
"""
def __init__(self, name, year):
self.name = name
self.year = year
self.responses = []4. Support Method Chaining
class DataPipeline:
def remove_outliers(self):
# Processing logic...
return self # Return self
def standardize(self):
# Processing logic...
return self
def filter_missing(self):
# Processing logic...
return self
# Method chaining
pipeline = (DataPipeline(data)
.remove_outliers()
.standardize()
.filter_missing()
)Programming Exercises
Exercise 1: Student Grade Management System (Basic)
Difficulty: ⭐⭐ Time: 20 minutes
Create a Student class.
Requirements:
class Student:
"""Student class"""
def __init__(self, student_id, name, major):
pass
def add_grade(self, course, grade):
"""Add a grade"""
pass
def get_gpa(self):
"""Calculate GPA (assuming 100-point scale, convert to 4.0 scale)"""
pass
def __str__(self):
return f"Student: {self.name} ({self.major}), GPA: {self.get_gpa():.2f}"
# Test
alice = Student(2024001, "Alice Wang", "Economics")
alice.add_grade("Microeconomics", 85)
alice.add_grade("Econometrics", 90)
alice.add_grade("Statistics", 78)
print(alice)
print(f"GPA: {alice.get_gpa():.2f}")✅ Reference Solution
class Student:
"""Student class"""
def __init__(self, student_id, name, major):
self.student_id = student_id
self.name = name
self.major = major
self.grades = {} # {course: grade}
def add_grade(self, course, grade):
"""Add a grade"""
if not (0 <= grade <= 100):
raise ValueError("Grade must be between 0-100")
self.grades[course] = grade
def get_gpa(self):
"""Calculate GPA (100-point to 4.0 scale conversion)"""
if not self.grades:
return 0.0
# Conversion rules: 90-100=4.0, 80-89=3.0, 70-79=2.0, 60-69=1.0, <60=0.0
total_points = 0
for grade in self.grades.values():
if grade >= 90:
total_points += 4.0
elif grade >= 80:
total_points += 3.0
elif grade >= 70:
total_points += 2.0
elif grade >= 60:
total_points += 1.0
else:
total_points += 0.0
return total_points / len(self.grades)
def get_average_score(self):
"""Calculate average score"""
if not self.grades:
return 0.0
return sum(self.grades.values()) / len(self.grades)
def __str__(self):
return f"Student: {self.name} ({self.major}), GPA: {self.get_gpa():.2f}"
def __repr__(self):
return f"Student(id={self.student_id}, name='{self.name}', courses={len(self.grades)})"
# Test
alice = Student(2024001, "Alice Wang", "Economics")
alice.add_grade("Microeconomics", 85)
alice.add_grade("Econometrics", 90)
alice.add_grade("Statistics", 78)
print(alice) # Student: Alice Wang (Economics), GPA: 3.00
print(f"Average score: {alice.get_average_score():.1f}") # 84.3
print(repr(alice)) # Student(id=2024001, name='Alice Wang', courses=3)Exercise 2: Survey Data Container (Basic)
Difficulty: ⭐⭐ Time: 25 minutes
class SurveyData:
"""Survey data management class"""
def __init__(self, survey_name):
pass
def add_response(self, response):
"""Add a response"""
pass
def _validate(self, response):
"""Private method: validate data"""
pass
def get_average_income(self):
"""Calculate average income"""
pass
def filter_by_age(self, min_age, max_age):
"""Filter by age"""
pass
def __len__(self):
return len(self.responses)
def __str__(self):
return f"{self.survey_name}: {len(self)} responses"
# Test
survey = SurveyData("2024 Income Survey")
survey.add_response({'id': 1, 'age': 30, 'income': 75000})
survey.add_response({'id': 2, 'age': 35, 'income': 85000})
print(survey)
print(f"Average income: ${survey.get_average_income():,.0f}")✅ Reference Solution
class SurveyData:
"""Survey data management class"""
def __init__(self, survey_name):
self.survey_name = survey_name
self.responses = []
def add_response(self, response):
"""Add a response"""
if self._validate(response):
self.responses.append(response)
return True
else:
print(f"⚠️ Invalid data: {response}")
return False
def _validate(self, response):
"""Private method: validate data"""
required_fields = ['id', 'age', 'income']
# Check required fields
if not all(field in response for field in required_fields):
return False
# Validate age
if not (0 < response['age'] < 120):
return False
# Validate income
if response['income'] < 0:
return False
return True
def get_average_income(self):
"""Calculate average income"""
if not self.responses:
return 0
incomes = [r['income'] for r in self.responses]
return sum(incomes) / len(incomes)
def filter_by_age(self, min_age, max_age):
"""Filter by age"""
return [r for r in self.responses
if min_age <= r['age'] <= max_age]
def get_income_stats(self):
"""Income statistics"""
if not self.responses:
return {}
incomes = [r['income'] for r in self.responses]
return {
'mean': sum(incomes) / len(incomes),
'min': min(incomes),
'max': max(incomes),
'count': len(incomes)
}
def __len__(self):
return len(self.responses)
def __str__(self):
return f"{self.survey_name}: {len(self)} responses"
def __getitem__(self, index):
"""Support index access"""
return self.responses[index]
# Test
survey = SurveyData("2024 Income Survey")
# Add valid data
survey.add_response({'id': 1, 'age': 30, 'income': 75000})
survey.add_response({'id': 2, 'age': 35, 'income': 85000})
survey.add_response({'id': 3, 'age': 45, 'income': 95000})
# Add invalid data (will be rejected)
survey.add_response({'id': 4, 'age': -5, 'income': 50000}) # Invalid age
survey.add_response({'id': 5, 'age': 28}) # Missing income field
print(survey) # 2024 Income Survey: 3 responses
print(f"Average income: ${survey.get_average_income():,.0f}")
print(f"Ages 30-40: {len(survey.filter_by_age(30, 40))} people")
print(f"First record: {survey[0]}")
stats = survey.get_income_stats()
print(f"\nIncome statistics:")
print(f" Sample size: {stats['count']}")
print(f" Average: ${stats['mean']:,.0f}")
print(f" Range: ${stats['min']:,} - ${stats['max']:,}")Exercise 3: Data Analysis Pipeline (Intermediate)
Difficulty: ⭐⭐⭐ Time: 35 minutes
Create a data processing pipeline that supports method chaining.
class DataPipeline:
"""Data processing pipeline"""
def __init__(self, data):
pass
def filter_by(self, condition):
"""Filter by condition, supports Lambda"""
pass
def transform(self, func):
"""Transform data"""
pass
def group_by(self, key):
"""Group by"""
pass
def get_result(self):
"""Get result"""
pass
def summary(self):
"""Processing summary"""
pass
# Test
data = [
{'id': 1, 'age': 25, 'income': 50000, 'city': 'Beijing'},
{'id': 2, 'age': 35, 'income': 80000, 'city': 'Shanghai'},
# ...
]
result = (DataPipeline(data)
.filter_by(lambda x: x['age'] >= 30)
.transform(lambda x: {**x, 'income_万元': x['income'] / 10000})
.get_result()
)✅ Reference Solution
class DataPipeline:
"""Data processing pipeline"""
def __init__(self, data):
self.original_data = data.copy()
self.data = data.copy()
self.steps = []
def filter_by(self, condition):
"""Filter by condition"""
self.data = [item for item in self.data if condition(item)]
self.steps.append(f"filter_by (kept {len(self.data)} records)")
return self
def transform(self, func):
"""Transform data"""
self.data = [func(item) for item in self.data]
self.steps.append("transform")
return self
def remove_field(self, *fields):
"""Remove fields"""
self.data = [{k: v for k, v in item.items() if k not in fields}
for item in self.data]
self.steps.append(f"remove_field({', '.join(fields)})")
return self
def add_field(self, field_name, func):
"""Add new field"""
for item in self.data:
item[field_name] = func(item)
self.steps.append(f"add_field('{field_name}')")
return self
def sort_by(self, key, reverse=False):
"""Sort"""
self.data = sorted(self.data, key=key, reverse=reverse)
self.steps.append(f"sort_by (reverse={reverse})")
return self
def limit(self, n):
"""Limit number"""
self.data = self.data[:n]
self.steps.append(f"limit({n})")
return self
def group_by(self, key_func):
"""Group by"""
groups = {}
for item in self.data:
group_key = key_func(item)
if group_key not in groups:
groups[group_key] = []
groups[group_key].append(item)
# Convert to grouped result format
self.data = [
{'group': k, 'items': v, 'count': len(v)}
for k, v in groups.items()
]
self.steps.append(f"group_by ({len(self.data)} groups)")
return self
def get_result(self):
"""Get result"""
return self.data
def summary(self):
"""Processing summary"""
print("=" * 50)
print(f"Data Processing Pipeline Summary")
print("=" * 50)
print(f"Original data: {len(self.original_data)} records")
print(f"After processing: {len(self.data)} records")
print(f"\nProcessing steps:")
for i, step in enumerate(self.steps, 1):
print(f" {i}. {step}")
print("=" * 50)
def __len__(self):
return len(self.data)
def __repr__(self):
return f"DataPipeline(records={len(self.data)}, steps={len(self.steps)})"
# Test
data = [
{'id': 1, 'age': 25, 'income': 50000, 'city': 'Beijing', 'gender': 'F'},
{'id': 2, 'age': 35, 'income': 80000, 'city': 'Shanghai', 'gender': 'M'},
{'id': 3, 'age': 45, 'income': 120000, 'city': 'Beijing', 'gender': 'F'},
{'id': 4, 'age': 28, 'income': 65000, 'city': 'Guangzhou', 'gender': 'M'},
{'id': 5, 'age': 32, 'income': 95000, 'city': 'Shanghai', 'gender': 'F'},
{'id': 6, 'age': 40, 'income': 110000, 'city': 'Beijing', 'gender': 'M'},
]
# Example 1: Basic pipeline
print("Example 1: Filter age >= 30, convert income to 10k units")
result1 = (DataPipeline(data)
.filter_by(lambda x: x['age'] >= 30)
.add_field('income_万元', lambda x: round(x['income'] / 10000, 2))
.remove_field('gender')
.sort_by(lambda x: x['income'], reverse=True)
.get_result()
)
for r in result1:
print(f" ID{r['id']}: {r['age']} years old, {r['city']}, {r['income_万元']} 万元")
# Example 2: Group statistics
print("\nExample 2: Group by city")
pipeline2 = DataPipeline(data)
result2 = (pipeline2
.filter_by(lambda x: x['age'] >= 25)
.group_by(lambda x: x['city'])
.get_result()
)
for group in result2:
avg_income = sum(item['income'] for item in group['items']) / len(group['items'])
print(f" {group['group']:12s}: {group['count']} people, average income ${avg_income:,.0f}")
pipeline2.summary()
# Example 3: Top N
print("\nExample 3: Top 3 highest incomes")
result3 = (DataPipeline(data)
.sort_by(lambda x: x['income'], reverse=True)
.limit(3)
.get_result()
)
for i, r in enumerate(result3, 1):
print(f" {i}. ID{r['id']}: {r['age']} years old, ${r['income']:,}")Exercise 4: Simple Linear Regression Class (Advanced)
Difficulty: ⭐⭐⭐⭐ Time: 40 minutes
Implement a simple linear regression class, mimicking Scikit-learn's API design.
class SimpleLinearRegression:
"""Simple linear regression"""
def __init__(self):
pass
def fit(self, X, y):
"""Fit model"""
pass
def predict(self, X):
"""Predict"""
pass
def score(self, X, y):
"""Calculate R²"""
pass
def __repr__(self):
pass
# Test
X = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
model = SimpleLinearRegression()
model.fit(X, y)
print(model) # Display slope and intercept
predictions = model.predict([6, 7, 8])
print(f"Predictions: {predictions}")
r2 = model.score(X, y)
print(f"R² = {r2:.3f}")✅ Reference Solution
import numpy as np
class SimpleLinearRegression:
"""Simple linear regression (y = slope * x + intercept)"""
def __init__(self):
self.slope = None
self.intercept = None
self.is_fitted = False
def fit(self, X, y):
"""Fit model
Parameters:
X: Independent variable (1D array)
y: Dependent variable (1D array)
Returns:
self (supports method chaining)
"""
X = np.array(X)
y = np.array(y)
if len(X) != len(y):
raise ValueError("X and y must have the same length")
# Calculate slope and intercept
x_mean = X.mean()
y_mean = y.mean()
# slope = Σ((x - x̄)(y - ȳ)) / Σ((x - x̄)²)
numerator = ((X - x_mean) * (y - y_mean)).sum()
denominator = ((X - x_mean) ** 2).sum()
if denominator == 0:
raise ValueError("X has zero variance, cannot fit")
self.slope = numerator / denominator
self.intercept = y_mean - self.slope * x_mean
self.is_fitted = True
return self # Support method chaining
def predict(self, X):
"""Predict
Parameters:
X: Independent variable
Returns:
Array of predictions
"""
if not self.is_fitted:
raise ValueError("Model not trained, please call fit() first")
X = np.array(X)
return self.slope * X + self.intercept
def score(self, X, y):
"""Calculate R² (coefficient of determination)
R² = 1 - (SS_res / SS_tot)
Parameters:
X: Independent variable
y: True values
Returns:
R² value (0-1, closer to 1 is better)
"""
y = np.array(y)
y_pred = self.predict(X)
# Residual sum of squares
ss_res = ((y - y_pred) ** 2).sum()
# Total sum of squares
ss_tot = ((y - y.mean()) ** 2).sum()
if ss_tot == 0:
return 0.0
return 1 - (ss_res / ss_tot)
def get_residuals(self, X, y):
"""Calculate residuals"""
y_pred = self.predict(X)
return np.array(y) - y_pred
def summary(self):
"""Print model summary"""
if not self.is_fitted:
print("Model not trained")
return
print("=" * 50)
print("Simple Linear Regression Model Summary")
print("=" * 50)
print(f"Slope: {self.slope:.4f}")
print(f"Intercept: {self.intercept:.4f}")
print(f"Equation: y = {self.slope:.4f}x + {self.intercept:.4f}")
print("=" * 50)
def __repr__(self):
if not self.is_fitted:
return "SimpleLinearRegression(unfitted)"
return f"SimpleLinearRegression(slope={self.slope:.4f}, intercept={self.intercept:.4f})"
def __str__(self):
if not self.is_fitted:
return "Untrained model"
return f"y = {self.slope:.4f}x + {self.intercept:.4f}"
# Test
print("=" * 60)
print("Simple Linear Regression Test")
print("=" * 60)
# Data 1: Perfect linear relationship
print("\nTest 1: Perfect linear relationship (y = 2x)")
X1 = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
model1 = SimpleLinearRegression()
model1.fit(X1, y1)
print(model1)
model1.summary()
predictions1 = model1.predict([6, 7, 8])
print(f"Predictions for x=[6,7,8]: {predictions1}")
print(f"R² = {model1.score(X1, y1):.4f}")
# Data 2: Linear relationship with noise
print("\nTest 2: Linear relationship with noise")
X2 = [1, 2, 3, 4, 5]
y2 = [2, 4, 5, 4, 5]
model2 = SimpleLinearRegression()
model2.fit(X2, y2)
print(model2)
predictions2 = model2.predict([6, 7, 8])
print(f"Predictions for x=[6,7,8]: {predictions2}")
print(f"R² = {model2.score(X2, y2):.4f}")
# Residual analysis
residuals = model2.get_residuals(X2, y2)
print(f"Residuals: {residuals}")
# Data 3: Income and years of education
print("\nTest 3: Income vs Years of Education")
education_years = [12, 14, 16, 18, 20] # Years of education
income = [35000, 45000, 60000, 75000, 90000] # Income
model3 = SimpleLinearRegression()
model3.fit(education_years, income)
model3.summary()
# Predict: Bachelor's (16 years) and Master's (18 years)
predictions3 = model3.predict([16, 18, 20])
print(f"\nPredicted income:")
print(f" Bachelor's (16 years): ${predictions3[0]:,.0f}")
print(f" Master's (18 years): ${predictions3[1]:,.0f}")
print(f" PhD (20 years): ${predictions3[2]:,.0f}")
print(f"\nR² = {model3.score(education_years, income):.4f}")
print("\n" + "=" * 60)Next Steps
After completing this chapter, you have mastered:
- OOP core concepts (class, object, method, attribute)
- Special methods (
__init__,__str__,__len__, etc.) - Encapsulation (public/private)
- OOP applications in data science
Congratulations on completing Module 6!
In Module 7, we'll learn file operations, including reading and writing CSV, Excel, Stata, and other data files.
Further Reading
Ready to learn about file operations?