11.1 Best Practices and Professional Tools
From Writing Code to Writing Good Code — Code Standards, Debugging, Version Control
Module Overview
Writing code is easy; writing good code is hard. This module teaches essential skills for professional developers: code standards (PEP 8), efficient debugging, performance optimization, and Git version control. These skills will make your code more readable, maintainable, and professional—they're also foundational for team collaboration and open-source contributions.
Important Note: This module focuses on engineering practices. Students doing solo data analysis may learn selectively. However, if you plan to collaborate with others, publish code, or pursue code quality, this module is essential.
Learning Objectives
After completing this module, you will be able to:
- Follow Python code standards (PEP 8)
- Write readable, maintainable code
- Use advanced debugging techniques
- Perform code profiling and optimization
- Use Git for version control
- Manage projects on GitHub
- Collaborate with others on development
Module Contents
01 - Python Code Style
Core Question: What makes code "good code"?
Core Content:
- PEP 8: Python's Official Style Guide
- Naming conventions:python
# Good naming student_age = 25 total_income = 50000 calculate_mean() class StudentRecord: pass # Poor naming s_age = 25 # Too brief totalIncome = 50 # camelCase (not Python style) CalculateMean() # Functions don't use PascalCase - Indentation and spacing: 4 spaces, spaces around operators
- Line length: 79 characters maximum
- Import order: standard library → third-party → local modules
- Naming conventions:
- Docstrings:python
def calculate_bmi(weight, height): """Calculate BMI index Parameters: weight (float): Weight in kilograms height (float): Height in meters Returns: float: BMI value Examples: >>> calculate_bmi(70, 1.75) 22.86 """ return weight / (height ** 2) - Comment Best Practices:
- Explain "why" not "what"
- Comment complex logic
- Avoid obvious comments
- Code Formatting Tools:
- Black: automatic code formatting
- autopep8: automatic PEP 8 compliance
- isort: automatic import organization
Why It Matters?
- Improves readability: you'll thank yourself in 6 months
- Facilitates collaboration: unified style reduces friction
- Professional image: demonstrates code literacy
Practical Comparison:
# Poor code
def f(x,y):
if x>0:
return x*y
else:return 0
# Good code
def calculate_product(x, y):
"""Calculate product of two numbers (only when x is positive)"""
if x > 0:
return x * y
else:
return 002 - Debugging and Profiling
Core Question: How to make code faster and more stable?
Core Content:
- Advanced Debugging Techniques:
- Breakpoint debugging (IDE integration)
- Conditional breakpoints: pause only under specific conditions
- Watch variables: view values in real-time
- Debugging Pandas operations:python
# Chain operation debugging df_clean = (df .pipe(lambda x: print(f"Original: {len(x)} rows") or x) .dropna() .pipe(lambda x: print(f"After dropna: {len(x)} rows") or x) .query('age >= 18') .pipe(lambda x: print(f"After filter: {len(x)} rows") or x) )
- Performance Profiling:python
# Measure execution time import time start = time.time() # Your code end = time.time() print(f"Elapsed: {end - start:.2f} seconds") # Jupyter magic commands %timeit df.apply(lambda x: x ** 2) %prun slow_function() # Detailed profiling - Performance Optimization Techniques:
- Vectorization vs loops:python
# Slow (loop) for i in range(len(df)): df.loc[i, 'squared'] = df.loc[i, 'value'] ** 2 # Fast (vectorized) df['squared'] = df['value'] ** 2 - Use NumPy functions
- Avoid repeated calculations
- Use
.valuesto convert to NumPy arrays (faster)
- Vectorization vs loops:
- Memory Optimization:
- Choose appropriate data types (
int32vsint64) - Read large files in chunks (
chunksize) - Drop unnecessary columns
- Choose appropriate data types (
- Logging:python
import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) logger.info("Starting data cleaning") logger.warning("Found 100 missing values") logger.error("File not found")
Performance Comparison:
# Slow method (loop): 10 seconds
result = []
for x in data:
result.append(x ** 2)
# Fast method (NumPy): 0.01 seconds
result = np.array(data) ** 203 - Git Basics
Core Question: How to manage code versions and collaborate?
Core Content:
- What is Git?
- Version control system: records every code change
- Collaboration tool: multiple developers simultaneously
- GitHub: code hosting platform
- Basic Workflow:bash
# Initialize repository git init # Check status git status # Add files to staging area git add script.py git add . # Add all files # Commit git commit -m "Add data cleaning script" # View history git log - Branch Management:bash
# Create and switch to new branch git checkout -b feature-analysis # Switch branches git checkout main # Merge branches git merge feature-analysis - Remote Repository (GitHub):bash
# Add remote repository git remote add origin https://github.com/username/repo.git # Push to remote git push -u origin main # Pull updates git pull # Clone repository git clone https://github.com/username/repo.git - Collaboration Workflow:
- Fork project to your account
- Clone to local machine
- Create branch and make changes
- Commit and push
- Create Pull Request
- Ignore Files (.gitignore):
# Python __pycache__/ *.pyc .ipynb_checkpoints/ # Data files *.csv *.dta data/ # Environment venv/ .env
Why It Matters?
- Never lose code again
- Rollback to any historical version
- Essential for team collaboration
- Showcase your projects (academic GitHub)
Practical Scenario:
# Scenario: Made changes but found errors, want to revert
# View history
git log --oneline
# a1b2c3d Add regression analysis
# e4f5g6h Complete data cleaning
# i7j8k9l Initial commit
# Rollback to data cleaning version
git checkout e4f5g6h
# To permanently rollback
git reset --hard e4f5g6hAmateur vs Professional Code
| Dimension | Amateur Code | Professional Code |
|---|---|---|
| Naming | x, data1 | student_age, clean_survey_df |
| Comments | None or excessive | Moderate, explain "why" |
| Structure | Single file, no functions | Modular, functional |
| Error Handling | Let program crash | try-except graceful handling |
| Version Control | None | Git + GitHub |
| Testing | Manual testing | Automated tests |
| Documentation | None | README + Docstrings |
How to Learn This Module?
Learning Path
Day 1 (2 hours): Code Style
- Read 01 - Python Code Style
- Install Black formatting tool
- Refactor an old script to comply with PEP 8
Day 2 (3 hours): Debugging and Optimization
- Read 02 - Debugging and Profiling
- Learn profiling tools
- Optimize a slow script
Days 3-4 (6 hours): Git Basics
- Read 03 - Git Basics
- Install and configure Git
- Create GitHub account
- Upload existing project to GitHub
- Practice basic workflow
Total Time: 11 hours (1 week)
Minimal Learning Path
For individual data analysis, priorities are:
Must Learn (basic literacy, 3 hours):
- 01 - Code Style (naming, comments, docstrings)
- Basic debugging techniques
Important (team collaboration, 6 hours):
- 03 - Git Basics (init, add, commit, push)
- GitHub usage
Optional (advanced skills):
- Performance optimization
- Git branch management
- Unit testing
Learning Recommendations
Code Standards are Habits, Not Burdens
- Will feel cumbersome at first
- Use automatic formatting tools (Black)
- You'll thank yourself in 6 months
Git Has a Steep Learning Curve, But Worth It
- First 2 hours are most painful
- Simple once you master basic commands
- 3 most important commands:bash
git add . git commit -m "message" git push
Start with Existing Projects
- Don't wait for the "perfect moment"
- Choose an existing script, upload to GitHub
- Learn by doing
Practice Project: Build Academic GitHub
my-research/ ├── README.md # Project description ├── requirements.txt # Dependencies ├── .gitignore # Ignore files ├── data/ # Data folder (git ignore) ├── scripts/ # Analysis scripts │ ├── 01_data_cleaning.py │ ├── 02_descriptive_stats.py │ └── 03_regression.py ├── outputs/ # Output results │ ├── tables/ │ └── figures/ └── notebooks/ # Jupyter notebooks └── exploratory_analysis.ipynb
Common Questions
Q: Why follow code standards? My code works! A:
- In 6 months, you'll forget the logic
- Standard code is like "writing a letter to your future self"
- If collaborating with others, standards are foundational
Q: Git is too complex, can I skip it? A:
- You can learn just the basics (add, commit, push)
- But strongly recommended—benefits are huge:
- Never lose code again
- Rollback to any version
- GitHub is your "academic business card"
Q: Should I upload all code to GitHub? A:
- Upload: cleaned scripts, reproducible analyses
- Don't upload: raw data (privacy), API keys, unfinished code
- Use
.gitignoreto exclude sensitive files
Q: Is performance optimization important? My code is fast enough. A:
- Small data (< 100K rows): not important
- Large data (> 1M rows): very important
- Repeatedly run code: worth optimizing
Q: How to cite GitHub code in papers? A:
Code and data available at:
https://github.com/username/project-name
Or use Zenodo for DOI:
DOI: 10.5281/zenodo.1234567Next Steps
After completing this module, you'll have mastered:
- Python code standards and best practices
- Efficient debugging and performance optimization
- Git version control and GitHub usage
- Professional developer workflow
Congratulations! You've completed all 11 modules of the Python Fundamentals Tutorial!
Next, you can:
- Deepen Pandas and data analysis skills (practice on real projects)
- Learn statistical modeling (regression analysis, causal inference)
- Explore machine learning and LLM applications
- Contribute to open-source projects, enhance skills
From zero to data analyst—you've taken a solid first step! Keep going!