Skip to content

11.1 Best Practices and Professional Tools

From Writing Code to Writing Good Code — Code Standards, Debugging, Version Control


Module Overview

Writing code is easy; writing good code is hard. This module teaches essential skills for professional developers: code standards (PEP 8), efficient debugging, performance optimization, and Git version control. These skills will make your code more readable, maintainable, and professional—they're also foundational for team collaboration and open-source contributions.

Important Note: This module focuses on engineering practices. Students doing solo data analysis may learn selectively. However, if you plan to collaborate with others, publish code, or pursue code quality, this module is essential.


Learning Objectives

After completing this module, you will be able to:

  • Follow Python code standards (PEP 8)
  • Write readable, maintainable code
  • Use advanced debugging techniques
  • Perform code profiling and optimization
  • Use Git for version control
  • Manage projects on GitHub
  • Collaborate with others on development

Module Contents

01 - Python Code Style

Core Question: What makes code "good code"?

Core Content:

  • PEP 8: Python's Official Style Guide
    • Naming conventions:
      python
      # Good naming
      student_age = 25
      total_income = 50000
      calculate_mean()
      class StudentRecord:
          pass
      
      # Poor naming
      s_age = 25          # Too brief
      totalIncome = 50    # camelCase (not Python style)
      CalculateMean()     # Functions don't use PascalCase
    • Indentation and spacing: 4 spaces, spaces around operators
    • Line length: 79 characters maximum
    • Import order: standard library → third-party → local modules
  • Docstrings:
    python
    def calculate_bmi(weight, height):
        """Calculate BMI index
    
        Parameters:
            weight (float): Weight in kilograms
            height (float): Height in meters
    
        Returns:
            float: BMI value
    
        Examples:
            >>> calculate_bmi(70, 1.75)
            22.86
        """
        return weight / (height ** 2)
  • Comment Best Practices:
    • Explain "why" not "what"
    • Comment complex logic
    • Avoid obvious comments
  • Code Formatting Tools:
    • Black: automatic code formatting
    • autopep8: automatic PEP 8 compliance
    • isort: automatic import organization

Why It Matters?

  • Improves readability: you'll thank yourself in 6 months
  • Facilitates collaboration: unified style reduces friction
  • Professional image: demonstrates code literacy

Practical Comparison:

python
# Poor code
def f(x,y):
    if x>0:
        return x*y
    else:return 0

# Good code
def calculate_product(x, y):
    """Calculate product of two numbers (only when x is positive)"""
    if x > 0:
        return x * y
    else:
        return 0

02 - Debugging and Profiling

Core Question: How to make code faster and more stable?

Core Content:

  • Advanced Debugging Techniques:
    • Breakpoint debugging (IDE integration)
    • Conditional breakpoints: pause only under specific conditions
    • Watch variables: view values in real-time
    • Debugging Pandas operations:
      python
      # Chain operation debugging
      df_clean = (df
          .pipe(lambda x: print(f"Original: {len(x)} rows") or x)
          .dropna()
          .pipe(lambda x: print(f"After dropna: {len(x)} rows") or x)
          .query('age >= 18')
          .pipe(lambda x: print(f"After filter: {len(x)} rows") or x)
      )
  • Performance Profiling:
    python
    # Measure execution time
    import time
    
    start = time.time()
    # Your code
    end = time.time()
    print(f"Elapsed: {end - start:.2f} seconds")
    
    # Jupyter magic commands
    %timeit df.apply(lambda x: x ** 2)
    %prun slow_function()  # Detailed profiling
  • Performance Optimization Techniques:
    • Vectorization vs loops:
      python
      # Slow (loop)
      for i in range(len(df)):
          df.loc[i, 'squared'] = df.loc[i, 'value'] ** 2
      
      # Fast (vectorized)
      df['squared'] = df['value'] ** 2
    • Use NumPy functions
    • Avoid repeated calculations
    • Use .values to convert to NumPy arrays (faster)
  • Memory Optimization:
    • Choose appropriate data types (int32 vs int64)
    • Read large files in chunks (chunksize)
    • Drop unnecessary columns
  • Logging:
    python
    import logging
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    logger.info("Starting data cleaning")
    logger.warning("Found 100 missing values")
    logger.error("File not found")

Performance Comparison:

python
# Slow method (loop): 10 seconds
result = []
for x in data:
    result.append(x ** 2)

# Fast method (NumPy): 0.01 seconds
result = np.array(data) ** 2

03 - Git Basics

Core Question: How to manage code versions and collaborate?

Core Content:

  • What is Git?
    • Version control system: records every code change
    • Collaboration tool: multiple developers simultaneously
    • GitHub: code hosting platform
  • Basic Workflow:
    bash
    # Initialize repository
    git init
    
    # Check status
    git status
    
    # Add files to staging area
    git add script.py
    git add .  # Add all files
    
    # Commit
    git commit -m "Add data cleaning script"
    
    # View history
    git log
  • Branch Management:
    bash
    # Create and switch to new branch
    git checkout -b feature-analysis
    
    # Switch branches
    git checkout main
    
    # Merge branches
    git merge feature-analysis
  • Remote Repository (GitHub):
    bash
    # Add remote repository
    git remote add origin https://github.com/username/repo.git
    
    # Push to remote
    git push -u origin main
    
    # Pull updates
    git pull
    
    # Clone repository
    git clone https://github.com/username/repo.git
  • Collaboration Workflow:
    1. Fork project to your account
    2. Clone to local machine
    3. Create branch and make changes
    4. Commit and push
    5. Create Pull Request
  • Ignore Files (.gitignore):
    # Python
    __pycache__/
    *.pyc
    .ipynb_checkpoints/
    
    # Data files
    *.csv
    *.dta
    data/
    
    # Environment
    venv/
    .env

Why It Matters?

  • Never lose code again
  • Rollback to any historical version
  • Essential for team collaboration
  • Showcase your projects (academic GitHub)

Practical Scenario:

bash
# Scenario: Made changes but found errors, want to revert

# View history
git log --oneline
# a1b2c3d Add regression analysis
# e4f5g6h Complete data cleaning
# i7j8k9l Initial commit

# Rollback to data cleaning version
git checkout e4f5g6h

# To permanently rollback
git reset --hard e4f5g6h

Amateur vs Professional Code

DimensionAmateur CodeProfessional Code
Namingx, data1student_age, clean_survey_df
CommentsNone or excessiveModerate, explain "why"
StructureSingle file, no functionsModular, functional
Error HandlingLet program crashtry-except graceful handling
Version ControlNoneGit + GitHub
TestingManual testingAutomated tests
DocumentationNoneREADME + Docstrings

How to Learn This Module?

Learning Path

Day 1 (2 hours): Code Style

  • Read 01 - Python Code Style
  • Install Black formatting tool
  • Refactor an old script to comply with PEP 8

Day 2 (3 hours): Debugging and Optimization

  • Read 02 - Debugging and Profiling
  • Learn profiling tools
  • Optimize a slow script

Days 3-4 (6 hours): Git Basics

  • Read 03 - Git Basics
  • Install and configure Git
  • Create GitHub account
  • Upload existing project to GitHub
  • Practice basic workflow

Total Time: 11 hours (1 week)

Minimal Learning Path

For individual data analysis, priorities are:

Must Learn (basic literacy, 3 hours):

  • 01 - Code Style (naming, comments, docstrings)
  • Basic debugging techniques

Important (team collaboration, 6 hours):

  • 03 - Git Basics (init, add, commit, push)
  • GitHub usage

Optional (advanced skills):

  • Performance optimization
  • Git branch management
  • Unit testing

Learning Recommendations

  1. Code Standards are Habits, Not Burdens

    • Will feel cumbersome at first
    • Use automatic formatting tools (Black)
    • You'll thank yourself in 6 months
  2. Git Has a Steep Learning Curve, But Worth It

    • First 2 hours are most painful
    • Simple once you master basic commands
    • 3 most important commands:
      bash
      git add .
      git commit -m "message"
      git push
  3. Start with Existing Projects

    • Don't wait for the "perfect moment"
    • Choose an existing script, upload to GitHub
    • Learn by doing
  4. Practice Project: Build Academic GitHub

    my-research/
    ├── README.md          # Project description
    ├── requirements.txt   # Dependencies
    ├── .gitignore         # Ignore files
    ├── data/             # Data folder (git ignore)
    ├── scripts/          # Analysis scripts
    │   ├── 01_data_cleaning.py
    │   ├── 02_descriptive_stats.py
    │   └── 03_regression.py
    ├── outputs/          # Output results
    │   ├── tables/
    │   └── figures/
    └── notebooks/        # Jupyter notebooks
        └── exploratory_analysis.ipynb

Common Questions

Q: Why follow code standards? My code works! A:

  • In 6 months, you'll forget the logic
  • Standard code is like "writing a letter to your future self"
  • If collaborating with others, standards are foundational

Q: Git is too complex, can I skip it? A:

  • You can learn just the basics (add, commit, push)
  • But strongly recommended—benefits are huge:
    • Never lose code again
    • Rollback to any version
    • GitHub is your "academic business card"

Q: Should I upload all code to GitHub? A:

  • Upload: cleaned scripts, reproducible analyses
  • Don't upload: raw data (privacy), API keys, unfinished code
  • Use .gitignore to exclude sensitive files

Q: Is performance optimization important? My code is fast enough. A:

  • Small data (< 100K rows): not important
  • Large data (> 1M rows): very important
  • Repeatedly run code: worth optimizing

Q: How to cite GitHub code in papers? A:

Code and data available at:
https://github.com/username/project-name

Or use Zenodo for DOI:
DOI: 10.5281/zenodo.1234567

Next Steps

After completing this module, you'll have mastered:

  • Python code standards and best practices
  • Efficient debugging and performance optimization
  • Git version control and GitHub usage
  • Professional developer workflow

Congratulations! You've completed all 11 modules of the Python Fundamentals Tutorial!

Next, you can:

  1. Deepen Pandas and data analysis skills (practice on real projects)
  2. Learn statistical modeling (regression analysis, causal inference)
  3. Explore machine learning and LLM applications
  4. Contribute to open-source projects, enhance skills

From zero to data analyst—you've taken a solid first step! Keep going!


Released under the MIT License. Content © Author.