Skip to content

Module 7: File I/O

Essential Skills for Data Persistence — Reading and Writing Various Data File Formats


Module Overview

The first step in data analysis is reading data, and the final step is saving results. This module will teach you how to read and write various file formats in Python: text files, CSV, Excel, Stata (.dta), JSON, and more. Master file I/O, and you'll be able to seamlessly interface with any data source.

Important Note: This module covers the infrastructure of data analysis. While it may not be as exciting as regression analysis, it's an essential skill.


Learning Objectives

After completing this module, you will be able to:

  • Understand file paths and encoding issues
  • Read and write text files (.txt)
  • Process CSV files (the most common data format)
  • Read and write Excel files (.xlsx)
  • Seamlessly interface with Stata data (.dta)
  • Process JSON data (API and web data)
  • Batch process multiple files
  • Build data processing pipelines

Module Contents

01 - File I/O Basics

Core Question: How do you read and write text files?

Core Content:

  • Why do we need file operations?
    • Save data analysis results
    • Read external data
    • Generate reports and logs
    • Data backup
  • Basic file operations:
    • Write to file: open('file.txt', 'w')
    • Read from file: open('file.txt', 'r')
    • Append content: open('file.txt', 'a')
  • Context managers (with statement):
    python
    # Recommended (automatically closes file)
    with open('file.txt', 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Not recommended (requires manual closing)
    f = open('file.txt', 'r')
    content = f.read()
    f.close()
  • Path operations (pathlib):
    python
    from pathlib import Path
    
    data_dir = Path('data')
    raw_dir = data_dir / 'raw'  # Path concatenation
    raw_dir.mkdir(parents=True, exist_ok=True)  # Create directory
    
    for file in data_dir.glob('*.txt'):  # Iterate through files
        print(file)
  • Encoding issues:
    • Chinese files must specify encoding='utf-8'
    • Windows default encoding is gbk, which may cause garbled text

Practical Case:

python
# Case: Save survey results
respondents = [
    {'id': 1001, 'age': 25, 'income': 50000},
    {'id': 1002, 'age': 30, 'income': 75000}
]

with open('survey_results.txt', 'w', encoding='utf-8') as f:
    f.write("Survey Results\n")
    f.write("=" * 40 + "\n")
    for resp in respondents:
        line = f"ID:{resp['id']}, Age:{resp['age']}, Income:{resp['income']}\n"
        f.write(line)

02 - CSV and Excel File Processing

Core Question: How do you read and write tabular data?

Core Content:

  • CSV files (most commonly used):
    • Using the csv module (basic)
    • Using Pandas (recommended):
      python
      import pandas as pd
      
      # Read CSV
      df = pd.read_csv('survey.csv')
      
      # Write CSV (avoid Excel garbled text)
      df.to_csv('output.csv', index=False, encoding='utf-8-sig')
      
      # Common parameters
      df = pd.read_csv(
          'data.csv',
          sep=',',              # Separator
          header=0,             # First row is column names
          usecols=[0, 1, 3],   # Only read certain columns
          dtype={'age': int},   # Specify data types
          na_values=['NA', '']  # Missing value markers
      )
  • Excel files:
    • Install dependency: pip install openpyxl
    • Read a single worksheet:
      python
      df = pd.read_excel('survey.xlsx', sheet_name='Sheet1')
    • Read multiple worksheets:
      python
      all_sheets = pd.read_excel('survey.xlsx', sheet_name=None)
    • Write to multiple worksheets:
      python
      with pd.ExcelWriter('report.xlsx') as writer:
          df1.to_excel(writer, sheet_name='2023', index=False)
          df2.to_excel(writer, sheet_name='2024', index=False)

Practical Case:

python
# Case: Merge multiple CSV files
from pathlib import Path

data_dir = Path('surveys')
all_data = []

for csv_file in data_dir.glob('*.csv'):
    df = pd.read_csv(csv_file)
    df['source_file'] = csv_file.name  # Add source identifier
    all_data.append(df)

combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_csv('combined_survey.csv', index=False)

03 - Stata Data File Reading and Writing

Core Question: How do you seamlessly interface with Stata data?

Core Content:

  • Why do we need to read and write Stata files?
    • Many social science datasets are stored in .dta format
    • Preserve metadata like variable labels and value labels
    • Collaborate with Stata users
  • Reading Stata files:
    python
    import pandas as pd
    
    # Basic reading
    df = pd.read_stata('survey_data.dta')
    
    # Preserve value labels (e.g., 1='Male', 2='Female')
    df = pd.read_stata('survey_data.dta', convert_categoricals=True)
  • Writing Stata files:
    python
    # Save as Stata 13 format
    df.to_stata('output.dta', write_index=False, version=117)
    
    # Stata version mapping:
    # 117 = Stata 13/14
    # 118 = Stata 15/16
    # 119 = Stata 17
  • Handling variable labels and value labels:
    python
    # Add variable labels
    variable_labels = {
        'id': 'Respondent ID',
        'gender': 'Gender',
        'education': 'Education Level'
    }
    
    df.to_stata(
        'output.dta',
        write_index=False,
        variable_labels=variable_labels
    )

Practical Case:

python
# Case: Stata to Python data flow
import pandas as pd
import numpy as np

# 1. Read Stata data
df = pd.read_stata('raw_survey.dta')
print(f"Original data: {len(df)} rows")

# 2. Data cleaning (Python)
df_clean = df[
    (df['age'] >= 18) &
    (df['age'] <= 100) &
    (df['income'] > 0)
].copy()

# 3. Generate new variables
df_clean['log_income'] = np.log(df_clean['income'])
df_clean['age_squared'] = df_clean['age'] ** 2

# 4. Save back to Stata format
df_clean.to_stata('clean_survey.dta', write_index=False)
print(f"After cleaning: {len(df_clean)} rows")

04 - JSON Data Processing

Core Question: How do you process web and API data?

Core Content:

  • What is JSON?
    • The standard format for modern web data
    • Structure similar to Python dictionaries
    • Widely used for APIs and configuration files
  • Basic operations:
    python
    import json
    
    # Python object → JSON string
    data = {'id': 1001, 'age': 30, 'income': 75000}
    json_str = json.dumps(data, indent=2, ensure_ascii=False)
    
    # JSON string → Python object
    json_str = '{"name": "Alice", "age": 25}'
    data = json.loads(json_str)
    
    # Read/write JSON files
    with open('data.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    
    with open('data.json', 'r', encoding='utf-8') as f:
        data = json.load(f)
  • JSON and Pandas interconversion:
    python
    # Pandas → JSON
    df.to_json('data.json', orient='records', force_ascii=False, indent=2)
    
    # JSON → Pandas
    df = pd.read_json('data.json', orient='records')

Practical Case:

python
# Case: Fetch data from API
import requests
import pandas as pd

# 1. Get JSON data
response = requests.get('https://api.example.com/data')
data = response.json()

# 2. Convert to DataFrame
df = pd.DataFrame(data['results'])

# 3. Save as CSV
df.to_csv('api_data.csv', index=False)

File Format Comparison

FormatAdvantagesDisadvantagesUse Cases
TXTSimple, universalNo structureLogs, simple reports
CSVUniversal, lightweight, readableNo type information, no metadataData exchange, tabular data
ExcelMultiple worksheets, rich formattingLarge files, slowReports, for non-technical users
Stata (.dta)Preserves labels, metadataOnly used in social sciencesSocial science data, Stata collaboration
JSONStructured, nested dataLarge file sizeAPI data, configuration files
ParquetEfficient, good compressionNot human-readableBig data, production environments

Selection Guidelines

NeedRecommended FormatReason
Data exchangeCSVUniversal, supported by all tools
For advisor/colleaguesExcelGood visualization, easy to edit
Stata collaborationDTAPreserves metadata
API dataJSONStandard format
Big data (>1GB)ParquetEfficient, good compression
Intermediate resultsCSV/PickleFast, simple

How to Study This Module?

Learning Path

Day 1 (2 hours): File I/O Basics

  • Read 01 - File I/O Basics
  • Practice reading and writing text files
  • Learn pathlib path operations

Day 2 (3 hours): CSV and Excel

  • Read 02 - CSV and Excel File Processing
  • Read and write CSV with Pandas
  • Practice merging multiple CSV files

Day 3 (2 hours): Stata Data

  • Read 03 - Stata Data File Reading and Writing
  • Read .dta files
  • Practice Stata ↔ Python data flow

Day 4 (2 hours): JSON Data

  • Read 04 - JSON Data Processing
  • Practice JSON reading and writing
  • JSON and Pandas interconversion

Total Time: 9 hours (1 week)

Minimal Learning Path

For social science students, priority order:

Must Learn (daily analysis, 5 hours):

  • 02 - CSV and Excel (most commonly used)
  • 03 - Stata Data Files (essential for social sciences)
  • Pandas reading/writing basics

Important (complete skillset, 4 hours):

  • 01 - File I/O Basics
  • 04 - JSON Data Processing

Optional (advanced topics):

  • Batch processing multiple files
  • Efficient formats like Parquet, HDF5
  • Database connections (SQL)

Study Recommendations

  1. Approach based on actual needs

    • Learn the method for whatever data format you have
    • CSV and Stata are the focus for social science students
    • JSON is only needed when using APIs
  2. Remember the core pattern

    python
    # Almost all formats follow this pattern:
    
    # Read
    data = pd.read_xxx('file.xxx')
    
    # Process
    data_clean = data[data['age'] > 18]
    
    # Save
    data_clean.to_xxx('output.xxx')
  3. Common errors and solutions

    • Garbled text: Specify encoding='utf-8' or encoding='utf-8-sig'
    • File not found: Check path, use absolute path or Path()
    • Missing values: Specify na_values=['NA', '', '-999']
    • Data types: Use dtype={'col': int} to specify type
  4. Practice project Create a data processing pipeline:

    python
    # data_pipeline.py
    
    import pandas as pd
    from pathlib import Path
    
    def process_survey(input_file, output_file):
        """Process survey data"""
        # 1. Read
        df = pd.read_stata(input_file)
        print(f"Original data: {len(df)} rows")
    
        # 2. Clean
        df_clean = df[
            (df['age'] >= 18) &
            (df['age'] <= 100) &
            (df['income'] > 0)
        ]
        print(f"After cleaning: {len(df_clean)} rows")
    
        # 3. Save
        df_clean.to_csv(output_file, index=False, encoding='utf-8-sig')
        print(f"Saved to: {output_file}")
    
    # Usage
    process_survey('raw_survey.dta', 'clean_survey.csv')

Common Questions

Q: Why is Chinese text garbled when reading CSV? A: Encoding issue. Try:

python
# Method 1: UTF-8
df = pd.read_csv('data.csv', encoding='utf-8')

# Method 2: GBK (Windows Chinese)
df = pd.read_csv('data.csv', encoding='gbk')

# Method 3: Auto-detect
df = pd.read_csv('data.csv', encoding='utf-8-sig')

Q: Why is the saved CSV garbled when opened in Excel? A: Excel opens files with GBK encoding by default. Solution:

python
# Use utf-8-sig when saving (Excel-compatible UTF-8)
df.to_csv('output.csv', index=False, encoding='utf-8-sig')

Q: How to process large files (>1GB)? A: Read in chunks:

python
# Read in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Process each chunk
    processed = chunk[chunk['age'] > 18]
    # Save or further process

Q: Stata or CSV, which is better? A:

  • Stata (.dta): Preserves variable labels and value labels, suitable for social science data
  • CSV: Universal format, supported by all tools, but loses metadata
  • Recommendation: Use Stata for raw data, CSV for intermediate results

Q: Why recommend Pandas over the csv module? A:

  • Pandas is more powerful: automatic type inference, missing value handling, data operations
  • csv module is more low-level: requires manually handling many details
  • For data analysis, Pandas is the first choice

Q: How to batch process multiple files? A:

python
from pathlib import Path

data_dir = Path('surveys')
for file in data_dir.glob('*.csv'):
    df = pd.read_csv(file)
    # Process...
    df.to_csv(f'processed_{file.name}', index=False)

Next Steps

After completing this module, you will have mastered:

  • Reading and writing text, CSV, Excel, Stata, JSON files
  • Handling encoding and path issues
  • Batch processing multiple files
  • Building data processing pipelines

In Module 8, we will learn error handling and debugging to make code more robust and reliable.

In Module 9, we will dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.

Master file I/O, and you can handle any data source! Keep going!


Released under the MIT License. Content © Author.