Module 7: File I/O

Essential Skills for Data Persistence — Reading and Writing Various Data File Formats

Module Overview

The first step in data analysis is reading data, and the final step is saving results. This module will teach you how to read and write various file formats in Python: text files, CSV, Excel, Stata (.dta), JSON, and more. Master file I/O, and you'll be able to seamlessly interface with any data source.

Important Note: This module covers the infrastructure of data analysis. While it may not be as exciting as regression analysis, it's an essential skill.

Learning Objectives

After completing this module, you will be able to:

Understand file paths and encoding issues
Read and write text files (.txt)
Process CSV files (the most common data format)
Read and write Excel files (.xlsx)
Seamlessly interface with Stata data (.dta)
Process JSON data (API and web data)
Batch process multiple files
Build data processing pipelines

Module Contents

01 - File I/O Basics

Core Question: How do you read and write text files?

Core Content:

Why do we need file operations?
- Save data analysis results
- Read external data
- Generate reports and logs
- Data backup
Basic file operations:
- Write to file: open('file.txt', 'w')
- Read from file: open('file.txt', 'r')
- Append content: open('file.txt', 'a')

Context managers (with statement):

python

# Recommended (automatically closes file)
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

# Not recommended (requires manual closing)
f = open('file.txt', 'r')
content = f.read()
f.close()

Path operations (pathlib):

python

from pathlib import Path

data_dir = Path('data')
raw_dir = data_dir / 'raw'  # Path concatenation
raw_dir.mkdir(parents=True, exist_ok=True)  # Create directory

for file in data_dir.glob('*.txt'):  # Iterate through files
    print(file)

Encoding issues:
- Chinese files must specify encoding='utf-8'
- Windows default encoding is gbk, which may cause garbled text

Practical Case:

python

# Case: Save survey results
respondents = [
    {'id': 1001, 'age': 25, 'income': 50000},
    {'id': 1002, 'age': 30, 'income': 75000}
]

with open('survey_results.txt', 'w', encoding='utf-8') as f:
    f.write("Survey Results\n")
    f.write("=" * 40 + "\n")
    for resp in respondents:
        line = f"ID:{resp['id']}, Age:{resp['age']}, Income:{resp['income']}\n"
        f.write(line)

02 - CSV and Excel File Processing

Core Question: How do you read and write tabular data?

Core Content:

CSV files (most commonly used):

Using the csv module (basic)

Using Pandas (recommended):

python

import pandas as pd

# Read CSV
df = pd.read_csv('survey.csv')

# Write CSV (avoid Excel garbled text)
df.to_csv('output.csv', index=False, encoding='utf-8-sig')

# Common parameters
df = pd.read_csv(
    'data.csv',
    sep=',',              # Separator
    header=0,             # First row is column names
    usecols=[0, 1, 3],   # Only read certain columns
    dtype={'age': int},   # Specify data types
    na_values=['NA', '']  # Missing value markers
)

Excel files:

Install dependency: pip install openpyxl

Read a single worksheet:

python

df = pd.read_excel('survey.xlsx', sheet_name='Sheet1')

Read multiple worksheets:

python

all_sheets = pd.read_excel('survey.xlsx', sheet_name=None)

Write to multiple worksheets:

python

with pd.ExcelWriter('report.xlsx') as writer:
    df1.to_excel(writer, sheet_name='2023', index=False)
    df2.to_excel(writer, sheet_name='2024', index=False)

Practical Case:

python

# Case: Merge multiple CSV files
from pathlib import Path

data_dir = Path('surveys')
all_data = []

for csv_file in data_dir.glob('*.csv'):
    df = pd.read_csv(csv_file)
    df['source_file'] = csv_file.name  # Add source identifier
    all_data.append(df)

combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_csv('combined_survey.csv', index=False)

03 - Stata Data File Reading and Writing

Core Question: How do you seamlessly interface with Stata data?

Core Content:

Why do we need to read and write Stata files?
- Many social science datasets are stored in .dta format
- Preserve metadata like variable labels and value labels
- Collaborate with Stata users

Reading Stata files:

python

import pandas as pd

# Basic reading
df = pd.read_stata('survey_data.dta')

# Preserve value labels (e.g., 1='Male', 2='Female')
df = pd.read_stata('survey_data.dta', convert_categoricals=True)

Writing Stata files:

python

# Save as Stata 13 format
df.to_stata('output.dta', write_index=False, version=117)

# Stata version mapping:
# 117 = Stata 13/14
# 118 = Stata 15/16
# 119 = Stata 17

Handling variable labels and value labels:

python

# Add variable labels
variable_labels = {
    'id': 'Respondent ID',
    'gender': 'Gender',
    'education': 'Education Level'
}

df.to_stata(
    'output.dta',
    write_index=False,
    variable_labels=variable_labels
)

Practical Case:

python

# Case: Stata to Python data flow
import pandas as pd
import numpy as np

# 1. Read Stata data
df = pd.read_stata('raw_survey.dta')
print(f"Original data: {len(df)} rows")

# 2. Data cleaning (Python)
df_clean = df[
    (df['age'] >= 18) &
    (df['age'] <= 100) &
    (df['income'] > 0)
].copy()

# 3. Generate new variables
df_clean['log_income'] = np.log(df_clean['income'])
df_clean['age_squared'] = df_clean['age'] ** 2

# 4. Save back to Stata format
df_clean.to_stata('clean_survey.dta', write_index=False)
print(f"After cleaning: {len(df_clean)} rows")

04 - JSON Data Processing

Core Question: How do you process web and API data?

Core Content:

What is JSON?
- The standard format for modern web data
- Structure similar to Python dictionaries
- Widely used for APIs and configuration files

Basic operations:

python

import json

# Python object → JSON string
data = {'id': 1001, 'age': 30, 'income': 75000}
json_str = json.dumps(data, indent=2, ensure_ascii=False)

# JSON string → Python object
json_str = '{"name": "Alice", "age": 25}'
data = json.loads(json_str)

# Read/write JSON files
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

with open('data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

JSON and Pandas interconversion:

python

# Pandas → JSON
df.to_json('data.json', orient='records', force_ascii=False, indent=2)

# JSON → Pandas
df = pd.read_json('data.json', orient='records')

Practical Case:

python

# Case: Fetch data from API
import requests
import pandas as pd

# 1. Get JSON data
response = requests.get('https://api.example.com/data')
data = response.json()

# 2. Convert to DataFrame
df = pd.DataFrame(data['results'])

# 3. Save as CSV
df.to_csv('api_data.csv', index=False)

File Format Comparison

Format	Advantages	Disadvantages	Use Cases
TXT	Simple, universal	No structure	Logs, simple reports
CSV	Universal, lightweight, readable	No type information, no metadata	Data exchange, tabular data
Excel	Multiple worksheets, rich formatting	Large files, slow	Reports, for non-technical users
Stata (.dta)	Preserves labels, metadata	Only used in social sciences	Social science data, Stata collaboration
JSON	Structured, nested data	Large file size	API data, configuration files
Parquet	Efficient, good compression	Not human-readable	Big data, production environments

Selection Guidelines

Need	Recommended Format	Reason
Data exchange	CSV	Universal, supported by all tools
For advisor/colleagues	Excel	Good visualization, easy to edit
Stata collaboration	DTA	Preserves metadata
API data	JSON	Standard format
Big data (>1GB)	Parquet	Efficient, good compression
Intermediate results	CSV/Pickle	Fast, simple

How to Study This Module?

Learning Path

Day 1 (2 hours): File I/O Basics

Read 01 - File I/O Basics
Practice reading and writing text files
Learn pathlib path operations

Day 2 (3 hours): CSV and Excel

Read 02 - CSV and Excel File Processing
Read and write CSV with Pandas
Practice merging multiple CSV files

Day 3 (2 hours): Stata Data

Read 03 - Stata Data File Reading and Writing
Read .dta files
Practice Stata ↔ Python data flow

Day 4 (2 hours): JSON Data

Read 04 - JSON Data Processing
Practice JSON reading and writing
JSON and Pandas interconversion

Total Time: 9 hours (1 week)

Minimal Learning Path

For social science students, priority order:

Must Learn (daily analysis, 5 hours):

02 - CSV and Excel (most commonly used)
03 - Stata Data Files (essential for social sciences)
Pandas reading/writing basics

Important (complete skillset, 4 hours):

01 - File I/O Basics
04 - JSON Data Processing

Optional (advanced topics):

Batch processing multiple files
Efficient formats like Parquet, HDF5
Database connections (SQL)

Study Recommendations

Approach based on actual needs
- Learn the method for whatever data format you have
- CSV and Stata are the focus for social science students
- JSON is only needed when using APIs

Remember the core pattern

python

# Almost all formats follow this pattern:

# Read
data = pd.read_xxx('file.xxx')

# Process
data_clean = data[data['age'] > 18]

# Save
data_clean.to_xxx('output.xxx')

Common errors and solutions
- Garbled text: Specify encoding='utf-8' or encoding='utf-8-sig'
- File not found: Check path, use absolute path or Path()
- Missing values: Specify na_values=['NA', '', '-999']
- Data types: Use dtype={'col': int} to specify type

Practice project Create a data processing pipeline:

python

# data_pipeline.py

import pandas as pd
from pathlib import Path

def process_survey(input_file, output_file):
    """Process survey data"""
    # 1. Read
    df = pd.read_stata(input_file)
    print(f"Original data: {len(df)} rows")

    # 2. Clean
    df_clean = df[
        (df['age'] >= 18) &
        (df['age'] <= 100) &
        (df['income'] > 0)
    ]
    print(f"After cleaning: {len(df_clean)} rows")

    # 3. Save
    df_clean.to_csv(output_file, index=False, encoding='utf-8-sig')
    print(f"Saved to: {output_file}")

# Usage
process_survey('raw_survey.dta', 'clean_survey.csv')

Common Questions

Q: Why is Chinese text garbled when reading CSV? A: Encoding issue. Try:

python

# Method 1: UTF-8
df = pd.read_csv('data.csv', encoding='utf-8')

# Method 2: GBK (Windows Chinese)
df = pd.read_csv('data.csv', encoding='gbk')

# Method 3: Auto-detect
df = pd.read_csv('data.csv', encoding='utf-8-sig')

Q: Why is the saved CSV garbled when opened in Excel? A: Excel opens files with GBK encoding by default. Solution:

python

# Use utf-8-sig when saving (Excel-compatible UTF-8)
df.to_csv('output.csv', index=False, encoding='utf-8-sig')

Q: How to process large files (>1GB)? A: Read in chunks:

python

# Read in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Process each chunk
    processed = chunk[chunk['age'] > 18]
    # Save or further process

Q: Stata or CSV, which is better? A:

Stata (.dta): Preserves variable labels and value labels, suitable for social science data
CSV: Universal format, supported by all tools, but loses metadata
Recommendation: Use Stata for raw data, CSV for intermediate results

Q: Why recommend Pandas over the csv module? A:

Pandas is more powerful: automatic type inference, missing value handling, data operations
csv module is more low-level: requires manually handling many details
For data analysis, Pandas is the first choice

Q: How to batch process multiple files? A:

python

from pathlib import Path

data_dir = Path('surveys')
for file in data_dir.glob('*.csv'):
    df = pd.read_csv(file)
    # Process...
    df.to_csv(f'processed_{file.name}', index=False)

Next Steps

After completing this module, you will have mastered:

Reading and writing text, CSV, Excel, Stata, JSON files
Handling encoding and path issues
Batch processing multiple files
Building data processing pipelines

In Module 8, we will learn error handling and debugging to make code more robust and reliable.

In Module 9, we will dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.

Master file I/O, and you can handle any data source! Keep going!

Module 7: File I/O ​

Module Overview ​

Learning Objectives ​

Module Contents ​

01 - File I/O Basics ​

02 - CSV and Excel File Processing ​

03 - Stata Data File Reading and Writing ​

04 - JSON Data Processing ​

File Format Comparison ​

Selection Guidelines ​

How to Study This Module? ​

Learning Path ​

Minimal Learning Path ​

Study Recommendations ​

Common Questions ​

Next Steps ​

Quick Links ​

Module 7: File I/O

Module Overview

Learning Objectives

Module Contents

01 - File I/O Basics

02 - CSV and Excel File Processing

03 - Stata Data File Reading and Writing

04 - JSON Data Processing

File Format Comparison

Selection Guidelines

How to Study This Module?

Learning Path

Minimal Learning Path

Study Recommendations

Common Questions

Next Steps

Quick Links