Module 7: File I/O
Essential Skills for Data Persistence — Reading and Writing Various Data File Formats
Module Overview
The first step in data analysis is reading data, and the final step is saving results. This module will teach you how to read and write various file formats in Python: text files, CSV, Excel, Stata (.dta), JSON, and more. Master file I/O, and you'll be able to seamlessly interface with any data source.
Important Note: This module covers the infrastructure of data analysis. While it may not be as exciting as regression analysis, it's an essential skill.
Learning Objectives
After completing this module, you will be able to:
- Understand file paths and encoding issues
- Read and write text files (
.txt) - Process CSV files (the most common data format)
- Read and write Excel files (
.xlsx) - Seamlessly interface with Stata data (
.dta) - Process JSON data (API and web data)
- Batch process multiple files
- Build data processing pipelines
Module Contents
01 - File I/O Basics
Core Question: How do you read and write text files?
Core Content:
- Why do we need file operations?
- Save data analysis results
- Read external data
- Generate reports and logs
- Data backup
- Basic file operations:
- Write to file:
open('file.txt', 'w') - Read from file:
open('file.txt', 'r') - Append content:
open('file.txt', 'a')
- Write to file:
- Context managers (with statement):python
# Recommended (automatically closes file) with open('file.txt', 'r', encoding='utf-8') as f: content = f.read() # Not recommended (requires manual closing) f = open('file.txt', 'r') content = f.read() f.close() - Path operations (pathlib):python
from pathlib import Path data_dir = Path('data') raw_dir = data_dir / 'raw' # Path concatenation raw_dir.mkdir(parents=True, exist_ok=True) # Create directory for file in data_dir.glob('*.txt'): # Iterate through files print(file) - Encoding issues:
- Chinese files must specify
encoding='utf-8' - Windows default encoding is
gbk, which may cause garbled text
- Chinese files must specify
Practical Case:
# Case: Save survey results
respondents = [
{'id': 1001, 'age': 25, 'income': 50000},
{'id': 1002, 'age': 30, 'income': 75000}
]
with open('survey_results.txt', 'w', encoding='utf-8') as f:
f.write("Survey Results\n")
f.write("=" * 40 + "\n")
for resp in respondents:
line = f"ID:{resp['id']}, Age:{resp['age']}, Income:{resp['income']}\n"
f.write(line)02 - CSV and Excel File Processing
Core Question: How do you read and write tabular data?
Core Content:
- CSV files (most commonly used):
- Using the
csvmodule (basic) - Using Pandas (recommended):python
import pandas as pd # Read CSV df = pd.read_csv('survey.csv') # Write CSV (avoid Excel garbled text) df.to_csv('output.csv', index=False, encoding='utf-8-sig') # Common parameters df = pd.read_csv( 'data.csv', sep=',', # Separator header=0, # First row is column names usecols=[0, 1, 3], # Only read certain columns dtype={'age': int}, # Specify data types na_values=['NA', ''] # Missing value markers )
- Using the
- Excel files:
- Install dependency:
pip install openpyxl - Read a single worksheet:python
df = pd.read_excel('survey.xlsx', sheet_name='Sheet1') - Read multiple worksheets:python
all_sheets = pd.read_excel('survey.xlsx', sheet_name=None) - Write to multiple worksheets:python
with pd.ExcelWriter('report.xlsx') as writer: df1.to_excel(writer, sheet_name='2023', index=False) df2.to_excel(writer, sheet_name='2024', index=False)
- Install dependency:
Practical Case:
# Case: Merge multiple CSV files
from pathlib import Path
data_dir = Path('surveys')
all_data = []
for csv_file in data_dir.glob('*.csv'):
df = pd.read_csv(csv_file)
df['source_file'] = csv_file.name # Add source identifier
all_data.append(df)
combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_csv('combined_survey.csv', index=False)03 - Stata Data File Reading and Writing
Core Question: How do you seamlessly interface with Stata data?
Core Content:
- Why do we need to read and write Stata files?
- Many social science datasets are stored in
.dtaformat - Preserve metadata like variable labels and value labels
- Collaborate with Stata users
- Many social science datasets are stored in
- Reading Stata files:python
import pandas as pd # Basic reading df = pd.read_stata('survey_data.dta') # Preserve value labels (e.g., 1='Male', 2='Female') df = pd.read_stata('survey_data.dta', convert_categoricals=True) - Writing Stata files:python
# Save as Stata 13 format df.to_stata('output.dta', write_index=False, version=117) # Stata version mapping: # 117 = Stata 13/14 # 118 = Stata 15/16 # 119 = Stata 17 - Handling variable labels and value labels:python
# Add variable labels variable_labels = { 'id': 'Respondent ID', 'gender': 'Gender', 'education': 'Education Level' } df.to_stata( 'output.dta', write_index=False, variable_labels=variable_labels )
Practical Case:
# Case: Stata to Python data flow
import pandas as pd
import numpy as np
# 1. Read Stata data
df = pd.read_stata('raw_survey.dta')
print(f"Original data: {len(df)} rows")
# 2. Data cleaning (Python)
df_clean = df[
(df['age'] >= 18) &
(df['age'] <= 100) &
(df['income'] > 0)
].copy()
# 3. Generate new variables
df_clean['log_income'] = np.log(df_clean['income'])
df_clean['age_squared'] = df_clean['age'] ** 2
# 4. Save back to Stata format
df_clean.to_stata('clean_survey.dta', write_index=False)
print(f"After cleaning: {len(df_clean)} rows")04 - JSON Data Processing
Core Question: How do you process web and API data?
Core Content:
- What is JSON?
- The standard format for modern web data
- Structure similar to Python dictionaries
- Widely used for APIs and configuration files
- Basic operations:python
import json # Python object → JSON string data = {'id': 1001, 'age': 30, 'income': 75000} json_str = json.dumps(data, indent=2, ensure_ascii=False) # JSON string → Python object json_str = '{"name": "Alice", "age": 25}' data = json.loads(json_str) # Read/write JSON files with open('data.json', 'w', encoding='utf-8') as f: json.dump(data, f, indent=2, ensure_ascii=False) with open('data.json', 'r', encoding='utf-8') as f: data = json.load(f) - JSON and Pandas interconversion:python
# Pandas → JSON df.to_json('data.json', orient='records', force_ascii=False, indent=2) # JSON → Pandas df = pd.read_json('data.json', orient='records')
Practical Case:
# Case: Fetch data from API
import requests
import pandas as pd
# 1. Get JSON data
response = requests.get('https://api.example.com/data')
data = response.json()
# 2. Convert to DataFrame
df = pd.DataFrame(data['results'])
# 3. Save as CSV
df.to_csv('api_data.csv', index=False)File Format Comparison
| Format | Advantages | Disadvantages | Use Cases |
|---|---|---|---|
| TXT | Simple, universal | No structure | Logs, simple reports |
| CSV | Universal, lightweight, readable | No type information, no metadata | Data exchange, tabular data |
| Excel | Multiple worksheets, rich formatting | Large files, slow | Reports, for non-technical users |
| Stata (.dta) | Preserves labels, metadata | Only used in social sciences | Social science data, Stata collaboration |
| JSON | Structured, nested data | Large file size | API data, configuration files |
| Parquet | Efficient, good compression | Not human-readable | Big data, production environments |
Selection Guidelines
| Need | Recommended Format | Reason |
|---|---|---|
| Data exchange | CSV | Universal, supported by all tools |
| For advisor/colleagues | Excel | Good visualization, easy to edit |
| Stata collaboration | DTA | Preserves metadata |
| API data | JSON | Standard format |
| Big data (>1GB) | Parquet | Efficient, good compression |
| Intermediate results | CSV/Pickle | Fast, simple |
How to Study This Module?
Learning Path
Day 1 (2 hours): File I/O Basics
- Read 01 - File I/O Basics
- Practice reading and writing text files
- Learn pathlib path operations
Day 2 (3 hours): CSV and Excel
- Read 02 - CSV and Excel File Processing
- Read and write CSV with Pandas
- Practice merging multiple CSV files
Day 3 (2 hours): Stata Data
- Read 03 - Stata Data File Reading and Writing
- Read
.dtafiles - Practice Stata ↔ Python data flow
Day 4 (2 hours): JSON Data
- Read 04 - JSON Data Processing
- Practice JSON reading and writing
- JSON and Pandas interconversion
Total Time: 9 hours (1 week)
Minimal Learning Path
For social science students, priority order:
Must Learn (daily analysis, 5 hours):
- 02 - CSV and Excel (most commonly used)
- 03 - Stata Data Files (essential for social sciences)
- Pandas reading/writing basics
Important (complete skillset, 4 hours):
- 01 - File I/O Basics
- 04 - JSON Data Processing
Optional (advanced topics):
- Batch processing multiple files
- Efficient formats like Parquet, HDF5
- Database connections (SQL)
Study Recommendations
Approach based on actual needs
- Learn the method for whatever data format you have
- CSV and Stata are the focus for social science students
- JSON is only needed when using APIs
Remember the core pattern
python# Almost all formats follow this pattern: # Read data = pd.read_xxx('file.xxx') # Process data_clean = data[data['age'] > 18] # Save data_clean.to_xxx('output.xxx')Common errors and solutions
- Garbled text: Specify
encoding='utf-8'orencoding='utf-8-sig' - File not found: Check path, use absolute path or
Path() - Missing values: Specify
na_values=['NA', '', '-999'] - Data types: Use
dtype={'col': int}to specify type
- Garbled text: Specify
Practice project Create a data processing pipeline:
python# data_pipeline.py import pandas as pd from pathlib import Path def process_survey(input_file, output_file): """Process survey data""" # 1. Read df = pd.read_stata(input_file) print(f"Original data: {len(df)} rows") # 2. Clean df_clean = df[ (df['age'] >= 18) & (df['age'] <= 100) & (df['income'] > 0) ] print(f"After cleaning: {len(df_clean)} rows") # 3. Save df_clean.to_csv(output_file, index=False, encoding='utf-8-sig') print(f"Saved to: {output_file}") # Usage process_survey('raw_survey.dta', 'clean_survey.csv')
Common Questions
Q: Why is Chinese text garbled when reading CSV? A: Encoding issue. Try:
# Method 1: UTF-8
df = pd.read_csv('data.csv', encoding='utf-8')
# Method 2: GBK (Windows Chinese)
df = pd.read_csv('data.csv', encoding='gbk')
# Method 3: Auto-detect
df = pd.read_csv('data.csv', encoding='utf-8-sig')Q: Why is the saved CSV garbled when opened in Excel? A: Excel opens files with GBK encoding by default. Solution:
# Use utf-8-sig when saving (Excel-compatible UTF-8)
df.to_csv('output.csv', index=False, encoding='utf-8-sig')Q: How to process large files (>1GB)? A: Read in chunks:
# Read in chunks
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
# Process each chunk
processed = chunk[chunk['age'] > 18]
# Save or further processQ: Stata or CSV, which is better? A:
- Stata (.dta): Preserves variable labels and value labels, suitable for social science data
- CSV: Universal format, supported by all tools, but loses metadata
- Recommendation: Use Stata for raw data, CSV for intermediate results
Q: Why recommend Pandas over the csv module? A:
- Pandas is more powerful: automatic type inference, missing value handling, data operations
- csv module is more low-level: requires manually handling many details
- For data analysis, Pandas is the first choice
Q: How to batch process multiple files? A:
from pathlib import Path
data_dir = Path('surveys')
for file in data_dir.glob('*.csv'):
df = pd.read_csv(file)
# Process...
df.to_csv(f'processed_{file.name}', index=False)Next Steps
After completing this module, you will have mastered:
- Reading and writing text, CSV, Excel, Stata, JSON files
- Handling encoding and path issues
- Batch processing multiple files
- Building data processing pipelines
In Module 8, we will learn error handling and debugging to make code more robust and reliable.
In Module 9, we will dive deep into core data science libraries like NumPy, Pandas, and Matplotlib.
Master file I/O, and you can handle any data source! Keep going!