CSV and Excel File Processing
Common Formats for Data Analysis — Reading and Writing Tabular Data
CSV File Operations
Using the csv Module
python
import csv
# Write CSV
data = [
['Name', 'Age', 'Income'],
['Alice', 25, 50000],
['Bob', 30, 75000],
['Carol', 35, 85000]
]
with open('survey.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(data)
# Read CSV
with open('survey.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)Using Pandas (Recommended)
python
import pandas as pd
# Read CSV
df = pd.read_csv('survey.csv')
print(df.head())
# Write CSV
df.to_csv('output.csv', index=False, encoding='utf-8-sig') # utf-8-sig prevents Excel garbling
# Common parameters
df = pd.read_csv(
'data.csv',
sep=',', # Separator
header=0, # First row is column names
names=['col1', 'col2'], # Custom column names
usecols=[0, 1, 3], # Only read certain columns
dtype={'age': int}, # Specify data types
na_values=['NA', ''] # Missing value markers
)Excel File Operations
Install Dependencies
bash
pip install openpyxl # For .xlsx
pip install xlrd # For old .xls formatReading Excel Files
python
import pandas as pd
# Read Excel
df = pd.read_excel('survey.xlsx', sheet_name='Sheet1')
# Read multiple worksheets
excel_file = pd.ExcelFile('survey.xlsx')
print(excel_file.sheet_names) # View all worksheets
df1 = pd.read_excel(excel_file, sheet_name='2023')
df2 = pd.read_excel(excel_file, sheet_name='2024')
# Read all worksheets into a dictionary
all_sheets = pd.read_excel('survey.xlsx', sheet_name=None)Writing to Excel Files
python
# Single worksheet
df.to_excel('output.xlsx', sheet_name='Results', index=False)
# Multiple worksheets
with pd.ExcelWriter('report.xlsx') as writer:
df1.to_excel(writer, sheet_name='2023', index=False)
df2.to_excel(writer, sheet_name='2024', index=False)Practical Cases
Case 1: Merge Multiple CSV Files
python
import pandas as pd
from pathlib import Path
# Read all CSV files in folder
data_dir = Path('surveys')
all_data = []
for csv_file in data_dir.glob('*.csv'):
df = pd.read_csv(csv_file)
df['source_file'] = csv_file.name # Add source identifier
all_data.append(df)
# Merge
combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_csv('combined_survey.csv', index=False)
print(f"Merged {len(all_data)} files, total {len(combined_df)} rows")Case 2: Excel Report Generation
python
import pandas as pd
# Prepare data
summary = pd.DataFrame({
'Category': ['Male', 'Female', 'Total'],
'Count': [450, 550, 1000],
'Avg_Income': [75000, 70000, 72500]
})
details = pd.DataFrame({
'ID': range(1, 11),
'Age': [25, 30, 35, 28, 32, 40, 27, 33, 38, 29],
'Income': [50000, 75000, 85000, 60000, 70000, 90000, 55000, 80000, 95000, 65000]
})
# Write to Excel (multiple worksheets)
with pd.ExcelWriter('analysis_report.xlsx') as writer:
summary.to_excel(writer, sheet_name='Summary', index=False)
details.to_excel(writer, sheet_name='Details', index=False)
print("Report generated: analysis_report.xlsx")Case 3: Data Cleaning Pipeline
python
import pandas as pd
# Read
df = pd.read_csv('raw_survey.csv')
# Clean
df_clean = (df
.dropna(subset=['age', 'income']) # Remove missing values
.query('18 <= age <= 100') # Filter age
.query('income > 0') # Filter income
.assign(income_log=lambda x: np.log(x['income'])) # New column
)
# Save
df_clean.to_csv('clean_survey.csv', index=False)
print(f"Original data: {len(df)} rows")
print(f"After cleaning: {len(df_clean)} rows")CSV vs Excel
| Feature | CSV | Excel |
|---|---|---|
| File size | Small | Large |
| Read/write speed | Fast | Slow |
| Data format | Text only | Supports formatting, formulas |
| Multiple worksheets | ✗ | ✓ |
| Cross-platform | ✓ | ⚠️ |
| Recommended scenarios | Big data, pure data | Reports, multiple tables |
Best Practices
1. CSV Encoding Issues
python
# For Windows Excel with Chinese text
df.to_csv('output.csv', encoding='utf-8-sig', index=False)
# For Mac/Linux
df.to_csv('output.csv', encoding='utf-8', index=False)2. Large File Processing
python
# Read large CSV in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk[chunk['age'] > 18]
chunks.append(processed)
result = pd.concat(chunks, ignore_index=True)3. Data Type Optimization
python
# Save memory
df = pd.read_csv('data.csv', dtype={
'id': 'int32',
'age': 'int8',
'income': 'float32',
'gender': 'category'
})Practice Exercises
python
# Exercise 1: CSV Data Analysis
# Read students.csv, containing: name, age, major, gpa
# Tasks:
# 1. Filter students with GPA >= 3.5
# 2. Group by major and calculate average GPA
# 3. Save results to high_performers.csv
# Exercise 2: Excel Multi-sheet Merge
# Read survey_2023.xlsx and survey_2024.xlsx
# Merge two years of data, add 'year' column
# Save to 'Combined' worksheet in new Excel fileNext Steps
Next section: Stata Data File Reading and Writing.
Keep going!