CSV and Excel File Processing

Common Formats for Data Analysis — Reading and Writing Tabular Data

CSV File Operations

Using the csv Module

python

import csv

# Write CSV
data = [
    ['Name', 'Age', 'Income'],
    ['Alice', 25, 50000],
    ['Bob', 30, 75000],
    ['Carol', 35, 85000]
]

with open('survey.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(data)

# Read CSV
with open('survey.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Using Pandas (Recommended)

python

import pandas as pd

# Read CSV
df = pd.read_csv('survey.csv')
print(df.head())

# Write CSV
df.to_csv('output.csv', index=False, encoding='utf-8-sig')  # utf-8-sig prevents Excel garbling

# Common parameters
df = pd.read_csv(
    'data.csv',
    sep=',',              # Separator
    header=0,             # First row is column names
    names=['col1', 'col2'],  # Custom column names
    usecols=[0, 1, 3],   # Only read certain columns
    dtype={'age': int},   # Specify data types
    na_values=['NA', '']  # Missing value markers
)

Excel File Operations

Install Dependencies

bash

pip install openpyxl  # For .xlsx
pip install xlrd      # For old .xls format

Reading Excel Files

python

import pandas as pd

# Read Excel
df = pd.read_excel('survey.xlsx', sheet_name='Sheet1')

# Read multiple worksheets
excel_file = pd.ExcelFile('survey.xlsx')
print(excel_file.sheet_names)  # View all worksheets

df1 = pd.read_excel(excel_file, sheet_name='2023')
df2 = pd.read_excel(excel_file, sheet_name='2024')

# Read all worksheets into a dictionary
all_sheets = pd.read_excel('survey.xlsx', sheet_name=None)

Writing to Excel Files

python

# Single worksheet
df.to_excel('output.xlsx', sheet_name='Results', index=False)

# Multiple worksheets
with pd.ExcelWriter('report.xlsx') as writer:
    df1.to_excel(writer, sheet_name='2023', index=False)
    df2.to_excel(writer, sheet_name='2024', index=False)

Practical Cases

Case 1: Merge Multiple CSV Files

python

import pandas as pd
from pathlib import Path

# Read all CSV files in folder
data_dir = Path('surveys')
all_data = []

for csv_file in data_dir.glob('*.csv'):
    df = pd.read_csv(csv_file)
    df['source_file'] = csv_file.name  # Add source identifier
    all_data.append(df)

# Merge
combined_df = pd.concat(all_data, ignore_index=True)
combined_df.to_csv('combined_survey.csv', index=False)

print(f"Merged {len(all_data)} files, total {len(combined_df)} rows")

Case 2: Excel Report Generation

python

import pandas as pd

# Prepare data
summary = pd.DataFrame({
    'Category': ['Male', 'Female', 'Total'],
    'Count': [450, 550, 1000],
    'Avg_Income': [75000, 70000, 72500]
})

details = pd.DataFrame({
    'ID': range(1, 11),
    'Age': [25, 30, 35, 28, 32, 40, 27, 33, 38, 29],
    'Income': [50000, 75000, 85000, 60000, 70000, 90000, 55000, 80000, 95000, 65000]
})

# Write to Excel (multiple worksheets)
with pd.ExcelWriter('analysis_report.xlsx') as writer:
    summary.to_excel(writer, sheet_name='Summary', index=False)
    details.to_excel(writer, sheet_name='Details', index=False)

print("Report generated: analysis_report.xlsx")

Case 3: Data Cleaning Pipeline

python

import pandas as pd

# Read
df = pd.read_csv('raw_survey.csv')

# Clean
df_clean = (df
    .dropna(subset=['age', 'income'])  # Remove missing values
    .query('18 <= age <= 100')          # Filter age
    .query('income > 0')                # Filter income
    .assign(income_log=lambda x: np.log(x['income']))  # New column
)

# Save
df_clean.to_csv('clean_survey.csv', index=False)

print(f"Original data: {len(df)} rows")
print(f"After cleaning: {len(df_clean)} rows")

CSV vs Excel

Feature	CSV	Excel
File size	Small	Large
Read/write speed	Fast	Slow
Data format	Text only	Supports formatting, formulas
Multiple worksheets	✗	✓
Cross-platform	✓	⚠️
Recommended scenarios	Big data, pure data	Reports, multiple tables

Best Practices

1. CSV Encoding Issues

python

# For Windows Excel with Chinese text
df.to_csv('output.csv', encoding='utf-8-sig', index=False)

# For Mac/Linux
df.to_csv('output.csv', encoding='utf-8', index=False)

2. Large File Processing

python

# Read large CSV in chunks
chunk_size = 10000
chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = chunk[chunk['age'] > 18]
    chunks.append(processed)

result = pd.concat(chunks, ignore_index=True)

3. Data Type Optimization

python

# Save memory
df = pd.read_csv('data.csv', dtype={
    'id': 'int32',
    'age': 'int8',
    'income': 'float32',
    'gender': 'category'
})

Practice Exercises

python

# Exercise 1: CSV Data Analysis
# Read students.csv, containing: name, age, major, gpa
# Tasks:
# 1. Filter students with GPA >= 3.5
# 2. Group by major and calculate average GPA
# 3. Save results to high_performers.csv

# Exercise 2: Excel Multi-sheet Merge
# Read survey_2023.xlsx and survey_2024.xlsx
# Merge two years of data, add 'year' column
# Save to 'Combined' worksheet in new Excel file

Next Steps

Next section: Stata Data File Reading and Writing.

Keep going!

CSV and Excel File Processing ​

CSV File Operations ​

Using the csv Module ​

Using Pandas (Recommended) ​

Excel File Operations ​

Install Dependencies ​

Reading Excel Files ​

Writing to Excel Files ​

Practical Cases ​

Case 1: Merge Multiple CSV Files ​

Case 2: Excel Report Generation ​

Case 3: Data Cleaning Pipeline ​

CSV vs Excel ​

Best Practices ​

1. CSV Encoding Issues ​

2. Large File Processing ​

3. Data Type Optimization ​

Practice Exercises ​

Next Steps ​

CSV and Excel File Processing

CSV File Operations

Using the csv Module

Using Pandas (Recommended)

Excel File Operations

Install Dependencies

Reading Excel Files

Writing to Excel Files

Practical Cases

Case 1: Merge Multiple CSV Files

Case 2: Excel Report Generation

Case 3: Data Cleaning Pipeline

CSV vs Excel

Best Practices

1. CSV Encoding Issues

2. Large File Processing

3. Data Type Optimization

Practice Exercises

Next Steps