Skip to content

JSON Data Processing

The Standard Format for Modern Web Data


What is JSON?

JSON (JavaScript Object Notation) is a lightweight data interchange format.

Features:

  • Human-readable
  • Easy for machines to parse
  • Widely used for APIs and web data

Structure similar to Python dictionaries:

json
{
  "name": "Alice",
  "age": 25,
  "major": "Economics"
}

Basic Operations

1. Import Module

python
import json

2. Python Object → JSON String

python
import json

# Python dictionary
data = {
    'respondent_id': 1001,
    'age': 30,
    'income': 75000,
    'interests': ['economics', 'data science']
}

# Convert to JSON string
json_str = json.dumps(data, indent=2, ensure_ascii=False)
print(json_str)

Output:

json
{
  "respondent_id": 1001,
  "age": 30,
  "income": 75000,
  "interests": [
    "economics",
    "data science"
  ]
}

3. JSON String → Python Object

python
json_str = '{"name": "Alice", "age": 25, "income": 50000}'
data = json.loads(json_str)

print(data['name'])    # Alice
print(data['age'])     # 25

4. Reading and Writing JSON Files

python
import json

# Write to file
data = {'id': 1001, 'age': 30, 'income': 75000}
with open('respondent.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Read from file
with open('respondent.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(data)

Practical Cases

Case 1: Save Survey Data

python
import json

survey_data = {
    'survey_name': '2024 Income Survey',
    'start_date': '2024-01-01',
    'end_date': '2024-12-31',
    'responses': [
        {'id': 1001, 'age': 30, 'income': 75000},
        {'id': 1002, 'age': 35, 'income': 85000},
        {'id': 1003, 'age': 28, 'income': 65000}
    ],
    'metadata': {
        'region': 'National',
        'sample_size': 1000,
        'response_rate': 0.85
    }
}

# Save
with open('survey_2024.json', 'w', encoding='utf-8') as f:
    json.dump(survey_data, f, indent=2, ensure_ascii=False)

Case 2: Fetch JSON Data from API

python
import requests
import json

# Fetch data (example API)
response = requests.get('https://api.example.com/data')

# Parse JSON
data = response.json()  # Equivalent to json.loads(response.text)

# Process data
for item in data['results']:
    print(f"{item['name']}: {item['value']}")

Case 3: JSON and Pandas Interconversion

python
import pandas as pd
import json

# Pandas → JSON
df = pd.DataFrame({
    'id': [1, 2, 3],
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})

# Method 1: Convert to JSON string
json_str = df.to_json(orient='records', force_ascii=False)

# Method 2: Save directly to file
df.to_json('data.json', orient='records', indent=2, force_ascii=False)

# JSON → Pandas
df = pd.read_json('data.json')
print(df)

orient parameter:

python
df.to_json(orient='records')  # [{'col1': val1, 'col2': val2}, ...]
df.to_json(orient='index')    # {'0': {'col1': val1}, '1': {...}}
df.to_json(orient='columns')  # {'col1': {'0': val1, '1': val2}, ...}

Complex JSON Processing

Nested JSON

python
import json

# Complex nested structure
data = {
    'survey': {
        'name': 'Income Survey',
        'metadata': {
            'year': 2024,
            'region': 'Beijing'
        }
    },
    'respondents': [
        {
            'id': 1001,
            'demographics': {
                'age': 30,
                'gender': 'Male'
            },
            'responses': {
                'income': 75000,
                'satisfaction': 4
            }
        }
    ]
}

# Access nested data
print(data['survey']['metadata']['year'])  # 2024
print(data['respondents'][0]['demographics']['age'])  # 30

JSON Lines Format (One JSON per Line)

python
import json

# Write JSONL
respondents = [
    {'id': 1001, 'age': 30},
    {'id': 1002, 'age': 35},
    {'id': 1003, 'age': 28}
]

with open('data.jsonl', 'w', encoding='utf-8') as f:
    for resp in respondents:
        f.write(json.dumps(resp, ensure_ascii=False) + '\n')

# Read JSONL
data = []
with open('data.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

print(f"Read {len(data)} records")

Best Practices

1. Handle Chinese Text

python
# Preserve Chinese characters
json.dumps(data, ensure_ascii=False)

# Chinese becomes \uXXXX
json.dumps(data, ensure_ascii=True)

2. Pretty-print Output

python
# Format with 2-space indentation
json.dumps(data, indent=2, ensure_ascii=False)

3. Handle Non-serializable Objects

python
from datetime import datetime
import json

# Date objects cannot be serialized directly
data = {'date': datetime.now()}
# json.dumps(data)  # TypeError

# Custom serialization
def json_serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")

json_str = json.dumps(data, default=json_serializer)

JSON vs CSV

FeatureJSONCSV
StructureNested structureFlat table
ReadabilityGoodBetter
File sizeLargerSmaller
Use casesAPI, configurationTabular data

Practice Exercises

python
# Exercise 1: Configuration File
# Create a configuration JSON file containing:
# - database: {host, port, username}
# - analysis: {min_age, max_age, sample_size}
# Save and read

# Exercise 2: Data Conversion
# Read CSV file
# Convert to JSON format (one object per row)
# Save as both .json and .jsonl formats

Module 7 Summary

You have now mastered:

  • Text file reading and writing
  • CSV/Excel processing
  • Stata file reading and writing
  • JSON data processing

Next module: Exception handling

Keep going!

Released under the MIT License. Content © Author.