Skip to content

Pandas Basics

Core Tool for Social Science Data Analysis — Python's Version of Stata


What is Pandas?

Pandas = Python Data Analysis Library

Why must social science students learn Pandas?

  • Handles tabular data (similar to Excel/Stata)
  • Data cleaning, filtering, and merging
  • Group statistics and pivot tables
  • Time series analysis

Core Data Structures

1. Series (1-Dimensional)

python
import pandas as pd

# Like a single column of data
ages = pd.Series([25, 30, 35, 40], name='age')
print(ages)

2. DataFrame (2-Dimensional Table)

python
# Similar to Stata dataset or Excel spreadsheet
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'income': [50000, 75000, 85000]
})
print(df)

Creating DataFrames

python
import pandas as pd

# Method 1: From dictionary
df = pd.DataFrame({
    'respondent_id': [1001, 1002, 1003],
    'age': [25, 30, 35],
    'gender': ['M', 'F', 'M'],
    'income': [50000, 75000, 85000]
})

# Method 2: From list
data = [
    [1001, 25, 'M', 50000],
    [1002, 30, 'F', 75000],
    [1003, 35, 'M', 85000]
]
df = pd.DataFrame(data, columns=['id', 'age', 'gender', 'income'])

# Method 3: From CSV
df = pd.read_csv('survey.csv')

# Method 4: From Stata
df = pd.read_stata('survey.dta')

Viewing Data

python
df.head()       # First 5 rows
df.tail(10)     # Last 10 rows
df.shape        # (rows, columns)
df.columns      # Column names
df.dtypes       # Data types
df.info()       # Data overview
df.describe()   # Descriptive statistics

Selecting Data

Selecting Columns

python
# Single column (returns Series)
ages = df['age']

# Multiple columns (returns DataFrame)
subset = df[['name', 'age', 'income']]

Selecting Rows

python
# By position (iloc)
df.iloc[0]       # First row
df.iloc[0:3]     # First 3 rows
df.iloc[[0, 2]]  # Rows 1 and 3

# By label (loc)
df.loc[0]        # Row with index 0
df.loc[0:2]      # Rows with index 0 to 2

Conditional Filtering

python
# Single condition
df[df['age'] > 30]
df[df['gender'] == 'F']

# Multiple conditions
df[(df['age'] > 25) & (df['income'] > 60000)]
df[(df['gender'] == 'M') | (df['age'] < 30)]

# Query method (more concise)
df.query('age > 30 and income > 60000')

Modifying Data

Adding Columns

python
# Based on existing columns
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2

# Conditional assignment
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Middle-aged')

# Using np.where
df['high_income'] = np.where(df['income'] > 70000, 'Yes', 'No')

Modifying Values

python
# Modify entire column
df['income'] = df['income'] * 1.1  # Increase income by 10%

# Conditional modification
df.loc[df['age'] > 30, 'age_group'] = 'Senior'

Deleting

python
# Delete column
df.drop('age_squared', axis=1, inplace=True)
# Or
df = df.drop(columns=['age_squared'])

# Delete rows
df.drop([0, 2], axis=0, inplace=True)

Stata Operations vs Pandas

Reading Data

stata
* Stata
use "survey.dta", clear
python
# Pandas
df = pd.read_stata('survey.dta')

Viewing Data

stata
* Stata
browse
describe
summarize
python
# Pandas
df.head()
df.info()
df.describe()

Filtering

stata
* Stata
keep if age > 30
keep if gender == "M" & income > 60000
python
# Pandas
df = df[df['age'] > 30]
df = df[(df['gender'] == 'M') & (df['income'] > 60000)]

Creating New Variables

stata
* Stata
gen log_income = log(income)
gen high_earner = (income > 70000)
python
# Pandas
df['log_income'] = np.log(df['income'])
df['high_earner'] = df['income'] > 70000

Sorting

stata
* Stata
sort age
gsort -income  // Descending
python
# Pandas
df = df.sort_values('age')
df = df.sort_values('income', ascending=False)

Practical Example

Example: Complete Data Cleaning Workflow

python
import pandas as pd
import numpy as np

# 1. Read data
df = pd.read_csv('raw_survey.csv')
print(f"Original data: {len(df)} rows")

# 2. Remove missing values
df = df.dropna(subset=['age', 'income'])

# 3. Filter valid data
df = df[
    (df['age'] >= 18) &
    (df['age'] <= 100) &
    (df['income'] > 0) &
    (df['income'] < 1000000)
]

# 4. Create new variables
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2

# 5. Categorical variables
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle-aged', 'Senior']
)

df['income_quartile'] = pd.qcut(
    df['income'],
    q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# 6. Descriptive statistics
print("\n=== Descriptive Statistics ===")
print(df.groupby('age_group')['income'].agg(['count', 'mean', 'std']))

# 7. Save
df.to_csv('clean_survey.csv', index=False)
print(f"\nCleaned data: {len(df)} rows")

Useful Techniques

Method Chaining

python
result = (df
    .query('age > 25')
    .assign(log_income=lambda x: np.log(x['income']))
    .sort_values('income', ascending=False)
    .head(10)
)

Batch Processing Columns

python
# Apply same function to multiple columns
numeric_cols = ['age', 'income', 'education_years']
df[numeric_cols] = df[numeric_cols].apply(lambda x: x / x.max())  # Normalize

Practice Exercises

python
# Data
df = pd.DataFrame({
    'id': range(1, 11),
    'age': [22, 35, 28, 45, 30, 50, 25, 38, 42, 29],
    'income': [45000, 75000, 55000, 90000, 60000, 95000, 50000, 80000, 85000, 58000],
    'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']
})

# Tasks:
# 1. Filter ages 30-45
# 2. Create income groups (<60k, 60-80k, >80k)
# 3. Calculate average age by gender and income group
# 4. Save results

Next Steps

The next section will cover Advanced Pandas Operations (merging, reshaping, grouping).

Keep going!

Released under the MIT License. Content © Author.