Pandas Basics
Core Tool for Social Science Data Analysis — Python's Version of Stata
What is Pandas?
Pandas = Python Data Analysis Library
Why must social science students learn Pandas?
- Handles tabular data (similar to Excel/Stata)
- Data cleaning, filtering, and merging
- Group statistics and pivot tables
- Time series analysis
Core Data Structures
1. Series (1-Dimensional)
python
import pandas as pd
# Like a single column of data
ages = pd.Series([25, 30, 35, 40], name='age')
print(ages)2. DataFrame (2-Dimensional Table)
python
# Similar to Stata dataset or Excel spreadsheet
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'income': [50000, 75000, 85000]
})
print(df)Creating DataFrames
python
import pandas as pd
# Method 1: From dictionary
df = pd.DataFrame({
'respondent_id': [1001, 1002, 1003],
'age': [25, 30, 35],
'gender': ['M', 'F', 'M'],
'income': [50000, 75000, 85000]
})
# Method 2: From list
data = [
[1001, 25, 'M', 50000],
[1002, 30, 'F', 75000],
[1003, 35, 'M', 85000]
]
df = pd.DataFrame(data, columns=['id', 'age', 'gender', 'income'])
# Method 3: From CSV
df = pd.read_csv('survey.csv')
# Method 4: From Stata
df = pd.read_stata('survey.dta')Viewing Data
python
df.head() # First 5 rows
df.tail(10) # Last 10 rows
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types
df.info() # Data overview
df.describe() # Descriptive statisticsSelecting Data
Selecting Columns
python
# Single column (returns Series)
ages = df['age']
# Multiple columns (returns DataFrame)
subset = df[['name', 'age', 'income']]Selecting Rows
python
# By position (iloc)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[[0, 2]] # Rows 1 and 3
# By label (loc)
df.loc[0] # Row with index 0
df.loc[0:2] # Rows with index 0 to 2Conditional Filtering
python
# Single condition
df[df['age'] > 30]
df[df['gender'] == 'F']
# Multiple conditions
df[(df['age'] > 25) & (df['income'] > 60000)]
df[(df['gender'] == 'M') | (df['age'] < 30)]
# Query method (more concise)
df.query('age > 30 and income > 60000')Modifying Data
Adding Columns
python
# Based on existing columns
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2
# Conditional assignment
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Middle-aged')
# Using np.where
df['high_income'] = np.where(df['income'] > 70000, 'Yes', 'No')Modifying Values
python
# Modify entire column
df['income'] = df['income'] * 1.1 # Increase income by 10%
# Conditional modification
df.loc[df['age'] > 30, 'age_group'] = 'Senior'Deleting
python
# Delete column
df.drop('age_squared', axis=1, inplace=True)
# Or
df = df.drop(columns=['age_squared'])
# Delete rows
df.drop([0, 2], axis=0, inplace=True)Stata Operations vs Pandas
Reading Data
stata
* Stata
use "survey.dta", clearpython
# Pandas
df = pd.read_stata('survey.dta')Viewing Data
stata
* Stata
browse
describe
summarizepython
# Pandas
df.head()
df.info()
df.describe()Filtering
stata
* Stata
keep if age > 30
keep if gender == "M" & income > 60000python
# Pandas
df = df[df['age'] > 30]
df = df[(df['gender'] == 'M') & (df['income'] > 60000)]Creating New Variables
stata
* Stata
gen log_income = log(income)
gen high_earner = (income > 70000)python
# Pandas
df['log_income'] = np.log(df['income'])
df['high_earner'] = df['income'] > 70000Sorting
stata
* Stata
sort age
gsort -income // Descendingpython
# Pandas
df = df.sort_values('age')
df = df.sort_values('income', ascending=False)Practical Example
Example: Complete Data Cleaning Workflow
python
import pandas as pd
import numpy as np
# 1. Read data
df = pd.read_csv('raw_survey.csv')
print(f"Original data: {len(df)} rows")
# 2. Remove missing values
df = df.dropna(subset=['age', 'income'])
# 3. Filter valid data
df = df[
(df['age'] >= 18) &
(df['age'] <= 100) &
(df['income'] > 0) &
(df['income'] < 1000000)
]
# 4. Create new variables
df['log_income'] = np.log(df['income'])
df['age_squared'] = df['age'] ** 2
# 5. Categorical variables
df['age_group'] = pd.cut(
df['age'],
bins=[0, 30, 50, 100],
labels=['Young', 'Middle-aged', 'Senior']
)
df['income_quartile'] = pd.qcut(
df['income'],
q=4,
labels=['Q1', 'Q2', 'Q3', 'Q4']
)
# 6. Descriptive statistics
print("\n=== Descriptive Statistics ===")
print(df.groupby('age_group')['income'].agg(['count', 'mean', 'std']))
# 7. Save
df.to_csv('clean_survey.csv', index=False)
print(f"\nCleaned data: {len(df)} rows")Useful Techniques
Method Chaining
python
result = (df
.query('age > 25')
.assign(log_income=lambda x: np.log(x['income']))
.sort_values('income', ascending=False)
.head(10)
)Batch Processing Columns
python
# Apply same function to multiple columns
numeric_cols = ['age', 'income', 'education_years']
df[numeric_cols] = df[numeric_cols].apply(lambda x: x / x.max()) # NormalizePractice Exercises
python
# Data
df = pd.DataFrame({
'id': range(1, 11),
'age': [22, 35, 28, 45, 30, 50, 25, 38, 42, 29],
'income': [45000, 75000, 55000, 90000, 60000, 95000, 50000, 80000, 85000, 58000],
'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']
})
# Tasks:
# 1. Filter ages 30-45
# 2. Create income groups (<60k, 60-80k, >80k)
# 3. Calculate average age by gender and income group
# 4. Save resultsNext Steps
The next section will cover Advanced Pandas Operations (merging, reshaping, grouping).
Keep going!