Your First Python Program

From "Hello World" to Data Analysis — Experience Python in 5 Minutes

The Traditional First Program: Hello World

Stata Version

stata

display "Hello World"

R Version

print("Hello World")

Python Version

python

print("Hello World")

Output:

Hello World

A More Meaningful First Program: Data Analysis

Let's complete a full data analysis workflow with Python!

Scenario: Analyzing Student Survey Data

Suppose we have student survey data:

name	age	major	gpa	study_hours
Alice	20	Economics	3.8	25
Bob	22	Sociology	3.5	20
Carol	21	Political Science	3.9	30
David	23	Economics	3.2	15

Complete Code (Ready to Run)

python

# Step 1: Create data
data = {
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
    'age': [20, 22, 21, 23, 20],
    'major': ['Economics', 'Sociology', 'Political Science', 'Economics', 'Sociology'],
    'gpa': [3.8, 3.5, 3.9, 3.2, 3.7],
    'study_hours': [25, 20, 30, 15, 22]
}

# Step 2: Create DataFrame (similar to Stata's dataset)
import pandas as pd
df = pd.DataFrame(data)

# Step 3: View data
print("📊 Data Preview:")
print(df)

# Step 4: Descriptive statistics
print("\n📈 Descriptive Statistics:")
print(df[['age', 'gpa', 'study_hours']].describe())

# Step 5: Group statistics by major
print("\n🎓 Average GPA by Major:")
print(df.groupby('major')['gpa'].mean())

# Step 6: Simple visualization (GPA vs Study Hours)
import matplotlib.pyplot as plt
plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()

Output:

📊 Data Preview:
    name  age               major  gpa  study_hours
0  Alice   20           Economics  3.8           25
1    Bob   22           Sociology  3.5           20
2  Carol   21  Political Science  3.9           30
3  David   23           Economics  3.2           15
4   Emma   20           Sociology  3.7           22

📈 Descriptive Statistics:
             age       gpa  study_hours
count   5.000000  5.000000     5.000000
mean   21.200000  3.620000    22.400000
std     1.303840  0.262488     5.549775
min    20.000000  3.200000    15.000000
25%    20.000000  3.500000    20.000000
50%    21.000000  3.700000    22.000000
75%    22.000000  3.800000    25.000000
max    23.000000  3.9000000    30.000000

🎓 Average GPA by Major:
major
Economics            3.50
Political Science    3.90
Sociology            3.60
Name: gpa, dtype: float64

Code Explanation

1. Create Data (Dictionary)

python

data = {
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
    'age': [20, 22, 21, 23, 20]
}

Understanding:

{} represents a dictionary
'name': [...] represents key-value pairs
Similar to R's list(name = c("Alice", "Bob", ...))

2. Create DataFrame

python

import pandas as pd
df = pd.DataFrame(data)

Understanding:

import pandas as pd: Import Pandas library, abbreviated as pd
pd.DataFrame(): Create DataFrame (similar to Stata's dataset, R's data.frame)

3. View Data

python

print(df)

Comparison:

Stata: browse or list
R: print(df) or just df
Python: print(df) or df (in Jupyter)

4. Descriptive Statistics

python

df[['age', 'gpa', 'study_hours']].describe()

Comparison:

Stata: summarize age gpa study_hours
R: summary(df[c("age", "gpa", "study_hours")])
Python: df[['age', 'gpa', 'study_hours']].describe()

5. Group Statistics

python

df.groupby('major')['gpa'].mean()

Comparison:

Stata: tabstat gpa, by(major)
R: aggregate(gpa ~ major, data=df, FUN=mean)
Python: df.groupby('major')['gpa'].mean()

Visualization Example

Scatter Plot: GPA vs Study Hours

python

import matplotlib.pyplot as plt

plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()

Compare to Stata:

stata

twoway scatter gpa study_hours, title("GPA vs Study Hours")

Compare to R:

plot(df$study_hours, df$gpa,
     xlab="Study Hours", ylab="GPA",
     main="GPA vs Study Hours")

Advanced Example: Adding Regression Line

python

import numpy as np
from scipy import stats

# Calculate regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['study_hours'], df['gpa'])
line = slope * df['study_hours'] + intercept

# Plot
plt.scatter(df['study_hours'], df['gpa'], label='Actual Data')
plt.plot(df['study_hours'], line, color='red', label=f'Regression Line (R²={r_value**2:.3f})')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours with Regression Line')
plt.legend()
plt.show()

print(f"📊 Regression Results: GPA = {intercept:.3f} + {slope:.3f} * Study Hours")
print(f"   R² = {r_value**2:.3f}, p-value = {p_value:.4f}")

Output:

📊 Regression Results: GPA = 2.954 + 0.030 * Study Hours
   R² = 0.523, p-value = 0.1678

Complete Data Analysis Template

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# ========== 1. Data Loading ==========
# Method 1: Create from dictionary
data = {
    'variable1': [1, 2, 3, 4, 5],
    'variable2': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Method 2: Load from CSV file (more common)
# df = pd.read_csv('data.csv')

# ========== 2. Data Cleaning ==========
df = df.dropna()  # Drop missing values
df = df[df['variable1'] > 0]  # Filter conditions

# ========== 3. Create New Variables ==========
df['log_var1'] = np.log(df['variable1'])
df['var1_squared'] = df['variable1'] ** 2

# ========== 4. Descriptive Statistics ==========
print(df.describe())
print(df.groupby('category')['variable1'].mean())

# ========== 5. Visualization ==========
plt.hist(df['variable1'], bins=10)
plt.title('Distribution of Variable 1')
plt.show()

# ========== 6. Statistical Analysis ==========
# Correlation coefficient
correlation = df['variable1'].corr(df['variable2'])
print(f"Correlation: {correlation:.3f}")

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['variable1'], df['variable2'])
print(f"Regression: y = {intercept:.2f} + {slope:.2f}x, R² = {r_value**2:.3f}")

# ========== 7. Save Results ==========
df.to_csv('output.csv', index=False)

Advanced Case: From Real Data to Publication-Quality Analysis

Case: Analyzing Income Inequality with Real Data

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Generate simulated income distribution data (mimicking CPS data)
np.random.seed(42)
n = 5000

# Generate income by education level
education_levels = ['High School', 'Bachelor', 'Master', 'PhD']
education_weights = [0.4, 0.35, 0.20, 0.05]

data = {
    'person_id': range(1, n+1),
    'age': np.random.randint(22, 65, n),
    'education': np.random.choice(education_levels, n, p=education_weights),
    'experience': np.random.randint(0, 40, n),
    'female': np.random.choice([0, 1], n),
    'urban': np.random.choice([0, 1], n, p=[0.3, 0.7])
}

df = pd.DataFrame(data)

# Generate income based on characteristics (log-normal distribution)
education_premium = df['education'].map({
    'High School': 0,
    'Bachelor': 0.3,
    'Master': 0.5,
    'PhD': 0.7
})

log_income = (10.5 +
              education_premium +
              0.03 * df['age'] -
              0.0004 * df['age']**2 +
              0.02 * df['experience'] -
              0.15 * df['female'] +
              0.10 * df['urban'] +
              np.random.normal(0, 0.3, n))

df['income'] = np.exp(log_income)

# ========== 1. Data Quality Check ==========
print("📊 Data Quality Report")
print("=" * 50)
print(f"Total sample size: {len(df)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"\nIncome distribution:")
print(df['income'].describe())

# ========== 2. Descriptive Statistics ==========
print("\n📈 Income Statistics by Education Level")
print("=" * 50)
summary = df.groupby('education')['income'].agg([
    ('Count', 'count'),
    ('Mean', 'mean'),
    ('Median', 'median'),
    ('Std', 'std'),
    ('P25', lambda x: x.quantile(0.25)),
    ('P75', lambda x: x.quantile(0.75))
]).round(0)
print(summary)

# ========== 3. Inequality Indicators ==========
def gini_coefficient(x):
    """Calculate Gini coefficient"""
    x = np.sort(x)
    n = len(x)
    cumsum = np.cumsum(x)
    return (2 * np.sum((n - np.arange(1, n+1) + 0.5) * x)) / (n * np.sum(x)) - 1

gini = gini_coefficient(df['income'])
print(f"\n📊 Income Gini Coefficient: {gini:.3f}")

# Calculate income ratio between education groups
mean_income = df.groupby('education')['income'].mean()
college_premium = (mean_income['Bachelor'] / mean_income['High School'] - 1) * 100
print(f"College Premium (Bachelor vs High School): {college_premium:.1f}%")

# ========== 4. Regression Analysis ==========
import statsmodels.formula.api as smf

# OLS regression
model = smf.ols('np.log(income) ~ C(education) + age + I(age**2) + experience + female + urban',
                data=df).fit()

print("\n📊 Regression Analysis Results")
print("=" * 50)
print(model.summary().tables[1])

# Extract key coefficients
edu_coef = model.params['C(education)[T.Bachelor]']
female_coef = model.params['female']

print(f"\nKey Findings:")
print(f"- College education increases income by {(np.exp(edu_coef)-1)*100:.1f}%")
print(f"- Gender wage gap: {abs(female_coef)*100:.1f}% (log points)")

# ========== 5. Data Visualization ==========
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Subplot 1: Income distribution (log scale)
axes[0, 0].hist(np.log(df['income']), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Log(Income)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Income Distribution (Log Scale)')

# Subplot 2: Income boxplot by education
df.boxplot(column='income', by='education', ax=axes[0, 1])
axes[0, 1].set_ylabel('Income ($)')
axes[0, 1].set_title('Income by Education Level')
axes[0, 1].get_figure().suptitle('')  # Remove default title

# Subplot 3: Age-income relationship
sns.scatterplot(data=df.sample(500), x='age', y='income',
                hue='education', alpha=0.6, ax=axes[1, 0])
axes[1, 0].set_ylabel('Income ($)')
axes[1, 0].set_title('Income vs Age by Education')

# Subplot 4: Gender wage gap
gender_income = df.groupby(['education', 'female'])['income'].mean().unstack()
gender_income.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_ylabel('Mean Income ($)')
axes[1, 1].set_title('Gender Pay Gap by Education')
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45)
axes[1, 1].legend(['Male', 'Female'])

plt.tight_layout()
plt.savefig('income_analysis.png', dpi=300, bbox_inches='tight')
print("\n✅ Chart saved as 'income_analysis.png'")

# ========== 6. Export Results ==========
# Export regression results table
with open('regression_results.txt', 'w') as f:
    f.write(str(model.summary()))

# Export descriptive statistics
summary.to_csv('descriptive_stats.csv')

print("\n✅ Analysis complete! Generated files:")
print("  - regression_results.txt")
print("  - descriptive_stats.csv")
print("  - income_analysis.png")

What does this case demonstrate?

Data Quality Check: Academic paper-style data cleaning workflow
Descriptive Statistics: Calculate mean, median, quantiles by group
Inequality Indicators: Calculate Gini coefficient (in Stata requires ineqdeco installation)
Regression Analysis: Includes quadratic terms, categorical variables, interaction terms
Publication-Quality Visualization: 4 subplots, 300 DPI output
Results Export: Ready for use in papers

Practical Exercises

Exercise 1: Modify Data

Try modifying the student data above, adding a new student:

Name: Frank
Age: 24
Major: Economics
GPA: 3.6
Study Hours: 28

Hint: Use pd.concat() or df.loc[]

Click to view answer

python

# Method 1: Using pd.concat
new_student = pd.DataFrame({
    'name': ['Frank'],
    'age': [24],
    'major': ['Economics'],
    'gpa': [3.6],
    'study_hours': [28]
})
df = pd.concat([df, new_student], ignore_index=True)

# Method 2: Using loc
df.loc[len(df)] = ['Frank', 24, 'Economics', 3.6, 28]

Exercise 2: New Analysis

Calculate:

Average study hours by major
Which students have GPA above 3.6?
Correlation coefficient between age and GPA

Click to view answer

python

# 1. Average study hours by major
print(df.groupby('major')['study_hours'].mean())

# 2. Students with GPA above 3.6
high_gpa = df[df['gpa'] > 3.6]
print(high_gpa[['name', 'gpa']])

# 3. Correlation between age and GPA
correlation = df['age'].corr(df['gpa'])
print(f"Correlation: {correlation:.3f}")

Exercise 3: Replicate Stata's tabstat

Use Python to replicate Stata's tabstat income education, by(gender) stat(mean sd min max n)

Click to view answer

python

result = df.groupby('female').agg({
    'income': ['mean', 'std', 'min', 'max', 'count'],
    'education': ['mean', 'std', 'min', 'max', 'count']
})
print(result)

Key Takeaways

Python Programming Philosophy

Python's core is objects: df is an object, .describe() is its method
Method chaining: df.groupby('major')['gpa'].mean() is method chaining
Import libraries: import pandas as pd is standard practice
Indexing: df['column'] or df[['col1', 'col2']]

Python vs Stata/R: Mindset Comparison

Aspect	Stata	R	Python
DataFrame	Global unique	Multiple, access columns with `$`	Multiple, access columns with `[]`
Function Call	`command varlist`	`function(data$var)`	`df['var'].method()`
Assignment	`gen`, `replace`	`<-` or `=`	`=`
Pipe Operations	Not supported	`%>%` (dplyr)	`.` (method chain)
Vectorization	Automatic	Automatic	Need NumPy

Best Practices

Code Organization: Use comments to separate modules (# ========== 1. Data Loading ==========)
Variable Naming: Use meaningful names (df_clean not df2)
Error Handling: Develop habit of checking data quality (missing values, outliers)
Reproducibility: Set random seed (np.random.seed(42))
Performance Optimization: For big data use pd.read_csv(chunksize=1000) for chunked reading

From First Program to Production-Level Code

Beginner Version

python

df = pd.read_csv("data.csv")
print(df.mean())

Production Version

python

import pandas as pd
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def load_and_validate_data(filepath):
    """
    Load and validate data

    Parameters:
    -----------
    filepath : str
        Data file path

    Returns:
    --------
    pd.DataFrame
        Cleaned DataFrame
    """
    try:
        df = pd.read_csv(filepath)
        logger.info(f"Successfully loaded {len(df)} rows")

        # Data validation
        required_columns = ['income', 'education', 'age']
        missing_cols = set(required_columns) - set(df.columns)
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")

        # Drop missing values
        initial_rows = len(df)
        df = df.dropna(subset=required_columns)
        dropped_rows = initial_rows - len(df)
        if dropped_rows > 0:
            logger.warning(f"Dropped {dropped_rows} rows with missing data")

        return df

    except FileNotFoundError:
        logger.error(f"File not found: {filepath}")
        raise
    except Exception as e:
        logger.error(f"Error loading data: {str(e)}")
        raise

# Use function
df = load_and_validate_data("data.csv")

Differences:

Error handling (try-except)
Docstrings
Logging
Data validation
Function encapsulation

Next Steps

Congratulations on completing your first Python program! In the next module, we will:

Learn how to configure Python development environment
Understand Jupyter Notebook usage
Master Python configuration in VS Code

You have mastered:

Python basic syntax
Pandas DataFrame concepts
Descriptive statistics and groupby operations
Mental comparison with Stata/R

Ready for the next stage?

Your First Python Program ​

The Traditional First Program: Hello World ​

Stata Version ​

R Version ​

Python Version ​

A More Meaningful First Program: Data Analysis ​

Scenario: Analyzing Student Survey Data ​

Complete Code (Ready to Run) ​

Code Explanation ​

1. Create Data (Dictionary) ​

2. Create DataFrame ​

3. View Data ​

4. Descriptive Statistics ​

5. Group Statistics ​

Visualization Example ​

Scatter Plot: GPA vs Study Hours ​

Advanced Example: Adding Regression Line ​

Complete Data Analysis Template ​

Advanced Case: From Real Data to Publication-Quality Analysis ​

Case: Analyzing Income Inequality with Real Data ​

Practical Exercises ​

Exercise 1: Modify Data ​

Exercise 2: New Analysis ​

Exercise 3: Replicate Stata's tabstat ​

Key Takeaways ​

Python Programming Philosophy ​

Python vs Stata/R: Mindset Comparison ​

Best Practices ​

From First Program to Production-Level Code ​

Beginner Version ​

Production Version ​

Next Steps ​

Your First Python Program

The Traditional First Program: Hello World

Stata Version

R Version

Python Version

A More Meaningful First Program: Data Analysis

Scenario: Analyzing Student Survey Data

Complete Code (Ready to Run)

Code Explanation

1. Create Data (Dictionary)

2. Create DataFrame

3. View Data

4. Descriptive Statistics

5. Group Statistics

Visualization Example

Scatter Plot: GPA vs Study Hours

Advanced Example: Adding Regression Line

Complete Data Analysis Template

Advanced Case: From Real Data to Publication-Quality Analysis

Case: Analyzing Income Inequality with Real Data

Practical Exercises

Exercise 1: Modify Data

Exercise 2: New Analysis

Exercise 3: Replicate Stata's tabstat

Key Takeaways

Python Programming Philosophy

Python vs Stata/R: Mindset Comparison

Best Practices

From First Program to Production-Level Code

Beginner Version

Production Version

Next Steps