Skip to content

Your First Python Program

From "Hello World" to Data Analysis — Experience Python in 5 Minutes


The Traditional First Program: Hello World

Stata Version

stata
display "Hello World"

R Version

r
print("Hello World")

Python Version

python
print("Hello World")

Output:

Hello World

A More Meaningful First Program: Data Analysis

Let's complete a full data analysis workflow with Python!

Scenario: Analyzing Student Survey Data

Suppose we have student survey data:

nameagemajorgpastudy_hours
Alice20Economics3.825
Bob22Sociology3.520
Carol21Political Science3.930
David23Economics3.215

Complete Code (Ready to Run)

python
# Step 1: Create data
data = {
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
    'age': [20, 22, 21, 23, 20],
    'major': ['Economics', 'Sociology', 'Political Science', 'Economics', 'Sociology'],
    'gpa': [3.8, 3.5, 3.9, 3.2, 3.7],
    'study_hours': [25, 20, 30, 15, 22]
}

# Step 2: Create DataFrame (similar to Stata's dataset)
import pandas as pd
df = pd.DataFrame(data)

# Step 3: View data
print("📊 Data Preview:")
print(df)

# Step 4: Descriptive statistics
print("\n📈 Descriptive Statistics:")
print(df[['age', 'gpa', 'study_hours']].describe())

# Step 5: Group statistics by major
print("\n🎓 Average GPA by Major:")
print(df.groupby('major')['gpa'].mean())

# Step 6: Simple visualization (GPA vs Study Hours)
import matplotlib.pyplot as plt
plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()

Output:

📊 Data Preview:
    name  age               major  gpa  study_hours
0  Alice   20           Economics  3.8           25
1    Bob   22           Sociology  3.5           20
2  Carol   21  Political Science  3.9           30
3  David   23           Economics  3.2           15
4   Emma   20           Sociology  3.7           22

📈 Descriptive Statistics:
             age       gpa  study_hours
count   5.000000  5.000000     5.000000
mean   21.200000  3.620000    22.400000
std     1.303840  0.262488     5.549775
min    20.000000  3.200000    15.000000
25%    20.000000  3.500000    20.000000
50%    21.000000  3.700000    22.000000
75%    22.000000  3.800000    25.000000
max    23.000000  3.9000000    30.000000

🎓 Average GPA by Major:
major
Economics            3.50
Political Science    3.90
Sociology            3.60
Name: gpa, dtype: float64

Code Explanation

1. Create Data (Dictionary)

python
data = {
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Emma'],
    'age': [20, 22, 21, 23, 20]
}

Understanding:

  • {} represents a dictionary
  • 'name': [...] represents key-value pairs
  • Similar to R's list(name = c("Alice", "Bob", ...))

2. Create DataFrame

python
import pandas as pd
df = pd.DataFrame(data)

Understanding:

  • import pandas as pd: Import Pandas library, abbreviated as pd
  • pd.DataFrame(): Create DataFrame (similar to Stata's dataset, R's data.frame)

3. View Data

python
print(df)

Comparison:

  • Stata: browse or list
  • R: print(df) or just df
  • Python: print(df) or df (in Jupyter)

4. Descriptive Statistics

python
df[['age', 'gpa', 'study_hours']].describe()

Comparison:

  • Stata: summarize age gpa study_hours
  • R: summary(df[c("age", "gpa", "study_hours")])
  • Python: df[['age', 'gpa', 'study_hours']].describe()

5. Group Statistics

python
df.groupby('major')['gpa'].mean()

Comparison:

  • Stata: tabstat gpa, by(major)
  • R: aggregate(gpa ~ major, data=df, FUN=mean)
  • Python: df.groupby('major')['gpa'].mean()

Visualization Example

Scatter Plot: GPA vs Study Hours

python
import matplotlib.pyplot as plt

plt.scatter(df['study_hours'], df['gpa'])
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours')
plt.show()

Compare to Stata:

stata
twoway scatter gpa study_hours, title("GPA vs Study Hours")

Compare to R:

r
plot(df$study_hours, df$gpa,
     xlab="Study Hours", ylab="GPA",
     main="GPA vs Study Hours")

Advanced Example: Adding Regression Line

python
import numpy as np
from scipy import stats

# Calculate regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df['study_hours'], df['gpa'])
line = slope * df['study_hours'] + intercept

# Plot
plt.scatter(df['study_hours'], df['gpa'], label='Actual Data')
plt.plot(df['study_hours'], line, color='red', label=f'Regression Line (R²={r_value**2:.3f})')
plt.xlabel('Study Hours per Week')
plt.ylabel('GPA')
plt.title('GPA vs Study Hours with Regression Line')
plt.legend()
plt.show()

print(f"📊 Regression Results: GPA = {intercept:.3f} + {slope:.3f} * Study Hours")
print(f"   R² = {r_value**2:.3f}, p-value = {p_value:.4f}")

Output:

📊 Regression Results: GPA = 2.954 + 0.030 * Study Hours
   R² = 0.523, p-value = 0.1678

Complete Data Analysis Template

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# ========== 1. Data Loading ==========
# Method 1: Create from dictionary
data = {
    'variable1': [1, 2, 3, 4, 5],
    'variable2': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Method 2: Load from CSV file (more common)
# df = pd.read_csv('data.csv')

# ========== 2. Data Cleaning ==========
df = df.dropna()  # Drop missing values
df = df[df['variable1'] > 0]  # Filter conditions

# ========== 3. Create New Variables ==========
df['log_var1'] = np.log(df['variable1'])
df['var1_squared'] = df['variable1'] ** 2

# ========== 4. Descriptive Statistics ==========
print(df.describe())
print(df.groupby('category')['variable1'].mean())

# ========== 5. Visualization ==========
plt.hist(df['variable1'], bins=10)
plt.title('Distribution of Variable 1')
plt.show()

# ========== 6. Statistical Analysis ==========
# Correlation coefficient
correlation = df['variable1'].corr(df['variable2'])
print(f"Correlation: {correlation:.3f}")

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['variable1'], df['variable2'])
print(f"Regression: y = {intercept:.2f} + {slope:.2f}x, R² = {r_value**2:.3f}")

# ========== 7. Save Results ==========
df.to_csv('output.csv', index=False)

Advanced Case: From Real Data to Publication-Quality Analysis

Case: Analyzing Income Inequality with Real Data

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Generate simulated income distribution data (mimicking CPS data)
np.random.seed(42)
n = 5000

# Generate income by education level
education_levels = ['High School', 'Bachelor', 'Master', 'PhD']
education_weights = [0.4, 0.35, 0.20, 0.05]

data = {
    'person_id': range(1, n+1),
    'age': np.random.randint(22, 65, n),
    'education': np.random.choice(education_levels, n, p=education_weights),
    'experience': np.random.randint(0, 40, n),
    'female': np.random.choice([0, 1], n),
    'urban': np.random.choice([0, 1], n, p=[0.3, 0.7])
}

df = pd.DataFrame(data)

# Generate income based on characteristics (log-normal distribution)
education_premium = df['education'].map({
    'High School': 0,
    'Bachelor': 0.3,
    'Master': 0.5,
    'PhD': 0.7
})

log_income = (10.5 +
              education_premium +
              0.03 * df['age'] -
              0.0004 * df['age']**2 +
              0.02 * df['experience'] -
              0.15 * df['female'] +
              0.10 * df['urban'] +
              np.random.normal(0, 0.3, n))

df['income'] = np.exp(log_income)

# ========== 1. Data Quality Check ==========
print("📊 Data Quality Report")
print("=" * 50)
print(f"Total sample size: {len(df)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"\nIncome distribution:")
print(df['income'].describe())

# ========== 2. Descriptive Statistics ==========
print("\n📈 Income Statistics by Education Level")
print("=" * 50)
summary = df.groupby('education')['income'].agg([
    ('Count', 'count'),
    ('Mean', 'mean'),
    ('Median', 'median'),
    ('Std', 'std'),
    ('P25', lambda x: x.quantile(0.25)),
    ('P75', lambda x: x.quantile(0.75))
]).round(0)
print(summary)

# ========== 3. Inequality Indicators ==========
def gini_coefficient(x):
    """Calculate Gini coefficient"""
    x = np.sort(x)
    n = len(x)
    cumsum = np.cumsum(x)
    return (2 * np.sum((n - np.arange(1, n+1) + 0.5) * x)) / (n * np.sum(x)) - 1

gini = gini_coefficient(df['income'])
print(f"\n📊 Income Gini Coefficient: {gini:.3f}")

# Calculate income ratio between education groups
mean_income = df.groupby('education')['income'].mean()
college_premium = (mean_income['Bachelor'] / mean_income['High School'] - 1) * 100
print(f"College Premium (Bachelor vs High School): {college_premium:.1f}%")

# ========== 4. Regression Analysis ==========
import statsmodels.formula.api as smf

# OLS regression
model = smf.ols('np.log(income) ~ C(education) + age + I(age**2) + experience + female + urban',
                data=df).fit()

print("\n📊 Regression Analysis Results")
print("=" * 50)
print(model.summary().tables[1])

# Extract key coefficients
edu_coef = model.params['C(education)[T.Bachelor]']
female_coef = model.params['female']

print(f"\nKey Findings:")
print(f"- College education increases income by {(np.exp(edu_coef)-1)*100:.1f}%")
print(f"- Gender wage gap: {abs(female_coef)*100:.1f}% (log points)")

# ========== 5. Data Visualization ==========
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Subplot 1: Income distribution (log scale)
axes[0, 0].hist(np.log(df['income']), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Log(Income)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Income Distribution (Log Scale)')

# Subplot 2: Income boxplot by education
df.boxplot(column='income', by='education', ax=axes[0, 1])
axes[0, 1].set_ylabel('Income ($)')
axes[0, 1].set_title('Income by Education Level')
axes[0, 1].get_figure().suptitle('')  # Remove default title

# Subplot 3: Age-income relationship
sns.scatterplot(data=df.sample(500), x='age', y='income',
                hue='education', alpha=0.6, ax=axes[1, 0])
axes[1, 0].set_ylabel('Income ($)')
axes[1, 0].set_title('Income vs Age by Education')

# Subplot 4: Gender wage gap
gender_income = df.groupby(['education', 'female'])['income'].mean().unstack()
gender_income.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_ylabel('Mean Income ($)')
axes[1, 1].set_title('Gender Pay Gap by Education')
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45)
axes[1, 1].legend(['Male', 'Female'])

plt.tight_layout()
plt.savefig('income_analysis.png', dpi=300, bbox_inches='tight')
print("\n✅ Chart saved as 'income_analysis.png'")

# ========== 6. Export Results ==========
# Export regression results table
with open('regression_results.txt', 'w') as f:
    f.write(str(model.summary()))

# Export descriptive statistics
summary.to_csv('descriptive_stats.csv')

print("\n✅ Analysis complete! Generated files:")
print("  - regression_results.txt")
print("  - descriptive_stats.csv")
print("  - income_analysis.png")

What does this case demonstrate?

  1. Data Quality Check: Academic paper-style data cleaning workflow
  2. Descriptive Statistics: Calculate mean, median, quantiles by group
  3. Inequality Indicators: Calculate Gini coefficient (in Stata requires ineqdeco installation)
  4. Regression Analysis: Includes quadratic terms, categorical variables, interaction terms
  5. Publication-Quality Visualization: 4 subplots, 300 DPI output
  6. Results Export: Ready for use in papers

Practical Exercises

Exercise 1: Modify Data

Try modifying the student data above, adding a new student:

  • Name: Frank
  • Age: 24
  • Major: Economics
  • GPA: 3.6
  • Study Hours: 28

Hint: Use pd.concat() or df.loc[]

Click to view answer
python
# Method 1: Using pd.concat
new_student = pd.DataFrame({
    'name': ['Frank'],
    'age': [24],
    'major': ['Economics'],
    'gpa': [3.6],
    'study_hours': [28]
})
df = pd.concat([df, new_student], ignore_index=True)

# Method 2: Using loc
df.loc[len(df)] = ['Frank', 24, 'Economics', 3.6, 28]

Exercise 2: New Analysis

Calculate:

  1. Average study hours by major
  2. Which students have GPA above 3.6?
  3. Correlation coefficient between age and GPA
Click to view answer
python
# 1. Average study hours by major
print(df.groupby('major')['study_hours'].mean())

# 2. Students with GPA above 3.6
high_gpa = df[df['gpa'] > 3.6]
print(high_gpa[['name', 'gpa']])

# 3. Correlation between age and GPA
correlation = df['age'].corr(df['gpa'])
print(f"Correlation: {correlation:.3f}")

Exercise 3: Replicate Stata's tabstat

Use Python to replicate Stata's tabstat income education, by(gender) stat(mean sd min max n)

Click to view answer
python
result = df.groupby('female').agg({
    'income': ['mean', 'std', 'min', 'max', 'count'],
    'education': ['mean', 'std', 'min', 'max', 'count']
})
print(result)

Key Takeaways

Python Programming Philosophy

  1. Python's core is objects: df is an object, .describe() is its method
  2. Method chaining: df.groupby('major')['gpa'].mean() is method chaining
  3. Import libraries: import pandas as pd is standard practice
  4. Indexing: df['column'] or df[['col1', 'col2']]

Python vs Stata/R: Mindset Comparison

AspectStataRPython
DataFrameGlobal uniqueMultiple, access columns with $Multiple, access columns with []
Function Callcommand varlistfunction(data$var)df['var'].method()
Assignmentgen, replace<- or ==
Pipe OperationsNot supported%>% (dplyr). (method chain)
VectorizationAutomaticAutomaticNeed NumPy

Best Practices

  1. Code Organization: Use comments to separate modules (# ========== 1. Data Loading ==========)
  2. Variable Naming: Use meaningful names (df_clean not df2)
  3. Error Handling: Develop habit of checking data quality (missing values, outliers)
  4. Reproducibility: Set random seed (np.random.seed(42))
  5. Performance Optimization: For big data use pd.read_csv(chunksize=1000) for chunked reading

From First Program to Production-Level Code

Beginner Version

python
df = pd.read_csv("data.csv")
print(df.mean())

Production Version

python
import pandas as pd
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def load_and_validate_data(filepath):
    """
    Load and validate data

    Parameters:
    -----------
    filepath : str
        Data file path

    Returns:
    --------
    pd.DataFrame
        Cleaned DataFrame
    """
    try:
        df = pd.read_csv(filepath)
        logger.info(f"Successfully loaded {len(df)} rows")

        # Data validation
        required_columns = ['income', 'education', 'age']
        missing_cols = set(required_columns) - set(df.columns)
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")

        # Drop missing values
        initial_rows = len(df)
        df = df.dropna(subset=required_columns)
        dropped_rows = initial_rows - len(df)
        if dropped_rows > 0:
            logger.warning(f"Dropped {dropped_rows} rows with missing data")

        return df

    except FileNotFoundError:
        logger.error(f"File not found: {filepath}")
        raise
    except Exception as e:
        logger.error(f"Error loading data: {str(e)}")
        raise

# Use function
df = load_and_validate_data("data.csv")

Differences:

  • Error handling (try-except)
  • Docstrings
  • Logging
  • Data validation
  • Function encapsulation

Next Steps

Congratulations on completing your first Python program! In the next module, we will:

  1. Learn how to configure Python development environment
  2. Understand Jupyter Notebook usage
  3. Master Python configuration in VS Code

You have mastered:

  • Python basic syntax
  • Pandas DataFrame concepts
  • Descriptive statistics and groupby operations
  • Mental comparison with Stata/R

Ready for the next stage?

Released under the MIT License. Content © Author.