4.1 Chapter Introduction (Python Statistical Toolkit Landscape)

From Descriptive Statistics to Causal Inference: Mastering the Python Statistical Ecosystem

Why Master Multiple Statistical Packages?

The Stata User's Confusion

Stata Users:

stata

* Everything is simple in Stata
regress wage education experience

Questions After Switching to Python:

Why are there so many packages? (statsmodels, scipy, linearmodels...)
Which package should I use? When should I use which?
Why do multiple implementations exist for the same functionality?

Answer: Python is an ecosystem, not a monolithic software

Python Statistical Ecosystem Landscape

Core Statistical Package Comparison

Package	Positioning	Core Functionality	Use Cases
statsmodels	Statistical Modeling	OLS, GLM, Time Series, Diagnostics	Classical Statistical Analysis, Publication-Quality Output
scipy.stats	Scientific Computing	Probability Distributions, Hypothesis Testing, Descriptive Statistics	Rapid Statistical Tests, Univariate Analysis
linearmodels	Econometrics	Panel Data, Instrumental Variables, GMM	Panel Regression, Endogeneity Treatment
pingouin	User-Friendly Statistics	t-tests, ANOVA, Correlation, Power Analysis	Rapid Statistics, Readable Output
scikit-learn	Machine Learning	Predictive Models, Feature Engineering, Validation	Prediction Tasks, Machine Learning
PyMC	Bayesian Inference	MCMC, Bayesian Models	Bayesian Statistics, Uncertainty Quantification

Stata vs Python: Paradigm Differences

Dimension	Stata	Python
Philosophy	Integrated Software	Modular Ecosystem
Regression	`regress y x1 x2`	`sm.OLS(y, X).fit()`
Output	Automatic Display	Requires `.summary()` Call
Extension	Limited (ado files)	Unlimited (Open-Source Packages)
Learning Curve	Gentle	Steep but More Flexible
Cost	Commercial Software (Expensive)	Completely Free

Learning Roadmap

Section 1: Statsmodels — The Foundation of Python Statistical Analysis

Core Position: Python's equivalent to Stata

Main Functionality:

python

import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. OLS Regression
model = sm.OLS(y, X).fit()
print(model.summary())  # Stata-style output

# 2. Formula Interface (R-style)
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()

# 3. Generalized Linear Models (GLM)
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

# 4. Time Series
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df['sales'], order=(1, 1, 1)).fit()

# 5. Model Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)

Output Characteristics:

Publication-quality tables (similar to Stata)
Detailed diagnostic statistics
Multiple fit metrics: AIC, BIC, R², etc.
Heteroskedasticity-robust standard errors

Section 2: SciPy.stats — Rapid Statistical Testing

Core Position: The Swiss Army Knife of Statistical Inference

Main Functionality:

python

from scipy import stats

# 1. t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

# 2. Chi-square Test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# 3. Normality Test
statistic, p_value = stats.shapiro(data)

# 4. Correlation
corr, p_value = stats.pearsonr(x, y)

# 5. Distribution Fitting
dist = stats.norm.fit(data)

Applicable Scenarios:

Rapid hypothesis testing
Univariate analysis
Probability distribution operations
When complex output tables are not needed

Section 3: LinearModels — Professional Econometrics Tool

Core Position: First Choice for Panel Data and Instrumental Variables

Main Functionality:

python

from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.iv import IV2SLS

# 1. Panel Data (Fixed Effects)
model = PanelOLS(
    dependent=df['wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

# 2. Instrumental Variables (2SLS)
model = IV2SLS(
    dependent=df['wage'],
    exog=df[['education']],
    endog=df[['ability']],
    instruments=df[['father_education']]
).fit(cov_type='robust')

# 3. GMM
from linearmodels.system import SUR
model = SUR(...).fit()

Advantages:

Designed specifically for panel data
Cluster-robust standard errors
Instrumental variable diagnostics (weak instrument tests)
System GMM support

Section 4: Specialized Packages

Pingouin — User-Friendly Statistical Package

python

import pingouin as pg

# 1. t-test (clearer output)
pg.ttest(group1, group2, correction=True)

# 2. ANOVA
pg.anova(data=df, dv='score', between='group')

# 3. Power Analysis
pg.power_ttest(d=0.5, n=50, alpha=0.05)

# 4. Post-hoc Tests
pg.pairwise_ttests(data=df, dv='score', between='group')

Features:

Output as DataFrame (easy to process)
Automatic effect size calculation (Cohen's d, η²)
Built-in visualization features

Statsmodels.formula.api — R-Style Formulas

python

import statsmodels.formula.api as smf

# Formula interface (more intuitive)
model = smf.ols('log_wage ~ education + experience + I(experience**2) + C(region)',
                data=df).fit()

# Advantages:
# - Automatically adds intercept
# - Automatically handles categorical variables (C())
# - Supports transformations (I(), np.log())
# - Supports interactions (education:experience)

Package Selection Decision Tree

Start
│
├─ Simple Hypothesis Testing? (t-test, chi-square test)
│  └─ Yes → scipy.stats or pingouin
│
├─ OLS Regression?
│  ├─ Need detailed diagnostics → statsmodels.OLS
│  ├─ Rapid prototyping → statsmodels.formula.api
│  └─ Prediction-focused → scikit-learn
│
├─ Panel Data?
│  ├─ Fixed/Random Effects → linearmodels.PanelOLS
│  └─ Dynamic Panel → linearmodels (or Stata)
│
├─ Instrumental Variables?
│  └─ linearmodels.IV2SLS
│
├─ Time Series?
│  ├─ ARIMA/SARIMA → statsmodels.tsa
│  └─ Complex Forecasting → prophet, neuralprophet
│
├─ GLM (Binary, Count)?
│  └─ statsmodels.GLM
│
└─ Bayesian Inference?
   └─ PyMC, ArviZ

Installation Guide

Basic Installation

bash

# Core statistical packages
pip install statsmodels scipy pandas

# Econometrics
pip install linearmodels

# User-friendly statistics
pip install pingouin

# Visualization
pip install matplotlib seaborn

# Complete data science stack (recommended)
conda install -c conda-forge statsmodels scipy pandas linearmodels pingouin

Version Requirements

python

import statsmodels
import scipy
import linearmodels

print(f"statsmodels: {statsmodels.__version__}")  # Recommended >= 0.14
print(f"scipy: {scipy.__version__}")              # Recommended >= 1.10
print(f"linearmodels: {linearmodels.__version__}")  # Recommended >= 5.0

Learning Objectives

After completing this chapter, you will be able to:

Capability Dimension	Specific Objectives
Tool Awareness	Understand the overall architecture of Python's statistical ecosystem
	Know when to use which package
Statsmodels	Master OLS, GLM, time series modeling
	Understand model diagnostics and robust standard errors
	Use formula interface for rapid modeling
SciPy.stats	Rapidly conduct various hypothesis tests
	Handle probability distributions
LinearModels	Perform panel data regression (fixed effects, random effects)
	Implement instrumental variable estimation (2SLS, GMM)
	Compute cluster-robust standard errors
Comprehensive Application	Complete workflow from data to publication
	Output publication-quality regression tables

Comparison with Stata/R

Stata → Python Mapping

Stata Command	Python Equivalent	Package
`regress y x1 x2`	`sm.OLS(y, X).fit()`	statsmodels
`logit y x1 x2`	`sm.Logit(y, X).fit()`	statsmodels
`xtreg y x, fe`	`PanelOLS(..., entity_effects=True).fit()`	linearmodels
`ivregress 2sls y (x1=z) x2`	`IV2SLS(...).fit()`	linearmodels
`arima y, ar(1) ma(1)`	`ARIMA(y, order=(1,0,1)).fit()`	statsmodels
`ttest x == 0`	`stats.ttest_1samp(x, 0)`	scipy.stats

R → Python Mapping

R Command	Python Equivalent	Package
`lm(y ~ x1 + x2)`	`smf.ols('y ~ x1 + x2', df).fit()`	statsmodels.formula
`glm(y ~ x, family=binomial)`	`sm.GLM(y, X, family=sm.families.Binomial()).fit()`	statsmodels
`t.test(x, y)`	`stats.ttest_ind(x, y)`	scipy.stats
`cor.test(x, y)`	`stats.pearsonr(x, y)`	scipy.stats
`plm(y ~ x, effect='individual')`	`PanelOLS(..., entity_effects=True).fit()`	linearmodels

Learning Recommendations

DO (Recommended Practices)

Start with statsmodels: It's the foundation, similar to Stata
Understand package positioning: Each package has a specific purpose
Check official documentation: Python package documentation is very detailed
Compare with Stata/R: Find familiar mapping relationships
Practice first: Run example code for each package

DON'T (Avoid Pitfalls)

Don't use only one package: Flexibly choose the most appropriate tool
Don't memorize functions: Understanding package design philosophy is more important
Don't ignore versions: Statistical packages update frequently, check version compatibility
Don't blindly trust defaults: Check standard error, degrees of freedom settings
Don't forget citations: Academic papers must cite packages and versions used

Recommended Resources

Official Documentation

Package	Documentation Link
Statsmodels	https://www.statsmodels.org/
SciPy	https://docs.scipy.org/doc/scipy/reference/stats.html
LinearModels	https://bashtage.github.io/linearmodels/
Pingouin	https://pingouin-stats.org/

Books

Seabold & Perktold (2010): "Statsmodels: Econometric and statistical modeling with Python"
Wooldridge (2020): Introductory Econometrics (7th) - Python examples
Bruce & Bruce (2020): Practical Statistics for Data Scientists (2nd)

Online Tutorials

QuantEcon: https://quantecon.org/ (Python tutorials for economists)
Python for Econometrics: Kevin Sheppard's lecture notes
Statsmodels Examples: Official example repository

Chapter Datasets

Dataset	Description	Source	Purpose
wage_panel.csv	Panel wage data	Simulated	linearmodels examples
treatment_iv.csv	Instrumental variable data	Simulated	IV2SLS examples
time_series.csv	Macroeconomic time series	FRED	ARIMA examples
survey_data.csv	Cross-sectional survey	Simulated	statsmodels examples

Ready?

Python's statistical ecosystem is powerful and flexible. Mastering it will give you:

Greater extensibility than Stata
Completely free (Stata costs $1,000+)
Integration into the world's largest data science community
Preparation for machine learning and causal inference

Note: This chapter is not "introductory" level; it requires:

Familiarity with basic Python syntax
Understanding of basic regression analysis concepts
Completion of Modules 1-3

Let's begin exploring the Python statistical universe!

Chapter File List

module-4_Core libraries/
├── 4.1-Chapter Introduction.md           # This file
├── 4.2-Statsmodels Essentials.md         # Statsmodels core functionality
├── 4.3-Scipy and Linearmodels.md         # SciPy statistical inference
└── 4.4-Integrated Workflow.md            # Data to publication workflow

Estimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistical background) Practicality: ⭐⭐⭐⭐⭐ (Core skill)

Next Section: 4.2 - Statsmodels Essentials

Begin your Python statistical journey!

4.1 Chapter Introduction (Python Statistical Toolkit Landscape) ​

Why Master Multiple Statistical Packages? ​

The Stata User's Confusion ​

Python Statistical Ecosystem Landscape ​

Core Statistical Package Comparison ​

Stata vs Python: Paradigm Differences ​

Learning Roadmap ​

Section 1: Statsmodels — The Foundation of Python Statistical Analysis ​

Section 2: SciPy.stats — Rapid Statistical Testing ​

Section 3: LinearModels — Professional Econometrics Tool ​

Section 4: Specialized Packages ​

Pingouin — User-Friendly Statistical Package ​

Statsmodels.formula.api — R-Style Formulas ​

Package Selection Decision Tree ​

Installation Guide ​

Basic Installation ​

Version Requirements ​

Learning Objectives ​

Comparison with Stata/R ​

Stata → Python Mapping ​

R → Python Mapping ​

Learning Recommendations ​

DO (Recommended Practices) ​

DON'T (Avoid Pitfalls) ​

Recommended Resources ​

Official Documentation ​

Books ​

Online Tutorials ​

Chapter Datasets ​

Ready? ​

Chapter File List ​

4.1 Chapter Introduction (Python Statistical Toolkit Landscape)

Why Master Multiple Statistical Packages?

The Stata User's Confusion

Python Statistical Ecosystem Landscape

Core Statistical Package Comparison

Stata vs Python: Paradigm Differences

Learning Roadmap

Section 1: Statsmodels — The Foundation of Python Statistical Analysis

Section 2: SciPy.stats — Rapid Statistical Testing

Section 3: LinearModels — Professional Econometrics Tool

Section 4: Specialized Packages

Pingouin — User-Friendly Statistical Package

Statsmodels.formula.api — R-Style Formulas

Package Selection Decision Tree

Installation Guide

Basic Installation

Version Requirements

Learning Objectives

Comparison with Stata/R

Stata → Python Mapping

R → Python Mapping

Learning Recommendations

DO (Recommended Practices)

DON'T (Avoid Pitfalls)

Recommended Resources

Official Documentation

Books

Online Tutorials

Chapter Datasets

Ready?

Chapter File List