4.1 Chapter Introduction (Python Statistical Toolkit Landscape)
From Descriptive Statistics to Causal Inference: Mastering the Python Statistical Ecosystem
Why Master Multiple Statistical Packages?
The Stata User's Confusion
Stata Users:
* Everything is simple in Stata
regress wage education experienceQuestions After Switching to Python:
- Why are there so many packages? (statsmodels, scipy, linearmodels...)
- Which package should I use? When should I use which?
- Why do multiple implementations exist for the same functionality?
Answer: Python is an ecosystem, not a monolithic software
Python Statistical Ecosystem Landscape
Core Statistical Package Comparison
| Package | Positioning | Core Functionality | Use Cases |
|---|---|---|---|
| statsmodels | Statistical Modeling | OLS, GLM, Time Series, Diagnostics | Classical Statistical Analysis, Publication-Quality Output |
| scipy.stats | Scientific Computing | Probability Distributions, Hypothesis Testing, Descriptive Statistics | Rapid Statistical Tests, Univariate Analysis |
| linearmodels | Econometrics | Panel Data, Instrumental Variables, GMM | Panel Regression, Endogeneity Treatment |
| pingouin | User-Friendly Statistics | t-tests, ANOVA, Correlation, Power Analysis | Rapid Statistics, Readable Output |
| scikit-learn | Machine Learning | Predictive Models, Feature Engineering, Validation | Prediction Tasks, Machine Learning |
| PyMC | Bayesian Inference | MCMC, Bayesian Models | Bayesian Statistics, Uncertainty Quantification |
Stata vs Python: Paradigm Differences
| Dimension | Stata | Python |
|---|---|---|
| Philosophy | Integrated Software | Modular Ecosystem |
| Regression | regress y x1 x2 | sm.OLS(y, X).fit() |
| Output | Automatic Display | Requires .summary() Call |
| Extension | Limited (ado files) | Unlimited (Open-Source Packages) |
| Learning Curve | Gentle | Steep but More Flexible |
| Cost | Commercial Software (Expensive) | Completely Free |
Learning Roadmap
Section 1: Statsmodels — The Foundation of Python Statistical Analysis
Core Position: Python's equivalent to Stata
Main Functionality:
import statsmodels.api as sm
import statsmodels.formula.api as smf
# 1. OLS Regression
model = sm.OLS(y, X).fit()
print(model.summary()) # Stata-style output
# 2. Formula Interface (R-style)
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()
# 3. Generalized Linear Models (GLM)
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
# 4. Time Series
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df['sales'], order=(1, 1, 1)).fit()
# 5. Model Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)Output Characteristics:
- Publication-quality tables (similar to Stata)
- Detailed diagnostic statistics
- Multiple fit metrics: AIC, BIC, R², etc.
- Heteroskedasticity-robust standard errors
Section 2: SciPy.stats — Rapid Statistical Testing
Core Position: The Swiss Army Knife of Statistical Inference
Main Functionality:
from scipy import stats
# 1. t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
# 2. Chi-square Test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
# 3. Normality Test
statistic, p_value = stats.shapiro(data)
# 4. Correlation
corr, p_value = stats.pearsonr(x, y)
# 5. Distribution Fitting
dist = stats.norm.fit(data)Applicable Scenarios:
- Rapid hypothesis testing
- Univariate analysis
- Probability distribution operations
- When complex output tables are not needed
Section 3: LinearModels — Professional Econometrics Tool
Core Position: First Choice for Panel Data and Instrumental Variables
Main Functionality:
from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.iv import IV2SLS
# 1. Panel Data (Fixed Effects)
model = PanelOLS(
dependent=df['wage'],
exog=df[['education', 'experience']],
entity_effects=True,
time_effects=True
).fit(cov_type='clustered', cluster_entity=True)
# 2. Instrumental Variables (2SLS)
model = IV2SLS(
dependent=df['wage'],
exog=df[['education']],
endog=df[['ability']],
instruments=df[['father_education']]
).fit(cov_type='robust')
# 3. GMM
from linearmodels.system import SUR
model = SUR(...).fit()Advantages:
- Designed specifically for panel data
- Cluster-robust standard errors
- Instrumental variable diagnostics (weak instrument tests)
- System GMM support
Section 4: Specialized Packages
Pingouin — User-Friendly Statistical Package
import pingouin as pg
# 1. t-test (clearer output)
pg.ttest(group1, group2, correction=True)
# 2. ANOVA
pg.anova(data=df, dv='score', between='group')
# 3. Power Analysis
pg.power_ttest(d=0.5, n=50, alpha=0.05)
# 4. Post-hoc Tests
pg.pairwise_ttests(data=df, dv='score', between='group')Features:
- Output as DataFrame (easy to process)
- Automatic effect size calculation (Cohen's d, η²)
- Built-in visualization features
Statsmodels.formula.api — R-Style Formulas
import statsmodels.formula.api as smf
# Formula interface (more intuitive)
model = smf.ols('log_wage ~ education + experience + I(experience**2) + C(region)',
data=df).fit()
# Advantages:
# - Automatically adds intercept
# - Automatically handles categorical variables (C())
# - Supports transformations (I(), np.log())
# - Supports interactions (education:experience)Package Selection Decision Tree
Start
│
├─ Simple Hypothesis Testing? (t-test, chi-square test)
│ └─ Yes → scipy.stats or pingouin
│
├─ OLS Regression?
│ ├─ Need detailed diagnostics → statsmodels.OLS
│ ├─ Rapid prototyping → statsmodels.formula.api
│ └─ Prediction-focused → scikit-learn
│
├─ Panel Data?
│ ├─ Fixed/Random Effects → linearmodels.PanelOLS
│ └─ Dynamic Panel → linearmodels (or Stata)
│
├─ Instrumental Variables?
│ └─ linearmodels.IV2SLS
│
├─ Time Series?
│ ├─ ARIMA/SARIMA → statsmodels.tsa
│ └─ Complex Forecasting → prophet, neuralprophet
│
├─ GLM (Binary, Count)?
│ └─ statsmodels.GLM
│
└─ Bayesian Inference?
└─ PyMC, ArviZInstallation Guide
Basic Installation
# Core statistical packages
pip install statsmodels scipy pandas
# Econometrics
pip install linearmodels
# User-friendly statistics
pip install pingouin
# Visualization
pip install matplotlib seaborn
# Complete data science stack (recommended)
conda install -c conda-forge statsmodels scipy pandas linearmodels pingouinVersion Requirements
import statsmodels
import scipy
import linearmodels
print(f"statsmodels: {statsmodels.__version__}") # Recommended >= 0.14
print(f"scipy: {scipy.__version__}") # Recommended >= 1.10
print(f"linearmodels: {linearmodels.__version__}") # Recommended >= 5.0Learning Objectives
After completing this chapter, you will be able to:
| Capability Dimension | Specific Objectives |
|---|---|
| Tool Awareness | Understand the overall architecture of Python's statistical ecosystem |
| Know when to use which package | |
| Statsmodels | Master OLS, GLM, time series modeling |
| Understand model diagnostics and robust standard errors | |
| Use formula interface for rapid modeling | |
| SciPy.stats | Rapidly conduct various hypothesis tests |
| Handle probability distributions | |
| LinearModels | Perform panel data regression (fixed effects, random effects) |
| Implement instrumental variable estimation (2SLS, GMM) | |
| Compute cluster-robust standard errors | |
| Comprehensive Application | Complete workflow from data to publication |
| Output publication-quality regression tables |
Comparison with Stata/R
Stata → Python Mapping
| Stata Command | Python Equivalent | Package |
|---|---|---|
regress y x1 x2 | sm.OLS(y, X).fit() | statsmodels |
logit y x1 x2 | sm.Logit(y, X).fit() | statsmodels |
xtreg y x, fe | PanelOLS(..., entity_effects=True).fit() | linearmodels |
ivregress 2sls y (x1=z) x2 | IV2SLS(...).fit() | linearmodels |
arima y, ar(1) ma(1) | ARIMA(y, order=(1,0,1)).fit() | statsmodels |
ttest x == 0 | stats.ttest_1samp(x, 0) | scipy.stats |
R → Python Mapping
| R Command | Python Equivalent | Package |
|---|---|---|
lm(y ~ x1 + x2) | smf.ols('y ~ x1 + x2', df).fit() | statsmodels.formula |
glm(y ~ x, family=binomial) | sm.GLM(y, X, family=sm.families.Binomial()).fit() | statsmodels |
t.test(x, y) | stats.ttest_ind(x, y) | scipy.stats |
cor.test(x, y) | stats.pearsonr(x, y) | scipy.stats |
plm(y ~ x, effect='individual') | PanelOLS(..., entity_effects=True).fit() | linearmodels |
Learning Recommendations
DO (Recommended Practices)
- Start with statsmodels: It's the foundation, similar to Stata
- Understand package positioning: Each package has a specific purpose
- Check official documentation: Python package documentation is very detailed
- Compare with Stata/R: Find familiar mapping relationships
- Practice first: Run example code for each package
DON'T (Avoid Pitfalls)
- Don't use only one package: Flexibly choose the most appropriate tool
- Don't memorize functions: Understanding package design philosophy is more important
- Don't ignore versions: Statistical packages update frequently, check version compatibility
- Don't blindly trust defaults: Check standard error, degrees of freedom settings
- Don't forget citations: Academic papers must cite packages and versions used
Recommended Resources
Official Documentation
| Package | Documentation Link |
|---|---|
| Statsmodels | https://www.statsmodels.org/ |
| SciPy | https://docs.scipy.org/doc/scipy/reference/stats.html |
| LinearModels | https://bashtage.github.io/linearmodels/ |
| Pingouin | https://pingouin-stats.org/ |
Books
- Seabold & Perktold (2010): "Statsmodels: Econometric and statistical modeling with Python"
- Wooldridge (2020): Introductory Econometrics (7th) - Python examples
- Bruce & Bruce (2020): Practical Statistics for Data Scientists (2nd)
Online Tutorials
- QuantEcon: https://quantecon.org/ (Python tutorials for economists)
- Python for Econometrics: Kevin Sheppard's lecture notes
- Statsmodels Examples: Official example repository
Chapter Datasets
| Dataset | Description | Source | Purpose |
|---|---|---|---|
| wage_panel.csv | Panel wage data | Simulated | linearmodels examples |
| treatment_iv.csv | Instrumental variable data | Simulated | IV2SLS examples |
| time_series.csv | Macroeconomic time series | FRED | ARIMA examples |
| survey_data.csv | Cross-sectional survey | Simulated | statsmodels examples |
Ready?
Python's statistical ecosystem is powerful and flexible. Mastering it will give you:
- Greater extensibility than Stata
- Completely free (Stata costs $1,000+)
- Integration into the world's largest data science community
- Preparation for machine learning and causal inference
Note: This chapter is not "introductory" level; it requires:
- Familiarity with basic Python syntax
- Understanding of basic regression analysis concepts
- Completion of Modules 1-3
Let's begin exploring the Python statistical universe!
Chapter File List
module-4_Core libraries/
├── 4.1-Chapter Introduction.md # This file
├── 4.2-Statsmodels Essentials.md # Statsmodels core functionality
├── 4.3-Scipy and Linearmodels.md # SciPy statistical inference
└── 4.4-Integrated Workflow.md # Data to publication workflowEstimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistical background) Practicality: ⭐⭐⭐⭐⭐ (Core skill)
Next Section: 4.2 - Statsmodels Essentials
Begin your Python statistical journey!