Skip to content

4.1 Chapter Introduction (Python Statistical Toolkit Landscape)

From Descriptive Statistics to Causal Inference: Mastering the Python Statistical Ecosystem

DifficultyImportancePracticality


Why Master Multiple Statistical Packages?

The Stata User's Confusion

Stata Users:

stata
* Everything is simple in Stata
regress wage education experience

Questions After Switching to Python:

  • Why are there so many packages? (statsmodels, scipy, linearmodels...)
  • Which package should I use? When should I use which?
  • Why do multiple implementations exist for the same functionality?

Answer: Python is an ecosystem, not a monolithic software


Python Statistical Ecosystem Landscape

Core Statistical Package Comparison

PackagePositioningCore FunctionalityUse Cases
statsmodelsStatistical ModelingOLS, GLM, Time Series, DiagnosticsClassical Statistical Analysis, Publication-Quality Output
scipy.statsScientific ComputingProbability Distributions, Hypothesis Testing, Descriptive StatisticsRapid Statistical Tests, Univariate Analysis
linearmodelsEconometricsPanel Data, Instrumental Variables, GMMPanel Regression, Endogeneity Treatment
pingouinUser-Friendly Statisticst-tests, ANOVA, Correlation, Power AnalysisRapid Statistics, Readable Output
scikit-learnMachine LearningPredictive Models, Feature Engineering, ValidationPrediction Tasks, Machine Learning
PyMCBayesian InferenceMCMC, Bayesian ModelsBayesian Statistics, Uncertainty Quantification

Stata vs Python: Paradigm Differences

DimensionStataPython
PhilosophyIntegrated SoftwareModular Ecosystem
Regressionregress y x1 x2sm.OLS(y, X).fit()
OutputAutomatic DisplayRequires .summary() Call
ExtensionLimited (ado files)Unlimited (Open-Source Packages)
Learning CurveGentleSteep but More Flexible
CostCommercial Software (Expensive)Completely Free

Learning Roadmap

Section 1: Statsmodels — The Foundation of Python Statistical Analysis

Core Position: Python's equivalent to Stata

Main Functionality:

python
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. OLS Regression
model = sm.OLS(y, X).fit()
print(model.summary())  # Stata-style output

# 2. Formula Interface (R-style)
model = smf.ols('wage ~ education + experience + C(region)', data=df).fit()

# 3. Generalized Linear Models (GLM)
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

# 4. Time Series
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df['sales'], order=(1, 1, 1)).fit()

# 5. Model Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(model.resid, model.model.exog)

Output Characteristics:

  • Publication-quality tables (similar to Stata)
  • Detailed diagnostic statistics
  • Multiple fit metrics: AIC, BIC, R², etc.
  • Heteroskedasticity-robust standard errors

Section 2: SciPy.stats — Rapid Statistical Testing

Core Position: The Swiss Army Knife of Statistical Inference

Main Functionality:

python
from scipy import stats

# 1. t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

# 2. Chi-square Test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# 3. Normality Test
statistic, p_value = stats.shapiro(data)

# 4. Correlation
corr, p_value = stats.pearsonr(x, y)

# 5. Distribution Fitting
dist = stats.norm.fit(data)

Applicable Scenarios:

  • Rapid hypothesis testing
  • Univariate analysis
  • Probability distribution operations
  • When complex output tables are not needed

Section 3: LinearModels — Professional Econometrics Tool

Core Position: First Choice for Panel Data and Instrumental Variables

Main Functionality:

python
from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.iv import IV2SLS

# 1. Panel Data (Fixed Effects)
model = PanelOLS(
    dependent=df['wage'],
    exog=df[['education', 'experience']],
    entity_effects=True,
    time_effects=True
).fit(cov_type='clustered', cluster_entity=True)

# 2. Instrumental Variables (2SLS)
model = IV2SLS(
    dependent=df['wage'],
    exog=df[['education']],
    endog=df[['ability']],
    instruments=df[['father_education']]
).fit(cov_type='robust')

# 3. GMM
from linearmodels.system import SUR
model = SUR(...).fit()

Advantages:

  • Designed specifically for panel data
  • Cluster-robust standard errors
  • Instrumental variable diagnostics (weak instrument tests)
  • System GMM support

Section 4: Specialized Packages

Pingouin — User-Friendly Statistical Package

python
import pingouin as pg

# 1. t-test (clearer output)
pg.ttest(group1, group2, correction=True)

# 2. ANOVA
pg.anova(data=df, dv='score', between='group')

# 3. Power Analysis
pg.power_ttest(d=0.5, n=50, alpha=0.05)

# 4. Post-hoc Tests
pg.pairwise_ttests(data=df, dv='score', between='group')

Features:

  • Output as DataFrame (easy to process)
  • Automatic effect size calculation (Cohen's d, η²)
  • Built-in visualization features

Statsmodels.formula.api — R-Style Formulas

python
import statsmodels.formula.api as smf

# Formula interface (more intuitive)
model = smf.ols('log_wage ~ education + experience + I(experience**2) + C(region)',
                data=df).fit()

# Advantages:
# - Automatically adds intercept
# - Automatically handles categorical variables (C())
# - Supports transformations (I(), np.log())
# - Supports interactions (education:experience)

Package Selection Decision Tree

Start

├─ Simple Hypothesis Testing? (t-test, chi-square test)
│  └─ Yes → scipy.stats or pingouin

├─ OLS Regression?
│  ├─ Need detailed diagnostics → statsmodels.OLS
│  ├─ Rapid prototyping → statsmodels.formula.api
│  └─ Prediction-focused → scikit-learn

├─ Panel Data?
│  ├─ Fixed/Random Effects → linearmodels.PanelOLS
│  └─ Dynamic Panel → linearmodels (or Stata)

├─ Instrumental Variables?
│  └─ linearmodels.IV2SLS

├─ Time Series?
│  ├─ ARIMA/SARIMA → statsmodels.tsa
│  └─ Complex Forecasting → prophet, neuralprophet

├─ GLM (Binary, Count)?
│  └─ statsmodels.GLM

└─ Bayesian Inference?
   └─ PyMC, ArviZ

Installation Guide

Basic Installation

bash
# Core statistical packages
pip install statsmodels scipy pandas

# Econometrics
pip install linearmodels

# User-friendly statistics
pip install pingouin

# Visualization
pip install matplotlib seaborn

# Complete data science stack (recommended)
conda install -c conda-forge statsmodels scipy pandas linearmodels pingouin

Version Requirements

python
import statsmodels
import scipy
import linearmodels

print(f"statsmodels: {statsmodels.__version__}")  # Recommended >= 0.14
print(f"scipy: {scipy.__version__}")              # Recommended >= 1.10
print(f"linearmodels: {linearmodels.__version__}")  # Recommended >= 5.0

Learning Objectives

After completing this chapter, you will be able to:

Capability DimensionSpecific Objectives
Tool AwarenessUnderstand the overall architecture of Python's statistical ecosystem
Know when to use which package
StatsmodelsMaster OLS, GLM, time series modeling
Understand model diagnostics and robust standard errors
Use formula interface for rapid modeling
SciPy.statsRapidly conduct various hypothesis tests
Handle probability distributions
LinearModelsPerform panel data regression (fixed effects, random effects)
Implement instrumental variable estimation (2SLS, GMM)
Compute cluster-robust standard errors
Comprehensive ApplicationComplete workflow from data to publication
Output publication-quality regression tables

Comparison with Stata/R

Stata → Python Mapping

Stata CommandPython EquivalentPackage
regress y x1 x2sm.OLS(y, X).fit()statsmodels
logit y x1 x2sm.Logit(y, X).fit()statsmodels
xtreg y x, fePanelOLS(..., entity_effects=True).fit()linearmodels
ivregress 2sls y (x1=z) x2IV2SLS(...).fit()linearmodels
arima y, ar(1) ma(1)ARIMA(y, order=(1,0,1)).fit()statsmodels
ttest x == 0stats.ttest_1samp(x, 0)scipy.stats

R → Python Mapping

R CommandPython EquivalentPackage
lm(y ~ x1 + x2)smf.ols('y ~ x1 + x2', df).fit()statsmodels.formula
glm(y ~ x, family=binomial)sm.GLM(y, X, family=sm.families.Binomial()).fit()statsmodels
t.test(x, y)stats.ttest_ind(x, y)scipy.stats
cor.test(x, y)stats.pearsonr(x, y)scipy.stats
plm(y ~ x, effect='individual')PanelOLS(..., entity_effects=True).fit()linearmodels

Learning Recommendations

  1. Start with statsmodels: It's the foundation, similar to Stata
  2. Understand package positioning: Each package has a specific purpose
  3. Check official documentation: Python package documentation is very detailed
  4. Compare with Stata/R: Find familiar mapping relationships
  5. Practice first: Run example code for each package

DON'T (Avoid Pitfalls)

  1. Don't use only one package: Flexibly choose the most appropriate tool
  2. Don't memorize functions: Understanding package design philosophy is more important
  3. Don't ignore versions: Statistical packages update frequently, check version compatibility
  4. Don't blindly trust defaults: Check standard error, degrees of freedom settings
  5. Don't forget citations: Academic papers must cite packages and versions used

Official Documentation

PackageDocumentation Link
Statsmodelshttps://www.statsmodels.org/
SciPyhttps://docs.scipy.org/doc/scipy/reference/stats.html
LinearModelshttps://bashtage.github.io/linearmodels/
Pingouinhttps://pingouin-stats.org/

Books

  1. Seabold & Perktold (2010): "Statsmodels: Econometric and statistical modeling with Python"
  2. Wooldridge (2020): Introductory Econometrics (7th) - Python examples
  3. Bruce & Bruce (2020): Practical Statistics for Data Scientists (2nd)

Online Tutorials

  • QuantEcon: https://quantecon.org/ (Python tutorials for economists)
  • Python for Econometrics: Kevin Sheppard's lecture notes
  • Statsmodels Examples: Official example repository

Chapter Datasets

DatasetDescriptionSourcePurpose
wage_panel.csvPanel wage dataSimulatedlinearmodels examples
treatment_iv.csvInstrumental variable dataSimulatedIV2SLS examples
time_series.csvMacroeconomic time seriesFREDARIMA examples
survey_data.csvCross-sectional surveySimulatedstatsmodels examples

Ready?

Python's statistical ecosystem is powerful and flexible. Mastering it will give you:

  • Greater extensibility than Stata
  • Completely free (Stata costs $1,000+)
  • Integration into the world's largest data science community
  • Preparation for machine learning and causal inference

Note: This chapter is not "introductory" level; it requires:

  • Familiarity with basic Python syntax
  • Understanding of basic regression analysis concepts
  • Completion of Modules 1-3

Let's begin exploring the Python statistical universe!


Chapter File List

module-4_Core libraries/
├── 4.1-Chapter Introduction.md           # This file
├── 4.2-Statsmodels Essentials.md         # Statsmodels core functionality
├── 4.3-Scipy and Linearmodels.md         # SciPy statistical inference
└── 4.4-Integrated Workflow.md            # Data to publication workflow

Estimated Learning Time: 20-24 hours Difficulty Level: ⭐⭐⭐⭐ (Requires statistical background) Practicality: ⭐⭐⭐⭐⭐ (Core skill)


Next Section: 4.2 - Statsmodels Essentials

Begin your Python statistical journey!

Released under the MIT License. Content © Author.