Module 10: Machine Learning and LLM APIs

From Scikit-learn to Large Language Models — Exploring Python's Cutting-Edge Applications

Chapter Overview

This chapter introduces you to Python's most exciting domains: machine learning and Large Language Model (LLM) APIs. You'll learn to build predictive models using Scikit-learn, understand deep learning fundamentals, and master practical skills for calling LLM APIs from OpenAI, Anthropic, DeepSeek, and more.

Important Note: While this chapter covers advanced topics, it has tremendous potential for social science research (text analysis, content classification, data annotation, etc.). We recommend studying this module after completing Module 9.

Learning Objectives

After completing this chapter, you will be able to:

Understand basic machine learning concepts
Build predictive models using Scikit-learn
Grasp deep learning and neural network fundamentals
Call OpenAI GPT APIs for text analysis
Use LLMs for sentiment analysis and text classification
Process text data in batches
Apply LLM technologies in research

Chapter Contents

10.2 - Scikit-learn Basics

Core Question: How do we do machine learning in Python?

Core Content:

What is Machine Learning?
- Supervised learning: Regression (predicting values), Classification (predicting categories)
- Unsupervised learning: Clustering, Dimensionality reduction
- Scikit-learn: Python's most popular machine learning library

Linear Regression Example:

python

from sklearn.linear_model import LinearRegression

# Prepare data
X = df[['education', 'age']]  # Features
y = df['income']  # Target variable

# Train model
model = LinearRegression()
model.fit(X, y)

# View coefficients
print(model.coef_)  # [5000, 1200]
print(model.intercept_)  # 20000

# Predict
predictions = model.predict([[16, 30]])  # education=16, age=30

Classification Model (Logistic Regression):

python

from sklearn.linear_model import LogisticRegression

# Predict high income
X = df[['education', 'age']]
y = (df['income'] > 80000).astype(int)  # 0 or 1

model = LogisticRegression()
model.fit(X, y)

# Predict probability
proba = model.predict_proba([[16, 30]])[0, 1]  # P(high income)

Model Evaluation:

python

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Split training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

Comparing with Stata:
- Stata: reg income education age
- Python: LinearRegression().fit(X, y)
- Difference: Scikit-learn focuses on prediction, Stata focuses on inference

Why It Matters:

Machine learning is a powerful tool for prediction and classification
Social science applications: Predicting behavior, text classification, recommendation systems
Complements traditional econometrics

10.3 - Deep Learning Introduction

Core Question: What is deep learning?

Core Content:

Deep Learning vs Traditional Machine Learning:
- Traditional ML: Manually designed features
- Deep learning: Automatically learns features (neural networks)
Neural Network Basics:
- Layers: Input layer, hidden layers, output layer
- Activation functions: ReLU, Sigmoid
- Backpropagation: How to train networks

PyTorch Introduction:

python

import torch
import torch.nn as nn

# Define simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(2, 10)  # Input layer → Hidden layer
        self.fc2 = nn.Linear(10, 1)  # Hidden layer → Output layer

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Hugging Face Transformers:

Pre-trained models: BERT, GPT, LLaMA
Use cases: Text classification, sentiment analysis, translation

python

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This product is amazing!")
# [{'label': 'POSITIVE', 'score': 0.99}]

Social Science Applications:

Text sentiment analysis
News topic classification
Social media content detection
Image recognition (protest crowd estimation)

Important Note: Deep learning requires GPUs and large datasets. Social science students typically use pre-trained models rather than training from scratch.

10.4 - LLM APIs Quick Start

Core Question: How do we call GPT, Claude, and other large models?

Core Content:

Why Should Social Science Students Learn LLM APIs?
- Text data analysis (news, social media, interviews)
- Content classification and coding
- Literature summarization and reviews
- Survey design assistance
- Data cleaning and annotation

OpenAI API:

python

from openai import OpenAI

client = OpenAI(api_key='your-api-key')

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a data analysis assistant"},
        {"role": "user", "content": "Explain what regression analysis is"}
    ]
)

print(response.choices[0].message.content)

Anthropic Claude API:

python

import anthropic

client = anthropic.Anthropic(api_key='your-api-key')

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Analyze the sentiment of this text"}
    ]
)

DeepSeek API (Chinese, Affordable):

python

from openai import OpenAI  # DeepSeek is compatible with OpenAI SDK

client = OpenAI(
    api_key='your-deepseek-key',
    base_url='https://api.deepseek.com'
)

Practical Use Cases:

Case 1: Batch Sentiment Analysis:

python

def analyze_sentiment(text):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Sentiment analysis expert. Answer: positive, negative, or neutral"},
            {"role": "user", "content": f"Analyze: {text}"}
        ]
    )
    return response.choices[0].message.content

# Batch analyze reviews
reviews = df['comment'].tolist()
sentiments = [analyze_sentiment(r) for r in reviews]
df['sentiment'] = sentiments

Case 2: Text Classification:

python

def classify_news(text, categories):
    prompt = f"""
    Classify the following news into one of these categories: {', '.join(categories)}

    News: {text}

    Return only the category name.
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Batch classification
categories = ['Politics', 'Economy', 'Society', 'Culture']
df['category'] = df['content'].apply(lambda x: classify_news(x, categories))

Case 3: Structured Data Extraction:

python

def extract_info(text):
    prompt = f"""
    Extract information from the following text, return in JSON format:
    - name: Person's name
    - age: Age
    - occupation: Occupation

    Text: {text}
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.choices[0].message.content)

# Batch extraction
infos = [extract_info(text) for text in df['biography']]
df_info = pd.DataFrame(infos)

Cost and Efficiency:

GPT-4o-mini: ~$0.001/1K tokens (affordable)
GPT-4: ~$0.03/1K tokens (expensive but more accurate)
DeepSeek: ~1/10 of GPT pricing (Chinese alternative)
Batch processing: Use ThreadPoolExecutor for concurrent calls

Traditional Econometrics vs Machine Learning vs LLM

Dimension	Traditional Econometrics (Stata)	Machine Learning (Scikit-learn)	LLM (GPT/Claude)
Goal	Causal inference, explanation	Prediction	Text understanding, generation
Output	Coefficients, p-values	Predictions, accuracy	Text, classification, labels
Data Volume	Works with small samples	Requires medium-large samples	Zero-shot/few-shot learning
Interpretability	High	Medium	Low
Use Cases	Publishing papers	Predictive modeling	Text analysis, content generation

Complementary Usage Strategy:

Exploratory analysis: LLM for quick classification and annotation
Predictive modeling: Machine learning for building predictors
Causal inference: Traditional econometrics (DID, RDD, IV)
Publishing papers: Traditional econometrics as primary, ML/LLM as supplementary

How to Study This Chapter?

Learning Roadmap

Days 1-2 (4 hours): Scikit-learn Basics

Read 10.2 - Scikit-learn Basics
Practice linear regression and logistic regression
Understand train/test set splitting

Day 3 (2 hours): Deep Learning Introduction

Read 10.3 - Deep Learning Introduction
Understand neural network concepts
Try Hugging Face pipelines

Days 4-5 (6 hours): LLM APIs

Read 10.4 - LLM APIs Quick Start
Register OpenAI/DeepSeek accounts
Implement text classification and sentiment analysis
Batch process real data

Total Time: 12 hours (1 week)

Minimal Learning Path

For social science students, priority order:

Must-Learn (Practical skills, 6 hours):

10.4 - LLM APIs Quick Start (complete study)
Sentiment analysis and text classification
Batch processing techniques

Important (Broadening horizons, 4 hours):

10.2 - Scikit-learn Basics (linear regression, logistic)
Understand prediction vs inference difference

Optional (Deep exploration):

10.3 - Deep Learning Introduction
Fine-tuning pre-trained models
Prompt engineering techniques

Study Recommendations

LLMs are "Super Assistants" for Social Science Students
- Text classification: Manual coding of 1000 articles takes weeks, LLM takes hours
- Sentiment analysis: Traditional methods require training models, LLM is ready to use
- Data cleaning: LLM can understand unstructured text
Note LLM Limitations
- Cannot be used for causal inference (still need DID, IV, etc.)
- May have biases (from training data)
- Requires validation (don't blindly trust LLM output)
- Best practice: LLM annotation + manual sampling validation
Cost Control
- Use GPT-3.5-Turbo during development (affordable)
- Consider DeepSeek for production (cheaper)
- Set max token limits for batch processing
- Cache results from repeated requests

Practice Project: News Sentiment Analysis Pipeline

python

import pandas as pd
from openai import OpenAI
from tqdm import tqdm

client = OpenAI(api_key='your-key')

def analyze_sentiment(text):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            max_tokens=10,
            messages=[
                {"role": "system", "content": "Sentiment analysis. Answer: positive/negative/neutral"},
                {"role": "user", "content": text}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error: {e}")
        return "unknown"

# Read data
df = pd.read_csv('news.csv')

# Batch analysis (with progress bar)
sentiments = []
for text in tqdm(df['content']):
    sentiment = analyze_sentiment(text)
    sentiments.append(sentiment)

df['sentiment'] = sentiments

# Save results
df.to_csv('news_with_sentiment.csv', index=False)

# Statistics
print(df['sentiment'].value_counts())

Frequently Asked Questions

Q: Can machine learning replace traditional econometrics? A: No. They have different goals:

Machine learning: Prediction ("Will this user click on the ad?")
Traditional econometrics: Causal inference ("How many purchases did the ad cause?")
Publishing papers still primarily uses traditional econometrics

Q: Are LLM APIs expensive? A:

GPT-3.5-Turbo: ~$1-2 for 1000 short texts (very affordable)
GPT-4: ~$30 for 1000 texts (expensive)
DeepSeek: 10% of GPT pricing (Chinese alternative)
For academic research text volumes, cost is typically <$100

Q: Are LLM annotations accurate? Can they be used in papers? A:

Accuracy: Usually 80-95% (depends on task)
Best practice: LLM annotation + manual validation (sample 10-20%)
Top journal acceptance: Increasing number of papers use LLM-assisted annotation
Must disclose: Explain LLM usage and validation process in methodology section

Q: Do I need a GPU? A:

No (when calling APIs)
Only training large models requires GPUs
99% of social science student use cases involve calling APIs, not training models

Q: How to choose an LLM API? A:

Development/testing: GPT-3.5-Turbo (fast and affordable)
High quality needs: GPT-4 or Claude-3.5-Sonnet
Limited budget: DeepSeek (Chinese, affordable)
Chinese priority: DeepSeek, Tongyi Qianwen, ERNIE Bot

Next Steps

After completing this chapter, you will have mastered:

Building predictive models with Scikit-learn
Understanding deep learning and neural network basics
Calling OpenAI/Claude/DeepSeek APIs
Batch processing text data (sentiment analysis, classification)
Applying LLM technologies in social science research

In Module 11, we'll learn code standards, debugging, and Git version control to make your code more professional.

LLMs are transforming social science research! Master this skill to supercharge your research!

Module 10: Machine Learning and LLM APIs ​

Chapter Overview ​

Learning Objectives ​

Chapter Contents ​

10.2 - Scikit-learn Basics ​

10.3 - Deep Learning Introduction ​

10.4 - LLM APIs Quick Start ​

Traditional Econometrics vs Machine Learning vs LLM ​

How to Study This Chapter? ​

Learning Roadmap ​

Minimal Learning Path ​

Study Recommendations ​

Frequently Asked Questions ​

Next Steps ​

Quick Links ​

Module 10: Machine Learning and LLM APIs

Chapter Overview

Learning Objectives

Chapter Contents

10.2 - Scikit-learn Basics

10.3 - Deep Learning Introduction

10.4 - LLM APIs Quick Start

Traditional Econometrics vs Machine Learning vs LLM

How to Study This Chapter?

Learning Roadmap

Minimal Learning Path

Study Recommendations

Frequently Asked Questions

Next Steps

Quick Links