Module 10: Machine Learning and LLM APIs
From Scikit-learn to Large Language Models — Exploring Python's Cutting-Edge Applications
Chapter Overview
This chapter introduces you to Python's most exciting domains: machine learning and Large Language Model (LLM) APIs. You'll learn to build predictive models using Scikit-learn, understand deep learning fundamentals, and master practical skills for calling LLM APIs from OpenAI, Anthropic, DeepSeek, and more.
Important Note: While this chapter covers advanced topics, it has tremendous potential for social science research (text analysis, content classification, data annotation, etc.). We recommend studying this module after completing Module 9.
Learning Objectives
After completing this chapter, you will be able to:
- Understand basic machine learning concepts
- Build predictive models using Scikit-learn
- Grasp deep learning and neural network fundamentals
- Call OpenAI GPT APIs for text analysis
- Use LLMs for sentiment analysis and text classification
- Process text data in batches
- Apply LLM technologies in research
Chapter Contents
10.2 - Scikit-learn Basics
Core Question: How do we do machine learning in Python?
Core Content:
- What is Machine Learning?
- Supervised learning: Regression (predicting values), Classification (predicting categories)
- Unsupervised learning: Clustering, Dimensionality reduction
- Scikit-learn: Python's most popular machine learning library
- Linear Regression Example:python
from sklearn.linear_model import LinearRegression # Prepare data X = df[['education', 'age']] # Features y = df['income'] # Target variable # Train model model = LinearRegression() model.fit(X, y) # View coefficients print(model.coef_) # [5000, 1200] print(model.intercept_) # 20000 # Predict predictions = model.predict([[16, 30]]) # education=16, age=30 - Classification Model (Logistic Regression):python
from sklearn.linear_model import LogisticRegression # Predict high income X = df[['education', 'age']] y = (df['income'] > 80000).astype(int) # 0 or 1 model = LogisticRegression() model.fit(X, y) # Predict probability proba = model.predict_proba([[16, 30]])[0, 1] # P(high income) - Model Evaluation:python
from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error # Split training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) rmse = mean_squared_error(y_test, y_pred, squared=False) - Comparing with Stata:
- Stata:
reg income education age - Python:
LinearRegression().fit(X, y) - Difference: Scikit-learn focuses on prediction, Stata focuses on inference
- Stata:
Why It Matters:
- Machine learning is a powerful tool for prediction and classification
- Social science applications: Predicting behavior, text classification, recommendation systems
- Complements traditional econometrics
10.3 - Deep Learning Introduction
Core Question: What is deep learning?
Core Content:
- Deep Learning vs Traditional Machine Learning:
- Traditional ML: Manually designed features
- Deep learning: Automatically learns features (neural networks)
- Neural Network Basics:
- Layers: Input layer, hidden layers, output layer
- Activation functions: ReLU, Sigmoid
- Backpropagation: How to train networks
- PyTorch Introduction:python
import torch import torch.nn as nn # Define simple neural network class SimpleNN(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(2, 10) # Input layer → Hidden layer self.fc2 = nn.Linear(10, 1) # Hidden layer → Output layer def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x - Hugging Face Transformers:
- Pre-trained models: BERT, GPT, LLaMA
- Use cases: Text classification, sentiment analysis, translation
pythonfrom transformers import pipeline # Sentiment analysis classifier = pipeline("sentiment-analysis") result = classifier("This product is amazing!") # [{'label': 'POSITIVE', 'score': 0.99}]
Social Science Applications:
- Text sentiment analysis
- News topic classification
- Social media content detection
- Image recognition (protest crowd estimation)
Important Note: Deep learning requires GPUs and large datasets. Social science students typically use pre-trained models rather than training from scratch.
10.4 - LLM APIs Quick Start
Core Question: How do we call GPT, Claude, and other large models?
Core Content:
- Why Should Social Science Students Learn LLM APIs?
- Text data analysis (news, social media, interviews)
- Content classification and coding
- Literature summarization and reviews
- Survey design assistance
- Data cleaning and annotation
- OpenAI API:python
from openai import OpenAI client = OpenAI(api_key='your-api-key') response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a data analysis assistant"}, {"role": "user", "content": "Explain what regression analysis is"} ] ) print(response.choices[0].message.content) - Anthropic Claude API:python
import anthropic client = anthropic.Anthropic(api_key='your-api-key') message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "Analyze the sentiment of this text"} ] ) - DeepSeek API (Chinese, Affordable):python
from openai import OpenAI # DeepSeek is compatible with OpenAI SDK client = OpenAI( api_key='your-deepseek-key', base_url='https://api.deepseek.com' )
Practical Use Cases:
Case 1: Batch Sentiment Analysis:
def analyze_sentiment(text):
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Sentiment analysis expert. Answer: positive, negative, or neutral"},
{"role": "user", "content": f"Analyze: {text}"}
]
)
return response.choices[0].message.content
# Batch analyze reviews
reviews = df['comment'].tolist()
sentiments = [analyze_sentiment(r) for r in reviews]
df['sentiment'] = sentimentsCase 2: Text Classification:
def classify_news(text, categories):
prompt = f"""
Classify the following news into one of these categories: {', '.join(categories)}
News: {text}
Return only the category name.
"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip()
# Batch classification
categories = ['Politics', 'Economy', 'Society', 'Culture']
df['category'] = df['content'].apply(lambda x: classify_news(x, categories))Case 3: Structured Data Extraction:
def extract_info(text):
prompt = f"""
Extract information from the following text, return in JSON format:
- name: Person's name
- age: Age
- occupation: Occupation
Text: {text}
"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.choices[0].message.content)
# Batch extraction
infos = [extract_info(text) for text in df['biography']]
df_info = pd.DataFrame(infos)Cost and Efficiency:
- GPT-4o-mini: ~$0.001/1K tokens (affordable)
- GPT-4: ~$0.03/1K tokens (expensive but more accurate)
- DeepSeek: ~1/10 of GPT pricing (Chinese alternative)
- Batch processing: Use
ThreadPoolExecutorfor concurrent calls
Traditional Econometrics vs Machine Learning vs LLM
| Dimension | Traditional Econometrics (Stata) | Machine Learning (Scikit-learn) | LLM (GPT/Claude) |
|---|---|---|---|
| Goal | Causal inference, explanation | Prediction | Text understanding, generation |
| Output | Coefficients, p-values | Predictions, accuracy | Text, classification, labels |
| Data Volume | Works with small samples | Requires medium-large samples | Zero-shot/few-shot learning |
| Interpretability | High | Medium | Low |
| Use Cases | Publishing papers | Predictive modeling | Text analysis, content generation |
Complementary Usage Strategy:
- Exploratory analysis: LLM for quick classification and annotation
- Predictive modeling: Machine learning for building predictors
- Causal inference: Traditional econometrics (DID, RDD, IV)
- Publishing papers: Traditional econometrics as primary, ML/LLM as supplementary
How to Study This Chapter?
Learning Roadmap
Days 1-2 (4 hours): Scikit-learn Basics
- Read 10.2 - Scikit-learn Basics
- Practice linear regression and logistic regression
- Understand train/test set splitting
Day 3 (2 hours): Deep Learning Introduction
- Read 10.3 - Deep Learning Introduction
- Understand neural network concepts
- Try Hugging Face pipelines
Days 4-5 (6 hours): LLM APIs
- Read 10.4 - LLM APIs Quick Start
- Register OpenAI/DeepSeek accounts
- Implement text classification and sentiment analysis
- Batch process real data
Total Time: 12 hours (1 week)
Minimal Learning Path
For social science students, priority order:
Must-Learn (Practical skills, 6 hours):
- 10.4 - LLM APIs Quick Start (complete study)
- Sentiment analysis and text classification
- Batch processing techniques
Important (Broadening horizons, 4 hours):
- 10.2 - Scikit-learn Basics (linear regression, logistic)
- Understand prediction vs inference difference
Optional (Deep exploration):
- 10.3 - Deep Learning Introduction
- Fine-tuning pre-trained models
- Prompt engineering techniques
Study Recommendations
LLMs are "Super Assistants" for Social Science Students
- Text classification: Manual coding of 1000 articles takes weeks, LLM takes hours
- Sentiment analysis: Traditional methods require training models, LLM is ready to use
- Data cleaning: LLM can understand unstructured text
Note LLM Limitations
- Cannot be used for causal inference (still need DID, IV, etc.)
- May have biases (from training data)
- Requires validation (don't blindly trust LLM output)
- Best practice: LLM annotation + manual sampling validation
Cost Control
- Use GPT-3.5-Turbo during development (affordable)
- Consider DeepSeek for production (cheaper)
- Set max token limits for batch processing
- Cache results from repeated requests
Practice Project: News Sentiment Analysis Pipeline
pythonimport pandas as pd from openai import OpenAI from tqdm import tqdm client = OpenAI(api_key='your-key') def analyze_sentiment(text): try: response = client.chat.completions.create( model="gpt-3.5-turbo", max_tokens=10, messages=[ {"role": "system", "content": "Sentiment analysis. Answer: positive/negative/neutral"}, {"role": "user", "content": text} ] ) return response.choices[0].message.content.strip() except Exception as e: print(f"Error: {e}") return "unknown" # Read data df = pd.read_csv('news.csv') # Batch analysis (with progress bar) sentiments = [] for text in tqdm(df['content']): sentiment = analyze_sentiment(text) sentiments.append(sentiment) df['sentiment'] = sentiments # Save results df.to_csv('news_with_sentiment.csv', index=False) # Statistics print(df['sentiment'].value_counts())
Frequently Asked Questions
Q: Can machine learning replace traditional econometrics? A: No. They have different goals:
- Machine learning: Prediction ("Will this user click on the ad?")
- Traditional econometrics: Causal inference ("How many purchases did the ad cause?")
- Publishing papers still primarily uses traditional econometrics
Q: Are LLM APIs expensive? A:
- GPT-3.5-Turbo: ~$1-2 for 1000 short texts (very affordable)
- GPT-4: ~$30 for 1000 texts (expensive)
- DeepSeek: 10% of GPT pricing (Chinese alternative)
- For academic research text volumes, cost is typically <$100
Q: Are LLM annotations accurate? Can they be used in papers? A:
- Accuracy: Usually 80-95% (depends on task)
- Best practice: LLM annotation + manual validation (sample 10-20%)
- Top journal acceptance: Increasing number of papers use LLM-assisted annotation
- Must disclose: Explain LLM usage and validation process in methodology section
Q: Do I need a GPU? A:
- No (when calling APIs)
- Only training large models requires GPUs
- 99% of social science student use cases involve calling APIs, not training models
Q: How to choose an LLM API? A:
- Development/testing: GPT-3.5-Turbo (fast and affordable)
- High quality needs: GPT-4 or Claude-3.5-Sonnet
- Limited budget: DeepSeek (Chinese, affordable)
- Chinese priority: DeepSeek, Tongyi Qianwen, ERNIE Bot
Next Steps
After completing this chapter, you will have mastered:
- Building predictive models with Scikit-learn
- Understanding deep learning and neural network basics
- Calling OpenAI/Claude/DeepSeek APIs
- Batch processing text data (sentiment analysis, classification)
- Applying LLM technologies in social science research
In Module 11, we'll learn code standards, debugging, and Git version control to make your code more professional.
LLMs are transforming social science research! Master this skill to supercharge your research!