Skip to content

Module 10: Machine Learning and LLM APIs

From Scikit-learn to Large Language Models — Exploring Python's Cutting-Edge Applications


Chapter Overview

This chapter introduces you to Python's most exciting domains: machine learning and Large Language Model (LLM) APIs. You'll learn to build predictive models using Scikit-learn, understand deep learning fundamentals, and master practical skills for calling LLM APIs from OpenAI, Anthropic, DeepSeek, and more.

Important Note: While this chapter covers advanced topics, it has tremendous potential for social science research (text analysis, content classification, data annotation, etc.). We recommend studying this module after completing Module 9.


Learning Objectives

After completing this chapter, you will be able to:

  • Understand basic machine learning concepts
  • Build predictive models using Scikit-learn
  • Grasp deep learning and neural network fundamentals
  • Call OpenAI GPT APIs for text analysis
  • Use LLMs for sentiment analysis and text classification
  • Process text data in batches
  • Apply LLM technologies in research

Chapter Contents

10.2 - Scikit-learn Basics

Core Question: How do we do machine learning in Python?

Core Content:

  • What is Machine Learning?
    • Supervised learning: Regression (predicting values), Classification (predicting categories)
    • Unsupervised learning: Clustering, Dimensionality reduction
    • Scikit-learn: Python's most popular machine learning library
  • Linear Regression Example:
    python
    from sklearn.linear_model import LinearRegression
    
    # Prepare data
    X = df[['education', 'age']]  # Features
    y = df['income']  # Target variable
    
    # Train model
    model = LinearRegression()
    model.fit(X, y)
    
    # View coefficients
    print(model.coef_)  # [5000, 1200]
    print(model.intercept_)  # 20000
    
    # Predict
    predictions = model.predict([[16, 30]])  # education=16, age=30
  • Classification Model (Logistic Regression):
    python
    from sklearn.linear_model import LogisticRegression
    
    # Predict high income
    X = df[['education', 'age']]
    y = (df['income'] > 80000).astype(int)  # 0 or 1
    
    model = LogisticRegression()
    model.fit(X, y)
    
    # Predict probability
    proba = model.predict_proba([[16, 30]])[0, 1]  # P(high income)
  • Model Evaluation:
    python
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error
    
    # Split training and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Train
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
  • Comparing with Stata:
    • Stata: reg income education age
    • Python: LinearRegression().fit(X, y)
    • Difference: Scikit-learn focuses on prediction, Stata focuses on inference

Why It Matters:

  • Machine learning is a powerful tool for prediction and classification
  • Social science applications: Predicting behavior, text classification, recommendation systems
  • Complements traditional econometrics

10.3 - Deep Learning Introduction

Core Question: What is deep learning?

Core Content:

  • Deep Learning vs Traditional Machine Learning:
    • Traditional ML: Manually designed features
    • Deep learning: Automatically learns features (neural networks)
  • Neural Network Basics:
    • Layers: Input layer, hidden layers, output layer
    • Activation functions: ReLU, Sigmoid
    • Backpropagation: How to train networks
  • PyTorch Introduction:
    python
    import torch
    import torch.nn as nn
    
    # Define simple neural network
    class SimpleNN(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc1 = nn.Linear(2, 10)  # Input layer → Hidden layer
            self.fc2 = nn.Linear(10, 1)  # Hidden layer → Output layer
    
        def forward(self, x):
            x = torch.relu(self.fc1(x))
            x = self.fc2(x)
            return x
  • Hugging Face Transformers:
    • Pre-trained models: BERT, GPT, LLaMA
    • Use cases: Text classification, sentiment analysis, translation
    python
    from transformers import pipeline
    
    # Sentiment analysis
    classifier = pipeline("sentiment-analysis")
    result = classifier("This product is amazing!")
    # [{'label': 'POSITIVE', 'score': 0.99}]

Social Science Applications:

  • Text sentiment analysis
  • News topic classification
  • Social media content detection
  • Image recognition (protest crowd estimation)

Important Note: Deep learning requires GPUs and large datasets. Social science students typically use pre-trained models rather than training from scratch.


10.4 - LLM APIs Quick Start

Core Question: How do we call GPT, Claude, and other large models?

Core Content:

  • Why Should Social Science Students Learn LLM APIs?
    • Text data analysis (news, social media, interviews)
    • Content classification and coding
    • Literature summarization and reviews
    • Survey design assistance
    • Data cleaning and annotation
  • OpenAI API:
    python
    from openai import OpenAI
    
    client = OpenAI(api_key='your-api-key')
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a data analysis assistant"},
            {"role": "user", "content": "Explain what regression analysis is"}
        ]
    )
    
    print(response.choices[0].message.content)
  • Anthropic Claude API:
    python
    import anthropic
    
    client = anthropic.Anthropic(api_key='your-api-key')
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": "Analyze the sentiment of this text"}
        ]
    )
  • DeepSeek API (Chinese, Affordable):
    python
    from openai import OpenAI  # DeepSeek is compatible with OpenAI SDK
    
    client = OpenAI(
        api_key='your-deepseek-key',
        base_url='https://api.deepseek.com'
    )

Practical Use Cases:

Case 1: Batch Sentiment Analysis:

python
def analyze_sentiment(text):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Sentiment analysis expert. Answer: positive, negative, or neutral"},
            {"role": "user", "content": f"Analyze: {text}"}
        ]
    )
    return response.choices[0].message.content

# Batch analyze reviews
reviews = df['comment'].tolist()
sentiments = [analyze_sentiment(r) for r in reviews]
df['sentiment'] = sentiments

Case 2: Text Classification:

python
def classify_news(text, categories):
    prompt = f"""
    Classify the following news into one of these categories: {', '.join(categories)}

    News: {text}

    Return only the category name.
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Batch classification
categories = ['Politics', 'Economy', 'Society', 'Culture']
df['category'] = df['content'].apply(lambda x: classify_news(x, categories))

Case 3: Structured Data Extraction:

python
def extract_info(text):
    prompt = f"""
    Extract information from the following text, return in JSON format:
    - name: Person's name
    - age: Age
    - occupation: Occupation

    Text: {text}
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.choices[0].message.content)

# Batch extraction
infos = [extract_info(text) for text in df['biography']]
df_info = pd.DataFrame(infos)

Cost and Efficiency:

  • GPT-4o-mini: ~$0.001/1K tokens (affordable)
  • GPT-4: ~$0.03/1K tokens (expensive but more accurate)
  • DeepSeek: ~1/10 of GPT pricing (Chinese alternative)
  • Batch processing: Use ThreadPoolExecutor for concurrent calls

Traditional Econometrics vs Machine Learning vs LLM

DimensionTraditional Econometrics (Stata)Machine Learning (Scikit-learn)LLM (GPT/Claude)
GoalCausal inference, explanationPredictionText understanding, generation
OutputCoefficients, p-valuesPredictions, accuracyText, classification, labels
Data VolumeWorks with small samplesRequires medium-large samplesZero-shot/few-shot learning
InterpretabilityHighMediumLow
Use CasesPublishing papersPredictive modelingText analysis, content generation

Complementary Usage Strategy:

  1. Exploratory analysis: LLM for quick classification and annotation
  2. Predictive modeling: Machine learning for building predictors
  3. Causal inference: Traditional econometrics (DID, RDD, IV)
  4. Publishing papers: Traditional econometrics as primary, ML/LLM as supplementary

How to Study This Chapter?

Learning Roadmap

Days 1-2 (4 hours): Scikit-learn Basics

  • Read 10.2 - Scikit-learn Basics
  • Practice linear regression and logistic regression
  • Understand train/test set splitting

Day 3 (2 hours): Deep Learning Introduction

  • Read 10.3 - Deep Learning Introduction
  • Understand neural network concepts
  • Try Hugging Face pipelines

Days 4-5 (6 hours): LLM APIs

  • Read 10.4 - LLM APIs Quick Start
  • Register OpenAI/DeepSeek accounts
  • Implement text classification and sentiment analysis
  • Batch process real data

Total Time: 12 hours (1 week)

Minimal Learning Path

For social science students, priority order:

Must-Learn (Practical skills, 6 hours):

  • 10.4 - LLM APIs Quick Start (complete study)
  • Sentiment analysis and text classification
  • Batch processing techniques

Important (Broadening horizons, 4 hours):

  • 10.2 - Scikit-learn Basics (linear regression, logistic)
  • Understand prediction vs inference difference

Optional (Deep exploration):

  • 10.3 - Deep Learning Introduction
  • Fine-tuning pre-trained models
  • Prompt engineering techniques

Study Recommendations

  1. LLMs are "Super Assistants" for Social Science Students

    • Text classification: Manual coding of 1000 articles takes weeks, LLM takes hours
    • Sentiment analysis: Traditional methods require training models, LLM is ready to use
    • Data cleaning: LLM can understand unstructured text
  2. Note LLM Limitations

    • Cannot be used for causal inference (still need DID, IV, etc.)
    • May have biases (from training data)
    • Requires validation (don't blindly trust LLM output)
    • Best practice: LLM annotation + manual sampling validation
  3. Cost Control

    • Use GPT-3.5-Turbo during development (affordable)
    • Consider DeepSeek for production (cheaper)
    • Set max token limits for batch processing
    • Cache results from repeated requests
  4. Practice Project: News Sentiment Analysis Pipeline

    python
    import pandas as pd
    from openai import OpenAI
    from tqdm import tqdm
    
    client = OpenAI(api_key='your-key')
    
    def analyze_sentiment(text):
        try:
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                max_tokens=10,
                messages=[
                    {"role": "system", "content": "Sentiment analysis. Answer: positive/negative/neutral"},
                    {"role": "user", "content": text}
                ]
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            print(f"Error: {e}")
            return "unknown"
    
    # Read data
    df = pd.read_csv('news.csv')
    
    # Batch analysis (with progress bar)
    sentiments = []
    for text in tqdm(df['content']):
        sentiment = analyze_sentiment(text)
        sentiments.append(sentiment)
    
    df['sentiment'] = sentiments
    
    # Save results
    df.to_csv('news_with_sentiment.csv', index=False)
    
    # Statistics
    print(df['sentiment'].value_counts())

Frequently Asked Questions

Q: Can machine learning replace traditional econometrics? A: No. They have different goals:

  • Machine learning: Prediction ("Will this user click on the ad?")
  • Traditional econometrics: Causal inference ("How many purchases did the ad cause?")
  • Publishing papers still primarily uses traditional econometrics

Q: Are LLM APIs expensive? A:

  • GPT-3.5-Turbo: ~$1-2 for 1000 short texts (very affordable)
  • GPT-4: ~$30 for 1000 texts (expensive)
  • DeepSeek: 10% of GPT pricing (Chinese alternative)
  • For academic research text volumes, cost is typically <$100

Q: Are LLM annotations accurate? Can they be used in papers? A:

  • Accuracy: Usually 80-95% (depends on task)
  • Best practice: LLM annotation + manual validation (sample 10-20%)
  • Top journal acceptance: Increasing number of papers use LLM-assisted annotation
  • Must disclose: Explain LLM usage and validation process in methodology section

Q: Do I need a GPU? A:

  • No (when calling APIs)
  • Only training large models requires GPUs
  • 99% of social science student use cases involve calling APIs, not training models

Q: How to choose an LLM API? A:

  • Development/testing: GPT-3.5-Turbo (fast and affordable)
  • High quality needs: GPT-4 or Claude-3.5-Sonnet
  • Limited budget: DeepSeek (Chinese, affordable)
  • Chinese priority: DeepSeek, Tongyi Qianwen, ERNIE Bot

Next Steps

After completing this chapter, you will have mastered:

  • Building predictive models with Scikit-learn
  • Understanding deep learning and neural network basics
  • Calling OpenAI/Claude/DeepSeek APIs
  • Batch processing text data (sentiment analysis, classification)
  • Applying LLM technologies in social science research

In Module 11, we'll learn code standards, debugging, and Git version control to make your code more professional.

LLMs are transforming social science research! Master this skill to supercharge your research!


Released under the MIT License. Content © Author.