Jupyter Notebook Quick Start

Data Scientists' Favorite Interactive Programming Environment

What is Jupyter Notebook?

Jupyter Notebook is an interactive programming environment that allows you to:

Write code and see results immediately (similar to Stata's do-file editor + Results window)
Mix code, charts, and text explanations (similar to R Markdown)
Run in browser without complex configuration

Comparison with Other Environments

Environment	Pros	Cons	Use Cases
Jupyter Notebook	Interactive, visual, easy to share	Not for large projects	Data analysis, teaching, prototyping
VS Code	Powerful, good for large projects	Steep learning curve	Software development, large projects
Google Colab	Free GPU, cloud-based	Requires internet	Deep learning, collaboration
PyCharm	Professional IDE	Resource-intensive	Professional development

Recommendation for Social Science Students: Start with Jupyter Notebook, learn VS Code when proficient

Quick Start: Three Ways to Use Jupyter

Method 1: Online (Easiest, Zero Configuration)

Visit this website's Python environment, no software installation needed!

Method 2: Google Colab (Free + Cloud)

Visit colab.research.google.com
Sign in with Google account
Click "New Notebook"
Start coding!

Advantages:

Completely free
Provides free GPU (suitable for deep learning)
Direct access to Google Drive

Method 3: Local Installation (Recommended for Long-term Use)

Step 1: Install Anaconda

Anaconda is a scientific computing distribution for Python, includes Jupyter Notebook and common libraries.

Download: anaconda.com/download

Post-installation Check:

bash

# Run in Terminal (Mac/Linux) or Anaconda Prompt (Windows)
jupyter --version

Step 2: Launch Jupyter Notebook

bash

# Run in terminal
jupyter notebook

Browser will automatically open http://localhost:8888

Jupyter Notebook Basic Operations

1. Create New Notebook

Click "New" → "Python 3" in upper right
Notebook opens, default name "Untitled"
Click title to rename to "my_first_analysis"

2. Cell Types

Jupyter has two main cell types:

(1) Code Cell

python

# Write Python code in code cells
x = 10
y = 20
print(x + y)

Execution Methods:

Press Shift + Enter: Run and jump to next cell
Press Ctrl + Enter: Run but stay in current cell

Output:

(2) Markdown Cell (Text Explanations)

Switch method: Select cell → Press M key

markdown

# This is a Heading
This is plain text

**Bold** and *Italic*

- List item 1
- List item 2

3. Essential Shortcuts

Shortcut	Function
`Shift + Enter`	Run current cell, jump to next
`Ctrl + Enter`	Run current cell, stay
`A`	Insert new cell above
`B`	Insert new cell below
`D + D` (press D twice)	Delete current cell
`M`	Convert to Markdown cell
`Y`	Convert to code cell
`Ctrl + S`	Save notebook

Hands-on: Complete Data Analysis with Jupyter

Example: Analyzing Student Grade Data

Cell 1: Import Libraries and Create Data

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample data
data = {
    'student_id': range(1, 21),
    'math_score': [85, 72, 90, 68, 88, 75, 92, 70, 85, 78,
                   95, 65, 88, 72, 90, 77, 85, 70, 92, 80],
    'study_hours': [10, 5, 12, 3, 11, 6, 13, 4, 10, 7,
                    14, 2, 11, 5, 12, 7, 10, 4, 13, 8],
    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F',
               'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}

df = pd.DataFrame(data)
df.head()  # Display first 5 rows

Cell 2: Descriptive Statistics

python

print("📊 Basic Statistics:")
print(df[['math_score', 'study_hours']].describe())

print("\n📈 Statistics by Gender:")
print(df.groupby('gender').agg({
    'math_score': ['mean', 'std'],
    'study_hours': ['mean', 'std']
}))

Cell 3: Visualization

python

# Set font to avoid character display issues
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # Mac
# plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows

# Scatter plot
plt.figure(figsize=(10, 5))

# Subplot 1: Score vs Study Hours
plt.subplot(1, 2, 1)
plt.scatter(df['study_hours'], df['math_score'], alpha=0.6)
plt.xlabel('Study Hours per Week')
plt.ylabel('Math Score')
plt.title('Score vs Study Hours')

# Subplot 2: Score Distribution
plt.subplot(1, 2, 2)
plt.hist(df['math_score'], bins=8, edgecolor='black', alpha=0.7)
plt.xlabel('Math Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')

plt.tight_layout()
plt.show()

Cell 4: Regression Analysis

python

from scipy import stats

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(
    df['study_hours'], df['math_score']
)

print(f"📊 Regression Equation: Math Score = {intercept:.2f} + {slope:.2f} * Study Hours")
print(f"   R² = {r_value**2:.3f}")
print(f"   p-value = {p_value:.4f}")

# Interpretation: Each additional study hour increases score by {slope:.2f} points
if p_value < 0.05:
    print("   ✅ Result is significant (p < 0.05)")
else:
    print("   ❌ Result is not significant (p >= 0.05)")

Advanced Features of Jupyter

1. Magic Commands

python

# View all magic commands
%lsmagic

Common magic commands:

Command	Function
`%time`	Time a single line of code
`%%time`	Time entire cell execution
`%matplotlib inline`	Display charts in notebook
`%load file.py`	Load external Python file
`%who`	List all variables

Example:

python

%%time
# Calculate sum of 1 to 1000000
total = sum(range(1000000))
print(total)

2. Display Multiple Outputs

python

# Normally, only the last expression result is displayed
df.head()
df.tail()  # Only this will display

# Solution: Use display()
from IPython.display import display
display(df.head())
display(df.tail())  # Both will display

3. Styled DataFrames

python

# More beautiful DataFrame display
df.style.background_gradient(cmap='viridis', subset=['math_score'])

Jupyter Best Practices

1. Organization Structure Recommendations

python

# Cell 1: Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Cell 2: Set parameters
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', None)

# Cell 3: Load data
df = pd.read_csv('data.csv')

# Cell 4-N: Step-by-step analysis
# Each cell does one thing

2. Add Markdown Explanations

markdown

# Data Analysis: Student Grade Research

## 1. Research Question
We want to know: Does study time affect math scores?

## 2. Data Source
Virtual student survey data (n=20)

## 3. Analysis Methods
- Descriptive statistics
- Correlation analysis
- Linear regression

3. Restart Kernel

If code errors or variables get confused:

Menu Bar → Kernel → Restart & Clear Output

4. Notebook Naming Convention

Industry Standard Naming Format:

01_data_cleaning.ipynb
02_exploratory_analysis.ipynb
03_regression_models.ipynb
04_robustness_checks.ipynb

Why Use Number Prefixes?

Ensures clear execution order
Facilitates team collaboration
Aligns with academic research workflow

5. Cell Atomicity Principle

❌ Wrong Approach (One cell does too much):

python

# Not recommended: All steps mixed together
df = pd.read_csv('data.csv')
df = df.dropna()
df['log_income'] = np.log(df['income'])
result = df.groupby('education')['log_income'].mean()
plt.bar(result.index, result.values)
plt.show()
model = smf.ols('log_income ~ education + age', data=df).fit()
print(model.summary())

✅ Correct Approach (One function per cell):

python

# Cell 1: Data loading
df = pd.read_csv('data.csv')
print(f"Loaded {len(df)} observations")

# Cell 2: Data cleaning
df_clean = df.dropna()
print(f"Removed {len(df) - len(df_clean)} rows with missing values")

# Cell 3: Feature engineering
df_clean['log_income'] = np.log(df_clean['income'])

# Cell 4: Descriptive statistics
result = df_clean.groupby('education')['log_income'].mean()
display(result)

# Cell 5: Visualization
plt.bar(result.index, result.values)
plt.title('Mean Log Income by Education')
plt.show()

# Cell 6: Regression analysis
model = smf.ols('log_income ~ education + age', data=df_clean).fit()
print(model.summary())

6. Version Control Friendly Settings

python

# Add at notebook beginning
import warnings
warnings.filterwarnings('ignore')

# Fix random seed (ensure reproducibility)
np.random.seed(42)

# Clear outputs (prevent Git diff confusion)
# Use nbstripout: pip install nbstripout
# Setup: nbstripout --install

Jupyter Workflow in Academic Research

Standard Research Project Structure

research_project/
├── data/
│   ├── raw/                    # Raw data (read-only)
│   │   └── survey_2023.csv
│   └── processed/              # Cleaned data
│       └── clean_survey.csv
├── notebooks/
│   ├── 01_data_cleaning.ipynb       # Data cleaning
│   ├── 02_descriptive_stats.ipynb   # Descriptive statistics
│   ├── 03_main_regression.ipynb     # Main regression
│   ├── 04_robustness.ipynb          # Robustness checks
│   └── 05_heterogeneity.ipynb       # Heterogeneity analysis
├── scripts/
│   └── helper_functions.py     # Reusable functions
├── outputs/
│   ├── figures/
│   │   └── fig1_scatter.png
│   └── tables/
│       └── tab1_summary.tex
├── README.md
└── requirements.txt

Export LaTeX Tables from Notebook

python

# Export pandas table to LaTeX
summary_stats = df.describe()
latex_code = summary_stats.to_latex(
    caption='Descriptive Statistics',
    label='tab:desc_stats',
    float_format="%.2f"
)

# Save to file
with open('../outputs/tables/table1.tex', 'w') as f:
    f.write(latex_code)

print("Table saved to outputs/tables/table1.tex")

Export High-Resolution Figures from Notebook

python

# Set publication-quality figure parameters
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 12
plt.rcParams['font.family'] = 'serif'

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(df['education'], df['income'], alpha=0.6)
plt.xlabel('Years of Education')
plt.ylabel('Annual Income ($)')
plt.title('Education and Income Relationship')

# Save in multiple formats
plt.savefig('../outputs/figures/fig1_scatter.png', bbox_inches='tight', dpi=300)
plt.savefig('../outputs/figures/fig1_scatter.pdf', bbox_inches='tight')  # For papers
plt.show()

Jupyter Extensions and Plugins

JupyterLab Extensions (Recommended)

bash

# Install JupyterLab (upgraded Notebook)
pip install jupyterlab

# Launch JupyterLab
jupyter lab

Essential Extensions

bash

# 1. Code Formatter (Black)
pip install jupyterlab-code-formatter black
jupyter labextension install @ryantam626/jupyterlab_code_formatter

# 2. Variable Inspector
pip install lckr-jupyterlab-variableinspector

# 3. Table of Contents
pip install jupyterlab-toc

# 4. Git Integration
pip install jupyterlab-git

Jupyter Notebook Extensions

bash

# Install nbextensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

# After launch, enable in Nbextensions tab:
# - Table of Contents
# - ExecuteTime (show execution time)
# - Autopep8 (code formatting)
# - Variable Inspector

Advanced Tips

1. Parallel Computing (For Big Data)

python

from joblib import Parallel, delayed
import multiprocessing

def process_chunk(chunk):
    # Process single data chunk
    return chunk.groupby('category')['value'].mean()

# Read large file in chunks
chunks = pd.read_csv('large_file.csv', chunksize=10000)

# Parallel processing
n_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=n_cores)(
    delayed(process_chunk)(chunk) for chunk in chunks
)

# Merge results
final_result = pd.concat(results)

2. Progress Bar (For Long-Running Operations)

python

from tqdm.notebook import tqdm

# Display progress bar in loops
results = []
for i in tqdm(range(1000), desc="Processing"):
    # Simulate time-consuming operation
    result = some_function(i)
    results.append(result)

3. Interactive Visualization (Plotly)

python

import plotly.express as px

# Create interactive scatter plot
fig = px.scatter(df, x='education', y='income',
                 color='gender', size='age',
                 hover_data=['country'],
                 title='Income by Education (Interactive)')
fig.show()

# Advantage: Can zoom, hover to view data points

4. Automated Report Generation (Papermill)

bash

# Install papermill
pip install papermill

# Batch run notebooks (parameterized)
papermill input_template.ipynb output_2023.ipynb \
  -p year 2023 \
  -p country "USA"

5. Memory Monitoring

python

# Check variable memory usage
%whos

# Check DataFrame memory usage
df.info(memory_usage='deep')

# Delete unnecessary variables
del large_dataframe
import gc
gc.collect()

Debugging Techniques

1. Use IPython Debugger

python

# Insert breakpoint where error occurs
import pdb; pdb.set_trace()

# Execution will pause here, can inspect variables
# Debug commands:
# - n (next): next line
# - c (continue): continue execution
# - q (quit): exit debugger
# - p variable: print variable value

2. Display Full Error Information

python

# Show detailed error stack
%xmode Verbose

# Restore default
%xmode Plain

3. Time Performance Profiling

python

# Profile function performance
%prun df.groupby('category').agg({'value': ['mean', 'std']})

# Line-by-line profiling (requires line_profiler)
%load_ext line_profiler
%lprun -f my_function my_function(df)

Jupyter vs Stata/R

Feature	Stata	R (RStudio)	Jupyter Notebook
Interactive Execution	✅	✅	✅
Embedded Charts	❌	✅ (R Markdown)	✅
Mix Text and Code	❌	✅ (R Markdown)	✅
Online Collaboration	❌	✅ (RStudio Cloud)	✅ (Colab)
Learning Curve	Easy	Medium	Easy

Frequently Asked Questions

Q1: What's the difference between Jupyter Notebook and JupyterLab?

Answer:

Jupyter Notebook: Classic interface, concise
JupyterLab: Next-generation interface, more powerful (multi-tabs, terminal, file manager)

Recommendation: Beginners start with Notebook, switch to Lab when proficient

Method 1: Export as HTML

File → Download as → HTML

Method 2: Upload to GitHub

GitHub automatically renders .ipynb files

Method 3: Use nbviewer

Visit nbviewer.jupyter.org
Enter GitHub link

Q3: Why don't charts display?

Solution: Run at notebook beginning

python

%matplotlib inline

Practical Exercises

Exercise 1: Create Your First Analysis Notebook

Create new notebook, name it "income_analysis"
Create the following data:

python

data = {
    'country': ['USA', 'China', 'India', 'Brazil', 'UK'],
    'gdp_per_capita': [65000, 12000, 2500, 9000, 45000],
    'population': [330, 1400, 1380, 213, 67]
}

Calculate:
- Total GDP for each country (GDP per capita × population)
- Average GDP per capita
- Draw bar chart

Exercise 2: Use Markdown

Add Markdown cells to notebook containing:

Title
Research question
Data source description

Next Steps

In the next section, we will learn VS Code Configuration, a more professional development environment suitable for large projects.

Keep moving forward!

Jupyter Notebook Quick Start ​

What is Jupyter Notebook? ​

Comparison with Other Environments ​

Quick Start: Three Ways to Use Jupyter ​

Method 1: Online (Easiest, Zero Configuration) ​

Method 2: Google Colab (Free + Cloud) ​

Method 3: Local Installation (Recommended for Long-term Use) ​

Step 1: Install Anaconda ​

Step 2: Launch Jupyter Notebook ​

Jupyter Notebook Basic Operations ​

1. Create New Notebook ​

2. Cell Types ​

(1) Code Cell ​

(2) Markdown Cell (Text Explanations) ​

3. Essential Shortcuts ​

Hands-on: Complete Data Analysis with Jupyter ​

Example: Analyzing Student Grade Data ​

Cell 1: Import Libraries and Create Data ​

Cell 2: Descriptive Statistics ​

Cell 3: Visualization ​

Cell 4: Regression Analysis ​

Advanced Features of Jupyter ​

1. Magic Commands ​

2. Display Multiple Outputs ​

3. Styled DataFrames ​

Jupyter Best Practices ​

1. Organization Structure Recommendations ​

2. Add Markdown Explanations ​

3. Restart Kernel ​

4. Notebook Naming Convention ​

5. Cell Atomicity Principle ​

6. Version Control Friendly Settings ​

Jupyter Workflow in Academic Research ​

Standard Research Project Structure ​

Export LaTeX Tables from Notebook ​

Export High-Resolution Figures from Notebook ​

Jupyter Extensions and Plugins ​

JupyterLab Extensions (Recommended) ​

Essential Extensions ​

Jupyter Notebook Extensions ​

Advanced Tips ​

1. Parallel Computing (For Big Data) ​

2. Progress Bar (For Long-Running Operations) ​

3. Interactive Visualization (Plotly) ​

4. Automated Report Generation (Papermill) ​

5. Memory Monitoring ​

Debugging Techniques ​

1. Use IPython Debugger ​

2. Display Full Error Information ​

3. Time Performance Profiling ​

Jupyter vs Stata/R ​

Frequently Asked Questions ​

Q1: What's the difference between Jupyter Notebook and JupyterLab? ​

Q2: How to share Jupyter Notebook? ​

Q3: Why don't charts display? ​

Practical Exercises ​

Exercise 1: Create Your First Analysis Notebook ​

Exercise 2: Use Markdown ​

Next Steps ​

Jupyter Notebook Quick Start

What is Jupyter Notebook?

Comparison with Other Environments

Quick Start: Three Ways to Use Jupyter

Method 1: Online (Easiest, Zero Configuration)

Method 2: Google Colab (Free + Cloud)

Method 3: Local Installation (Recommended for Long-term Use)

Step 1: Install Anaconda

Step 2: Launch Jupyter Notebook

Jupyter Notebook Basic Operations

1. Create New Notebook

2. Cell Types

(1) Code Cell

(2) Markdown Cell (Text Explanations)

3. Essential Shortcuts

Hands-on: Complete Data Analysis with Jupyter

Example: Analyzing Student Grade Data

Cell 1: Import Libraries and Create Data

Cell 2: Descriptive Statistics

Cell 3: Visualization

Cell 4: Regression Analysis

Advanced Features of Jupyter

1. Magic Commands

2. Display Multiple Outputs

3. Styled DataFrames

Jupyter Best Practices

1. Organization Structure Recommendations

2. Add Markdown Explanations

3. Restart Kernel

4. Notebook Naming Convention

5. Cell Atomicity Principle

6. Version Control Friendly Settings

Jupyter Workflow in Academic Research

Standard Research Project Structure

Export LaTeX Tables from Notebook

Export High-Resolution Figures from Notebook

Jupyter Extensions and Plugins

JupyterLab Extensions (Recommended)

Essential Extensions

Jupyter Notebook Extensions

Advanced Tips

1. Parallel Computing (For Big Data)

2. Progress Bar (For Long-Running Operations)

3. Interactive Visualization (Plotly)

4. Automated Report Generation (Papermill)

5. Memory Monitoring

Debugging Techniques

1. Use IPython Debugger

2. Display Full Error Information

3. Time Performance Profiling

Jupyter vs Stata/R

Frequently Asked Questions

Q1: What's the difference between Jupyter Notebook and JupyterLab?

Q2: How to share Jupyter Notebook?

Q3: Why don't charts display?

Practical Exercises

Exercise 1: Create Your First Analysis Notebook

Exercise 2: Use Markdown

Next Steps