Skip to content

Jupyter Notebook Quick Start

Data Scientists' Favorite Interactive Programming Environment


What is Jupyter Notebook?

Jupyter Notebook is an interactive programming environment that allows you to:

  • Write code and see results immediately (similar to Stata's do-file editor + Results window)
  • Mix code, charts, and text explanations (similar to R Markdown)
  • Run in browser without complex configuration

Comparison with Other Environments

EnvironmentProsConsUse Cases
Jupyter NotebookInteractive, visual, easy to shareNot for large projectsData analysis, teaching, prototyping
VS CodePowerful, good for large projectsSteep learning curveSoftware development, large projects
Google ColabFree GPU, cloud-basedRequires internetDeep learning, collaboration
PyCharmProfessional IDEResource-intensiveProfessional development

Recommendation for Social Science Students: Start with Jupyter Notebook, learn VS Code when proficient


Quick Start: Three Ways to Use Jupyter

Method 1: Online (Easiest, Zero Configuration)

Visit this website's Python environment, no software installation needed!

Method 2: Google Colab (Free + Cloud)

  1. Visit colab.research.google.com
  2. Sign in with Google account
  3. Click "New Notebook"
  4. Start coding!

Advantages:

  • Completely free
  • Provides free GPU (suitable for deep learning)
  • Direct access to Google Drive

Step 1: Install Anaconda

Anaconda is a scientific computing distribution for Python, includes Jupyter Notebook and common libraries.

Download: anaconda.com/download

Post-installation Check:

bash
# Run in Terminal (Mac/Linux) or Anaconda Prompt (Windows)
jupyter --version

Step 2: Launch Jupyter Notebook

bash
# Run in terminal
jupyter notebook

Browser will automatically open http://localhost:8888


Jupyter Notebook Basic Operations

1. Create New Notebook

  1. Click "New" → "Python 3" in upper right
  2. Notebook opens, default name "Untitled"
  3. Click title to rename to "my_first_analysis"

2. Cell Types

Jupyter has two main cell types:

(1) Code Cell

python
# Write Python code in code cells
x = 10
y = 20
print(x + y)

Execution Methods:

  • Press Shift + Enter: Run and jump to next cell
  • Press Ctrl + Enter: Run but stay in current cell

Output:

30

(2) Markdown Cell (Text Explanations)

Switch method: Select cell → Press M key

markdown
# This is a Heading
This is plain text

**Bold** and *Italic*

- List item 1
- List item 2

3. Essential Shortcuts

ShortcutFunction
Shift + EnterRun current cell, jump to next
Ctrl + EnterRun current cell, stay
AInsert new cell above
BInsert new cell below
D + D (press D twice)Delete current cell
MConvert to Markdown cell
YConvert to code cell
Ctrl + SSave notebook

Hands-on: Complete Data Analysis with Jupyter

Example: Analyzing Student Grade Data

Cell 1: Import Libraries and Create Data

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample data
data = {
    'student_id': range(1, 21),
    'math_score': [85, 72, 90, 68, 88, 75, 92, 70, 85, 78,
                   95, 65, 88, 72, 90, 77, 85, 70, 92, 80],
    'study_hours': [10, 5, 12, 3, 11, 6, 13, 4, 10, 7,
                    14, 2, 11, 5, 12, 7, 10, 4, 13, 8],
    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F',
               'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}

df = pd.DataFrame(data)
df.head()  # Display first 5 rows

Cell 2: Descriptive Statistics

python
print("📊 Basic Statistics:")
print(df[['math_score', 'study_hours']].describe())

print("\n📈 Statistics by Gender:")
print(df.groupby('gender').agg({
    'math_score': ['mean', 'std'],
    'study_hours': ['mean', 'std']
}))

Cell 3: Visualization

python
# Set font to avoid character display issues
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # Mac
# plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows

# Scatter plot
plt.figure(figsize=(10, 5))

# Subplot 1: Score vs Study Hours
plt.subplot(1, 2, 1)
plt.scatter(df['study_hours'], df['math_score'], alpha=0.6)
plt.xlabel('Study Hours per Week')
plt.ylabel('Math Score')
plt.title('Score vs Study Hours')

# Subplot 2: Score Distribution
plt.subplot(1, 2, 2)
plt.hist(df['math_score'], bins=8, edgecolor='black', alpha=0.7)
plt.xlabel('Math Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')

plt.tight_layout()
plt.show()

Cell 4: Regression Analysis

python
from scipy import stats

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(
    df['study_hours'], df['math_score']
)

print(f"📊 Regression Equation: Math Score = {intercept:.2f} + {slope:.2f} * Study Hours")
print(f"   R² = {r_value**2:.3f}")
print(f"   p-value = {p_value:.4f}")

# Interpretation: Each additional study hour increases score by {slope:.2f} points
if p_value < 0.05:
    print("   ✅ Result is significant (p < 0.05)")
else:
    print("   ❌ Result is not significant (p >= 0.05)")

Advanced Features of Jupyter

1. Magic Commands

python
# View all magic commands
%lsmagic

Common magic commands:

CommandFunction
%timeTime a single line of code
%%timeTime entire cell execution
%matplotlib inlineDisplay charts in notebook
%load file.pyLoad external Python file
%whoList all variables

Example:

python
%%time
# Calculate sum of 1 to 1000000
total = sum(range(1000000))
print(total)

2. Display Multiple Outputs

python
# Normally, only the last expression result is displayed
df.head()
df.tail()  # Only this will display

# Solution: Use display()
from IPython.display import display
display(df.head())
display(df.tail())  # Both will display

3. Styled DataFrames

python
# More beautiful DataFrame display
df.style.background_gradient(cmap='viridis', subset=['math_score'])

Jupyter Best Practices

1. Organization Structure Recommendations

python
# Cell 1: Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Cell 2: Set parameters
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', None)

# Cell 3: Load data
df = pd.read_csv('data.csv')

# Cell 4-N: Step-by-step analysis
# Each cell does one thing

2. Add Markdown Explanations

markdown
# Data Analysis: Student Grade Research

## 1. Research Question
We want to know: Does study time affect math scores?

## 2. Data Source
Virtual student survey data (n=20)

## 3. Analysis Methods
- Descriptive statistics
- Correlation analysis
- Linear regression

3. Restart Kernel

If code errors or variables get confused:

Menu BarKernelRestart & Clear Output

4. Notebook Naming Convention

Industry Standard Naming Format:

01_data_cleaning.ipynb
02_exploratory_analysis.ipynb
03_regression_models.ipynb
04_robustness_checks.ipynb

Why Use Number Prefixes?

  • Ensures clear execution order
  • Facilitates team collaboration
  • Aligns with academic research workflow

5. Cell Atomicity Principle

❌ Wrong Approach (One cell does too much):

python
# Not recommended: All steps mixed together
df = pd.read_csv('data.csv')
df = df.dropna()
df['log_income'] = np.log(df['income'])
result = df.groupby('education')['log_income'].mean()
plt.bar(result.index, result.values)
plt.show()
model = smf.ols('log_income ~ education + age', data=df).fit()
print(model.summary())

✅ Correct Approach (One function per cell):

python
# Cell 1: Data loading
df = pd.read_csv('data.csv')
print(f"Loaded {len(df)} observations")

# Cell 2: Data cleaning
df_clean = df.dropna()
print(f"Removed {len(df) - len(df_clean)} rows with missing values")

# Cell 3: Feature engineering
df_clean['log_income'] = np.log(df_clean['income'])

# Cell 4: Descriptive statistics
result = df_clean.groupby('education')['log_income'].mean()
display(result)

# Cell 5: Visualization
plt.bar(result.index, result.values)
plt.title('Mean Log Income by Education')
plt.show()

# Cell 6: Regression analysis
model = smf.ols('log_income ~ education + age', data=df_clean).fit()
print(model.summary())

6. Version Control Friendly Settings

python
# Add at notebook beginning
import warnings
warnings.filterwarnings('ignore')

# Fix random seed (ensure reproducibility)
np.random.seed(42)

# Clear outputs (prevent Git diff confusion)
# Use nbstripout: pip install nbstripout
# Setup: nbstripout --install

Jupyter Workflow in Academic Research

Standard Research Project Structure

research_project/
├── data/
│   ├── raw/                    # Raw data (read-only)
│   │   └── survey_2023.csv
│   └── processed/              # Cleaned data
│       └── clean_survey.csv
├── notebooks/
│   ├── 01_data_cleaning.ipynb       # Data cleaning
│   ├── 02_descriptive_stats.ipynb   # Descriptive statistics
│   ├── 03_main_regression.ipynb     # Main regression
│   ├── 04_robustness.ipynb          # Robustness checks
│   └── 05_heterogeneity.ipynb       # Heterogeneity analysis
├── scripts/
│   └── helper_functions.py     # Reusable functions
├── outputs/
│   ├── figures/
│   │   └── fig1_scatter.png
│   └── tables/
│       └── tab1_summary.tex
├── README.md
└── requirements.txt

Export LaTeX Tables from Notebook

python
# Export pandas table to LaTeX
summary_stats = df.describe()
latex_code = summary_stats.to_latex(
    caption='Descriptive Statistics',
    label='tab:desc_stats',
    float_format="%.2f"
)

# Save to file
with open('../outputs/tables/table1.tex', 'w') as f:
    f.write(latex_code)

print("Table saved to outputs/tables/table1.tex")

Export High-Resolution Figures from Notebook

python
# Set publication-quality figure parameters
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 12
plt.rcParams['font.family'] = 'serif'

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(df['education'], df['income'], alpha=0.6)
plt.xlabel('Years of Education')
plt.ylabel('Annual Income ($)')
plt.title('Education and Income Relationship')

# Save in multiple formats
plt.savefig('../outputs/figures/fig1_scatter.png', bbox_inches='tight', dpi=300)
plt.savefig('../outputs/figures/fig1_scatter.pdf', bbox_inches='tight')  # For papers
plt.show()

Jupyter Extensions and Plugins

bash
# Install JupyterLab (upgraded Notebook)
pip install jupyterlab

# Launch JupyterLab
jupyter lab

Essential Extensions

bash
# 1. Code Formatter (Black)
pip install jupyterlab-code-formatter black
jupyter labextension install @ryantam626/jupyterlab_code_formatter

# 2. Variable Inspector
pip install lckr-jupyterlab-variableinspector

# 3. Table of Contents
pip install jupyterlab-toc

# 4. Git Integration
pip install jupyterlab-git

Jupyter Notebook Extensions

bash
# Install nbextensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

# After launch, enable in Nbextensions tab:
# - Table of Contents
# - ExecuteTime (show execution time)
# - Autopep8 (code formatting)
# - Variable Inspector

Advanced Tips

1. Parallel Computing (For Big Data)

python
from joblib import Parallel, delayed
import multiprocessing

def process_chunk(chunk):
    # Process single data chunk
    return chunk.groupby('category')['value'].mean()

# Read large file in chunks
chunks = pd.read_csv('large_file.csv', chunksize=10000)

# Parallel processing
n_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=n_cores)(
    delayed(process_chunk)(chunk) for chunk in chunks
)

# Merge results
final_result = pd.concat(results)

2. Progress Bar (For Long-Running Operations)

python
from tqdm.notebook import tqdm

# Display progress bar in loops
results = []
for i in tqdm(range(1000), desc="Processing"):
    # Simulate time-consuming operation
    result = some_function(i)
    results.append(result)

3. Interactive Visualization (Plotly)

python
import plotly.express as px

# Create interactive scatter plot
fig = px.scatter(df, x='education', y='income',
                 color='gender', size='age',
                 hover_data=['country'],
                 title='Income by Education (Interactive)')
fig.show()

# Advantage: Can zoom, hover to view data points

4. Automated Report Generation (Papermill)

bash
# Install papermill
pip install papermill

# Batch run notebooks (parameterized)
papermill input_template.ipynb output_2023.ipynb \
  -p year 2023 \
  -p country "USA"

5. Memory Monitoring

python
# Check variable memory usage
%whos

# Check DataFrame memory usage
df.info(memory_usage='deep')

# Delete unnecessary variables
del large_dataframe
import gc
gc.collect()

Debugging Techniques

1. Use IPython Debugger

python
# Insert breakpoint where error occurs
import pdb; pdb.set_trace()

# Execution will pause here, can inspect variables
# Debug commands:
# - n (next): next line
# - c (continue): continue execution
# - q (quit): exit debugger
# - p variable: print variable value

2. Display Full Error Information

python
# Show detailed error stack
%xmode Verbose

# Restore default
%xmode Plain

3. Time Performance Profiling

python
# Profile function performance
%prun df.groupby('category').agg({'value': ['mean', 'std']})

# Line-by-line profiling (requires line_profiler)
%load_ext line_profiler
%lprun -f my_function my_function(df)

Jupyter vs Stata/R

FeatureStataR (RStudio)Jupyter Notebook
Interactive Execution
Embedded Charts✅ (R Markdown)
Mix Text and Code✅ (R Markdown)
Online Collaboration✅ (RStudio Cloud)✅ (Colab)
Learning CurveEasyMediumEasy

Frequently Asked Questions

Q1: What's the difference between Jupyter Notebook and JupyterLab?

Answer:

  • Jupyter Notebook: Classic interface, concise
  • JupyterLab: Next-generation interface, more powerful (multi-tabs, terminal, file manager)

Recommendation: Beginners start with Notebook, switch to Lab when proficient

Q2: How to share Jupyter Notebook?

Method 1: Export as HTML

  • File → Download as → HTML

Method 2: Upload to GitHub

  • GitHub automatically renders .ipynb files

Method 3: Use nbviewer

Q3: Why don't charts display?

Solution: Run at notebook beginning

python
%matplotlib inline

Practical Exercises

Exercise 1: Create Your First Analysis Notebook

  1. Create new notebook, name it "income_analysis"
  2. Create the following data:
python
data = {
    'country': ['USA', 'China', 'India', 'Brazil', 'UK'],
    'gdp_per_capita': [65000, 12000, 2500, 9000, 45000],
    'population': [330, 1400, 1380, 213, 67]
}
  1. Calculate:
    • Total GDP for each country (GDP per capita × population)
    • Average GDP per capita
    • Draw bar chart

Exercise 2: Use Markdown

Add Markdown cells to notebook containing:

  • Title
  • Research question
  • Data source description

Next Steps

In the next section, we will learn VS Code Configuration, a more professional development environment suitable for large projects.

Keep moving forward!

Released under the MIT License. Content © Author.