Jupyter Notebook Quick Start
Data Scientists' Favorite Interactive Programming Environment
What is Jupyter Notebook?
Jupyter Notebook is an interactive programming environment that allows you to:
- Write code and see results immediately (similar to Stata's do-file editor + Results window)
- Mix code, charts, and text explanations (similar to R Markdown)
- Run in browser without complex configuration
Comparison with Other Environments
| Environment | Pros | Cons | Use Cases |
|---|---|---|---|
| Jupyter Notebook | Interactive, visual, easy to share | Not for large projects | Data analysis, teaching, prototyping |
| VS Code | Powerful, good for large projects | Steep learning curve | Software development, large projects |
| Google Colab | Free GPU, cloud-based | Requires internet | Deep learning, collaboration |
| PyCharm | Professional IDE | Resource-intensive | Professional development |
Recommendation for Social Science Students: Start with Jupyter Notebook, learn VS Code when proficient
Quick Start: Three Ways to Use Jupyter
Method 1: Online (Easiest, Zero Configuration)
Visit this website's Python environment, no software installation needed!
Method 2: Google Colab (Free + Cloud)
- Visit colab.research.google.com
- Sign in with Google account
- Click "New Notebook"
- Start coding!
Advantages:
- Completely free
- Provides free GPU (suitable for deep learning)
- Direct access to Google Drive
Method 3: Local Installation (Recommended for Long-term Use)
Step 1: Install Anaconda
Anaconda is a scientific computing distribution for Python, includes Jupyter Notebook and common libraries.
Download: anaconda.com/download
Post-installation Check:
# Run in Terminal (Mac/Linux) or Anaconda Prompt (Windows)
jupyter --versionStep 2: Launch Jupyter Notebook
# Run in terminal
jupyter notebookBrowser will automatically open http://localhost:8888
Jupyter Notebook Basic Operations
1. Create New Notebook
- Click "New" → "Python 3" in upper right
- Notebook opens, default name "Untitled"
- Click title to rename to "my_first_analysis"
2. Cell Types
Jupyter has two main cell types:
(1) Code Cell
# Write Python code in code cells
x = 10
y = 20
print(x + y)Execution Methods:
- Press
Shift + Enter: Run and jump to next cell - Press
Ctrl + Enter: Run but stay in current cell
Output:
30(2) Markdown Cell (Text Explanations)
Switch method: Select cell → Press M key
# This is a Heading
This is plain text
**Bold** and *Italic*
- List item 1
- List item 23. Essential Shortcuts
| Shortcut | Function |
|---|---|
Shift + Enter | Run current cell, jump to next |
Ctrl + Enter | Run current cell, stay |
A | Insert new cell above |
B | Insert new cell below |
D + D (press D twice) | Delete current cell |
M | Convert to Markdown cell |
Y | Convert to code cell |
Ctrl + S | Save notebook |
Hands-on: Complete Data Analysis with Jupyter
Example: Analyzing Student Grade Data
Cell 1: Import Libraries and Create Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create sample data
data = {
'student_id': range(1, 21),
'math_score': [85, 72, 90, 68, 88, 75, 92, 70, 85, 78,
95, 65, 88, 72, 90, 77, 85, 70, 92, 80],
'study_hours': [10, 5, 12, 3, 11, 6, 13, 4, 10, 7,
14, 2, 11, 5, 12, 7, 10, 4, 13, 8],
'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F',
'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
}
df = pd.DataFrame(data)
df.head() # Display first 5 rowsCell 2: Descriptive Statistics
print("📊 Basic Statistics:")
print(df[['math_score', 'study_hours']].describe())
print("\n📈 Statistics by Gender:")
print(df.groupby('gender').agg({
'math_score': ['mean', 'std'],
'study_hours': ['mean', 'std']
}))Cell 3: Visualization
# Set font to avoid character display issues
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # Mac
# plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows
# Scatter plot
plt.figure(figsize=(10, 5))
# Subplot 1: Score vs Study Hours
plt.subplot(1, 2, 1)
plt.scatter(df['study_hours'], df['math_score'], alpha=0.6)
plt.xlabel('Study Hours per Week')
plt.ylabel('Math Score')
plt.title('Score vs Study Hours')
# Subplot 2: Score Distribution
plt.subplot(1, 2, 2)
plt.hist(df['math_score'], bins=8, edgecolor='black', alpha=0.7)
plt.xlabel('Math Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')
plt.tight_layout()
plt.show()Cell 4: Regression Analysis
from scipy import stats
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(
df['study_hours'], df['math_score']
)
print(f"📊 Regression Equation: Math Score = {intercept:.2f} + {slope:.2f} * Study Hours")
print(f" R² = {r_value**2:.3f}")
print(f" p-value = {p_value:.4f}")
# Interpretation: Each additional study hour increases score by {slope:.2f} points
if p_value < 0.05:
print(" ✅ Result is significant (p < 0.05)")
else:
print(" ❌ Result is not significant (p >= 0.05)")Advanced Features of Jupyter
1. Magic Commands
# View all magic commands
%lsmagicCommon magic commands:
| Command | Function |
|---|---|
%time | Time a single line of code |
%%time | Time entire cell execution |
%matplotlib inline | Display charts in notebook |
%load file.py | Load external Python file |
%who | List all variables |
Example:
%%time
# Calculate sum of 1 to 1000000
total = sum(range(1000000))
print(total)2. Display Multiple Outputs
# Normally, only the last expression result is displayed
df.head()
df.tail() # Only this will display
# Solution: Use display()
from IPython.display import display
display(df.head())
display(df.tail()) # Both will display3. Styled DataFrames
# More beautiful DataFrame display
df.style.background_gradient(cmap='viridis', subset=['math_score'])Jupyter Best Practices
1. Organization Structure Recommendations
# Cell 1: Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Cell 2: Set parameters
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', None)
# Cell 3: Load data
df = pd.read_csv('data.csv')
# Cell 4-N: Step-by-step analysis
# Each cell does one thing2. Add Markdown Explanations
# Data Analysis: Student Grade Research
## 1. Research Question
We want to know: Does study time affect math scores?
## 2. Data Source
Virtual student survey data (n=20)
## 3. Analysis Methods
- Descriptive statistics
- Correlation analysis
- Linear regression3. Restart Kernel
If code errors or variables get confused:
Menu Bar → Kernel → Restart & Clear Output
4. Notebook Naming Convention
Industry Standard Naming Format:
01_data_cleaning.ipynb
02_exploratory_analysis.ipynb
03_regression_models.ipynb
04_robustness_checks.ipynbWhy Use Number Prefixes?
- Ensures clear execution order
- Facilitates team collaboration
- Aligns with academic research workflow
5. Cell Atomicity Principle
❌ Wrong Approach (One cell does too much):
# Not recommended: All steps mixed together
df = pd.read_csv('data.csv')
df = df.dropna()
df['log_income'] = np.log(df['income'])
result = df.groupby('education')['log_income'].mean()
plt.bar(result.index, result.values)
plt.show()
model = smf.ols('log_income ~ education + age', data=df).fit()
print(model.summary())✅ Correct Approach (One function per cell):
# Cell 1: Data loading
df = pd.read_csv('data.csv')
print(f"Loaded {len(df)} observations")
# Cell 2: Data cleaning
df_clean = df.dropna()
print(f"Removed {len(df) - len(df_clean)} rows with missing values")
# Cell 3: Feature engineering
df_clean['log_income'] = np.log(df_clean['income'])
# Cell 4: Descriptive statistics
result = df_clean.groupby('education')['log_income'].mean()
display(result)
# Cell 5: Visualization
plt.bar(result.index, result.values)
plt.title('Mean Log Income by Education')
plt.show()
# Cell 6: Regression analysis
model = smf.ols('log_income ~ education + age', data=df_clean).fit()
print(model.summary())6. Version Control Friendly Settings
# Add at notebook beginning
import warnings
warnings.filterwarnings('ignore')
# Fix random seed (ensure reproducibility)
np.random.seed(42)
# Clear outputs (prevent Git diff confusion)
# Use nbstripout: pip install nbstripout
# Setup: nbstripout --installJupyter Workflow in Academic Research
Standard Research Project Structure
research_project/
├── data/
│ ├── raw/ # Raw data (read-only)
│ │ └── survey_2023.csv
│ └── processed/ # Cleaned data
│ └── clean_survey.csv
├── notebooks/
│ ├── 01_data_cleaning.ipynb # Data cleaning
│ ├── 02_descriptive_stats.ipynb # Descriptive statistics
│ ├── 03_main_regression.ipynb # Main regression
│ ├── 04_robustness.ipynb # Robustness checks
│ └── 05_heterogeneity.ipynb # Heterogeneity analysis
├── scripts/
│ └── helper_functions.py # Reusable functions
├── outputs/
│ ├── figures/
│ │ └── fig1_scatter.png
│ └── tables/
│ └── tab1_summary.tex
├── README.md
└── requirements.txtExport LaTeX Tables from Notebook
# Export pandas table to LaTeX
summary_stats = df.describe()
latex_code = summary_stats.to_latex(
caption='Descriptive Statistics',
label='tab:desc_stats',
float_format="%.2f"
)
# Save to file
with open('../outputs/tables/table1.tex', 'w') as f:
f.write(latex_code)
print("Table saved to outputs/tables/table1.tex")Export High-Resolution Figures from Notebook
# Set publication-quality figure parameters
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 12
plt.rcParams['font.family'] = 'serif'
# Plot
plt.figure(figsize=(8, 6))
plt.scatter(df['education'], df['income'], alpha=0.6)
plt.xlabel('Years of Education')
plt.ylabel('Annual Income ($)')
plt.title('Education and Income Relationship')
# Save in multiple formats
plt.savefig('../outputs/figures/fig1_scatter.png', bbox_inches='tight', dpi=300)
plt.savefig('../outputs/figures/fig1_scatter.pdf', bbox_inches='tight') # For papers
plt.show()Jupyter Extensions and Plugins
JupyterLab Extensions (Recommended)
# Install JupyterLab (upgraded Notebook)
pip install jupyterlab
# Launch JupyterLab
jupyter labEssential Extensions
# 1. Code Formatter (Black)
pip install jupyterlab-code-formatter black
jupyter labextension install @ryantam626/jupyterlab_code_formatter
# 2. Variable Inspector
pip install lckr-jupyterlab-variableinspector
# 3. Table of Contents
pip install jupyterlab-toc
# 4. Git Integration
pip install jupyterlab-gitJupyter Notebook Extensions
# Install nbextensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
# After launch, enable in Nbextensions tab:
# - Table of Contents
# - ExecuteTime (show execution time)
# - Autopep8 (code formatting)
# - Variable InspectorAdvanced Tips
1. Parallel Computing (For Big Data)
from joblib import Parallel, delayed
import multiprocessing
def process_chunk(chunk):
# Process single data chunk
return chunk.groupby('category')['value'].mean()
# Read large file in chunks
chunks = pd.read_csv('large_file.csv', chunksize=10000)
# Parallel processing
n_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=n_cores)(
delayed(process_chunk)(chunk) for chunk in chunks
)
# Merge results
final_result = pd.concat(results)2. Progress Bar (For Long-Running Operations)
from tqdm.notebook import tqdm
# Display progress bar in loops
results = []
for i in tqdm(range(1000), desc="Processing"):
# Simulate time-consuming operation
result = some_function(i)
results.append(result)3. Interactive Visualization (Plotly)
import plotly.express as px
# Create interactive scatter plot
fig = px.scatter(df, x='education', y='income',
color='gender', size='age',
hover_data=['country'],
title='Income by Education (Interactive)')
fig.show()
# Advantage: Can zoom, hover to view data points4. Automated Report Generation (Papermill)
# Install papermill
pip install papermill
# Batch run notebooks (parameterized)
papermill input_template.ipynb output_2023.ipynb \
-p year 2023 \
-p country "USA"5. Memory Monitoring
# Check variable memory usage
%whos
# Check DataFrame memory usage
df.info(memory_usage='deep')
# Delete unnecessary variables
del large_dataframe
import gc
gc.collect()Debugging Techniques
1. Use IPython Debugger
# Insert breakpoint where error occurs
import pdb; pdb.set_trace()
# Execution will pause here, can inspect variables
# Debug commands:
# - n (next): next line
# - c (continue): continue execution
# - q (quit): exit debugger
# - p variable: print variable value2. Display Full Error Information
# Show detailed error stack
%xmode Verbose
# Restore default
%xmode Plain3. Time Performance Profiling
# Profile function performance
%prun df.groupby('category').agg({'value': ['mean', 'std']})
# Line-by-line profiling (requires line_profiler)
%load_ext line_profiler
%lprun -f my_function my_function(df)Jupyter vs Stata/R
| Feature | Stata | R (RStudio) | Jupyter Notebook |
|---|---|---|---|
| Interactive Execution | ✅ | ✅ | ✅ |
| Embedded Charts | ❌ | ✅ (R Markdown) | ✅ |
| Mix Text and Code | ❌ | ✅ (R Markdown) | ✅ |
| Online Collaboration | ❌ | ✅ (RStudio Cloud) | ✅ (Colab) |
| Learning Curve | Easy | Medium | Easy |
Frequently Asked Questions
Q1: What's the difference between Jupyter Notebook and JupyterLab?
Answer:
- Jupyter Notebook: Classic interface, concise
- JupyterLab: Next-generation interface, more powerful (multi-tabs, terminal, file manager)
Recommendation: Beginners start with Notebook, switch to Lab when proficient
Q2: How to share Jupyter Notebook?
Method 1: Export as HTML
- File → Download as → HTML
Method 2: Upload to GitHub
- GitHub automatically renders
.ipynbfiles
Method 3: Use nbviewer
- Visit nbviewer.jupyter.org
- Enter GitHub link
Q3: Why don't charts display?
Solution: Run at notebook beginning
%matplotlib inlinePractical Exercises
Exercise 1: Create Your First Analysis Notebook
- Create new notebook, name it "income_analysis"
- Create the following data:
data = {
'country': ['USA', 'China', 'India', 'Brazil', 'UK'],
'gdp_per_capita': [65000, 12000, 2500, 9000, 45000],
'population': [330, 1400, 1380, 213, 67]
}- Calculate:
- Total GDP for each country (GDP per capita × population)
- Average GDP per capita
- Draw bar chart
Exercise 2: Use Markdown
Add Markdown cells to notebook containing:
- Title
- Research question
- Data source description
Next Steps
In the next section, we will learn VS Code Configuration, a more professional development environment suitable for large projects.
Keep moving forward!