L4: Learning by Sampling and Optimization

Drawing mode (d to exit, x to clear)

## CDS DS 595

### Siddharth Mishra-Sharma

Mon, February 2, 2026

[smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html)
]
.col-1-3[
.width-80[![Arianna Rosenbluth](figures/arianna_rosenbluth.png)]

---

# Logistics

1. **Slides** (for this lecture) and **notes** (for the next few lectures) are up on the website

2. **Lab 2** is up: [bu-ds595/lab02-starter](https://github.com/bu-ds595/lab02-starter)
   - Message to clone assignment and get started coming later today

3. **Assignment 1** will be out Wednesday
   - Design your own MCMC algorithms!

---

# Historical interlude...

.cols[
.col-1-2[
.width-100[![Metropolis et al. 1953](figures/metropolis_1953_paper.png)]
]
.col-1-2[
]
]

---

# Last time: Forward and inverse problems

.center.width-50[![Forward and inverse problems](figures/forward_inverse.png)]

.center.small.muted[Bayes' theorem provides to do inference.]

---

# You have* the posterior. Now what?

- What is the most probable value of the parameter?
$$\color{#CC3311}{\mathbb{E}[\theta \mid x] = \int \theta \, p(\theta \mid x) \, d\theta}$$

- What is a plausible range of parameter values?
$$\color{#009988}{\Pr(\theta \in [a, b] \mid x) = \int_a^b p(\theta \mid x) \, d\theta}$$
]
.col-1-2.center[
.width-100[![Banana posterior](figures/banana_posterior.png)]
]
]

---

# We can't do these integrals

For simple models (like the Gaussian example last time), these integrals have closed-form solutions. For almost any interesting model, they don't.

We can **evaluate** the posterior at any point, but we cannot **integrate** it analytically.

.center.width-40[![Curse of dimensionality](figures/curse_of_dimensionality.png)]

.center.small[Grid evaluation with 20 points per dimension]

---

# Two strategies

.center.width-70[![MCMC vs VI](figures/mcmc_vs_vi_intro.png)]

1. **MCMC:** Generate samples from the posterior, replace integrals with averages

2. **VI:** Approximate the posterior with a simpler distribution

---

# Sampling is just averaging

If we had samples from the posterior, we could replace integrals with averages.

Given samples $\theta\_1, \theta\_2, \ldots, \theta\_N \sim p(\theta \mid x)$:

$$\mathbb{E}[f(\theta) \mid x] = \int f(\theta) \, p(\theta \mid x) \, d\theta \;\approx\; \frac{1}{N} \sum\_{i=1}^N f(\theta\_i)$$

**But how do we get samples from a distribution we can only evaluate up to a normalizing constant?**

We can compute $p(x \mid \theta) p(\theta)$ for any $\theta$, but we don't know $p(x)$.

---

# The Metropolis Algorithm

---

# Exploring a probability landscape

.center.width-40[![MCMC landscape](figures/mcmc_landscape.png)]

---

# The Metropolis algorithm

.algorithm[
**1. Propose:** From current position $\theta$, propose a move to $\theta' \sim \mathcal{N}(\theta, \sigma^2 I)$

**2. Accept or reject:** Compute acceptance probability:
$$\alpha = \min\left(1, \frac{p(\theta' \mid x)}{p(\theta \mid x)}\right)$$

- If new spot has higher probability: always accept
- If lower: accept with probability $\alpha$

**3. Record:** If accepted, move to $\theta'$; otherwise stay at $\theta$. Record position.
]

---

# The algorithm in action

.center.width-90[![Metropolis-Hastings-Rosenbluth](figures/metropolis_hastings_illustration.png)]

.small.muted.center[Accepted proposals (orange) move the chain; rejected proposals (red dashed) leave it in place. The histogram converges to the posterior.]

---

# Why does this work?

The acceptance ratio only involves $p(\theta' \mid x) / p(\theta \mid x)$.

Since both share the same normalizing constant $p(x)$, it cancels:

$$\frac{p(\theta' \mid x)}{p(\theta \mid x)} = \frac{p(x \mid \theta') \, p(\theta')}{p(x \mid \theta) \, p(\theta)}$$

.highlight[
We only need the **unnormalized posterior**—likelihood times prior. No normalizing constant needed!
]

---

# Why does the acceptance ratio work?

In equilibrium, probability flow between any two states balances out.

.center.width-60[![Acceptance intuition](figures/acceptance_intuition.png)]

The net flow between A and B balances out: **time spent at each $\theta$ is proportional to $p(\theta \mid x)$**.

---

# Convergence to the target

.center.width-90[![Detailed balance animation](figures/detailed_balance.gif)]

No matter where we start, the chain converges to the **target distribution**.

---

# Tuning and Convergence

---

# The step size tradeoff

The proposal distribution has a free parameter: step size $\sigma$.

.center.width-70[![Step size](figures/step_size.png)]

.small.muted.center[(a) Too small: slow exploration. (b) Too large: most proposals rejected. (c) Well-tuned.]

**Rule of thumb:** aim for ~25% acceptance rate in high dimensions. Why not 100%?

---

# Autocorrelation

MCMC samples are correlated—each sample is close to the previous one.

.center.width-60[![Autocorrelation](figures/autocorrelation.png)]

**Effective sample size:** $N\_{\text{eff}} = N / \tau$, where $\tau$ is the timescale of correlations.

---

# Why random walks struggle

In high dimensions, probability concentrates in a thin shell: the **typical set**.

To explore, we must traverse the shell. But random walks propose moves in **all directions**—most point off the shell. This results in small steps, slow exploration.

.center.width-70[![Typical set](figures/typical_set.png)]

---

# Hamiltonian Monte Carlo

---

# Using gradient information

The posterior has structure—the gradient $\nabla \log p(\theta \mid x)$ tells us which way is "uphill."

.center.width-70[![Gradient information](figures/gradient_information.png)]

.small.muted.center[Left: Random walk proposes uniformly. Right: Gradient points toward higher probability; perpendicular directions follow contours.]

---

# The physics picture

Total energy is conserved:

$$H = \underbrace{U(\theta)}\_{\text{potential}} + \underbrace{K(\rho)}\_{\text{kinetic}}$$

Ball speeds up in valleys (high prob), slows down on ridges (low prob).
]
.col-1-2[
.center.width-100[![HMC physics](figures/hmc_physics.gif)]
]
]

---

# The Hamiltonian

Augment parameters $\theta$ with momentum $\rho$. The Hamiltonian (total energy) is:

$$H(\theta, \rho) = \underbrace{-\log p(\theta \mid x)}\_{\text{potential energy}} + \underbrace{\frac{1}{2}\rho^\top \rho}\_{\text{kinetic energy}}$$

Hamilton's equations give the dynamics:

$$\frac{d\theta}{dt} = \rho \quad \text{(position changes according to momentum)}$$

$$\frac{d\rho}{dt} = \nabla \log p(\theta \mid x) \quad \text{(momentum changes according to gradient)}$$

---

# HMC algorithm

**2. Leapfrog:** Integrate $\frac{d\theta}{dt} = \rho$, $\frac{d\rho}{dt} = \nabla \log p(\theta \mid x)$ for $L$ steps of size $\epsilon$

**3. Accept/reject:** Accept proposal with prob $\min(1, e^{-\Delta H})$

.small[
**Why ΔH?** We're doing Metropolis on the joint $(\theta, \rho)$ space:
$$e^{-\Delta H} = \frac{p(\theta', \rho')}{p(\theta, \rho)}$$
If leapfrog were exact, $\Delta H = 0$ → always accept. The Metropolis step corrects for discretization error.
]
]
.col-1-2[
.center.width-90[![HMC algorithm](figures/hmc_algorithm.gif)]
]
]

---

# Random walk vs. HMC

.center.width-70[![RW vs HMC](figures/rw_vs_hmc.gif)]

.small.muted.center[Left: Metropolis diffuses slowly. Right: HMC makes long, directed moves.]

---

# HMC scaling

HMC achieves autocorrelation time $\tau$ **nearly independent of dimension**. Compare to random walk: $\tau \propto d$.

The cost is computing gradients. But **automatic differentiation** makes this trivial!

def log_prob(theta):
    return log_likelihood(theta) + log_prior(theta)

# Automatic gradient - that's it!
grad_log_prob = jax.grad(log_prob)
```
]
.col-1-3[
.center.width-60[![JAX](https://raw.githubusercontent.com/google/jax/main/images/jax_logo_250px.png)]
]
]

---

# Diagnostics

---

# Checking convergence

We can never **prove** MCMC has converged, but failures can often be detected.

.center.width-90[![Convergence diagnostics](figures/convergence_diagnostics.png)]

.cols[
.col-1-3[
**Trace plots:** Should look like stationary noise, not trending.
]
.col-1-3[
**$N\_{\text{eff}}$:** Effective sample size after accounting for correlation.
]
.col-1-3[
**$\hat{R}$:** Run multiple chains; check they agree. Want $\hat{R} < 1.01$.
]
]

---

# Visualizing posteriors: corner plots

Reveals degeneracies, multimodality, and non-Gaussianity that summary statistics hide.
]
.col-1-2[
.center.width-80[![Corner cosmology](figures/corner_cosmology.png)]
]
]

.small.muted.center[Planck satellite: cosmological parameters from the cosmic microwave background.]

---

# Variational Inference

---

# A different approach: turn inference into optimization

MCMC gives exact samples but can be slow—especially for models with many parameters.

**Variational inference (VI)** trades exactness for speed by approximating the posterior with a simpler distribution.

.center.width-70[![MCMC vs VI](figures/mcmc_vs_vi_intro.png)]

---

# Measuring closeness: KL divergence

The Kullback–Leibler divergence measures how different two distributions are:

$$\text{KL}(q \| p) = \int q(\theta) \log \frac{q(\theta)}{p(\theta \mid x)} \, d\theta$$

.center.width-80[![KL divergence](figures/kl_divergence.png)]

.small.muted.center[When $q$ places mass where $p$ is nearly zero, KL is large.]

---

# The geometry of VI

**Goal:** Find $q^* = \arg\min\_{q \in \mathcal{Q}} \text{KL}(q \| p)$

**Problem:** Computing KL requires $p(x)$—the intractable integral we're trying to avoid!
]
.col-1-2[
.center.width-100[![VI geometry](figures/vi_geometry.png)]
]
]

---

# What we can and can't compute

✓ **Can compute:** The joint $p(x, \theta) = p(x \mid \theta) p(\theta)$ for any $\theta$.

✗ **Cannot compute:** The marginal $p(x) = \int p(x, \theta) \, d\theta$.

The ELBO lets us optimize using only the joint.

---

# The key identity

Rearranging the KL divergence:

where we define the **Evidence Lower Bound**:

$$\text{ELBO}(q) = \mathbb{E}\_q[\log p(x, \theta)] - \mathbb{E}\_q[\log q(\theta)]$$

Since $\text{KL} \geq 0$, we have $\text{ELBO}(q) \leq \log p(x)$ — hence the name "lower bound".

---

# Two forces in the ELBO

$$\text{ELBO}(q) = \underbrace{\mathbb{E}\_{q}[\log p(x \mid \theta)]}\_{\text{expected log-likelihood}} - \underbrace{\text{KL}(q(\theta) \| p(\theta))}\_{\text{divergence from prior}}$$

.center.width-60[![ELBO decomposition](figures/elbo_decomposition.png)]

---

# Why the ELBO works

Since $\log p(x)$ is constant with respect to $q$:

Maximizing the ELBO is equivalent to minimizing the KL divergence.

---

# Example: banana posterior

Consider a curved, non-Gaussian posterior:

$$p(\theta\_1, \theta\_2 \mid x) \propto \exp\left( -\frac{1}{2}\theta\_1^2 - \frac{1}{2}(\theta\_2 - \theta\_1^2)^2 \right)$$

Approximate with a **mean-field** Gaussian:

$$q(\theta) = \mathcal{N}(\theta\_1; \mu\_1, \sigma\_1^2) \cdot \mathcal{N}(\theta\_2; \mu\_2, \sigma\_2^2)$$

A product of independent Gaussians, one per parameter.

---

# VI on the banana

.center.width-80[![VI optimization](figures/vi_optimization.gif)]

---

# Limitations of (mean-field) VI

.cols[
.col-1-2[
**Mean-field approximation:**
- Forces independence between parameters
- Typically **underestimates uncertainty**

**Mode-seeking behavior:**
- $\text{KL}(q \| p)$ penalizes mass where $p$ is small
- Multimodal posteriors → concentrates on one mode

.highlight[
More expressive families: full-covariance Gaussians, distributions parameterized by neural networks (e.g., normalizing flows)
]
]
.col-1-2[
.center.width-100[![Mode seeking](figures/mode_seeking.png)]
]
]

---

# Comparison

| | **MCMC** | **VI** |
|---|---|---|
| Output | Samples | Approximate distribution |
| Accuracy | Exact (asymptotically) | Limited by family $\mathcal{Q}$ |
| Speed | Can be slow | Fast |
| Multimodality | Can explore all modes | Mode-seeking |
| Uncertainty | Well-calibrated | Often underestimates |
| Scalability | $O(N)$ per sample | Amortizable |

---

# Takeaways

.center.width-107[![RW vs HMC](figures/rw_vs_hmc.gif)]

- Generate samples → replace integrals with averages
- HMC uses gradients to scale
]
.col-1-2[
.center[**Variational Inference**]

.center.width-60[![VI](figures/vi_banana_only.gif)]

- Approximate posterior with simpler distribution
- max ELBO = min KL
]
]