Drawing mode (d to exit, x to clear)
class: middle, title-slide .cols[ .col-2-3[ # Learning by Sampling and Optimization ## CDS DS 595 ### Siddharth Mishra-Sharma Mon, February 2, 2026 [smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html) ] .col-1-3[ .width-80[] .small[Arianna Rosenbluth, 1927–2020] ] ] .small[📄 [Notes](https://bu-ds595.github.io/course-materials-spring26/notes/02-mcmc-vi.pdf)] --- # Logistics 1. **Slides** (for this lecture) and **notes** (for the next few lectures) are up on the website 2. **Lab 2** is up: [bu-ds595/lab02-starter](https://github.com/bu-ds595/lab02-starter) - Message to clone assignment and get started coming later today 3. **Assignment 1** will be out Wednesday - Design your own MCMC algorithms! --- # Historical interlude... .cols[ .col-1-2[ .width-100[] ] .col-1-2[ ] ] -- .abs-right-half[ .width-90[] ] --- # Last time: Forward and inverse problems .center.width-50[] .center[ .eq-box[ $$p(\theta \mid x) = \frac{p(x \mid \theta) \, p(\theta)}{p(x)}$$ ] ] .center.small.muted[Bayes' theorem provides to do inference.] --- # You have* the posterior. Now what? .cols[ .col-1-2[ We want to ask questions like: - What is the most probable value of the parameter? $$\color{#CC3311}{\mathbb{E}[\theta \mid x] = \int \theta \, p(\theta \mid x) \, d\theta}$$ - What is a plausible range of parameter values? $$\color{#009988}{\Pr(\theta \in [a, b] \mid x) = \int_a^b p(\theta \mid x) \, d\theta}$$ ] .col-1-2.center[ .width-100[] ] ] .footnote[*can evaluate it at any $\theta$] --- # We can't do these integrals For simple models (like the Gaussian example last time), these integrals have closed-form solutions. For almost any interesting model, they don't. We can **evaluate** the posterior at any point, but we cannot **integrate** it analytically. .center.width-40[] .center.small[Grid evaluation with 20 points per dimension] --- # Two strategies .center.width-70[] 1. **MCMC:** Generate samples from the posterior, replace integrals with averages 2. **VI:** Approximate the posterior with a simpler distribution --- # Sampling is just averaging If we had samples from the posterior, we could replace integrals with averages. -- Given samples $\theta\_1, \theta\_2, \ldots, \theta\_N \sim p(\theta \mid x)$: $$\mathbb{E}[f(\theta) \mid x] = \int f(\theta) \, p(\theta \mid x) \, d\theta \;\approx\; \frac{1}{N} \sum\_{i=1}^N f(\theta\_i)$$ -- **But how do we get samples from a distribution we can only evaluate up to a normalizing constant?** We can compute $p(x \mid \theta) p(\theta)$ for any $\theta$, but we don't know $p(x)$. --- class: center, middle, section-slide # The Metropolis Algorithm --- # Exploring a probability landscape .center.width-40[] --- # The Metropolis algorithm .algorithm[ **1. Propose:** From current position $\theta$, propose a move to $\theta' \sim \mathcal{N}(\theta, \sigma^2 I)$ **2. Accept or reject:** Compute acceptance probability: $$\alpha = \min\left(1, \frac{p(\theta' \mid x)}{p(\theta \mid x)}\right)$$ - If new spot has higher probability: always accept - If lower: accept with probability $\alpha$ **3. Record:** If accepted, move to $\theta'$; otherwise stay at $\theta$. Record position. ] --- # The algorithm in action .center.width-90[] .small.muted.center[Accepted proposals (orange) move the chain; rejected proposals (red dashed) leave it in place. The histogram converges to the posterior.] --- # Why does this work? The acceptance ratio only involves $p(\theta' \mid x) / p(\theta \mid x)$. Since both share the same normalizing constant $p(x)$, it cancels: $$\frac{p(\theta' \mid x)}{p(\theta \mid x)} = \frac{p(x \mid \theta') \, p(\theta')}{p(x \mid \theta) \, p(\theta)}$$ .highlight[ We only need the **unnormalized posterior**—likelihood times prior. No normalizing constant needed! ] --- # Why does the acceptance ratio work? In equilibrium, probability flow between any two states balances out. .center.width-60[] The net flow between A and B balances out: **time spent at each $\theta$ is proportional to $p(\theta \mid x)$**. --- # Convergence to the target .center.width-90[] No matter where we start, the chain converges to the **target distribution**. --- class: center, middle, section-slide # Tuning and Convergence --- # The step size tradeoff The proposal distribution has a free parameter: step size $\sigma$. .center.width-70[] .small.muted.center[(a) Too small: slow exploration. (b) Too large: most proposals rejected. (c) Well-tuned.] **Rule of thumb:** aim for ~25% acceptance rate in high dimensions. Why not 100%? --- # Autocorrelation MCMC samples are correlated—each sample is close to the previous one. .center.width-60[] **Effective sample size:** $N\_{\text{eff}} = N / \tau$, where $\tau$ is the timescale of correlations. --- # Why random walks struggle In high dimensions, probability concentrates in a thin shell: the **typical set**. To explore, we must traverse the shell. But random walks propose moves in **all directions**—most point off the shell. This results in small steps, slow exploration. .center.width-70[] --- class: center, middle, section-slide # Hamiltonian Monte Carlo --- # Using gradient information The posterior has structure—the gradient $\nabla \log p(\theta \mid x)$ tells us which way is "uphill." .center.width-70[] .small.muted.center[Left: Random walk proposes uniformly. Right: Gradient points toward higher probability; perpendicular directions follow contours.] --- # The physics picture .cols[ .col-1-2[ Height = $-\log p(\theta \mid x)$ Total energy is conserved: $$H = \underbrace{U(\theta)}\_{\text{potential}} + \underbrace{K(\rho)}\_{\text{kinetic}}$$ Ball speeds up in valleys (high prob), slows down on ridges (low prob). ] .col-1-2[ .center.width-100[] ] ] --- # The Hamiltonian Augment parameters $\theta$ with momentum $\rho$. The Hamiltonian (total energy) is: $$H(\theta, \rho) = \underbrace{-\log p(\theta \mid x)}\_{\text{potential energy}} + \underbrace{\frac{1}{2}\rho^\top \rho}\_{\text{kinetic energy}}$$ -- Hamilton's equations give the dynamics: $$\frac{d\theta}{dt} = \rho \quad \text{(position changes according to momentum)}$$ $$\frac{d\rho}{dt} = \nabla \log p(\theta \mid x) \quad \text{(momentum changes according to gradient)}$$ --- # HMC algorithm .cols[ .col-1-2[ **1. Sample momentum:** Draw $\rho \sim \mathcal{N}(0, I)$ **2. Leapfrog:** Integrate $\frac{d\theta}{dt} = \rho$, $\frac{d\rho}{dt} = \nabla \log p(\theta \mid x)$ for $L$ steps of size $\epsilon$ **3. Accept/reject:** Accept proposal with prob $\min(1, e^{-\Delta H})$ .small[ **Why ΔH?** We're doing Metropolis on the joint $(\theta, \rho)$ space: $$e^{-\Delta H} = \frac{p(\theta', \rho')}{p(\theta, \rho)}$$ If leapfrog were exact, $\Delta H = 0$ → always accept. The Metropolis step corrects for discretization error. ] ] .col-1-2[ .center.width-90[] ] ] --- # Random walk vs. HMC .center.width-70[] .small.muted.center[Left: Metropolis diffuses slowly. Right: HMC makes long, directed moves.] --- # HMC scaling HMC achieves autocorrelation time $\tau$ **nearly independent of dimension**. Compare to random walk: $\tau \propto d$. The cost is computing gradients. But **automatic differentiation** makes this trivial! .cols[ .col-2-3[ ```python import jax def log_prob(theta): return log_likelihood(theta) + log_prior(theta) # Automatic gradient - that's it! grad_log_prob = jax.grad(log_prob) ``` ] .col-1-3[ .center.width-60[] ] ] --- class: center, middle, section-slide # Diagnostics --- # Checking convergence We can never **prove** MCMC has converged, but failures can often be detected. .center.width-90[] .cols[ .col-1-3[ **Trace plots:** Should look like stationary noise, not trending. ] .col-1-3[ **$N\_{\text{eff}}$:** Effective sample size after accounting for correlation. ] .col-1-3[ **$\hat{R}$:** Run multiple chains; check they agree. Want $\hat{R} < 1.01$. ] ] --- # Visualizing posteriors: corner plots .cols[ .col-1-2[ A grid showing: - 1D marginals (diagonal) - 2D marginals (off-diagonal) Reveals degeneracies, multimodality, and non-Gaussianity that summary statistics hide. ] .col-1-2[ .center.width-80[] ] ] .small.muted.center[Planck satellite: cosmological parameters from the cosmic microwave background.] --- class: center, middle, section-slide # Variational Inference --- # A different approach: turn inference into optimization MCMC gives exact samples but can be slow—especially for models with many parameters. **Variational inference (VI)** trades exactness for speed by approximating the posterior with a simpler distribution. .center.width-70[] --- # Measuring closeness: KL divergence The Kullback–Leibler divergence measures how different two distributions are: $$\text{KL}(q \| p) = \int q(\theta) \log \frac{q(\theta)}{p(\theta \mid x)} \, d\theta$$ -- .center.width-80[] .small.muted.center[When $q$ places mass where $p$ is nearly zero, KL is large.] --- # The geometry of VI .cols[ .col-1-2[ Choose a family $\mathcal{Q}$ of tractable distributions (e.g., Gaussians). **Goal:** Find $q^* = \arg\min\_{q \in \mathcal{Q}} \text{KL}(q \| p)$ **Problem:** Computing KL requires $p(x)$—the intractable integral we're trying to avoid! ] .col-1-2[ .center.width-100[] ] ] --- # What we can and can't compute ✓ **Can compute:** The joint $p(x, \theta) = p(x \mid \theta) p(\theta)$ for any $\theta$. ✗ **Cannot compute:** The marginal $p(x) = \int p(x, \theta) \, d\theta$. The ELBO lets us optimize using only the joint. --- # The key identity Rearranging the KL divergence: .center[ .eq-box[ $\text{KL}(q \| p) = \log p(x) - \text{ELBO}(q)$ ] ] where we define the **Evidence Lower Bound**: $$\text{ELBO}(q) = \mathbb{E}\_q[\log p(x, \theta)] - \mathbb{E}\_q[\log q(\theta)]$$ Since $\text{KL} \geq 0$, we have $\text{ELBO}(q) \leq \log p(x)$ — hence the name "lower bound". --- # Two forces in the ELBO $$\text{ELBO}(q) = \underbrace{\mathbb{E}\_{q}[\log p(x \mid \theta)]}\_{\text{expected log-likelihood}} - \underbrace{\text{KL}(q(\theta) \| p(\theta))}\_{\text{divergence from prior}}$$ .center.width-60[] --- # Why the ELBO works Since $\log p(x)$ is constant with respect to $q$: .center[ .eq-box[ $$\arg\max\_q \text{ELBO}(q) = \arg\min\_q \text{KL}(q \| p)$$ ] ] Maximizing the ELBO is equivalent to minimizing the KL divergence. .highlight[ The ELBO only involves the joint $p(x, \theta)$—no normalizing constant needed! ] --- # Example: banana posterior Consider a curved, non-Gaussian posterior: $$p(\theta\_1, \theta\_2 \mid x) \propto \exp\left( -\frac{1}{2}\theta\_1^2 - \frac{1}{2}(\theta\_2 - \theta\_1^2)^2 \right)$$ -- Approximate with a **mean-field** Gaussian: $$q(\theta) = \mathcal{N}(\theta\_1; \mu\_1, \sigma\_1^2) \cdot \mathcal{N}(\theta\_2; \mu\_2, \sigma\_2^2)$$ A product of independent Gaussians, one per parameter. --- # VI on the banana .center.width-80[] --- # Limitations of (mean-field) VI .cols[ .col-1-2[ **Mean-field approximation:** - Forces independence between parameters - Typically **underestimates uncertainty** **Mode-seeking behavior:** - $\text{KL}(q \| p)$ penalizes mass where $p$ is small - Multimodal posteriors → concentrates on one mode .highlight[ More expressive families: full-covariance Gaussians, distributions parameterized by neural networks (e.g., normalizing flows) ] ] .col-1-2[ .center.width-100[] ] ] --- # Comparison | | **MCMC** | **VI** | |---|---|---| | Output | Samples | Approximate distribution | | Accuracy | Exact (asymptotically) | Limited by family $\mathcal{Q}$ | | Speed | Can be slow | Fast | | Multimodality | Can explore all modes | Mode-seeking | | Uncertainty | Well-calibrated | Often underestimates | | Scalability | $O(N)$ per sample | Amortizable | --- # Takeaways .cols[ .col-1-2[ .center[**MCMC**] .center.width-107[] - Generate samples → replace integrals with averages - HMC uses gradients to scale ] .col-1-2[ .center[**Variational Inference**] .center.width-60[] - Approximate posterior with simpler distribution - max ELBO = min KL ] ]