Drawing mode (d to exit, x to clear)
class: middle, title-slide .cols[ .col-2-3[ # Probabilistic Reasoning ## CDS DS 595 ### Siddharth Mishra-Sharma Wed, January 29, 2026 [smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html) ] .col-1-3[ .width-80[] ] ] .small[📄 [Notes](https://bu-ds595.github.io/course-materials-spring26/notes/01-probabilistic-reasoning.pdf)] --- # Why probabilistic reasoning? .center[] .small.muted.center[Binary black hole merger / Credit: LIGO] --- # Why probabilistic reasoning? .muted[We'd want to go from this...] .cols[ .col-1-2[ .center[] .center.small.muted[LIGO] ] .col-1-2[ .center[] .center.small.muted[Gravitational wave waveform] ] ] --- # Why probabilistic reasoning? .muted[... to this] .cols[ .col-1-2[ .center[] .center.small.muted[Sky localization] ] .col-1-2[ .center[] .center.small.muted[Masses, spins, distances, ...] ] ] --- # Why probabilistic reasoning? Discoveries have a high standard for confirmation! .center.width-50[] --- class: center, middle, section-slide # Notation and Probability Basics --- # Notation We'll use consistent notation throughout: | Symbol | Meaning | Examples | |--------|---------|----------| | $x$ | Observed data | GW strain signal, particle collisions | | $\theta$ | Parameters we want to learn | Black hole masses, Higgs mass | | $z$ | Latent variables | Detector calibration, noise parameters | -- **Probabilistic notation:** - $x \sim p(x)$ means "$x$ is drawn from distribution $p$" - $x \sim \mathcal{N}(\mu, \sigma^2)$ means "$x$ is Gaussian with mean $\mu$, variance $\sigma^2$" --- # Density vs. probability .cols[ .col-1-2[ For continuous variables, $p(x)$ is a **probability density** (pdf), not a probability. To get actual probabilities, integrate: $$P(a \le x \le b) = \int\_a^b p(x)\\,dx$$ ] .col-1-2[ .center[] ] ] --- # Conditional probability .cols[ .col-2-3[ The probability of $A$ given that we know $B$: $$p(A \mid B) = \frac{p(A, B)}{p(B)}$$ **Intuition:** conditioning on $B$ restricts our universe to outcomes where $B$ occurred. Within that restricted universe, $p(A \mid B)$ is the fraction where $A$ also holds. ] .col-1-3[ .center[] ] ] --- # Chain rule Any joint distribution factors as a product of conditionals: $$p(x\_1, x\_2, x\_3) = p(x\_1) \\, p(x\_2 \mid x\_1) \\, p(x\_3 \mid x\_1, x\_2)$$ -- More generally: $$p(x\_1, \ldots, x\_n) = \prod\_{i=1}^{n} p(x\_i \mid x\_1, \ldots, x\_{i-1})$$ .highlight[ This is how autoregressive models (like GPT) generate sequences: each token conditioned on all previous tokens. ] --- # Independence $A$ and $B$ are **independent** iff any of these equivalent statements hold: $$p(A \mid B) = p(A)$$ $$p(B \mid A) = p(B)$$ $$p(A, B) = p(A)p(B)$$ -- For independent observations: $$p(x\_1, x\_2, \ldots, x\_N \mid \theta) = \prod\_i p(x\_i \mid \theta)$$ --- # Marginalization .cols[ .col-1-2[ If we don't care about some variable, integrate it out: $$p(x) = \int p(x, z)\\, dz$$ .small.muted[Joint distribution $p(x,y)$ with marginals $p(x)$ (top) and $p(y)$ (right)] ] .col-1-2[ .center.width-90[] ] ] --- # Product rule Joint probabilities factor in two ways: $$p(x, z) = p(x \mid z) p(z) = p(z \mid x) p(x)$$ This symmetry gives us Bayes' theorem. --- # Bayes' theorem .cols[ .col-2-3[ From the product rule: $p(A, B) = p(A \mid B) p(B) = p(B \mid A) p(A)$ .center[ .eq-box[ $$p(A \mid B) = \frac{p(B \mid A) \\, p(A)}{p(B)}$$ ] ] - $p(A)$ — **prior** belief about $A$ - $p(B)$ — probability of **evidence** $B$ - $p(A \mid B)$ — **posterior** belief about $A$ given evidence $B$ - $p(B \mid A)$ — **likelihood** of $B$ given $A$ ] .col-1-3[ .center[] .small.muted.center[Thomas Bayes (1701–1761)] ] ] --- # Why does this matter for science? In science, we can often model the **forward** direction but want to go **backward**: -- Bayes' theorem **inverts** the forward model: $$p(\text{cause} \mid \text{effect}) = \frac{p(\text{effect} \mid \text{cause}) \\, p(\text{cause})}{p(\text{effect})}$$ .highlight[ We build models (forward). Bayes lets us use them for inference (backward). ] --- # Bayes' theorem for inference .center.small[ .eq-box[ $$p(\theta \mid x) = \frac{p(x \mid \theta) \\, p(\theta)}{p(x)}$$ ] ] -- | Term | Name | Role | |------|------|------| | $p(\theta)$ | **Prior** | What we believed before | | $p(x \mid \theta)$ | **Likelihood** | How probable is data given $\theta$? | | $p(\theta \mid x)$ | **Posterior** | What we believe after seeing data | | $p(x)$ | **Evidence** | Normalizing constant | --- class: center, middle, section-slide # Forward vs. Inverse Problems --- # Forward vs. inverse .center.width-70[] .small.muted.center[Forward: sample $x \sim p(x \mid \theta)$. Inverse: characterize $p(\theta \mid x)$.] --- # Forward processes easy -- just sample $x \sim p(x \mid \theta)$ .center[] --- # But inference is hard... need to get $p(\theta \mid x)$ .center[] --- class: center, middle, section-slide # It's all about the likelihood function! .huge[$p(x \mid \theta)$] --- # Example: fitting a line .center.width-50[] .small.muted.center[Each measurement has Gaussian uncertainty $\sigma$. Gray dashed: true line.] --- # Example: fitting a line We measure $N$ data points $\\{x\_i\\}$ at known inputs $\\{t\_i\\}$. We believe the true relationship is $x = mt + c$, but measurements have noise. -- **The generative story:** 1. There are true parameters $\theta = (m, c)$ 2. For each input $t\_i$, the "true" value is $\mu\_i = m t\_i + c$ 3. We observe $x\_i = \mu\_i + \epsilon\_i$ where $\epsilon\_i \sim \mathcal{N}(0, \sigma^2)$ -- In probabilistic notation: $$x\_i \sim \mathcal{N}(m t\_i + c, \sigma^2)$$ --- # The likelihood $$p(x \mid \theta)$$ The likelihood encodes the **forward model**: given parameters, predict the data distribution. -- **Maximum likelihood:** Find parameters that make the data most probable: $$\hat{\theta}\_{\text{MLE}} = \arg\max\_\theta p(x \mid \theta) = \arg\max\_\theta \log p(x \mid \theta)$$ In practice, work with log-likelihoods. --- # MLE for line fitting $$\ell(m, c) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum\_{i=1}^N (x\_i - mt\_i - c)^2$$ The first term is constant in $(m, c)$. -- Maximizing $\ell$ is equivalent to minimizing: $$\sum\_i (x\_i - mt\_i - c)^2$$ .highlight[ Ordinary least squares is MLE for Gaussian noise. ] --- # MLE fit .center.width-50[] .small.muted.center[Orange: MLE fit. Gray dashed: true line.] --- # Likelihood ratios .cols[ .col-1-2[ Compare two hypotheses: $$\Lambda = \frac{p(x \mid \theta\_1)}{p(x \mid \theta\_0)}$$ .small.muted[This is where $p$-values come from (but we won't go into details...)] ] .col-1-2[  ] ] --- class: center, middle, section-slide # The Posterior Distribution --- # Bayes' theorem revisited .center[ .eq-box[ $$\underbrace{p(\theta \mid x)}\_{\text{posterior}} = \frac{\overbrace{p(x \mid \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(x)}\_{\text{evidence}}}$$ ] ] .highlight[ The posterior is the **complete answer**. Unlike MLE (a point), the posterior is a distribution. ] --- # Visualizing Bayesian inference .center.width-80[] .small.muted.center[Prior (left) encodes initial beliefs. Likelihood (middle) peaks where data are probable. Posterior (right) combines both.] --- # Sequential updating When $x\_2$ arrives, yesterday's posterior becomes today's prior: $p(\theta \mid x\_1, x\_2) \propto p(x\_2 \mid \theta) \cdot p(\theta \mid x\_1)$ .center.width-80[] --- # Line fitting posterior With a flat prior on $(m, c)$, posterior is analytic: $$p(m, c \mid \\{x\_i\\}, \\{t\_i\\}) \propto \exp\left( -\frac{1}{2\sigma^2} \sum\_{i=1}^N (x\_i - mt\_i - c)^2 \right)$$ -- --- # The full posterior .center.width-90[] --- # Parameter degeneracy .cols[ .col-1-2[ The joint posterior can reveal structure that marginals miss. .small.muted[Corner plot: marginals (diagonal) and joint posterior (off-diagonal) with 68% and 95% credible regions.] ] .col-1-2[ .width-80[] ] ] --- class: center, middle, section-slide # Model Comparison --- # The evidence and Bayes factors The marginal likelihood (or "evidence"): $$p(x \mid \mathcal{M}) = \int p(x \mid \theta, \mathcal{M}) \\, p(\theta \mid \mathcal{M})\\, d\theta$$ -- **Bayes factors** compare models: $$\text{BF}\_{12} = \frac{p(x \mid \mathcal{M}\_1)}{p(x \mid \mathcal{M}\_2)}$$ $\text{BF}\_{12} = 10$ means: "The data are 10 times more probable under model 1 compared to model 2." --- # Linear vs. cubic .center.width-80[] .small.muted.center[(a) Both models fit reasonably. (b) Evidence favors the simpler linear model ($\Delta \log Z \approx 2.3$, BF $\approx 10$).] --- # Automatic Occam's razor .cols[ .col-1-2[ **Complex models** spread probability over many datasets → low probability for each. **Simple models** concentrate probability → high probability for datasets they do predict well. We'll come back to this when discussing neural network inductive biases! ] .col-1-2[ .center.width-70[] ] ] .highlight[ Bayesian model comparison penalizes complexity automatically—no ad hoc penalties needed. ] --- class: center, middle, section-slide # ML Through a Probabilistic Lens --- # (Almost) Every ML task is about a probability distribution | Task | What you learn | |------|----------------| | Classification | $p(y \mid x)$, $y$ discrete | | Regression | $p(y \mid x)$, $y$ continuous | | Density estimation | $p(x)$ | | Inverse problem / inference | $p(\theta \mid x)$ | | Anomaly detection | Low $p(x)$ regions | | Generation | Sample $x \sim p(x)$ | --- # Loss functions are likelihoods Every loss function encodes probabilistic assumptions. As we saw, **MSE is just the negative log-likelihood of a Gaussian**. -- **Cross-entropy:** For classification, we model $p(y \mid x)$ as a Categorical distribution. -- If true label is $y = k$ and model predicts probabilities $\hat{\pi}\_1, \ldots, \hat{\pi}\_K$: $$p(y = k) = \hat{\pi}\_k \quad \Rightarrow \quad \log p(y = k) = \log \hat{\pi}\_k$$ -- Negative log-likelihood over dataset: $$\mathcal{L}\_{\text{CE}} = -\sum\_i \log \hat{\pi}\_{y\_i} = -\sum\_i \sum\_k y\_{ik} \log \hat{\pi}\_k$$ where $y\_{ik} = 1$ if sample $i$ has label $k$. --- # Takeaways **The likelihood is a story** about how data came to be. .center.width-60[] **Bayes' theorem lets us invert:** $p(\theta \mid x) \propto p(x \mid \theta) p(\theta)$ --- # The computational challenge .cols[ .col-1-2[ Sounds easy, right? In practice, $x$, $z$, and $\theta$ can all be high-dimensional and horribly complicated, and the likelihood (forward model) can be practically intractable. We'll discuss a host of methods to wrangle these distributions! ] .col-1-2[ .center[.width-100[]] .center[.small[Grid evaluation with 20 points per dimension]] ] ]