L2-3: Probabilistic Reasoning

Drawing mode (d to exit, x to clear)

## CDS DS 595

### Siddharth Mishra-Sharma

Wed, January 29, 2026

[smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html)
]
.col-1-3[
.width-80[![Bayes](figures/bayes_gemini.png)]
]
]

.small[📄 [Notes](https://bu-ds595.github.io/course-materials-spring26/notes/01-probabilistic-reasoning.pdf)]

---

# Why probabilistic reasoning?

.small.muted.center[Binary black hole merger / Credit: LIGO]

---

# Why probabilistic reasoning?

.cols[
.col-1-2[
.center[![LIGO](figures/ligo-waveform.jpg)]
.center.small.muted[LIGO]
]
.col-1-2[
.center[![Gravitational wave](figures/gravitational-wave.webp)]
.center.small.muted[Gravitational wave waveform]
]
]

---

# Why probabilistic reasoning?

.cols[
.col-1-2[
.center[![Sky localization](figures/gw-skymap-milkyway.jpg)]
.center.small.muted[Sky localization]
]
.col-1-2[
.center[![Posterior](figures/gw-posterior.png)]
.center.small.muted[Masses, spins, distances, ...]
]
]

---

# Why probabilistic reasoning?

Discoveries have a high standard for confirmation!

.center.width-50[![Higgs boson announcement](figures/higgs-announcement.jpeg)]

---

# Notation and Probability Basics

---

# Notation

We'll use consistent notation throughout:

| Symbol | Meaning | Examples |
|--------|---------|----------|
| $x$ | Observed data | GW strain signal, particle collisions |
| $\theta$ | Parameters we want to learn | Black hole masses, Higgs mass |
| $z$ | Latent variables | Detector calibration, noise parameters |

**Probabilistic notation:**

- $x \sim p(x)$ means "$x$ is drawn from distribution $p$"
- $x \sim \mathcal{N}(\mu, \sigma^2)$ means "$x$ is Gaussian with mean $\mu$, variance $\sigma^2$"

---

# Density vs. probability

.cols[
.col-1-2[
For continuous variables, $p(x)$ is a **probability density** (pdf), not a probability.

To get actual probabilities, integrate:

$$P(a \le x \le b) = \int\_a^b p(x)\\,dx$$
]
.col-1-2[
.center[![Density vs probability](figures/density_vs_probability.png)]
]
]

---

# Conditional probability

$$p(A \mid B) = \frac{p(A, B)}{p(B)}$$

**Intuition:** conditioning on $B$ restricts our universe to outcomes where $B$ occurred. Within that restricted universe, $p(A \mid B)$ is the fraction where $A$ also holds.

]
.col-1-3[
.center[![Conditional probability](figures/conditional_probability.png)]
]
]

---

# Chain rule

Any joint distribution factors as a product of conditionals:

$$p(x\_1, x\_2, x\_3) = p(x\_1) \\, p(x\_2 \mid x\_1) \\, p(x\_3 \mid x\_1, x\_2)$$

More generally:

$$p(x\_1, \ldots, x\_n) = \prod\_{i=1}^{n} p(x\_i \mid x\_1, \ldots, x\_{i-1})$$

.highlight[
This is how autoregressive models (like GPT) generate sequences: each token conditioned on all previous tokens.
]

---

# Independence

$A$ and $B$ are **independent** iff any of these equivalent statements hold:

$$p(A \mid B) = p(A)$$

$$p(B \mid A) = p(B)$$

$$p(A, B) = p(A)p(B)$$

For independent observations:

$$p(x\_1, x\_2, \ldots, x\_N \mid \theta) = \prod\_i p(x\_i \mid \theta)$$

---

# Marginalization

$$p(x) = \int p(x, z)\\, dz$$

.small.muted[Joint distribution $p(x,y)$ with marginals $p(x)$ (top) and $p(y)$ (right)]
]
.col-1-2[
.center.width-90[![Marginalization](figures/marginalization.png)]
]
]

---

# Product rule

Joint probabilities factor in two ways:

$$p(x, z) = p(x \mid z) p(z) = p(z \mid x) p(x)$$

This symmetry gives us Bayes' theorem.

---

# Bayes' theorem

- $p(A)$ — **prior** belief about $A$
- $p(B)$ — probability of **evidence** $B$
- $p(A \mid B)$ — **posterior** belief about $A$ given evidence $B$
- $p(B \mid A)$ — **likelihood** of $B$ given $A$
]
.col-1-3[
.center[![Thomas Bayes](figures/bayes_gemini.png)]
.small.muted.center[Thomas Bayes (1701–1761)]
]
]

---

# Why does this matter for science?

In science, we can often model the **forward** direction but want to go **backward**:

Bayes' theorem **inverts** the forward model:

$$p(\text{cause} \mid \text{effect}) = \frac{p(\text{effect} \mid \text{cause}) \\, p(\text{cause})}{p(\text{effect})}$$

---

# Bayes' theorem for inference

.center.small[
.eq-box[
$$p(\theta \mid x) = \frac{p(x \mid \theta) \\, p(\theta)}{p(x)}$$
]
]

| Term | Name | Role |
|------|------|------|
| $p(\theta)$ | **Prior** | What we believed before |
| $p(x \mid \theta)$ | **Likelihood** | How probable is data given $\theta$? |
| $p(\theta \mid x)$ | **Posterior** | What we believe after seeing data |
| $p(x)$ | **Evidence** | Normalizing constant |

---

# Forward vs. Inverse Problems

---

# Forward vs. inverse

.center.width-70[![Forward vs inverse](figures/forward_inverse.png)]

.small.muted.center[Forward: sample $x \sim p(x \mid \theta)$. Inverse: characterize $p(\theta \mid x)$.]

---

# Forward processes easy -- just sample $x \sim p(x \mid \theta)$

---

# But inference is hard... need to get $p(\theta \mid x)$

---

# It's all about the likelihood function!

---

# Example: fitting a line

.center.width-50[![Line fit data](figures/line_fit_data_nofit.png)]

.small.muted.center[Each measurement has Gaussian uncertainty $\sigma$. Gray dashed: true line.]

---

# Example: fitting a line

We measure $N$ data points $\\{x\_i\\}$ at known inputs $\\{t\_i\\}$.

We believe the true relationship is $x = mt + c$, but measurements have noise.

**The generative story:**

1. There are true parameters $\theta = (m, c)$
2. For each input $t\_i$, the "true" value is $\mu\_i = m t\_i + c$
3. We observe $x\_i = \mu\_i + \epsilon\_i$ where $\epsilon\_i \sim \mathcal{N}(0, \sigma^2)$

In probabilistic notation:

$$x\_i \sim \mathcal{N}(m t\_i + c, \sigma^2)$$

---

# The likelihood

$$p(x \mid \theta)$$

The likelihood encodes the **forward model**: given parameters, predict the data distribution.

**Maximum likelihood:** Find parameters that make the data most probable:

$$\hat{\theta}\_{\text{MLE}} = \arg\max\_\theta p(x \mid \theta) = \arg\max\_\theta \log p(x \mid \theta)$$

In practice, work with log-likelihoods.

---

# MLE for line fitting

$$\ell(m, c) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum\_{i=1}^N (x\_i - mt\_i - c)^2$$

The first term is constant in $(m, c)$.

Maximizing $\ell$ is equivalent to minimizing:

$$\sum\_i (x\_i - mt\_i - c)^2$$

---

# MLE fit

.center.width-50[![Line fit data with MLE](figures/line_fit_data.png)]

.small.muted.center[Orange: MLE fit. Gray dashed: true line.]

---

# Likelihood ratios

$$\Lambda = \frac{p(x \mid \theta\_1)}{p(x \mid \theta\_0)}$$

.small.muted[This is where $p$-values come from (but we won't go into details...)]
]
.col-1-2[
![Profile likelihood](figures/profile_likelihood.png)
]
]

---

# The Posterior Distribution

---

# Bayes' theorem revisited

.center[
.eq-box[
$$\underbrace{p(\theta \mid x)}\_{\text{posterior}} = \frac{\overbrace{p(x \mid \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(x)}\_{\text{evidence}}}$$
]
]

.highlight[
The posterior is the **complete answer**. Unlike MLE (a point), the posterior is a distribution.
]

---

# Visualizing Bayesian inference

.center.width-80[![Bayes theorem](figures/bayes_theorem.png)]

.small.muted.center[Prior (left) encodes initial beliefs. Likelihood (middle) peaks where data are probable. Posterior (right) combines both.]

---

# Sequential updating

When $x\_2$ arrives, yesterday's posterior becomes today's prior: $p(\theta \mid x\_1, x\_2) \propto p(x\_2 \mid \theta) \cdot p(\theta \mid x\_1)$

.center.width-80[![Posterior evolution](figures/posterior_evolution.png)]

---

# Line fitting posterior

With a flat prior on $(m, c)$, posterior is analytic:

$$p(m, c \mid \\{x\_i\\}, \\{t\_i\\}) \propto \exp\left( -\frac{1}{2\sigma^2} \sum\_{i=1}^N (x\_i - mt\_i - c)^2 \right)$$

---

# The full posterior

.center.width-90[![Line regression](figures/line_regression_combined.png)]

---

# Parameter degeneracy

.small.muted[Corner plot: marginals (diagonal) and joint posterior (off-diagonal) with 68% and 95% credible regions.]
]
.col-1-2[
.width-80[![Corner plot](figures/corner_plot.png)]
]
]

---

# Model Comparison

---

# The evidence and Bayes factors

The marginal likelihood (or "evidence"):

$$p(x \mid \mathcal{M}) = \int p(x \mid \theta, \mathcal{M}) \\, p(\theta \mid \mathcal{M})\\, d\theta$$

**Bayes factors** compare models:

$$\text{BF}\_{12} = \frac{p(x \mid \mathcal{M}\_1)}{p(x \mid \mathcal{M}\_2)}$$

$\text{BF}\_{12} = 10$ means: "The data are 10 times more probable under model 1 compared to model 2."

---

# Linear vs. cubic

.center.width-80[![Model comparison](figures/model_comparison.png)]

.small.muted.center[(a) Both models fit reasonably. (b) Evidence favors the simpler linear model ($\Delta \log Z \approx 2.3$, BF $\approx 10$).]

---

# Automatic Occam's razor

.cols[
.col-1-2[
**Complex models** spread probability over many datasets → low probability for each.

**Simple models** concentrate probability → high probability for datasets they do predict well.

We'll come back to this when discussing neural network inductive biases!
]
.col-1-2[
.center.width-70[![Model complexity](figures/model_complexity.png)]
]
]

.highlight[
Bayesian model comparison penalizes complexity automatically—no ad hoc penalties needed.
]

---

# ML Through a Probabilistic Lens

---

# (Almost) Every ML task is about a probability distribution

| Task | What you learn |
|------|----------------|
| Classification | $p(y \mid x)$, $y$ discrete |
| Regression | $p(y \mid x)$, $y$ continuous |
| Density estimation | $p(x)$ |
| Inverse problem / inference | $p(\theta \mid x)$ |
| Anomaly detection | Low $p(x)$ regions |
| Generation | Sample $x \sim p(x)$ |

---

# Loss functions are likelihoods

Every loss function encodes probabilistic assumptions.

As we saw, **MSE is just the negative log-likelihood of a Gaussian**.

**Cross-entropy:** For classification, we model $p(y \mid x)$ as a Categorical distribution.

If true label is $y = k$ and model predicts probabilities $\hat{\pi}\_1, \ldots, \hat{\pi}\_K$:

$$p(y = k) = \hat{\pi}\_k \quad \Rightarrow \quad \log p(y = k) = \log \hat{\pi}\_k$$

Negative log-likelihood over dataset:

$$\mathcal{L}\_{\text{CE}} = -\sum\_i \log \hat{\pi}\_{y\_i} = -\sum\_i \sum\_k y\_{ik} \log \hat{\pi}\_k$$

where $y\_{ik} = 1$ if sample $i$ has label $k$.

---

# Takeaways

**The likelihood is a story** about how data came to be.

.center.width-60[![Forward vs inverse](figures/forward_inverse.png)]

**Bayes' theorem lets us invert:** $p(\theta \mid x) \propto p(x \mid \theta) p(\theta)$

---

# The computational challenge

In practice, $x$, $z$, and $\theta$ can all be high-dimensional and horribly complicated, and the likelihood (forward model) can be practically intractable.

We'll discuss a host of methods to wrangle these distributions!
]
.col-1-2[
.center[.width-100[![Curse of dimensionality](figures/curse_of_dimensionality.png)]]
.center[.small[Grid evaluation with 20 points per dimension]]
]
]