L6: Convolutional Neural Networks

Drawing mode (d to exit, x to clear)

# Convolutional Neural Networks

## CDS DS 595

### Siddharth Mishra-Sharma

[smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html)

---

# Logistics

1. [Dispatches from the singularity](https://www.anthropic.com/engineering/building-c-compiler)

2. **Assignment 1:** due Feb 18 — make sure to get started if you haven't already

3. **Lab 3:** out this afternoon, due end of Wednesday

4. Office hours tomorrow, Tue 3–5pm, CDS 1528

---

# A brief history of CNNs

Inspired by the visual cortex:
- Local receptive fields (small patches)
- Hierarchy of features
- Translation invariance

.center.width-100[![Neocognitron](figures/neocognitron.png)]
]
.col-1-2[
**Late 1980s–90s — LeNet** (LeCun et al.)

Made it trainable with backprop. Result: LeNet-5 (1998) — handwritten digit recognition deployed at scale.

.center.width-100[![MNIST examples](figures/mnist_example.png)]
]
]

---

# A brief history of CNNs (cont.)

.center.width-80[![The long winter](figures/long_winter.jpg)]

SVMs and hand-crafted features dominated. CNNs were seen as too slow data-hungry.
]
.col-1-2[
**2012 — AlexNet** (Krizhevsky, Sutskever, Hinton)

.center.width-100[![AlexNet paper](figures/alexnet_paper.png)]
]
]

---

# Part 1: Inductive Biases

---

# Recall: the MLP

.cols[
.col-1-2[
.center.width-100[![MLP architecture](figures/mlp.png)]
]
.col-1-2[
$$h\_{\ell+1} = \sigma(W\_\ell \, h\_\ell + b\_\ell)$$

$$y = W\_L \, h\_L$$

Every input is connected to every output — **no structural assumptions** about the data.
]
]

---

# The MNIST shift experiment

.cols[
.col-1-2[
**Setup:**
1. Train MLP on MNIST (centered handwritten digits)
2. Test on same digits, shifted by a few pixels

**Result:** .highlight-inline[Accuracy crashes from ~95% to ~50%.]

The network learned *where* the digit is, not *what* it is.
]
.col-1-2[
.center[
![MNIST shift](figures/mnist_shift.png)
]
]
]

---

# Inductive biases

Many hypotheses fit the training data. The inductive bias determines which ones we **prefer**.

The right inductive bias makes learning **much** easier.

]
.col-1-2[
.center[
![Inductive bias](figures/inductive_bias.png)
]
]
]

---

# Inductive biases by data type

| Data type | Structure | Inductive bias | Architecture |
|-----------|-----------|----------------|--------------|
| Images | Grid, local correlations | Locality + weight sharing | CNN |
| Sequences | Ordered, variable length | Position matters | RNN, Transformer |
| Sets | Unordered collection | Order doesn't matter | DeepSets |
| Graphs | Nodes + edges | Relabeling doesn't matter + edges | GNN |
| 3D point clouds | Positions in space | Rotation/translation symmetry | Geometric neural nets |

---

# Locality + weight sharing

.cols[
.col-1-2[
**Locality:** nearby pixels are more related than distant ones — process small neighborhoods.

**Weight sharing:** the same pattern (e.g. an edge) can appear anywhere — reuse the same detector at every position.

Encoding both into the architecture gives us **convolution**.
]
.col-1-2[
.center.width-100[![Convolution](figures/convolution.png)]
]
]

.highlight[
Convolution doesn't make the network more powerful—it makes it **less** powerful in a useful way.
]

---

# Part 2: 1D Convolution

---

# 1D convolution

The weights $\omega = [\omega\_1, \omega\_2, \omega\_3]$ are the **kernel** (or filter).

Just 3 parameters — a fully connected layer over the same inputs would need $n \times n$.
]
.col-1-2[
.center.width-60[![1D convolution](figures/udl_fig10_2a_1d_conv.png)]

.small.muted.center[UDL Fig 10.2a: Inputs $x\_1, x\_2, x\_3$ are multiplied by kernel weights and summed to produce $z\_2$.]
]
]

---

# Multiple channels

One kernel detects one pattern. To detect many patterns: **run multiple kernels in parallel**.

Each produces a **channel** (feature map).

.center.width-80[![Multiple channels](figures/udl_fig10_5_channels.png)]

With $C\_{in}$ input channels, kernel size $K$, and $C\_{out}$ output channels:

$$\text{Parameters} = C\_{in} \times C\_{out} \times K + C\_{out}$$

---

# Empirical comparison: MNIST-1D

.center.width-70[![MNIST-1D training curves](figures/udl_fig10_8_mnist1d.png)]

.small[Conv net: **2,050 params, 17% error**. FC net: **59,065 params, 40% error**. 29× fewer parameters, much better generalization.]

.highlight[
The convolutional structure encodes prior knowledge: **every position should be processed the same way**. The FC network has to discover this from data.
]

---

# Part 3: 2D Convolution

---

# Extending to 2D

For images, the kernel slides in both directions:

.center.width-80[![2D convolution](figures/udl_fig10_9ab_2d_conv.png)]

.small.muted.center[UDL Fig 10.9: The same 3×3 kernel slides across the image. Same weights, different position.]

A 3×3 kernel: **9 parameters** (plus bias). A fully connected layer over a 224×224 image: **22 billion**.

---

# RGB images: 3D kernels

An RGB image has 3 channels. The kernel is now 3D: it spans all input channels.

.center.width-80[![RGB kernel](figures/udl_fig10_10_rgb_kernel.png)]

For $C\_{in}$ input channels, $C\_{out}$ output channels, $K \times K$ spatial kernel:

$$\text{Parameters} = C\_{in} \times C\_{out} \times K \times K + C\_{out}$$

.small[A typical layer: $C\_{in} = 64$, $C\_{out} = 128$, $K = 3$ → **73,856 parameters**. Fully connected over the same 56×56×64 input: **~25 billion**.]

---

# Receptive fields grow with depth

The **receptive field** = region of input that affects a hidden unit.

.center.width-80[![Receptive field growth](figures/receptive_field.png)]

Layer 1 (kernel 3): sees 3 inputs. Layer 2: sees 5. Layer 3: sees 7...

Deep networks integrate information from across the entire input—while keeping each layer's parameters small.

---

# Pooling

To reason about larger structures (objects, not just edges), we need to expand the receptive field.

**Pooling** downsamples the spatial dimensions — fewer pixels, but each one summarizes a larger region. Zero learnable parameters.
]
.col-1-2[
.center.width-90[![Max and average pooling](figures/pooling.png)]
]
]

---

# Compositionality: the hierarchy of learned representations

$$\text{edges} \to \text{textures} \to \text{parts} \to \text{objects}$$

Locality + weight sharing forces the network to build complex concepts by composing simple ones.

.small.muted[First CONV layer (top right): oriented edges and color blobs. Second CONV layer (bottom right): higher-order combinations, less directly interpretable.]
]
.col-1-2[
.center.width-40[![AlexNet CONV1 filters](figures/alexnet_conv1_filters.jpeg)]
.center.width-40[![AlexNet CONV2 filters](figures/alexnet_conv2_filters.jpeg)]
]
]

---

# The same features form across models

.cols[
.col-1-2[
.center.width-90[![Universality grid](figures/universality_grid.png)]
]
.col-1-2[
Four different architectures, trained independently — all learn curve detectors in similar orientations.

This isn't a coincidence. The task (natural images) and the inductive bias (locality + weight sharing) together push different models toward the **same features**.

.highlight[
If this "universality" holds broadly, studying features and circuits in one model teaches us about all models.
]
]
]

.footnote[Olah et al., ["Zoom In: An Introduction to Circuits"](https://distill.pub/2020/circuits/zoom-in), Distill (2020)]

---

# A CNN in code

```python
import flax.linen as nn

class SimpleCNN(nn.Module):
    @nn.compact
    def __call__(self, x):
        # Conv: 32 filters with 3×3 kernels, then ReLU
        x = nn.relu(nn.Conv(32, (3,3))(x))
        # Pool: take max of each 2×2 block, halving spatial dims
        x = nn.max_pool(x, (2,2), (2,2))
        # Conv: 64 filters (more channels, smaller spatial)
        x = nn.relu(nn.Conv(64, (3,3))(x))
        # Pool again
        x = nn.max_pool(x, (2,2), (2,2))
        # Flatten: collapse spatial dims into a single vector
        x = x.reshape((x.shape[0], -1))
        # Dense layers for classification
        x = nn.relu(nn.Dense(128)(x))
        return nn.Dense(10)(x)
```

---

# Demo: training a CNN on MNIST

**Notebook:** [`mnist_cnn.ipynb`](https://colab.research.google.com/github/bu-ds595/course-materials-spring26/blob/main/notebooks/mnist_cnn.ipynb)

---

# Part 4: CNN Architectures

---

# ResNet: depth through residual connections

.small.muted.right[He et al. (2015)]

.center.width-80[![ResNet architecture](figures/resnet_arch.png)]

ResNets stack conv layers with residual connections ($h' = F(h) + h$). A strong, lightweight baseline for image tasks.

---

# U-Net: dense prediction

.small.muted.right[Ronneberger et al. (2015)]

.center.width-70[![U-Net architecture](figures/udl_fig11_10_unet.png)]

Encoder (downsample) + decoder (upsample) + **skip connections** that preserve fine detail. Output is the same spatial resolution as the input — one prediction per pixel.

---

# U-Net in science

.cols[
.col-1-2[
Whenever you need pixel-level predictions from grid data, U-Net is a strong baseline. It also shows up as a backbone in diffusion models (more later).

- **Biomedical:** Cell segmentation, organ delineation, lesion detection
- **Astronomy:** Galaxy deblending, source extraction, cosmic ray removal
- **Climate:** Cloud segmentation, sea ice mapping
]
.col-1-2[
.center.width-100[![HeLa cell segmentation](figures/unet_hela_segmentation.png)]
.small.muted.center[HeLa cell segmentation (Ronneberger et al., 2015)]
]
]

---

# CNNs in science

Convolutions + pooling + dense head = strong baseline for any image-like data.

.cols[
.col-1-2[
.center.width-100[![GalaxyMNIST samples](figures/galaxymnist_samples.png)]
.small.muted.center[Galaxy morphology classification (GalaxyMNIST — Lab 3!)]
]
.col-1-2[
.center.width-80[![Jet image CNN](figures/komiske_jet_cnn.png)]
.small.muted.center[Jet classification from calorimeter images (Komiske et al., 2016)]
]
]

---

# CNNs vs. Transformers

.small.muted.right[Liu et al. (2022), "A ConvNet for the 2020s"]

.center.width-70[![ConvNeXt ImageNet performance](figures/convnext_imagenet.png)]

**Vision Transformers** have largely replaced CNNs on large-scale benchmarks, but modernized ConvNets (ConvNeXt) remain competitive in particular in data-limited regimes (often in science applications!).

---

# Pretrained models — a preview

.small.muted.right[Archit et al. (2025), "Segment Anything for Microscopy"]

.center.width-100[![SAM 3 segmentation](figures/sam3_segmentation.png)]
.small.muted.center[Segment Anything (SAM): pretrained on 5.2M natural images.]
]
.col-1-2[
.center.width-90[![muSAM](figures/musam_pretrained.png)]
.small.muted.center[Fine-tuned for microscopy segmentation (Archit et al., 2025).]
]
]

---

# Not everything is an image

.small.muted.right[Komiske et al. (2017)]

.center.width-70[![Collider data CNN](figures/komiske_pileup_cnn.png)]

Collider events can be binned into 2D histograms and fed to a CNN — but this discretizes continuous data onto an arbitrary grid.

---

# Beyond grids

.cols[
.col-1-2[
A molecule has no pixel grid. Atoms sit at arbitrary positions in 3D, connected by bonds.

Can flatten it into a fixed-length vector (atom counts), but that throws away the bonding structure.
]
.col-1-2[
.center.width-100[![QM9 molecules as graphs](figures/qm9_molecules_grid.png)]
]
]

---

# Next time

**Graph Neural Networks:** When data has relational structure.

.center.width-40[![Message passing](figures/message_passing.png)]