Drawing mode (d to exit, x to clear)
class: middle, title-slide # Convolutional Neural Networks ## CDS DS 595 ### Siddharth Mishra-Sharma [smsharma.io/teaching/ds595-ai4science](https://smsharma.io/teaching/ds595-ai4science.html) --- # Logistics 1. [Dispatches from the singularity](https://www.anthropic.com/engineering/building-c-compiler) 2. **Assignment 1:** due Feb 18 — make sure to get started if you haven't already 3. **Lab 3:** out this afternoon, due end of Wednesday 4. Office hours tomorrow, Tue 3–5pm, CDS 1528 --- # A brief history of CNNs .cols[ .col-1-2[ **1980 — Neocognitron** (Fukushima) Inspired by the visual cortex: - Local receptive fields (small patches) - Hierarchy of features - Translation invariance .center.width-100[] ] .col-1-2[ **Late 1980s–90s — LeNet** (LeCun et al.) Made it trainable with backprop. Result: LeNet-5 (1998) — handwritten digit recognition deployed at scale. .center.width-100[] ] ] --- # A brief history of CNNs (cont.) .cols[ .col-1-2[ **2000s — The long winter** .center.width-80[] SVMs and hand-crafted features dominated. CNNs were seen as too slow data-hungry. ] .col-1-2[ **2012 — AlexNet** (Krizhevsky, Sutskever, Hinton) .center.width-100[] ] ] --- class: center, middle, section-slide # Part 1: Inductive Biases .small[The role of architecture in learning structure] --- # Recall: the MLP .cols[ .col-1-2[ .center.width-100[] ] .col-1-2[ $$h\_{\ell+1} = \sigma(W\_\ell \, h\_\ell + b\_\ell)$$ $$y = W\_L \, h\_L$$ Every input is connected to every output — **no structural assumptions** about the data. ] ] --- # The MNIST shift experiment .cols[ .col-1-2[ **Setup:** 1. Train MLP on MNIST (centered handwritten digits) 2. Test on same digits, shifted by a few pixels **Result:** .highlight-inline[Accuracy crashes from ~95% to ~50%.] The network learned *where* the digit is, not *what* it is. ] .col-1-2[ .center[  ] ] ] --- # Inductive biases .cols[ .col-1-2[ An **inductive bias** = assumption built into the model. Many hypotheses fit the training data. The inductive bias determines which ones we **prefer**. The right inductive bias makes learning **much** easier. ] .col-1-2[ .center[  ] ] ] --- # Inductive biases by data type | Data type | Structure | Inductive bias | Architecture | |-----------|-----------|----------------|--------------| | Images | Grid, local correlations | Locality + weight sharing | CNN | | Sequences | Ordered, variable length | Position matters | RNN, Transformer | | Sets | Unordered collection | Order doesn't matter | DeepSets | | Graphs | Nodes + edges | Relabeling doesn't matter + edges | GNN | | 3D point clouds | Positions in space | Rotation/translation symmetry | Geometric neural nets | --- # Locality + weight sharing .cols[ .col-1-2[ **Locality:** nearby pixels are more related than distant ones — process small neighborhoods. **Weight sharing:** the same pattern (e.g. an edge) can appear anywhere — reuse the same detector at every position. Encoding both into the architecture gives us **convolution**. ] .col-1-2[ .center.width-100[] ] ] .highlight[ Convolution doesn't make the network more powerful—it makes it **less** powerful in a useful way. ] --- class: center, middle, section-slide # Part 2: 1D Convolution .small[Starting simple before moving to images] --- # 1D convolution .cols[ .col-1-2[ A **convolution** computes each output as a weighted sum of nearby inputs: .center[.eq-box[ $z\_i = \omega\_1 x\_{i-1} + \omega\_2 x\_i + \omega\_3 x\_{i+1}$ ]] The weights $\omega = [\omega\_1, \omega\_2, \omega\_3]$ are the **kernel** (or filter). Just 3 parameters — a fully connected layer over the same inputs would need $n \times n$. ] .col-1-2[ .center.width-60[] .small.muted.center[UDL Fig 10.2a: Inputs $x\_1, x\_2, x\_3$ are multiplied by kernel weights and summed to produce $z\_2$.] ] ] --- # Multiple channels One kernel detects one pattern. To detect many patterns: **run multiple kernels in parallel**. Each produces a **channel** (feature map). .center.width-80[] With $C\_{in}$ input channels, kernel size $K$, and $C\_{out}$ output channels: $$\text{Parameters} = C\_{in} \times C\_{out} \times K + C\_{out}$$ --- # Empirical comparison: MNIST-1D .center.width-70[] .small[Conv net: **2,050 params, 17% error**. FC net: **59,065 params, 40% error**. 29× fewer parameters, much better generalization.] .highlight[ The convolutional structure encodes prior knowledge: **every position should be processed the same way**. The FC network has to discover this from data. ] --- class: center, middle, section-slide # Part 3: 2D Convolution .small[The building block of image processing] --- # Extending to 2D For images, the kernel slides in both directions: .center.width-80[] .small.muted.center[UDL Fig 10.9: The same 3×3 kernel slides across the image. Same weights, different position.] A 3×3 kernel: **9 parameters** (plus bias). A fully connected layer over a 224×224 image: **22 billion**. --- # RGB images: 3D kernels An RGB image has 3 channels. The kernel is now 3D: it spans all input channels. .center.width-80[] For $C\_{in}$ input channels, $C\_{out}$ output channels, $K \times K$ spatial kernel: $$\text{Parameters} = C\_{in} \times C\_{out} \times K \times K + C\_{out}$$ .small[A typical layer: $C\_{in} = 64$, $C\_{out} = 128$, $K = 3$ → **73,856 parameters**. Fully connected over the same 56×56×64 input: **~25 billion**.] --- # Receptive fields grow with depth The **receptive field** = region of input that affects a hidden unit. .center.width-80[] Layer 1 (kernel 3): sees 3 inputs. Layer 2: sees 5. Layer 3: sees 7... Deep networks integrate information from across the entire input—while keeping each layer's parameters small. --- # Pooling .cols[ .col-1-2[ A 3×3 kernel only sees 3×3 pixels. To reason about larger structures (objects, not just edges), we need to expand the receptive field. **Pooling** downsamples the spatial dimensions — fewer pixels, but each one summarizes a larger region. Zero learnable parameters. ] .col-1-2[ .center.width-90[] ] ] --- # Compositionality: the hierarchy of learned representations .cols[ .col-1-2[ Across layers, the network builds a **compositional hierarchy**: $$\text{edges} \to \text{textures} \to \text{parts} \to \text{objects}$$ Locality + weight sharing forces the network to build complex concepts by composing simple ones. .small.muted[First CONV layer (top right): oriented edges and color blobs. Second CONV layer (bottom right): higher-order combinations, less directly interpretable.] ] .col-1-2[ .center.width-40[] .center.width-40[] ] ] .footnote[Filters from [cs231n](https://cs231n.github.io/understanding-cnn/)] --- # The same features form across models .cols[ .col-1-2[ .center.width-90[] ] .col-1-2[ Four different architectures, trained independently — all learn curve detectors in similar orientations. This isn't a coincidence. The task (natural images) and the inductive bias (locality + weight sharing) together push different models toward the **same features**. .highlight[ If this "universality" holds broadly, studying features and circuits in one model teaches us about all models. ] ] ] .footnote[Olah et al., ["Zoom In: An Introduction to Circuits"](https://distill.pub/2020/circuits/zoom-in), Distill (2020)] --- # A CNN in code ```python import flax.linen as nn class SimpleCNN(nn.Module): @nn.compact def __call__(self, x): # Conv: 32 filters with 3×3 kernels, then ReLU x = nn.relu(nn.Conv(32, (3,3))(x)) # Pool: take max of each 2×2 block, halving spatial dims x = nn.max_pool(x, (2,2), (2,2)) # Conv: 64 filters (more channels, smaller spatial) x = nn.relu(nn.Conv(64, (3,3))(x)) # Pool again x = nn.max_pool(x, (2,2), (2,2)) # Flatten: collapse spatial dims into a single vector x = x.reshape((x.shape[0], -1)) # Dense layers for classification x = nn.relu(nn.Dense(128)(x)) return nn.Dense(10)(x) ``` --- # Demo: training a CNN on MNIST **Notebook:** [`mnist_cnn.ipynb`](https://colab.research.google.com/github/bu-ds595/course-materials-spring26/blob/main/notebooks/mnist_cnn.ipynb) --- class: center, middle, section-slide # Part 4: CNN Architectures .small[The conv layer is a primitive — compose it in arbitrary ways] --- # ResNet: depth through residual connections .small.muted.right[He et al. (2015)] .center.width-80[] ResNets stack conv layers with residual connections ($h' = F(h) + h$). A strong, lightweight baseline for image tasks. --- # U-Net: dense prediction .small.muted.right[Ronneberger et al. (2015)] .center.width-70[] Encoder (downsample) + decoder (upsample) + **skip connections** that preserve fine detail. Output is the same spatial resolution as the input — one prediction per pixel. --- # U-Net in science .cols[ .col-1-2[ Whenever you need pixel-level predictions from grid data, U-Net is a strong baseline. It also shows up as a backbone in diffusion models (more later). - **Biomedical:** Cell segmentation, organ delineation, lesion detection - **Astronomy:** Galaxy deblending, source extraction, cosmic ray removal - **Climate:** Cloud segmentation, sea ice mapping ] .col-1-2[ .center.width-100[] .small.muted.center[HeLa cell segmentation (Ronneberger et al., 2015)] ] ] --- # CNNs in science Convolutions + pooling + dense head = strong baseline for any image-like data. .cols[ .col-1-2[ .center.width-100[] .small.muted.center[Galaxy morphology classification (GalaxyMNIST — Lab 3!)] ] .col-1-2[ .center.width-80[] .small.muted.center[Jet classification from calorimeter images (Komiske et al., 2016)] ] ] --- # CNNs vs. Transformers .small.muted.right[Liu et al. (2022), "A ConvNet for the 2020s"] .center.width-70[] **Vision Transformers** have largely replaced CNNs on large-scale benchmarks, but modernized ConvNets (ConvNeXt) remain competitive in particular in data-limited regimes (often in science applications!). --- # Pretrained models — a preview .small.muted.right[Archit et al. (2025), "Segment Anything for Microscopy"] .cols[ .col-1-2[ In practice, you rarely train from scratch. .center.width-100[] .small.muted.center[Segment Anything (SAM): pretrained on 5.2M natural images.] ] .col-1-2[ .center.width-90[] .small.muted.center[Fine-tuned for microscopy segmentation (Archit et al., 2025).] ] ] --- # Not everything is an image .small.muted.right[Komiske et al. (2017)] .center.width-70[] Collider events can be binned into 2D histograms and fed to a CNN — but this discretizes continuous data onto an arbitrary grid. --- # Beyond grids .cols[ .col-1-2[ A molecule has no pixel grid. Atoms sit at arbitrary positions in 3D, connected by bonds. Can flatten it into a fixed-length vector (atom counts), but that throws away the bonding structure. ] .col-1-2[ .center.width-100[] ] ] --- # Next time **Graph Neural Networks:** When data has relational structure. .center.width-40[]