Notes¶
Notes
tiny_pool ships 1-D and 2-D max-/avg-pool layers. Pooling is independent across channels and does not introduce learnable parameters; during training, only MaxPool stores argmax positions for backward.
Pool — Local Downsampling: Keep Key Info, Reduce Data
Take max or average within a local window, shrinking the feature map.
Intuition¶
| Pool type | Value | Meaning | Effect |
|---|---|---|---|
| MaxPool | Max in window | "Strongest response here?" | Retains strongest features, shift invariance |
| AvgPool | Average in window | "Overall response here?" | Retains trend, smooths features |
Why Pool?¶
- Less compute: smaller feature maps → fewer params
- Larger receptive field: later convs see wider context
- Shift invariance: slight shift may still pick same max
Shape Change¶
``` Input [B, C, L] → Output [B, C, (L - kernel)/stride + 1] ```
SHAPE CONVENTIONS¶
| Layer | Input | Output |
|---|---|---|
| MaxPool1D / AvgPool1D | [batch, channels, length] | [batch, channels, (L - K) / S + 1] |
| MaxPool2D / AvgPool2D | [batch, channels, height, width] | [batch, channels, (H - kH) / sH + 1, (W - kW) / sW + 1] |
The constructor's stride defaults to kernel (i.e. non-overlapping pooling).
MaxPool1D¶
class MaxPool1D : public Layer
{
explicit MaxPool1D(int kernel_size, int stride = -1);
Tensor forward (const Tensor &x) override;
Tensor backward(const Tensor &grad_out) override;
};
Forward¶
The absolute argmax index is recorded in mask_[b, c, l] (training only).
Backward¶
grad_out[b, c, l] is written to g[b, c, mask_[b, c, l]] (gradient flows only into the max position).
AvgPool1D¶
Forward¶
Backward¶
Distribute grad_out / K evenly to the K positions in the receptive field.
MaxPool2D¶
mask_ has shape [B, C, OH, OW * 2], packing (ih, iw) as two consecutive floats (saves a reshape). Backward unpacks the same layout and writes grad_out to the original max position.
AvgPool2D¶
Mirrors AvgPool1D: forward averages kH * kW elements; backward broadcasts grad_out / (kH * kW) back.
NOTES¶
stridedefaults tokernel→ non-overlapping pooling (most common).- MaxPool gradients are sparse: only the max position receives gradient (winner-take-all), often desirable for feature selection.
- AvgPool gradients are uniform: more stable but less selective; commonly used for global average pooling (see
GlobalAvgPoolunder LAYERS/BASE). - No padding: the implementation assumes
(L - K)and(H/W - kH/kW)are divisible bystride. Callers must align shapes; otherwise edge elements are silently dropped.
CONV + POOL RECIPE¶
A standard CNN1D block looks like:
Each MaxPool1D(2) halves the sequence length, expanding the receptive field while shrinking the downstream Dense parameter count.
COMPUTE / MEMORY¶
- Complexity:
O(B · C · OH · OW · kH · kW)(no matmul). - Training memory (MaxPool only):
mask_matches output size in 1-D, 2× in 2-D. - PSRAM: pool layers themselves carry no weights; activation memory depends on batch / channels / length.