Notes¶
Notes
tiny_norm now provides three normalisation layers: LayerNorm, BatchNorm1D, and BatchNorm2D. BatchNorm layers default to inference mode, using running_mean/running_var so MCU deployment can directly match PC-trained checkpoints.
Norm — LayerNorm: Stabilizing Activations Layer by Layer
During training, the distribution of each layer's input keeps changing (Internal Covariate Shift). Normalization pulls it back to a stable range.
Intuition¶
LayerNorm¶
LayerNorm normalizes each sample independently: compute mean and variance across all features of that sample, then normalize.
- \(\mu, \sigma\): mean and variance of the current sample (across feature dims)
- \(\gamma, \beta\): learnable scale and shift (network decides the best distribution post-norm)
Why Before Activation?¶
Norm is typically applied before Activation (Pre-Norm), constraining the numerical range before entering the activation's saturation zone.
When to use LayerNorm?
LayerNorm works regardless of batch size—even batch=1. Ideal for Attention layers and sequence models.
LayerNorm¶
LayerNorm normalises along the last dimension (feat) without running statistics:
- Learnable params:
gamma/betaof shape[feat]. - Default
epsilon=1e-5. - Works with any rank as long as the last dim equals
feat.
BatchNorm1D (Dense / MLP)¶
- Input/output:
[batch, feat]. - Constructor:
BatchNorm1D(int feat, float momentum=0.1f, float epsilon=1e-5f). - Training mode: computes per-feature batch
mu/varand updates running stats: running_mean = (1-m) * running_mean + m * murunning_var = (1-m) * running_var + m * var- Inference mode: fuses each feature to
scale+shiftconstants.
BatchNorm2D (Conv outputs)¶
- Input/output:
[N,C,L](Conv1D) or[N,C,H,W](Conv2D), same shape out. - Constructor:
BatchNorm2D(int num_channels, float momentum=0.1f, float epsilon=1e-5f). - Statistics are computed per channel over all non-channel axes (
N * spatial). - Inference mode also uses fused
running_mean/running_var.
Training / Inference switch¶
Both BatchNorm1D and BatchNorm2D expose:
bn->set_training(true); // batch stats + running-stat update
bn->set_training(false); // running stats only
At model level:
Practical guidance¶
- Deployment inference: load
gamma/beta/running_mean/running_varfrom PC training and keeptraining_mode=false. - Very small batches: prefer
LayerNormfor more stable behavior. - Updated demos:
example_mlpnow includes aBatchNorm1Ddemo;example_cnnincludes aBatchNorm2Ddemo with explicit mode switching and running-stat inspection.