Notes¶
Notes
tiny_optimizer exposes two gradient-descent optimisers tuned for the ESP32-S3 memory budget: SGD (momentum + L2) and Adam (lite). All optimisers consume a std::vector<ParamGroup> populated by the model layers.
Optimizer — Walking Downhill Along the Gradient
The gradient tells the optimizer which direction to go. The optimizer decides how big a step to take and how to take it.
Intuition¶
Gradient Descent = Blindfolded Hiking¶
Imagine you're blindfolded on a mountain, trying to reach the valley. Each step: feel the steepest direction (gradient), take a step that way (update parameters). That's gradient descent.
Three Optimizers¶
| Optimizer | Intuition | Characteristic | Best for |
|---|---|---|---|
| SGD | \(\text{param} = \text{param} - \text{lr} \cdot \text{grad}\) | Pure gradient descent | Simple problems, small data |
| SGD+Momentum | Adds "inertia": retains previous direction, gradient corrects it | Reduces oscillation | Default for most tasks |
| Adam | Per-parameter adaptive LR + momentum | Each param has its own step | Complex networks, hard to tune |
Hyperparameter tuning
- lr: too large → diverges; too small → extremely slow. Typical: 0.01 ~ 0.0001
- momentum: 0.9 is classic. Higher = more inertia
- weight_decay: L2 regularization, prevents overfitting
Learning rate is THE most important hyperparameter
Debug sequence: try lr=0.01. If loss doesn't decrease (or oscillates), try 0.001. If too slow, try 0.1.
ParamGroup¶
struct ParamGroup
{
Tensor *param; // weight / bias tensor
Tensor *grad; // matching gradient tensor
};
Each trainable layer (Dense, Conv1D, Conv2D, LayerNorm, Attention) overrides Layer::collect_params() and pushes its (weight, dweight), (bias, dbias) pairs onto a std::vector<ParamGroup>. Sequential::collect_params() collects from the whole network.
Optimizer Abstract Base¶
class Optimizer
{
public:
virtual void init(const std::vector<ParamGroup> &groups) = 0;
virtual void step(std::vector<ParamGroup> &groups) = 0;
virtual void zero_grad(std::vector<ParamGroup> &groups);
};
Required call order:
- Construct:
SGD opt(lr, mom)orAdam opt(lr, β1, β2, ε). - Collect params:
model.collect_params(params). - Init:
opt.init(params)— only here are momentum / Adam moment buffers allocated to match each parameter's shape. - Training loop: per batch run
opt.zero_grad(params)→ forward → backward →opt.step(params).
SGD with momentum & L2¶
Update:
Params:
lr: learning rate \(\eta\).momentum: \(\mu\); 0 falls back to vanilla SGD.weight_decay: L2 coefficient \(\lambda\).
init() allocates one velocity tensor per parameter; zero_grad() is provided by the base class.
Adam (lite)¶
Adam(float lr = 1e-3f,
float beta1 = 0.9f,
float beta2 = 0.999f,
float epsilon = 1e-8f,
float weight_decay = 0.0f);
Per step:
Bias correction is applied to the LR (cheaper than per-element):
init() allocates m and v per parameter; step() increments the internal time step t_.
Practical defaults
- For SHM, biomedical signals or other small/unstable datasets: Adam with defaults works.
- For sparse / large-batch training: SGD with
lr~0.1 andmomentum=0.9. weight_decay > 0matches PyTorch L2 regularisation; do not over-decay biases (the implementation does decay them but their magnitude is small).
Memory / PSRAM impact¶
- SGD: +1 velocity tensor per parameter → ~2× memory.
- Adam: +2 moment tensors per parameter → ~3× memory.
If you place model weights in PSRAM, you typically want optimiser buffers in PSRAM too. Tensor defaults to TINY_AI_MALLOC; replace weight tensors with Tensor::from_data(psram_buf, ...) views when the budget is tight.
Trainer Integration¶
Trainer::ensure_params_collected() runs lazily on the first fit() call:
Per batch:
optimizer_->zero_grad(params_);
auto logits = model_->forward(X_batch);
auto grad = loss_backward(logits, ..., loss_type_, y_batch);
model_->backward(grad);
optimizer_->step(params_);
Implementing a custom optimiser is just a matter of subclassing Optimizer and overriding init / step — no changes to layers or Trainer required.