Notes¶
Notes
tiny_optimizer exposes two gradient-descent optimisers tuned for the ESP32-S3 memory budget: SGD (momentum + L2) and Adam (lite). All optimisers consume a std::vector<ParamGroup> populated by the model layers.
ParamGroup¶
struct ParamGroup
{
Tensor *param; // weight / bias tensor
Tensor *grad; // matching gradient tensor
};
Each trainable layer (Dense, Conv1D, Conv2D, LayerNorm, Attention) overrides Layer::collect_params() and pushes its (weight, dweight), (bias, dbias) pairs onto a std::vector<ParamGroup>. Sequential::collect_params() collects from the whole network.
Optimizer Abstract Base¶
class Optimizer
{
public:
virtual void init(const std::vector<ParamGroup> &groups) = 0;
virtual void step(std::vector<ParamGroup> &groups) = 0;
virtual void zero_grad(std::vector<ParamGroup> &groups);
};
Required call order:
- Construct:
SGD opt(lr, mom)orAdam opt(lr, β1, β2, ε). - Collect params:
model.collect_params(params). - Init:
opt.init(params)— only here are momentum / Adam moment buffers allocated to match each parameter's shape. - Training loop: per batch run
opt.zero_grad(params)→ forward → backward →opt.step(params).
SGD with momentum & L2¶
Update:
Params:
lr: learning rate \(\eta\).momentum: \(\mu\); 0 falls back to vanilla SGD.weight_decay: L2 coefficient \(\lambda\).
init() allocates one velocity tensor per parameter; zero_grad() is provided by the base class.
Adam (lite)¶
Adam(float lr = 1e-3f,
float beta1 = 0.9f,
float beta2 = 0.999f,
float epsilon = 1e-8f,
float weight_decay = 0.0f);
Per step:
Bias correction is applied to the LR (cheaper than per-element):
init() allocates m and v per parameter; step() increments the internal time step t_.
Practical defaults
- For SHM, biomedical signals or other small/unstable datasets: Adam with defaults works.
- For sparse / large-batch training: SGD with
lr~0.1 andmomentum=0.9. weight_decay > 0matches PyTorch L2 regularisation; do not over-decay biases (the implementation does decay them but their magnitude is small).
Memory / PSRAM impact¶
- SGD: +1 velocity tensor per parameter → ~2× memory.
- Adam: +2 moment tensors per parameter → ~3× memory.
If you place model weights in PSRAM, you typically want optimiser buffers in PSRAM too. Tensor defaults to TINY_AI_MALLOC; replace weight tensors with Tensor::from_data(psram_buf, ...) views when the budget is tight.
Trainer Integration¶
Trainer::ensure_params_collected() runs lazily on the first fit() call:
Per batch:
optimizer_->zero_grad(params_);
auto logits = model_->forward(X_batch);
auto grad = loss_backward(logits, ..., loss_type_, y_batch);
model_->backward(grad);
optimizer_->step(params_);
Implementing a custom optimiser is just a matter of subclassing Optimizer and overriding init / step — no changes to layers or Trainer required.