Notes¶
Notes
Dense is the fully-connected (linear) layer: \( y = x W^\top + b \). It powers MLPs and classification heads. Weights are initialised with Xavier-uniform to keep activations well-scaled in deep stacks.
Dense — Fully-Connected: Every Output Sees Every Input
\(\mathbf{y} = \mathbf{x} \cdot W^T + \mathbf{b}\) — a weighted sum at its core.
Intuition¶
Each Neuron = One Weighted Vote¶
One output neuron's job:
- Look at all input features
- Assign a weight to each (important features → higher weight)
- Weighted sum + bias
- Pass through activation to decide whether to "fire"
What the Weight Matrix Means¶
\(W\) has shape `[outfeatures, infeatures]` — each row contains all the weights for one output neuron.
Why Xavier Initialization¶
If weights are too small → signal vanishes through layers. Too large → signal explodes.
Xavier uniform: \(W \sim U\left[-\sqrt{6/(n_{in}+n_{out})},\; \sqrt{6/(n_{in}+n_{out})}\right]\) Keeps variance consistent, enabling stable signal flow through deep networks.
Dense = One Matrix Multiply
```cpp Tensor output = matmul(input, weight.T()) + bias; // [B, out_features] ```
MATH¶
Input x has shape [batch, in_features], output y has shape [batch, out_features]:
Weights [out_features, in_features] (rows = output dim), bias [out_features].
Xavier-uniform init¶
Bias is zero-initialised.
CLASS DEFINITION¶
class Dense : public Layer
{
public:
Tensor weight; // [out_features, in_features]
Tensor bias; // [out_features] (empty when use_bias=false)
#if TINY_AI_TRAINING_ENABLED
Tensor dweight;
Tensor dbias;
#endif
Dense(int in_features, int out_features, bool use_bias = true);
Tensor forward(const Tensor &x) override; // [B, in_feat] → [B, out_feat]
Tensor backward(const Tensor &grad_out) override;
void collect_params(std::vector<ParamGroup> &groups) override;
int in_features() const;
int out_features() const;
};
BACKWARD¶
Input is cached in x_cache_ (a clone() of the forward input). The backward equations:
dweight and dbias are accumulated; the optimiser's zero_grad() clears them at the start of every mini-batch.
PARAMETER COLLECTION¶
void Dense::collect_params(std::vector<ParamGroup> &groups)
{
groups.push_back({&weight, &dweight});
if (use_bias_) groups.push_back({&bias, &dbias});
}
When use_bias = false, the bias tensor is empty and is not registered.
USAGE¶
Dense fc1(F, 128); // [B, F] → [B, 128]
Dense fc2(128, num_classes); // [B, 128] → [B, num_classes]
Sequential m;
m.add(new Dense(F, 128));
m.add(new ActivationLayer(ActType::RELU));
m.add(new Dense(128, num_classes));
m.add(new ActivationLayer(ActType::SOFTMAX));
Or via the MLP convenience wrapper:
which auto-inserts ReLU between hidden Dense layers and a final Softmax.
PERFORMANCE & MEMORY¶
- Param count:
F_in * F_out + F_out(with bias). - Complexity: forward
O(B * F_in * F_out); backward of the same order. - Memory: training adds another ~2× weight (
dweight) and ~1× bias (dbias). - PSRAM: when
F_in * F_out ≥ 64 KiB, storeweightin PSRAM viaTensor::from_data.
QUANTISATION HOOKS¶
- INT8 PTQ:
quantize_weights(weight, qp)produces anint8_t*, then calltiny_quant_dense_forward_int8for fully-integer inference. - FP8:
calibrate(weight, TINY_DTYPE_FP8_E4M3)+quantize(weight, buf, qp)saves 4× storage; dequantise back to float at runtime.