Notes¶
Notes
tiny_fp8 provides a pure-software implementation of OCP-spec 8-bit floating-point formats: E4M3FN for weights / activations and E5M2 for gradients. ESP32-S3 has no FP8 ALU, so all values are stored as uint8_t and upcast to float32 for arithmetic.
FP8 — 8-bit Floating Point: E4M3FN & E5M2
OCP standard: E4M3FN for weights/activations, E5M2 for gradients.
Intuition¶
| Format | Exp | Mantissa | Range | Precision | Role |
|---|---|---|---|---|---|
| E4M3FN | 4 | 3 | \(\pm 448\) | High | Weights, activations |
| E5M2 | 5 | 2 | \(\pm 57344\) | Low | Gradients |
Pure software emulation. E4M3FN: higher precision; E5M2: larger dynamic range.
FORMAT OVERVIEW¶
| Format | Bit layout | Bias | Max value | Min normal | Special encodings |
|---|---|---|---|---|---|
| E4M3FN | S EEEE MMM | 7 | ±448.0 | 2⁻⁶ | NaN = 0x7F / 0xFF; no ±inf |
| E5M2 | S EEEEE MM | 15 | ±57344.0 | 2⁻¹⁴ | ±Inf = 0x7C / 0xFC; NaN = 0x7D-0x7F / 0xFD-0xFF |
E4M3FN trades ±inf for four extra normal values — OCP's recommended "fitting normal" format for weights / activations. E5M2 mirrors a subset of IEEE 754 (keeps ±inf / NaN) and is recommended for gradients.
E4M3FN¶
uint8_t fp32_to_fp8_e4m3 (float val);
float fp8_e4m3_to_fp32 (uint8_t fp8);
void fp32_to_fp8_e4m3_batch(const float *src, uint8_t *dst, int n);
void fp8_e4m3_to_fp32_batch(const uint8_t *src, float *dst, int n);
Encode flow:
- Extract sign / exponent / mantissa.
- If
exp > 8→ clamp to ±448 (encoding0x7E); ifexp < -9→ flush to ±0. - Rebias
new_exp = exp + 7. - If
new_exp <= 0: subnormal, right-shift mantissa by(1 - new_exp)extra bits; otherwise round-to-nearest-even to a 3-bit mantissa. - If rounding overflows the mantissa, bump exponent and re-check overflow.
- Pack
S EEEE MMM.
Decode flow:
0x7F / 0xFF→ NaN.exp == 0→ subnormal:val = (-1)^S · 2⁻⁶ · (mant / 8).- Otherwise → normal:
val = (-1)^S · 2^(exp - 7) · (1 + mant / 8).
E5M2¶
uint8_t fp32_to_fp8_e5m2 (float val);
float fp8_e5m2_to_fp32 (uint8_t fp8);
void fp32_to_fp8_e5m2_batch(const float *src, uint8_t *dst, int n);
void fp8_e5m2_to_fp32_batch(const uint8_t *src, float *dst, int n);
Differences vs E4M3:
- Bias = 15,
new_exp = exp + 15. - Round-to-nearest-even to a 2-bit mantissa.
- Values beyond ±57344 → encode ±Inf.
- IEEE-style ±Inf / NaN are preserved.
FORMAT DISPATCH¶
uint8_t fp32_to_fp8(float val, tiny_dtype_t dtype); // pick E4M3 / E5M2
float fp8_to_fp32(uint8_t fp8, tiny_dtype_t dtype);
void fp32_to_fp8_batch(const float *src, uint8_t *dst, int n, tiny_dtype_t dtype);
void fp8_to_fp32_batch(const uint8_t *src, float *dst, int n, tiny_dtype_t dtype);
tiny::quantize / dequantize (in tiny_quant.hpp) auto-dispatch to these helpers based on params.dtype, so application code rarely needs to call the batch functions directly.
USAGE PATTERNS¶
Weight compression¶
QuantParams qp = calibrate(weight, TINY_DTYPE_FP8_E4M3);
uint8_t *buf = (uint8_t *)TINY_AI_MALLOC(weight.size);
quantize(weight, buf, qp);
// ... persist to SPIFFS / NVS / deployment blob ...
// Reload + decompress
Tensor restored = Tensor::zeros_like(weight);
dequantize(buf, restored, qp);
example_cnn.cpp demonstrates the end-to-end 4× compression flow with error stats — see EXAMPLES/CNN.
Gradient communication / checkpointing¶
Compress gradients to E5M2 before stashing them in PSRAM:
QuantParams qp_g = calibrate(grad, TINY_DTYPE_FP8_E5M2);
uint8_t *gb = (uint8_t *)TINY_AI_MALLOC_PSRAM(grad.size);
quantize(grad, gb, qp_g);
Decompress back to fp32 when needed.
ACCURACY & TRADE-OFFS¶
- E4M3FN: relative error ~⅛ = 12.5%; works well for sparsified weights / activations after ReLU.
- E5M2: relative error ~¼ = 25% but range is 128× larger, suitable for the long-tailed distributions of gradients.
- Recommendation: pair with INT8 — when INT8 loses precision on layers with extreme dynamic range (typical for attention weights), switch to E4M3 instead.
SOFTWARE IMPLEMENTATION COST¶
ESP32-S3 has no FP8 ALU, so every quant / dequant call goes through pure C++ bit packing (with expf / powf). Recommendations:
- Treat FP8 as storage, not compute: decompress once, do math in fp32.
- Quantise / dequantise off the hot path (one-shot at weight load time).
- Batch functions are inlined per element — feel free to add coarser parallelism on top (FreeRTOS tasks, cache-friendly unrolling).