Notes¶
Notes
tiny_quant_config.h centralises the dtype enum, the quantisation parameter struct, and the format-specific limits. Every INT / FP8 quantiser depends on these types, so this header is the foundation of the entire quant subsystem.
Quant Config — Data Type Enum & Quant Parameters Struct
Defines \(tiny\_dtype\_t\) enum and \(tiny\_quant\_params\_t\) struct.
Reference¶
| Enum | Name | Bit width |
|---|---|---|
TINY_DTYPE_FP32 | 32-bit float | 32 |
TINY_DTYPE_INT8 | 8-bit signed integer | 8 |
TINY_DTYPE_INT16 | 16-bit signed integer | 16 |
TINY_DTYPE_E4M3FN | FP8 (E4M3FN) | 8 |
TINY_DTYPE_E5M2 | FP8 (E5M2) | 8 |
tinydtypet¶
typedef enum
{
TINY_DTYPE_FLOAT32 = 0, // 32-bit IEEE 754 float (native compute type)
TINY_DTYPE_INT16 = 1, // signed 16-bit integer
TINY_DTYPE_INT8 = 2, // signed 8-bit integer (most HW-friendly on ESP32-S3)
TINY_DTYPE_FP8_E4M3 = 3, // 8-bit float E4M3FN: range ±448, weights/activations
TINY_DTYPE_FP8_E5M2 = 4, // 8-bit float E5M2: range ±57344, gradients
} tiny_dtype_t;
tinyquantparams_t¶
typedef struct
{
tiny_dtype_t dtype;
float scale; // float_val = scale * (quant_val - zero_point)
int zero_point; // 0 for symmetric / FP8
} tiny_quant_params_t;
tiny_ai defaults to symmetric quantisation (zero_point = 0):
scale is derived from the tensor's max absolute value:
with \( Q_\text{max} \) = 127 (INT8), 32767 (INT16), 448 (FP8 E4M3), 57344 (FP8 E5M2).
FORMAT LIMITS¶
// FP8 E4M3FN (OCP spec): no ±inf, NaN = 0x7F / 0xFF
#define TINY_FP8_E4M3_MAX 448.0f
#define TINY_FP8_E4M3_MIN (-448.0f)
#define TINY_FP8_E4M3_NAN 0x7Fu
// FP8 E5M2: supports ±inf and NaN
#define TINY_FP8_E5M2_MAX 57344.0f
#define TINY_FP8_E5M2_MIN (-57344.0f)
#define TINY_FP8_E5M2_INF 0x7Cu
#define TINY_FP8_E5M2_NAN 0x7Fu
// INT8 / INT16 symmetric ranges
#define TINY_INT8_MAX 127
#define TINY_INT8_MIN (-128)
#define TINY_INT16_MAX 32767
#define TINY_INT16_MIN (-32768)
CHOOSING A dtype¶
| Scenario | Suggested dtype | Notes |
|---|---|---|
| Inference, memory-bound | INT8 | Works with the INT8 dense kernel, 4× compression |
| High-precision inference / stats | INT16 | 2× compression, near-zero loss |
| Static weights needing FP range | FP8_E4M3 | 4× compression, wider range than INT8 |
| Gradients / backward intermediates | FP8_E5M2 | Larger range, lower precision, ideal for gradients |
| Training | FLOAT32 | Maximum numerical stability |
NAMESPACING¶
The types and constants in tiny_quant_config.h live in the global / extern "C" scope and are usable from both C and C++. tiny_quant.hpp adds the C++ tiny::QuantParams wrapper:
struct QuantParams
{
tiny_dtype_t dtype;
float scale;
int zero_point;
tiny_quant_params_t to_c() const { return { dtype, scale, zero_point }; }
};
to_c() bridges C++ code into the C API (e.g. tiny_quant_f32_to_int8).