Notes¶
Notes
Dataset wraps an external float32 matrix + label array into a shuffleable, splittable, iterable training dataset. It only owns an index array — the underlying data stays as a read-only view, which makes it natural to pin the matrix in read-only flash or PSRAM.
Dataset — Data Management: Shuffle, Split, Mini-Batch
Organizes data for training. Does three things: shuffle, split, batch.
Intuition¶
Why Shuffle?¶
- Without shuffle: model learns sequential patterns, not classification
- Shuffled: each batch has balanced sample distribution
Why Validation Set?¶
- `split(0.8)` = 80% training, 20% validation
- Never use validation data for parameter updates
Mini-Batch¶
Sweet spot between full-batch (memory-heavy) and stochastic (noisy). Batch size 16/32/64.
CLASS DEFINITION¶
class Dataset
{
public:
Dataset(const float *X, const int *y,
int n_samples, int n_features, int n_classes);
Dataset();
Dataset(const Dataset &);
Dataset(Dataset &&) noexcept;
Dataset &operator=(const Dataset &);
Dataset &operator=(Dataset &&) noexcept;
~Dataset();
void shuffle(uint32_t seed = 0);
void reset();
int next_batch(Tensor &X_batch, int *y_batch, int batch_size);
void split(float test_ratio, Dataset &train_out, Dataset &test_out,
uint32_t seed = 0) const;
int n_samples() const;
int n_features() const;
int n_classes() const;
Tensor to_tensor() const;
};
DATA CONTRACT¶
Xis a row-majorn_samples × n_featuresfloat matrix owned by the caller (typically astatic const float[]iniris_data.hpp/signal_data.hpp).yis ann_samples-long array of class indices.Datasetkeeps view pointers toX/yand anint *indices_array; the destructor only freesindices_.- Copying / moving a
Datasetnever copies the underlying data, only the index array.
shuffle / split¶
Uses an LCG + Fisher–Yates pass to permute indices_, then resets cursor_. When seed == 0, the default seed 1234567891u is used.
n_test = round(n_samples * test_ratio), clamped to[1, n_samples - 1].- Copy + shuffle the index array; the first
n_trainindices go totrain_out, the rest totest_out. - Internally it uses a private
Dataset(X, y, n, F, C, given_indices)constructor so each subset owns an independent copy of the indices.
next_batch ITERATION¶
- Pulls
actual = min(batch_size, n_samples - cursor_)samples starting fromindices_[cursor_]. - If
X_batch.size != actual * n_features, the tensor is reallocated asTensor(actual, n_features). - Copies each row from
XintoX_batch; copiesy_[idx]intoy_batch[i]. - Returns
actual. A return value of 0 indicates end-of-epoch.
Typical loop:
Dataset ds(X, y, N, F, C);
ds.shuffle(epoch);
ds.reset();
Tensor X_batch;
int *y_batch = (int *)TINY_AI_MALLOC(B * sizeof(int));
while (true)
{
int actual = ds.next_batch(X_batch, y_batch, B);
if (actual == 0) break;
// forward / backward / step ...
}
Trainer::fit() already implements this loop.
to_tensor¶
Copies all currently-indexed samples into a [n_samples, n_features] tensor (deep copy). Handy for one-shot inference / Sequential::accuracy.
MEMORY BUDGET¶
- Self:
indices_isn_samples * sizeof(int)— a few KB. - Per-batch:
Tensor X_batchisB * F * 4bytes,y_batchisB * 4— both reallocated on demand. - After split: train + test each carry their own index copy but share
X/y.
For typical ESP32-S3 IMU / vibration datasets (N ~ thousands, F ~ tens), the full Dataset overhead is in the single-digit KB range and lives comfortably in internal SRAM.