Linalg — 实现¶

文件结构¶

linalg/
├── tiny_linalg.h        (36 行 — 公有 API 声明)
└── tiny_linalg.c        (516 行 — 实现)

依赖：tiny_math_config.h → tiny_constants.h，加上 <math.h> 和 <string.h>。

设计模式¶

每个函数遵循一致的模式：

验证输入（空指针、维度）
分发到 ESP-DSP（条件：padding=0, stride=1, ESP32 平台）
回退到通用 C 循环（所有其他情况）

示例：`tiny_mat_add_f32`¶

tiny_error_t tiny_mat_add_f32(const float *input1, const float *input2,
                               float *output, int rows, int cols,
                               int padd1, int padd2, int padd_out,
                               int stride1, int stride2, int stride_out)
{
    // 1. 验证
    if (NULL == input1 || NULL == input2 || NULL == output)
        return TINY_ERR_MATH_NULL_POINTER;

    // 2. 平台分发
    #if MCU_PLATFORM_SELECTED == MCU_PLATFORM_ESP32
    if (padd1 == 0 && padd2 == 0 && padd_out == 0 &&
        stride1 == 1 && stride2 == 1 && stride_out == 1) {
        dspm_add_f32(input1, input2, output, rows, cols,
                     0, 0, 0, 1, 1, 1);
        return TINY_OK;
    }
    #endif

    // 3. 通用回退 — 行优先 + stride
    const int in1_step = cols + padd1;
    const int in2_step = cols + padd2;
    const int out_step = cols + padd_out;

    for (int row = 0; row < rows; row++) {
        int base_in1 = row * in1_step;
        int base_in2 = row * in2_step;
        int base_out = row * out_step;

        for (int col = 0; col < cols; col++) {
            int idx_in1 = base_in1 + col * stride1;
            int idx_in2 = base_in2 + col * stride2;
            int idx_out = base_out + col * stride_out;

            output[idx_out] = input1[idx_in1] + input2[idx_in2];
        }
    }
    return TINY_OK;
}

内存布局模型¶

TinyMath 矩阵以 行优先 格式存储，可选 padding：

行优先（无 padding）：            行优先（padding=2）：
┌───┬───┬───┬───┐                  ┌───┬───┬───┬───┬───┬───┐
│ a │ b │ c │ d │ ← cols=4         │ a │ b │ c │ d │ - │ - │
├───┼───┼───┼───┤                  ├───┼───┼───┼───┼───┼───┤
│ e │ f │ g │ h │                  │ e │ f │ g │ h │ - │ - │
└───┴───┴───┴───┘                  └───┴───┴───┴───┴───┴───┘
cols=4, step=4                     cols=4, step=6

元素 \((i, j)\) 的存储索引为：

index = i × (cols + padding) + j × stride

为什么需要 Padding？

Padding 是为了与 ESP-DSP 和其他需要 2 的幂次行对齐以实现向量化加载的 SIMD 优化库兼容。在大多数 TinySHM 工作流（小矩阵，n ≤ 50）中，padding = 0 是最优的。

平台分发机制¶

#if MCU_PLATFORM_SELECTED == MCU_PLATFORM_ESP32
    // ── 加速路径 ──
    // 要求：连续存储（padding=0, stride=1）
    dspm_add_f32(...);       // ESP-DSP，快 2-5 倍
#else
    // ── 通用 C 回退 ──
    for (...) output[i] = a[i] + b[i];  // 可移植，较慢
#endif

相同的模式用于 sub、mult、addc、subc 和 multc。仅在所有矩阵为稠密（无 padding，单位 stride）时采取加速路径。

按实现策略的函数分组¶

分组	验证	ESP-DSP 路径	通用算法
add / sub / mult	空指针、维度	`dspm_add/sub/mul_f32`	带 stride 的行主循环
addc / subc / multc	空指针、维度	`dsps_add/sub/mulc_f32`	带标量操作数的循环
mult	空指针、维度	`dspm_mult_f32`	三重嵌套循环 \(O(mnk)\)
mult_ex	空指针、维度	—	带 padding 偏移的三重嵌套
matvec	空指针、维度	每行 `dsps_dotprod_f32`	点积循环
transpose	空指针、维度	—	缓存友好的行迭代
eye / zero / fill	空指针、维度	`memset`	简单 C 循环
hasnan / hasinf	空指针	—	`isnan()` / `isinf()` 扫描