SIMD and Auto-Vectorization
Vex has two related, but different, SIMD stories:
- small fixed-size arrays can be lowered directly to LLVM vector instructions
- tensor and graph workloads can flow through SIR, which has its own SIMD backend
The important constraint is that SIMD is not a blanket guarantee for every array expression. Some patterns inline cleanly, some lower to loop kernels, and some belong on the graph fn and Tensor<T> path instead.
See also:
The practical model
Use ordinary arrays when you want straightforward element-wise code:
fn add4(a: [i32; 4], b: [i32; 4]): [i32; 4] {
return a + b
}
fn dot8(a: [f32; 8], b: [f32; 8]): f32 {
return \+ (a * b)
}For small static arrays, the compiler can lower these operations to direct vector IR. The current inline threshold in the SIMD backend is 64 bytes with a power-of-two element count.
When data is dynamic, graph-shaped, or needs tensor-specific routing, use the graph fn and Tensor<T> path instead:
graph fn normalize(x: Tensor<f32>): Tensor<f32> {
let mag = Math.sqrt(x * x)
return x / mag
}What is solid today
The currently well-grounded pieces are:
- inline SIMD for small static arrays in compiler codegen
- vector comparisons and reductions for qualifying fixed-size arrays
- SIR SIMD backends for tensor and graph workloads
- mask operations in SIR such as
any,all,countBits,firstSet, andselect
What should be read conservatively
Treat these areas as advanced or still moving:
- generic tensor arithmetic like
Tensor<T> * Tensor<T>across arbitraryT - automatic coercion stories between
Span<T>andTensor<T>at every call boundary - assuming every dynamic array expression will become a single SIMD instruction
- assuming every matrix or signal-processing operator is equally mature on every backend
Small-array path
For fixed-size arrays, these patterns are the safest ones to expect the compiler to optimize well:
- element-wise arithmetic such as
+,-,*,/ - comparisons such as
==,!=,<,> - reductions such as
\+,\*,\<,\>
Example:
fn energy4(x: [f32; 4]): f32 {
return \+ (x * x)
}This is the part of the SIMD story backed by crates/vex-compiler/src/codegen_hir/expr/simd_small.
Tensor and graph path
When code is naturally tensor-oriented, prefer graph fn plus concrete tensor types such as Tensor<f32>.
That path lowers through SIR and then routes to CPU SIMD, GPU, or other backends based on shape and backend support. It is more powerful than the small-array fast path, but it is also where current limitations show up first.
Choosing the right abstraction
Use fixed arrays when:
- the shape is small and known at compile time
- the code is mostly arithmetic, comparison, or reduction
- you want predictable LLVM-level vectorization
Use Tensor<T> and graph fn when:
- the data shape is dynamic
- the computation is already graph-like
- you want SIR routing and backend dispatch
Verification
When SIMD behavior matters, inspect the generated LLVM or backend output instead of assuming fusion happened:
vex compile --emit-llvm file.vxGood signs for the small-array path are direct vector operations and vector reductions. For graph and tensor code, the more relevant check is whether the code lowered into the expected SIR/backend route.
// ❌ Avoid: Manual loops when operators work
fn sum_bad(data: [f64]): f64 {
let! total = 0.0
for x in data {
total = total + x // Unnecessary!
}
return total
}SIMD Operator Reference
| Operator | Description | Example |
|---|---|---|
\+ | Sum reduction | \+ [1,2,3] → 6 |
\* | Product reduction | \* [1,2,3] → 6 |
\< | Min reduction | \< [3,1,2] → 1 |
\> | Max reduction | \> [3,1,2] → 3 |
\& | AND reduction | \& [t,t,f] → false |
| | OR reduction | | [t,f,f] → true |
<? | Element-wise min | [1,5] <? [3,2] → [1,2] |
>? | Element-wise max | [1,5] >? [3,2] → [3,5] |
>< | Clamp | [1,5] >< (2,4) → [2,4] |
+| | Saturating add | 250u8 +| 10u8 → 255 |
-| | Saturating sub | 5u8 -| 10u8 → 0 |
<<< | Rotate left | x <<< 1 |
>>> | Rotate right | x >>> 1 |
<*> | Matrix multiply | a <*> b |
' | Transpose | matrix' |
Next Steps
- GPU Programming - Massively parallel compute
- FFI - Integrating with native libraries
- Memory Management - Efficient data handling