Skip to content

SIMD and Auto-Vectorization

Vex has two related, but different, SIMD stories:

  • small fixed-size arrays can be lowered directly to LLVM vector instructions
  • tensor and graph workloads can flow through SIR, which has its own SIMD backend

The important constraint is that SIMD is not a blanket guarantee for every array expression. Some patterns inline cleanly, some lower to loop kernels, and some belong on the graph fn and Tensor<T> path instead.

See also:

The practical model

Use ordinary arrays when you want straightforward element-wise code:

vex
fn add4(a: [i32; 4], b: [i32; 4]): [i32; 4] {
    return a + b
}

fn dot8(a: [f32; 8], b: [f32; 8]): f32 {
    return \+ (a * b)
}

For small static arrays, the compiler can lower these operations to direct vector IR. The current inline threshold in the SIMD backend is 64 bytes with a power-of-two element count.

When data is dynamic, graph-shaped, or needs tensor-specific routing, use the graph fn and Tensor<T> path instead:

vex
graph fn normalize(x: Tensor<f32>): Tensor<f32> {
    let mag = Math.sqrt(x * x)
    return x / mag
}

What is solid today

The currently well-grounded pieces are:

  • inline SIMD for small static arrays in compiler codegen
  • vector comparisons and reductions for qualifying fixed-size arrays
  • SIR SIMD backends for tensor and graph workloads
  • mask operations in SIR such as any, all, countBits, firstSet, and select

What should be read conservatively

Treat these areas as advanced or still moving:

  • generic tensor arithmetic like Tensor<T> * Tensor<T> across arbitrary T
  • automatic coercion stories between Span<T> and Tensor<T> at every call boundary
  • assuming every dynamic array expression will become a single SIMD instruction
  • assuming every matrix or signal-processing operator is equally mature on every backend

Small-array path

For fixed-size arrays, these patterns are the safest ones to expect the compiler to optimize well:

  • element-wise arithmetic such as +, -, *, /
  • comparisons such as ==, !=, <, >
  • reductions such as \+, \*, \<, \>

Example:

vex
fn energy4(x: [f32; 4]): f32 {
    return \+ (x * x)
}

This is the part of the SIMD story backed by crates/vex-compiler/src/codegen_hir/expr/simd_small.

Tensor and graph path

When code is naturally tensor-oriented, prefer graph fn plus concrete tensor types such as Tensor<f32>.

That path lowers through SIR and then routes to CPU SIMD, GPU, or other backends based on shape and backend support. It is more powerful than the small-array fast path, but it is also where current limitations show up first.

Choosing the right abstraction

Use fixed arrays when:

  • the shape is small and known at compile time
  • the code is mostly arithmetic, comparison, or reduction
  • you want predictable LLVM-level vectorization

Use Tensor<T> and graph fn when:

  • the data shape is dynamic
  • the computation is already graph-like
  • you want SIR routing and backend dispatch

Verification

When SIMD behavior matters, inspect the generated LLVM or backend output instead of assuming fusion happened:

bash
vex compile --emit-llvm file.vx

Good signs for the small-array path are direct vector operations and vector reductions. For graph and tensor code, the more relevant check is whether the code lowered into the expected SIR/backend route.

vex
// ❌ Avoid: Manual loops when operators work
fn sum_bad(data: [f64]): f64 {
    let! total = 0.0
    for x in data {
        total = total + x // Unnecessary!
    }
    return total
}

SIMD Operator Reference

OperatorDescriptionExample
\+Sum reduction\+ [1,2,3]6
\*Product reduction\* [1,2,3]6
\<Min reduction\< [3,1,2]1
\>Max reduction\> [3,1,2]3
\&AND reduction\& [t,t,f]false
|OR reduction| [t,f,f]true
<?Element-wise min[1,5] <? [3,2][1,2]
>?Element-wise max[1,5] >? [3,2][3,5]
><Clamp[1,5] >< (2,4)[2,4]
+|Saturating add250u8 +| 10u8255
-|Saturating sub5u8 -| 10u80
<<<Rotate leftx <<< 1
>>>Rotate rightx >>> 1
<*>Matrix multiplya <*> b
'Transposematrix'

Next Steps

Released under the MIT License.