SIMD and Auto-Vectorization

Vex has two related, but different, SIMD stories:

small fixed-size arrays can be lowered directly to LLVM vector instructions
tensor and graph workloads can flow through SIR, which has its own SIMD backend

The important constraint is that SIMD is not a blanket guarantee for every array expression. Some patterns inline cleanly, some lower to loop kernels, and some belong on the graph fn and Tensor<T> path instead.

The practical model

Use ordinary arrays when you want straightforward element-wise code:

vex

fn add4(a: [i32; 4], b: [i32; 4]): [i32; 4] {
    return a + b
}

fn dot8(a: [f32; 8], b: [f32; 8]): f32 {
    return \+ (a * b)
}

For small static arrays, the compiler can lower these operations to direct vector IR. The current inline threshold in the SIMD backend is 64 bytes with a power-of-two element count.

When data is dynamic, graph-shaped, or needs tensor-specific routing, use the graph fn and Tensor<T> path instead:

vex

graph fn normalize(x: Tensor<f32>): Tensor<f32> {
    let mag = Math.sqrt(x * x)
    return x / mag
}

What is solid today

The currently well-grounded pieces are:

inline SIMD for small static arrays in compiler codegen
vector comparisons and reductions for qualifying fixed-size arrays
SIR SIMD backends for tensor and graph workloads
mask operations in SIR such as any, all, countBits, firstSet, and select

What should be read conservatively

Treat these areas as advanced or still moving:

generic tensor arithmetic like Tensor<T> * Tensor<T> across arbitrary T
automatic coercion stories between Span<T> and Tensor<T> at every call boundary
assuming every dynamic array expression will become a single SIMD instruction
assuming every matrix or signal-processing operator is equally mature on every backend

Small-array path

For fixed-size arrays, these patterns are the safest ones to expect the compiler to optimize well:

element-wise arithmetic such as +, -, *, /
comparisons such as ==, !=, <, >
reductions such as \+, \*, \<, \>

Example:

vex

fn energy4(x: [f32; 4]): f32 {
    return \+ (x * x)
}

This is the part of the SIMD story backed by crates/vex-compiler/src/codegen_hir/expr/simd_small.

Tensor and graph path

When code is naturally tensor-oriented, prefer graph fn plus concrete tensor types such as Tensor<f32>.

That path lowers through SIR and then routes to CPU SIMD, GPU, or other backends based on shape and backend support. It is more powerful than the small-array fast path, but it is also where current limitations show up first.

Choosing the right abstraction

Use fixed arrays when:

the shape is small and known at compile time
the code is mostly arithmetic, comparison, or reduction
you want predictable LLVM-level vectorization

Use Tensor<T> and graph fn when:

the data shape is dynamic
the computation is already graph-like
you want SIR routing and backend dispatch

Verification

When SIMD behavior matters, inspect the generated LLVM or backend output instead of assuming fusion happened:

bash

vex compile --emit-llvm file.vx

Good signs for the small-array path are direct vector operations and vector reductions. For graph and tensor code, the more relevant check is whether the code lowered into the expected SIR/backend route.

vex

// ❌ Avoid: Manual loops when operators work
fn sum_bad(data: [f64]): f64 {
    let! total = 0.0
    for x in data {
        total = total + x // Unnecessary!
    }
    return total
}

SIMD Operator Reference

Operator	Description	Example
`\+`	Sum reduction	`\+ [1,2,3]` → `6`
`\*`	Product reduction	`\* [1,2,3]` → `6`
`\<`	Min reduction	`\< [3,1,2]` → `1`
`\>`	Max reduction	`\> [3,1,2]` → `3`
`\&`	AND reduction	`\& [t,t,f]` → `false`
`\|`	OR reduction	`\| [t,f,f]` → `true`
`<?`	Element-wise min	`[1,5] <? [3,2]` → `[1,2]`
`>?`	Element-wise max	`[1,5] >? [3,2]` → `[3,5]`
`><`	Clamp	`[1,5] >< (2,4)` → `[2,4]`
`+\|`	Saturating add	`250u8 +\| 10u8` → `255`
`-\|`	Saturating sub	`5u8 -\| 10u8` → `0`
`<<<`	Rotate left	`x <<< 1`
`>>>`	Rotate right	`x >>> 1`
`<*>`	Matrix multiply	`a <*> b`
`'`	Transpose	`matrix'`

Next Steps

GPU Programming - Massively parallel compute
FFI - Integrating with native libraries
Memory Management - Efficient data handling

SIMD and Auto-Vectorization ​

The practical model ​

What is solid today ​

What should be read conservatively ​

Small-array path ​

Tensor and graph path ​

Choosing the right abstraction ​

Verification ​

SIMD Operator Reference ​

Next Steps ​