Skip to content

Graph Functions

Graph functions compile entire function bodies as optimized compute graphs. The compiler automatically fuses operations, eliminates intermediate allocations, and dispatches to the optimal hardware — GPU or SIMD.

Quick Start

vex
// Add the `graph` keyword before your function name
graph fn dot(a: [f32; N], b: [f32; N]): f32 {
    return <+ (a * b);
}

fn main(): i32 {
    let x = [2.0f32; 100000];
    let y = [3.0f32; 100000];
    let result = dot(x, y);  // → GPU dispatch on Apple Silicon
    $println("dot = {}", result);
    return 0;
}

What happens behind the scenes:

  1. Compiler parses the entire graph body into a computation graph fn (SIR DAG)
  2. Fusion pass merges a * b + <+ into a single fused operation
  3. Dispatch policy checks data size and hardware:
    • 100K floats on Apple M2 → Metal GPU compute kernel
    • 100 floats → SIMD vectorized loop (AVX-512 / NEON)
  4. Zero intermediate arrays allocated

Why Graph Functions?

Without graph

vex
fn slow_normalize(x: [f32; N]): [f32; N] {
    let mean = <+ x / N;              // Loop 1 + temp scalar
    let centered = x - mean;           // Loop 2 + temp array [N]
    let sq = centered * centered;      // Loop 3 + temp array [N]
    let variance = <+ sq / N;         // Loop 4 + temp scalar
    return centered / sqrt(variance);  // Loop 5 + temp array [N]
}
// 5 loops, 3 temporary arrays, 5× cache misses

With graph

vex
graph fn normalize(x: [f32; N]): [f32; N] {
    let mean = <+ x / N;
    let centered = x - mean;
    let variance = <+ (centered * centered) / N;
    return centered / sqrt(variance);
}
// GPU: 2 kernels (fused reduce + fused elementwise), 0 temp arrays
// SIMD: 2 vectorized loops, 0 temp arrays, register-only intermediates

Supported Operations

Graph functions support all Vex array and tensor operations:

Array Operations

vex
graph fn example(a: [f32; N], b: [f32; N]): [f32; N] {
    let sum = a + b;           // elementwise add
    let prod = a * b;          // elementwise multiply
    let scaled = a * 2.0;      // scalar broadcast
    let total = <+ a;          // reduction (sum)
    let maximum = >?| a;       // reduction (max)
    return sum;
}

Matrix Operations

vex
graph fn matmul_chain(a: [f32; M, K], b: [f32; K, N], c: [f32; N, P]): [f32; M, P] {
    let ab = a <*> b;          // matrix multiply
    return ab <*> c;           // chained matmul
}

Math Functions

vex
graph fn activation(x: [f32; N]): [f32; N] {
    return exp(x) / (1.0 + exp(x));  // sigmoid, all fused
}

Control Flow

vex
graph fn relu(x: [f32; N]): [f32; N] {
    if x > 0.0 {
        return x;
    } else {
        return 0.0;
    }
}

graph fn transformer(x: [f32; N], layers: [[f32; N, N]; 12]): [f32; N] {
    let! out = x;
    for i in 0..12 {       // compile-time unrolled
        out = out <*> layers[i];
    }
    return out;
}

Nested Graph Functions

Graph functions can call other graph fn functions. The compiler inlines the callee's computation graph fn into the caller's, enabling cross-function fusion:

vex
graph fn softmax(x: [f32; N]): [f32; N] {
    let max_val = >?| x;
    let shifted = x - max_val;
    let exp_vals = exp(shifted);
    let sum_val = <+ exp_vals;
    return exp_vals / sum_val;
}

graph fn attention(q: [f32; N], k: [f32; N], v: [f32; N]): [f32; N] {
    let scores = q <*> k.T / sqrt(N);
    let weights = softmax(scores);  // ← inlined and fused!
    return weights <*> v;
}

Hardware Dispatch

The compiler automatically selects the best execution target:

Data SizeTargetWhy
< 2,048 elementsScalarNot worth vectorizing
2,048 – 50,000SIMD (AVX-512 / NEON)CPU cache-friendly
≥ 50,000 (Apple UMA)GPU (Metal)Zero-copy shared memory
≥ 1,000,000 (discrete)GPU (CUDA/ROCm)Amortizes PCIe transfer

Override with environment variables:

bash
VEX_NO_GPU=1 vex run app.vx        # Force SIMD
VEX_FORCE_GPU=1 vex run app.vx     # Force GPU

Analytical Fusion & DAG Compilation Pipeline

To eliminate intermediate allocations and memory bandwidth overhead, Vex compiles graph fn blocks using a dedicated Silicon IR (SIR) optimization pass.

The Lowering Flow

When compiling a data-parallel function, the compiler takes the following steps:

  1. HIR Generation: The source code is parsed into a High-level Intermediate Representation (HIR) tree.
  2. SIR Lowering: The compiler walks the HIR and lowers operations into a Silicon IR Directed Acyclic Graph (SIR DAG). Nodes represent parallel primitives (e.g. Input, Map, Reduce, MatMul).
  3. Fusion Analysis (optimize_sir): An optimization pass traces the DAG to find consecutive element-wise operations (such as Map and Reduce) that share identical shape constraints.
  4. Kernel Rebuilding: The compiler coalesces the matching nodes into a single FusedKernel or FusedReduce node.
Vex Source: <+ (a * b + c)

   ▼ [Lowering]
 HIR: Reduce(Sum, Binary(Add, Binary(Mul, a, b), c))

   ▼ [Silicon IR Representation]
 SIR Nodes: [Input(a), Input(b), Input(c)] -> Map(Mul) -> Map(Add) -> Reduce(Sum)

   ▼ [optimize_sir Pass]
 SIR Fused Nodes: FusedReduce(kernel = FusedKernel { t0 = a[i]*b[i]; t1 = t0+c[i]; }, op = Sum)

The FusedKernel Structure

A fused kernel encapsulates the entire chain of execution in a single struct:

  • Inputs: Reference pointers to external source tensors (e.g., a, b, c).
  • Ops Chain: A contiguous list of operations mapped as instructions (e.g., t0 = Mul(Input0, Input1), t1 = Add(t0, Input2)).
  • Target Type / Shape: Dimension rules that dictate iteration bounds.

This allows the backends to emit a single unified loop block (such as an LLVM vector loop on CPU or an MSL kernel on GPU), passing intermediates in register arrays instead of allocating heap memory.


Inline Fusion vs Graph Functions

FeatureInline Fusion <+ (a*b)Graph Function
ScopeSingle expressionEntire function body
FusionExpression-levelCross-statement
Control flowNoif/else, for loops
NestingNoGraph calls graph fn
Use caseQuick one-linersComplex algorithms

Both work together: inline fusion expressions inside graph fn functions get fused into the larger graph.

Debug

bash
VEX_FUSION_DEBUG=1 vex run app.vx

Output:

[GRAPH FN] dot: 3 SIR nodes → fused to 1 kernel
[GRAPH FN] Dispatch: SIMD (AVX-512, 8 lanes, unroll=2)
bash
VEX_GPU_DEBUG=1 vex run app.vx

Output:

[GRAPH FN] attention: 7 SIR nodes → 3 kernels (2 matmul + 1 fused)
[GRAPH FN] Dispatch: GPU Metal (Apple M2 Max)
[Metal] Kernel 1: matmul (256×256) → buffer_0
[Metal] Kernel 2: fused softmax → buffer_1
[Metal] Kernel 3: matmul → readback

Released under the MIT License.