unicode

The unicode module provides extensive tooling for querying Unicode character properties, converting case, and classifying tokens. It powers Vex's core compiler features (like lexing) and offers high-performance Unicode validation for applications.

Performance Story

Unlike many standard libraries that depend on massive, multi-megabyte C libraries (like ICU) for Unicode operations, Vex implements its own highly compressed, generated tables directly in Vex code. Operations use binary search to quickly traverse these tables, mapping characters to their properties or folded cases in a matter of nanoseconds.

When dealing with ASCII-range characters, unicode uses a zero-table fast path, achieving completely free (0ns/op) bounds checks via compiler inlining.

Metrics:

ASCII Letter/Number Check: 0.00 ns/op (Inlined branch)
Unicode Alphabetic Query: 1,216 ns/op (822k ops/s)
ASCII toUpper/toLower: 0.00 ns/op
Unicode Case Folding: 1,534 ns/op (650k ops/s)

Usage

You can import properties directly from unicode:

rust

import { 
    isAlphabetic, 
    isDigit, 
    isWhitespace, 
    isLowercase,
    isUppercase,
    category,
    toLower,
    toUpper
} from "unicode/char";

Character Categories

Query the exact General_Category using the category function, which maps back to the CAT_* constants.

rust

import { category, CAT_LU, CAT_LL } from "unicode";

let cp = 'A';
if category(cp) == CAT_LU {
    println("Uppercase Letter!");
}

Available category categories follow the Unicode Standard:

Letter: CAT_LU (Uppercase), CAT_LL (Lowercase), CAT_LT (Titlecase), CAT_LM (Modifier), CAT_LO (Other)
Number: CAT_ND (Decimal), CAT_NL (Letter), CAT_NO (Other)
Punctuation: CAT_PC, CAT_PD, CAT_PE, CAT_PF, CAT_PI, CAT_PO, CAT_PS
Symbol: CAT_SM (Math), CAT_SC (Currency), CAT_SK (Modifier), CAT_SO (Other)
Whitespace: CAT_ZS (Space), CAT_ZL (Line), CAT_ZP (Paragraph)
Control: CAT_CC (Control), CAT_CF (Format), CAT_CN (Unassigned), CAT_CO (Private)

Validation Functions

Vex provides simple boolean returns for common subsets of characters:

rust

let smiley = '😀';
isAlphabetic(smiley);  // false
isUppercase('H');      // true
isLowercase('h');      // true
isDigit('5');          // true
isWhitespace('\n');    // true
isPunctuation(',');    // true

Case Mapping

Use toLower and toUpper to do Simple Case Folding over a single char Code Point.

rust

let folded = toLower('Ü');
let upper = toUpper('ñ');

For full strings or byte buffers, look at string where the vectorizer speeds up bulk ASCII manipulation.

unicode ​

Performance Story ​

Usage ​

Character Categories ​

Validation Functions ​

Case Mapping ​

unicode

Performance Story

Usage

Character Categories

Validation Functions

Case Mapping