Helix MODELS & PATHS

One verified stack. Pick your model, pick your trust boundary.

Every path below runs through the same from-raw Helix toolchain — a compiler rebuildable from 299 hand-typed bytes, emitting the kernels that execute the model. What differs is the model size, the hardware path, and exactly where the trusted-once boundary sits. Every number on this page is transcribed from a committed fail-closed gate output (docs/HELIX_GPT2_DEMO_RUNBOOK.md + the gate logs it cites); where a number does not exist yet, the cell says not yet measured — never an estimate.

The paths, one by one

GPT-2 124M · GPU ATTESTED MVP
124M params · 12 layers · fp32 · greedy · 8 kovc-emitted PTX kernels
What it proves: the full end-to-end claim — the real, unchanged public model generating on the from-raw stack, parity-checked against an independent oracle, byte-identical across re-runs, bound into a signed attestation. The snappy one: a warm generation takes seconds, so this is the path to run live in a room.
argmax id 262 EXACT 25/25 ids vs oracle max-abs logit diff 2.59e-04 (logits ~130) re-runs byte-identical (gen-ids sha 8a2595cd…) warm generation: seconds
Trusted-once boundary: hand-auditable from hex0 → kovc → PTX; below PTX, NVIDIA's closed ptxas + GPU driver + the C CUDA-FFI launcher are trusted-once. gates: gpt2_gpu_mvp.sh · gpt2_demo_attest.sh
GPT-2-Large 774M · GPU SCALE PROOF
774M params · 36 layers · fp32 · greedy · the SAME 8 kernels, zero new ops
What it proves: the stack generalizes — a 6× bigger model runs through the exact same kernel set with only dimension changes (read from the model's config.json), and still matches the oracle token-for-token.
argmax id 262 25/25 ids vs oracle max-abs logit diff 3.8e-05 speed: not separately timed (same kernels)
Trusted-once boundary: same as 124M GPU — to PTX, then ptxas/driver trusted-once. gate: gpt2_scale.sh (evidence: scripts/scale_results.txt)
GPT-2-XL 1.5B · GPU FLAGSHIP · THE CHAT MODEL
1.5B params · 48 layers · fp32 · greedy · same 8 kernels · per-layer weight streaming
What it proves: the measured fp32 ceiling — 1.5 billion parameters on one 8 GB sm_86 box, served live over SSE with the served output certified equal to the offline oracle before exposure. This is the model behind the chat playground. It is intentionally slow: the pitch is trust, not speed.
argmax id 262 25/25 ids vs oracle max-abs logit diff 4.4e-05 served == offline oracle token-for-token ≈ 9.8 s/token measured (195.5 s / 20 tok, serve gate)
Trusted-once boundary: same as 124M GPU — to PTX, then ptxas/driver trusted-once. gates: gpt2_scale.sh · helix_serve_gate.sh (scripts/_gate_run.log)
GPT-2 124M · CPU, no ptxas PUREST TRUST
124M params · 12 layers · fp32 · greedy · all arithmetic in kovc-compiled-from-raw Helix
What it proves: the deepest trust claim — the entire forward pass with no GPU vendor boundary at all: zero trusted arithmetic above the 299-byte seed (the shared host TCB — OS/gcc/libc/CPU — is disclosed, not hidden). ~130 s/token, slow by design: this path exists to remove ptxas from the story, not to serve traffic.
argmax id 262 == oracle max-abs logit diff 2.75e-04 block-0 hidden max-abs 1.144e-04 token-for-token measured (full greedy run; too slow to gate routinely) ≈ 130 s/token
Trusted-once boundary: none on the arithmetic side — no ptxas, no driver, no GPU. The disclosed residual is the shared host TCB (OS / kernel / gcc / libc / CPU + microcode) below the audited seed. gate: gpt2_cpu_parity.sh
SmolLM2-135M · GPU · Llama-arch NEW · VERIFIED (GATED)
135M params · 30 layers per its config (gates read dims from config.json) · RMSNorm · RoPE · SwiGLU · GQA · Apache-2.0 weights
What it proves: the same verified stack extends from 2019's GPT-2 family to the modern Llama architecture family — 3 new kovc-emitted kernels (rmsnorm, rope, silu_mul) + 5 reused, run with the same fail-closed gate machinery and the same independent-oracle parity bar. No change to the compiler: the self-host fixpoint is untouched. This is modern architecture, verifiably executed — not modern capability (135M is a small base model).
Gated results: layer-0 parity max-abs 3.2e-05; full-model last-row logits argmax-exact with max-abs 4.9e-05 over 49,152 logits; 20-token greedy generation token-for-token identical (25/25) to an independent numpy oracle that reads the original safetensors; corrupted-weights negative control correctly failed. Also served via the model switcher (token-for-token over HTTP). Speed is not yet measured (left honest — no benchmark was taken). Fail-closed gate: scripts/llama_model_gate.sh (G-L1/G-L2).
Trusted-once boundary: same as the GPU paths — verified to PTX, then ptxas/driver trusted-once; fp32. plan + results: docs/HELIX_LLAMA_PLAN.md

Measured, side by side

Same columns as the proof page's table, extended with the in-progress row. Every green pill cites a committed fail-closed gate; amber means the gates are still running.

Model · pathParams · layersParity vs independent oracle max-abs logit diffMeasured speedTrusted-once boundary
GPT-2 124M · GPU 124M · 12 argmax 262 EXACT 25/25 ids 2.59e-04 (logits ~130) seconds (warm) — gpt2_gpu_mvp.sh hand-auditable to PTX; ptxas/driver trusted-once
GPT-2-Large 774M · GPU 774M · 36 argmax 262 25/25 ids 3.8e-05 not separately timed (same kernels)gpt2_scale.sh same as 124M GPU — zero new ops at scale
GPT-2-XL 1.5B · GPU (the chat model) 1.5B · 48 argmax 262 25/25 ids served == offline 4.4e-05 ≈ 9.8 s/token (195.5 s / 20 tok, serve gate) same as 124M GPU — fits the 8 GB sm_86 box at fp32
GPT-2 124M · CPU no-ptxas 124M · 12 argmax 262 == oracle token-for-token (measured) 2.75e-04 (block-0 hidden: 1.144e-04) ≈ 130 s/token — slow by design no GPU boundary at all — zero trusted arithmetic above the seed (shared host TCB disclosed)
SmolLM2-135M · GPU Llama-arch, NEW 135M · 30 (per config) token-for-token 25/25; argmax-exact, max-abs 4.9e-05 / 49,152 not yet measured 3.2e-05 (layer-0 parity) same as the GPU paths; verified to PTX, then ptxas/driver trusted-once

Sources: scripts/gpt2_gpu_mvp.sh, scripts/gpt2_scale.sh (+ committed PRIMARY-mode evidence in scripts/scale_results.txt), scripts/gpt2_cpu_parity.sh, scripts/helix_serve_gate.sh (scripts/_gate_run.log), scripts/llama_ops_parity.sh (SmolLM2 G-L0; full-model gates in progress — docs/HELIX_LLAMA_PLAN.md). fp32 everywhere; greedy decoding; the oracle is an independent numpy fp32 implementation of each model's spec. No cuBLAS/vendor comparisons are claimed anywhere — that is not what this stack is for.

"Trusted once" — what each boundary actually means

Verification is only honest if you can say exactly where it stops. These are the stops, in plain language (the full residuals list lives on the proof page).

GPU paths TO PTX

Everything from the 299 hand-typed bytes up to the PTX text of the kernels is rebuildable and hand-auditable (hex0 → seed → kovc → PTX). Below PTX, NVIDIA's closed ptxas assembler, the GPU driver, and the C CUDA-FFI launcher are trusted-once: audited at the interface, not rebuilt from source. That boundary is stated on every page — "complete to PTX, not to SASS."

CPU path NO VENDOR BOUNDARY

The no-ptxas path removes the GPU vendor entirely: every arithmetic operation in the forward pass runs in code compiled by the from-raw toolchain. Zero trusted arithmetic above the seed. The price is honesty's favorite number on this site: ≈ 130 s/token.

Both paths SHARED HOST TCB

Below the audited seed, the usual platform is still trusted: OS / kernel / gcc (for the bootstrap harness) / libc / binutils / loader / CPU + microcode / RAM. Disclosed, not hidden — the claim has always been about the compute stack above the seed. The independent oracle also shares each model's spec (not its code): it catches implementation bugs, not a shared misunderstanding of the architecture.