verifiable execution · bring your weights

Run the model you know. Verify every byte beneath it.

Helix is a verifiable execution layer for AI. Bring your weights: every layer and every kernel beneath them traces back to 299 hand-typed bytes, and the output is matched token-for-token against an independent reference. Audit instead of trust. Trust first — with performance improving as Helix develops.

The record

What ran, and what matched.

Four real, unchanged public models — base completion models, not assistants — ran on kovc-emitted PTX kernels and were compared against an independent fp32 numpy reference.

models verified 124M → 1.5B greedy continuation 25/25 ids, all four byte-identical reruns gen-ids sha256 8a2595cd… hardware 1× sm_86 · RTX 3070-class · 8 GB precision fp32 only live GPT-2-XL ≈ 10 s/token today — faster as it matures
Model Architecture Last-token argmax Max-abs logit diff Greedy continuation PTX kernels
GPT-2 124M GPT-2 · 2019 base completion model id 262 — exact 2.59e-04 on logits of magnitude ~130 token-for-token · 25/25 ids 8
GPT-2-Large 774M GPT-2 · 36 layers id 262 — exact 3.8e-05 token-for-token · 25/25 ids same 8 — zero new
GPT-2-XL 1.5B GPT-2 · 48 layers id 262 — exact 4.4e-05 token-for-token · 25/25 ids same 8 — zero new
SmolLM2-135M Llama arch · 2024 · GQA + RoPE + SwiGLU + RMSNorm · 30 layers id 260 — exact 4.9e-05 over 49,152 logits token-for-token · 25/25 ids 3 new + 5 reused = 8

All three GPT-2 sizes run through the exact same 8 kovc-emitted PTX kernels — only the dimensions change, read from config.json. GPT-2-XL at 1.5B parameters fits the 8 GB sm_86 box at fp32 via per-layer weight streaming. Verified complete to PTX, not SASS. The live GPT-2-XL chat ran at ≈ 10 s/token (measured 195.5 s for 20 tokens in the gated serve run) — that is the verifiability flex, not a speed flex. It also won't stay there: throughput optimization and datacenter-class GPUs are next on the roadmap, so Helix will get faster as it develops.

The definition

What “verified” means here.

One precise claim, checked four ways — and quoted with the model's real output, unedited.

An independent reference, not a self-check

The oracle is an independent numpy implementation. It reads the original HuggingFace safetensors and computes its own forward pass — independent of Helix's importer and of the GPU path. It must agree with the Helix run token-for-token. Nothing in the comparison loops Helix's own output back as the standard.

Fail-closed gates

Every parity check is a gate that exits nonzero on mismatch — gates are never faked and never warn-and-continue. The SmolLM2 leg includes a corrupted-weights negative control that correctly failed: the comparator has teeth.

Byte-identical reruns

The same run, executed twice, produces byte-identical generated token ids — sha256 8a2595cd… for the GPT-2 demo. Reproducibility is asserted, not assumed.

A signed attestation, bound to the from-raw root

One command, scripts/gpt2_demo_attest.sh, rebuilds the compiler from raw, runs GPT-2 through kovc-emitted kernels against the oracle, re-runs byte-identical, and writes a signed attestation binding the three from-raw anchors — seed 9837db12…, self-host fixpoint 0992dddd…, gcc-DDC K1 84363adb… — to the model run.

What the models actually said

GPT-2 124M, greedy: “The capital of France is the capital of the French Republic, and the capital of the French Republic is the capital of the French” — grammatical but repetitive. That is the real 2019 model under greedy decoding, not a bug, and quoting it unedited is the point.

GPT-2-XL 1.5B, greedy: “The capital of France is the city of Paris. It is the capital of France and the largest city in France. It is” — same stack, same 8 kernels, bigger model.

The modern-model leg

The architecture of today's open models.

SmolLM2-135M — gated 2026-06-09 — brings the Llama family onto the same verified stack.

GPT-2 proved the stack on a 2019 architecture. SmolLM2-135M proves it on a 2024 one: grouped-query attention (9 query / 3 kv heads), RoPE with theta 1e5, SwiGLU, RMSNorm, 30 layers, tied head, a 49,152-token vocabulary — the same architectural family as today's open models.

Its full forward pass runs on the same stack and matches the independent oracle: layer-0 residual max-abs 3.2e-05; full-model logits argmax exact (id 260) with max-abs 4.9e-05 across all 49,152 logits; 20-token greedy continuation token-for-token, 25/25 ids.

It cost three new kernelsrmsnorm, rope, silu_mul — plus five reused, for eight again. And it required no compiler change: the self-host fixpoint 0992dddd… is untouched. The gate, scripts/llama_model_gate.sh, is fail-closed and took 72 seconds on its first run; its corrupted-weights negative control failed exactly as it should.

The precise claim

Modern architecture, verifiably executed — not modern capability. SmolLM2-135M is a small base completion model; its greedy completions are repetitive base-model English. What's demonstrated is that the architecture family runs, verified, on a stack you can audit to the byte.

Reproduce it yourself

Two tiers. Be precise about which is which.

The trust core reproduces from the repo alone. The model demos additionally need artifacts that are deliberately not in the repo.

reproduce · tier A, then tier B
# Tier A — the trust core. Repo only; no weights, no oracle, no GPU.
git clone https://github.com/Questeria/helix && cd helix
bash scripts/reproduce_trust.sh      # ~1 minute · CPU-only · fail-closed

# Tier B — the model demos. Public weights + an independent oracle required.
bash scripts/gpt2_demo_attest.sh     # from-raw rebuild → GPT-2 vs oracle → signed attestation
bash scripts/llama_model_gate.sh     # SmolLM2-135M gate · first run 72 s · fail-closed
every gate exits nonzero on mismatch anchors: 9837db12 · 0992dddd · 84363adb
Is anything on this site running live?

No. This website runs nothing and proves nothing by itself — it links to committed replays and gate records of real, captured runs. The replay is clearly labeled a replay.

Why ≈ 10 seconds per token for GPT-2-XL?

By design, for now. fp32 only, a single 8 GB sm_86 GPU, per-layer weight streaming, and kernels built first for auditability. Helix leads with trust, and the measured figure (195.5 s for 20 tokens in the gated serve run) is published as-is — but speed is an active roadmap focus: throughput optimization and datacenter-class hardware are next, and Helix will get faster as it develops.

Why is the 124M output so repetitive?

Because that's the real, unchanged 2019 base completion model under greedy decoding. A flashier output would have required changing the model or the decoding — and then it wouldn't be the model you know.

Does the oracle prove the model is "correct"?

It proves Helix's execution matches an independent implementation of the same spec. The oracle shares the model's spec, so it catches implementation bugs — not a shared misunderstanding. That residual is stated below, unprompted.

Honest residuals

Said before you ask.

The honesty is the pitch. These are the limits of the demo claim, stated unprompted so the claim stays precise.

  1. Complete to PTX, not SASS. Below PTX, the GPU path trusts NVIDIA's closed ptxas, the driver, and the C CUDA-FFI host launcher — trusted-once, and disclosed as such.
  2. fp32 only. Parity is exact on argmax and on the token sequence, and within measured tolerance on hidden states. No other precision is claimed.
  3. Single GPU, sm_86. One RTX 3070-class card with 8 GB. Not multi-GPU, not a cluster.
  4. A demonstration at up to 1.5B parameters, not frontier scale. The point today is verifiability, not size or speed — though performance is an active roadmap focus and will improve as Helix develops.
  5. The oracle shares the model's spec. It is an independent implementation, not an independent specification. It catches implementation bugs; it cannot catch a misunderstanding both implementations share.
  6. RoPE cos/sin tables are trusted-once host data. Like the weights themselves, they enter the run as data, not as verified computation.
  7. Never claimed: beating cuBLAS, a “fully verified GPU”, or completeness down to GPU machine code.
All the way down

Audit instead of trust.

The kernels beneath these runs were emitted by a compiler that rebuilds from 299 hand-typed bytes — with one committed command, in about a minute, on your machine.