PREVIEW · MOCK TELEMETRY Simulated internals for layout preview — not a real GPT-2 run. This page is showing mock telemetry; the live verified backend (C HTTP+SSE worker on an RTX 3070) EXISTS and is gated green by scripts/helix_serve_gate.sh — start it and open ?source=sse to connect. Layer sweeps, kernel names and the per-layer op order are structurally faithful; only the numbers are mock. The preview animation is time-compressed: live XL really runs at ≈ 10 s/token (~3 min for 20 tokens) — by design, the demo is about verifiability, not speed.

Live backend detected — the GPU worker is ready. Switch to LIVE

Helix — the verifiable execution layer

Watch a modern instruction-tuned chat model think on a stack you can rebuild from 299 bytes.

SmolLM2-360M-Instruct is the headline model here — a 2024 Llama-architecture chat model that really follows instructions through its own ChatML template (recorded reply to “What is the capital of France?” → “The capital of France is Paris.”, verified token-for-token 8/8 by the oracle). The older GPT-2-XL base model is switchable too, and the offline preview replays it. Either way the live thing being proven is the verified compute underneath — every layer and kernel comes from the from-raw Helix toolchain.

SmolLM2-360M-Instruct · 360M · 32 layers · 2024 Llama arch 8 kovc-emitted kernels fp32 · greedy · GPT-2-XL (1.5B · 48 layers) switchable live pacing is intentionally slow; the pitch is trust, not speed

Conversation — text completion carry context SmolLM2-360M-Instruct · fp32 · greedy

Give GPT-2 some text to continue

Pick a seed below or type your own. GPT-2-XL will continue it token-by-token while the 48 transformer layers and kovc kernels light up on the right. It is a base model — expect continuations, not answers.

token shade = p(chosen token) — real data from the live logits, “what the model actually considered”; click a token for alternatives <25% 25–45% 45–70% 70–90% ≥90%

Conversation = repeated completion with carried context. Each turn re-sends the conversation so far as one completion prompt — the model itself is stateless between requests: a 2019 base completion model, not an assistant. The live server caps the prompt at ~320 tokens (--max-ctx); when the carried text would blow that budget, the oldest text is cut first and the page says so.

n_tokens

Enter ↵ to run · Shift+Enter for a newline

GPU busy — one generation at a time (single-flight; the server keeps no queue, so the page just waits politely and retries).

What the model is doing idle

token embedding

each word-piece becomes a long list of numbers

layer — of —

the same two-step block, repeated layer after layer

attentioneach word looks back at earlier words and decides what matters

MLPeach word is reworked on its own, through a wider scratch space

final norm

one last tidy-up of the numbers

next-token head

score every vocabulary piece — the highest score wins

next token

—

run a prompt and this line explains each step in plain language.

full expert op-stream → proof & attestation →

Verified stack

Full trust chain →

Where it runs

Hand-auditable to PTX; ptxas / driver trusted-once. Forward-only.

Prompt → tokens idle

Python-free C tokenizer (gpt2_tok) T = 0

awaiting a prompt…

48 transformer layers

— / 48

per-layer ms

kovc-emitted kernels firing

no kernels launched yet

Tokens & throughput

tokens

—

tok/s

0.0

seconds

tokens (id · piece · logit) stream here…

This reply

token step 0 / 0 layer-passes 0 / 0

Run summary

runs once a completion finishes…

Honest residuals: fp32-only · complete-to-PTX-not-SASS · single GPU (sm_86) · base-model-not-assistant · oracle-shares-spec · never-claimed-AGI. This is a demonstration of verifiable execution — not a claim of model quality, speed records, or full-GPU verification. No live parity verdict appears in this chat.

Full residuals & the real numbers →

Watch a modern instruction-tuned chat model think on a stack you can rebuild from 299 bytes.

You're about to watch a model actually compute

kernel

Why Helix — what the language actually does

GPU kernels are written in Helix itself

Autodiff is a language built-in

The toolchain rebuilds from 299 hand-typed bytes

Self-hosting, proven byte-identical — and Python-free

Models ship only through fail-closed parity gates

Optimised, then re-proven — never trusted on speed

Every gate proves it can fail — the negative control

Saved chats

Generation settings

Seed (sampled runs)

Stop sequences

Max tokens

System prompt & persona

Power user — assistant prefill & few-shot

Logit lens — layer-by-layer view of one token

Model race — same prompt, every model

Tools — you run them, not the model

Calculator (hand-written parser — no eval)

Dates

Units

Memory pins

Ask the docs (RAG-lite — tiny bundled corpus)

Model race