WASM Interpreter Transformer
A hand-compiled transformer that executes WebAssembly bytecode via real forward passes. Every weight was set by a compiler — not by gradient descent. No training data, no loss function, no optimizer. Just linear algebra.
What This Is
This is a complete WASM bytecode interpreter implemented as a transformer neural network. Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time — using real matrix multiplications, real attention, and real feed-forward network computations.
The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating. The attention heads retrieve operands and stack values through quadratic key matching. Stack depth is computed internally by a cumulative-sum attention head — no precomputed depth values are needed. The unembedding converts numeric results to token predictions via a quadratic scoring trick.
The transformer supports filesystem I/O (open, read, write, close across 4 file descriptors)
and structured loops with br_if conditional branching, executing loops up to 256 iterations
via a Continuous Trace with Cycling Positional Encoding mechanism.
112/112 test programs pass with 100% accuracy.
Architecture
| Parameter | Value |
|---|---|
d_model |
100 |
n_layers |
8 |
heads_per_layer |
[13, 1, 0, 8, 2, 1, 1, 4] |
total_heads |
30 |
d_head |
2 |
d_ffn |
100 |
vocab_size |
260 (256 byte tokens + 4 special) |
| FFN activation | SwiGLU (ReLU gate) |
| Attention | Hard-max + sum-mode + cross-attention |
| Total parameters | ~316K (all hand-compiled) |
Unlike standard transformers, each layer has a different number of attention heads (0 to 13), tailored to the specific computational role of that layer.
How It Works
The 8-Layer Pipeline
- Layer 0 (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head
- Layer 1 (1 head): Stack depth accumulation — 1 sum-attention head computes cumulative stack depth as a running sum of push/pop deltas
- Layer 2 (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
- Layer 3 (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations
- Layer 4 (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations
- Layer 5 (1 head): Local variables — 1 head finds the matching
local.tee/local.set, FFN gates the retrieved value - Layer 6 (1 head): Linear memory — 1 head finds the matching
i32.storeby address, FFN gates the retrieved value - Layer 7 (4 heads): Filesystem — 4 cross-attention heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset
Sum-Attention (Cumulative Sums)
Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the sum-mode head accumulates all past value vectors.
Each instruction's value encodes its stack delta: +1 for pushes (e.g., i32.const), -1 for pops (e.g., i32.add), -2 for fd_write. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.
Cross-Attention (Filesystem)
Layer 7 uses cross-attention heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as fd_write operations modify the filesystem.
SwiGLU Gating
Each neuron has two weight vectors:
- Gate: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
- Value: Reads the computation inputs (stack top, stack second, operand, bits).
output = max(0, gate · x) × (value · x)
Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.
Loop Execution
Loops use a Continuous Trace with Cycling Positional Encoding mechanism:
loopandend_loopare structural markers (no-ops in the execution trace)br_ifpops a condition from the stack; if non-zero, execution branches back to the loop body start- Positional encodings cycle using
virtualIP % loopLengthso the transformer sees correct instruction indices across iterations - Maximum 256 iterations per loop; nested loops are supported
Key Tricks
- Multiplication via gating: For
i32.mul, the gate equals one operand while the value holds the other.max(0, TOP) × SECOND = TOP × SECOND. - Comparisons via ReLU pairs: Two neurons with gates
(a-b)and(a-b-1)create a step function that detectsa > b. - Quadratic unembedding:
logit(t) = 2t·R - t²is a downward parabola peaking att = RESULT. - Quadratic key trick:
K = (2j, -j²),Q = (i, 1)→ dot product peaks atj = ifor exact position matching. - Sum-attention for depth: Instead of precomputing stack depth in PE, one head sums all past stack deltas.
- Dynamic filesystem cursors: File read/write offsets are tracked inline during execution — no reference VM pre-run needed.
Supported WASM Operations
Arithmetic & Logic
i32.const, i32.add, i32.sub, i32.mul,
i32.and, i32.or
Comparisons
i32.eq, i32.ne, i32.lt_s, i32.gt_s, i32.le_s, i32.ge_s
Memory & Variables
i32.load, i32.store, local.get, local.set, local.tee
Filesystem I/O
fd_open, fd_read, fd_write, fd_close
(4 file descriptors, 32 bytes per file)
Control Flow
loop, end_loop, br_if (up to 256 iterations, nested loops supported)
Output & Termination
output, halt
Compliance Test Suite
112 tests across 24 categories, all passing at 100%:
| Category | Tests |
|---|---|
| Core arithmetic & logic | 24 |
| Comparisons | 16 |
| Memory & variables | 14 |
| Filesystem | 8 |
| Filesystem integration | 3 |
| Limits & bounds | 15 |
| Basic loops | 7 |
| Loop + arithmetic | 7 |
| Loop + locals/memory | 4 |
| Loop + filesystem | 6 |
| Loop edge cases | 4 |
| Combined & output | 4 |
Positional Encoding Note
This model uses program-specific positional encodings computed by a compile-time analysis pass, but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is not part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.
The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn. No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.
For loops, positional encodings use virtual IP cycling (instIdx % loopBodyLength) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance.
How to Use
This model uses a custom architecture. To run it, use the reference implementation:
# Load with safetensors
from safetensors.torch import load_file
import json
weights = load_file("model.safetensors")
config = json.load(open("config.json"))
# The model requires a custom forward pass implementation
# with hard-max, sum-mode, and cross-attention.
# See config.json for head configurations per layer.
# See the reference TypeScript implementation for the complete specification.
A complete TypeScript reference implementation is available in the source repository.
Live Demos
- Interactive WASM REPL — type WASM instructions line-by-line and watch the transformer execute them in real time
- Transformer X-Ray — step through execution and see every layer, head, and neuron activate
- Interactive Article Explorer — explore the concepts behind this model
- FFN Interpreter Slide Deck — 15-slide visual explanation of how the FFN interprets bytecode
Inspiration
This model is inspired by "Can LLMs Be Computers?" by Percepta AI.
License
MIT
- Downloads last month
- 78