WASM Interpreter Transformer

A hand-compiled transformer that executes WebAssembly bytecode via real forward passes. Every weight was set by a compiler — not by gradient descent. No training data, no loss function, no optimizer. Just linear algebra.

What This Is

This is a complete WASM bytecode interpreter implemented as a transformer neural network. Given a WASM program as input tokens, it autoregressively generates the execution trace — one output byte at a time — using real matrix multiplications, real attention, and real feed-forward network computations.

The FFN neurons implement arithmetic, comparisons, and bitwise logic through SwiGLU gating. The attention heads retrieve operands and stack values through quadratic key matching. Stack depth is computed internally by a cumulative-sum attention head — no precomputed depth values are needed. The unembedding converts numeric results to token predictions via a quadratic scoring trick.

The transformer supports filesystem I/O (open, read, write, close across 4 file descriptors) and structured loops with br_if conditional branching, executing loops up to 256 iterations via a Continuous Trace with Cycling Positional Encoding mechanism.

112/112 test programs pass with 100% accuracy.

Architecture

Parameter	Value
`d_model`	100
`n_layers`	8
`heads_per_layer`	[13, 1, 0, 8, 2, 1, 1, 4]
`total_heads`	30
`d_head`	2
`d_ffn`	100
`vocab_size`	260 (256 byte tokens + 4 special)
FFN activation	SwiGLU (ReLU gate)
Attention	Hard-max + sum-mode + cross-attention
Total parameters	~316K (all hand-compiled)

Unlike standard transformers, each layer has a different number of attention heads (0 to 13), tailored to the specific computational role of that layer.

How It Works

The 8-Layer Pipeline

Layer 0 (13 heads): Opcode fetch — 11 paired attention heads identify 25 opcodes by matching one-hot flags, plus 1 operand retrieval head and 1 single-opcode head
Layer 1 (1 head): Stack depth accumulation — 1 sum-attention head computes cumulative stack depth as a running sum of push/pop deltas
Layer 2 (0 heads): Depth squaring — FFN-only layer computes WRITE_DEPTH² for use as a quadratic key in later retrieval heads
Layer 3 (8 heads): Bit retrieval — 8 hard-max heads extract individual bits from stack top and stack second for AND/OR operations
Layer 4 (2 heads): Stack top/second retrieval + arithmetic FFN — 2 heads retrieve the top two stack values, FFN computes all arithmetic, comparison, and bitwise operations
Layer 5 (1 head): Local variables — 1 head finds the matching local.tee/local.set, FFN gates the retrieved value
Layer 6 (1 head): Linear memory — 1 head finds the matching i32.store by address, FFN gates the retrieved value
Layer 7 (4 heads): Filesystem — 4 cross-attention heads (one per file descriptor) retrieve bytes from an external filesystem key-value store by file offset

Sum-Attention (Cumulative Sums)

Layer 1 introduces a novel attention variant: instead of selecting the single best-matching key (hard-max), the sum-mode head accumulates all past value vectors.

Each instruction's value encodes its stack delta: +1 for pushes (e.g., i32.const), -1 for pops (e.g., i32.add), -2 for fd_write. The cumulative sum of all past deltas gives the current stack depth — exactly what's needed for the quadratic key trick to find the correct stack position.

Cross-Attention (Filesystem)

Layer 7 uses cross-attention heads that attend over an external key-value store representing file contents rather than the sequence's own tokens. Each of the 4 heads handles one file descriptor, using the current file offset as a query to retrieve the byte at that position. File contents are updated dynamically during execution as fd_write operations modify the filesystem.

SwiGLU Gating

Each neuron has two weight vectors:

Gate: Reads the opcode flag (e.g., FETCH_ADD). Only fires when the correct operation is active.
Value: Reads the computation inputs (stack top, stack second, operand, bits).

output = max(0, gate · x) × (value · x)

Wrong opcode → gate = 0 → output = 0 (silenced). Right opcode → gate = 1 → value passes through.

Loop Execution

Loops use a Continuous Trace with Cycling Positional Encoding mechanism:

loop and end_loop are structural markers (no-ops in the execution trace)
br_if pops a condition from the stack; if non-zero, execution branches back to the loop body start
Positional encodings cycle using virtualIP % loopLength so the transformer sees correct instruction indices across iterations
Maximum 256 iterations per loop; nested loops are supported

Key Tricks

Multiplication via gating: For i32.mul, the gate equals one operand while the value holds the other. max(0, TOP) × SECOND = TOP × SECOND.
Comparisons via ReLU pairs: Two neurons with gates (a-b) and (a-b-1) create a step function that detects a > b.
Quadratic unembedding: logit(t) = 2t·R - t² is a downward parabola peaking at t = RESULT.
Quadratic key trick: K = (2j, -j²), Q = (i, 1) → dot product peaks at j = i for exact position matching.
Sum-attention for depth: Instead of precomputing stack depth in PE, one head sums all past stack deltas.
Dynamic filesystem cursors: File read/write offsets are tracked inline during execution — no reference VM pre-run needed.

Supported WASM Operations

Arithmetic & Logic

i32.const, i32.add, i32.sub, i32.mul, i32.and, i32.or

Comparisons

i32.eq, i32.ne, i32.lt_s, i32.gt_s, i32.le_s, i32.ge_s

Memory & Variables

i32.load, i32.store, local.get, local.set, local.tee

Filesystem I/O

fd_open, fd_read, fd_write, fd_close (4 file descriptors, 32 bytes per file)

Control Flow

loop, end_loop, br_if (up to 256 iterations, nested loops supported)

Output & Termination

output, halt

Compliance Test Suite

112 tests across 24 categories, all passing at 100%:

Category	Tests
Core arithmetic & logic	24
Comparisons	16
Memory & variables	14
Filesystem	8
Filesystem integration	3
Limits & bounds	15
Basic loops	7
Loop + arithmetic	7
Loop + locals/memory	4
Loop + filesystem	6
Loop edge cases	4
Combined & output	4

Positional Encoding Note

This model uses program-specific positional encodings computed by a compile-time analysis pass, but stack depth (WRITE_DEPTH, WRITE_DEPTH_SQ, BEFORE_DEPTH) is not part of the PE — it's computed internally by the transformer's sum-attention head in Layer 1.

The remaining PE contains: instruction indices, local variable source locations, memory address mappings, and filesystem cursor overrides — structural metadata that a trained model would learn. No runtime values appear in the PEs; all actual computation happens in the transformer's forward passes.

For loops, positional encodings use virtual IP cycling (instIdx % loopBodyLength) so the transformer receives correct structural metadata across iterations without needing to know the iteration count in advance.

How to Use

This model uses a custom architecture. To run it, use the reference implementation:

# Load with safetensors
from safetensors.torch import load_file
import json

weights = load_file("model.safetensors")
config = json.load(open("config.json"))

# The model requires a custom forward pass implementation
# with hard-max, sum-mode, and cross-attention.
# See config.json for head configurations per layer.
# See the reference TypeScript implementation for the complete specification.

A complete TypeScript reference implementation is available in the source repository.

Live Demos

Interactive WASM REPL — type WASM instructions line-by-line and watch the transformer execute them in real time
Transformer X-Ray — step through execution and see every layer, head, and neuron activate
Interactive Article Explorer — explore the concepts behind this model
FFN Interpreter Slide Deck — 15-slide visual explanation of how the FFN interprets bytecode

Inspiration

This model is inspired by "Can LLMs Be Computers?" by Percepta AI.

License

MIT

Downloads last month: 78

Safetensors

Model size

316k params

Tensor type

F32