DenseMem-Folded Ram Tech

Title: I built a KV cache compression protocol — 256x ratio, 0.9994 fidelity, running live on an RTX 4090


Hey r/LocalLLaMA,

I’ve been running a 72B model’s full KV cache in 640MB of DDR5 RAM on my RTX 4090 + Core i9. Wanted to share what I built.

DenseMem v0.2.0 — FoldedMemory Protocol

The problem: a 72B model at 32K context needs ~160GB of KV cache. That’s H100 territory. Most of us can’t touch it.

The insight: KV cache activations aren’t random. They’re highly structured and correlated. SVD at rank=64 exploits that geometry. The compression is lossy in theory but in practice the fidelity holds at 0.9994 cosine similarity — because real transformer activations live in a low-dimensional subspace.

Live benchmark (RTX 4090 + Core i9 + DDR5):

  • Compression: 256x
  • Fidelity: 0.9994 cosine similarity
  • Negative control (random noise): 0.12 — confirms it’s exploiting structure, not luck
  • Avg fetch latency: 1.95ms
  • Max fetch latency under load: 3.96ms
  • Evictions: 2,944 clean
  • 16,384 MB → 63.9 MB live test

Architecture:

Two-tier hierarchy — VRAM hot, DDR5 warm. Attention-weighted eviction (0.5 attn + 0.3 recency + 0.2 freq). Prefetcher using layer lookahead + sequential token prediction. Two-method API: store() and fetch().

Current limitation:

Hit rate is 25% — my i9’s 2-channel DDR5 is the bottleneck (~38 GB/s). On Threadripper PRO 8-channel DDR5 (~224 GB/s) I’m projecting 65-75% hit rate with sub-2ms latency.

Running live:

Qwen2.5-7B at 32K context on a single 4090. Every tick compressed INT8 via PCA into DDR5. Context went from 4K to 32K — 8x expansion via DenseMem.

Cost:

Uncompressed 72B KV cache at 32K ctx: $32,000 in HBM3e.
FoldedMemory: $1.88 in DDR5.

GitHub: GitHub - thorshammerztp-arch/densemem-protocol: DenseMem protocol package · GitHub Patent pending (US 64/045,595).

Happy to answer questions on the compression math, architecture, or benchmark methodology.


Built by a solo developer / Navy veteran on personal hardware. No funding.

1 Like