Feather Bench commited on
Commit
0f62ca9
·
1 Parent(s): 1057d9d

Add Benchmarks tab + headline banner — LongMemEval 0.693 visible on landing

Browse files
Files changed (1) hide show
  1. app.py +77 -0
app.py CHANGED
@@ -309,6 +309,16 @@ with gr.Blocks(
309
  <code>pip install feather-db</code>
310
  </p>
311
  </div>
 
 
 
 
 
 
 
 
 
 
312
  """)
313
 
314
  gr.Markdown("""
@@ -319,6 +329,73 @@ feature performance, competitor moves, community signals, strategy briefs, and u
319
 
320
  with gr.Tabs():
321
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322
  # ── Search ────────────────────────────────────────────────────────────
323
  with gr.TabItem("🔍 Semantic Search"):
324
  gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")
 
309
  <code>pip install feather-db</code>
310
  </p>
311
  </div>
312
+
313
+ <div style="background:linear-gradient(90deg,#eef2ff,#fef3c7);border:1px solid #c7d2fe;border-radius:8px;padding:0.85rem 1rem;margin-bottom:1rem">
314
+ <strong style="font-size:1.05rem">📊 Latest benchmark — LongMemEval (Apr 2026):</strong>
315
+ <code style="background:white;padding:2px 6px;border-radius:4px">Feather + GPT-4o = 0.693</code> ·
316
+ <code style="background:white;padding:2px 6px;border-radius:4px">Feather + Gemini-Flash = 0.657</code>
317
+ <span style="color:#6b7280">— beats the LongMemEval paper's full-context GPT-4o ceiling (0.640).
318
+ Reproducible on a $2.40 budget.</span>
319
+ <a href="https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md"
320
+ target="_blank" style="margin-left:0.4rem">Full report →</a>
321
+ </div>
322
  """)
323
 
324
  gr.Markdown("""
 
329
 
330
  with gr.Tabs():
331
 
332
+ # ── Benchmarks ────────────────────────────────────────────────────────
333
+ with gr.TabItem("📊 Benchmarks"):
334
+ gr.Markdown("""
335
+ ## Feather DB v0.8.0 — Reproducible Benchmark Results
336
+
337
+ ### LongMemEval (Xu et al., 2024 / ICLR 2025)
338
+
339
+ 500-question end-to-end memory QA benchmark. Each question carries up to ~115K
340
+ tokens of chat history; system must ingest, retrieve, and answer correctly
341
+ across 5 memory ability axes.
342
+
343
+ | System | Variant | Answerer | Overall | Cost / run |
344
+ |---|---|---|---|---|
345
+ | **Feather DB v0.8.0 + decay** | **S** | **gpt-4o** | **0.693** | ~$8 |
346
+ | **Feather DB v0.8.0 + decay** | **S** | **gemini-2.5-flash** | **0.657** | ~$2.40 |
347
+ | Full-context GPT-4o (paper "ceiling") | S | gpt-4o + CoN | 0.640 | n/a |
348
+ | Zep (graphiti) | S | gpt-4o-mini | 0.638 | (vendor) |
349
+ | Full-context GPT-4o-mini | S | gpt-4o-mini | 0.554 | n/a |
350
+ | Naive vector RAG (paper) | S/M | gpt-4o | ~0.31 | n/a |
351
+
352
+ **Feather + GPT-4o (0.693) beats the LongMemEval paper's full-context GPT-4o
353
+ ceiling (0.640).** Our 10-snippet retrieval carries more useful signal to the
354
+ answerer than dumping the whole 115K-token haystack into a frontier model — at
355
+ ~40× lower input-token cost per query.
356
+
357
+ #### Per-axis (Feather + GPT-4o on _S_)
358
+
359
+ | Axis | Score |
360
+ |---|---|
361
+ | single-session-user | **1.000** *(perfect)* |
362
+ | single-session-assistant | 0.964 |
363
+ | single-session-preference | 0.767 |
364
+ | knowledge-update | 0.714 |
365
+ | multi-session-reasoning | 0.606 |
366
+ | temporal-reasoning | 0.477 |
367
+
368
+ ### ANN performance — SIFT1M (real data, 500K × 128-dim)
369
+
370
+ | ef | p50 latency | p99 latency | Recall@10 |
371
+ |---|---|---|---|
372
+ | 10 | 0.07 ms | 0.13 ms | 0.774 |
373
+ | **50 (default)** | **0.19 ms** | **0.23 ms** | **0.972** |
374
+ | 100 | 0.32 ms | 0.39 ms | 0.991 |
375
+ | 200 | 0.56 ms | 0.69 ms | 0.998 |
376
+
377
+ ### Reproduce
378
+
379
+ ```bash
380
+ pip install feather-db
381
+ git clone https://github.com/feather-store/feather && cd feather
382
+
383
+ python -m bench run longmemeval --dataset s --limit 0 \\
384
+ --embedder openai \\
385
+ --answerer-provider gemini --answerer-model gemini-2.5-flash \\
386
+ --judge llm --judge-provider gemini --judge-model gemini-2.0-flash \\
387
+ --decay-half-life 14 --decay-time-weight 0.4 --k 10
388
+ ```
389
+
390
+ ### Audit trail
391
+
392
+ Per-run JSON results — every number above is one of these files:
393
+ - **HuggingFace Dataset:** [Hawky-ai/feather-db-benchmarks](https://huggingface.co/datasets/Hawky-ai/feather-db-benchmarks)
394
+ - **GitHub:** [`bench/results/`](https://github.com/feather-store/feather/tree/master/bench/results)
395
+ - **Full report:** [`docs/benchmarks/longmemeval.md`](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md)
396
+ - **arXiv paper:** [`docs/featherdb_paper.pdf`](https://github.com/feather-store/feather/blob/master/docs/featherdb_paper.pdf)
397
+ """)
398
+
399
  # ── Search ────────────────────────────────────────────────────────────
400
  with gr.TabItem("🔍 Semantic Search"):
401
  gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")