Spaces:
Sleeping
Sleeping
Feather Bench commited on
Commit ·
0f62ca9
1
Parent(s): 1057d9d
Add Benchmarks tab + headline banner — LongMemEval 0.693 visible on landing
Browse files
app.py
CHANGED
|
@@ -309,6 +309,16 @@ with gr.Blocks(
|
|
| 309 |
<code>pip install feather-db</code>
|
| 310 |
</p>
|
| 311 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
""")
|
| 313 |
|
| 314 |
gr.Markdown("""
|
|
@@ -319,6 +329,73 @@ feature performance, competitor moves, community signals, strategy briefs, and u
|
|
| 319 |
|
| 320 |
with gr.Tabs():
|
| 321 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
# ── Search ────────────────────────────────────────────────────────────
|
| 323 |
with gr.TabItem("🔍 Semantic Search"):
|
| 324 |
gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")
|
|
|
|
| 309 |
<code>pip install feather-db</code>
|
| 310 |
</p>
|
| 311 |
</div>
|
| 312 |
+
|
| 313 |
+
<div style="background:linear-gradient(90deg,#eef2ff,#fef3c7);border:1px solid #c7d2fe;border-radius:8px;padding:0.85rem 1rem;margin-bottom:1rem">
|
| 314 |
+
<strong style="font-size:1.05rem">📊 Latest benchmark — LongMemEval (Apr 2026):</strong>
|
| 315 |
+
<code style="background:white;padding:2px 6px;border-radius:4px">Feather + GPT-4o = 0.693</code> ·
|
| 316 |
+
<code style="background:white;padding:2px 6px;border-radius:4px">Feather + Gemini-Flash = 0.657</code>
|
| 317 |
+
<span style="color:#6b7280">— beats the LongMemEval paper's full-context GPT-4o ceiling (0.640).
|
| 318 |
+
Reproducible on a $2.40 budget.</span>
|
| 319 |
+
<a href="https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md"
|
| 320 |
+
target="_blank" style="margin-left:0.4rem">Full report →</a>
|
| 321 |
+
</div>
|
| 322 |
""")
|
| 323 |
|
| 324 |
gr.Markdown("""
|
|
|
|
| 329 |
|
| 330 |
with gr.Tabs():
|
| 331 |
|
| 332 |
+
# ── Benchmarks ────────────────────────────────────────────────────────
|
| 333 |
+
with gr.TabItem("📊 Benchmarks"):
|
| 334 |
+
gr.Markdown("""
|
| 335 |
+
## Feather DB v0.8.0 — Reproducible Benchmark Results
|
| 336 |
+
|
| 337 |
+
### LongMemEval (Xu et al., 2024 / ICLR 2025)
|
| 338 |
+
|
| 339 |
+
500-question end-to-end memory QA benchmark. Each question carries up to ~115K
|
| 340 |
+
tokens of chat history; system must ingest, retrieve, and answer correctly
|
| 341 |
+
across 5 memory ability axes.
|
| 342 |
+
|
| 343 |
+
| System | Variant | Answerer | Overall | Cost / run |
|
| 344 |
+
|---|---|---|---|---|
|
| 345 |
+
| **Feather DB v0.8.0 + decay** | **S** | **gpt-4o** | **0.693** | ~$8 |
|
| 346 |
+
| **Feather DB v0.8.0 + decay** | **S** | **gemini-2.5-flash** | **0.657** | ~$2.40 |
|
| 347 |
+
| Full-context GPT-4o (paper "ceiling") | S | gpt-4o + CoN | 0.640 | n/a |
|
| 348 |
+
| Zep (graphiti) | S | gpt-4o-mini | 0.638 | (vendor) |
|
| 349 |
+
| Full-context GPT-4o-mini | S | gpt-4o-mini | 0.554 | n/a |
|
| 350 |
+
| Naive vector RAG (paper) | S/M | gpt-4o | ~0.31 | n/a |
|
| 351 |
+
|
| 352 |
+
**Feather + GPT-4o (0.693) beats the LongMemEval paper's full-context GPT-4o
|
| 353 |
+
ceiling (0.640).** Our 10-snippet retrieval carries more useful signal to the
|
| 354 |
+
answerer than dumping the whole 115K-token haystack into a frontier model — at
|
| 355 |
+
~40× lower input-token cost per query.
|
| 356 |
+
|
| 357 |
+
#### Per-axis (Feather + GPT-4o on _S_)
|
| 358 |
+
|
| 359 |
+
| Axis | Score |
|
| 360 |
+
|---|---|
|
| 361 |
+
| single-session-user | **1.000** *(perfect)* |
|
| 362 |
+
| single-session-assistant | 0.964 |
|
| 363 |
+
| single-session-preference | 0.767 |
|
| 364 |
+
| knowledge-update | 0.714 |
|
| 365 |
+
| multi-session-reasoning | 0.606 |
|
| 366 |
+
| temporal-reasoning | 0.477 |
|
| 367 |
+
|
| 368 |
+
### ANN performance — SIFT1M (real data, 500K × 128-dim)
|
| 369 |
+
|
| 370 |
+
| ef | p50 latency | p99 latency | Recall@10 |
|
| 371 |
+
|---|---|---|---|
|
| 372 |
+
| 10 | 0.07 ms | 0.13 ms | 0.774 |
|
| 373 |
+
| **50 (default)** | **0.19 ms** | **0.23 ms** | **0.972** |
|
| 374 |
+
| 100 | 0.32 ms | 0.39 ms | 0.991 |
|
| 375 |
+
| 200 | 0.56 ms | 0.69 ms | 0.998 |
|
| 376 |
+
|
| 377 |
+
### Reproduce
|
| 378 |
+
|
| 379 |
+
```bash
|
| 380 |
+
pip install feather-db
|
| 381 |
+
git clone https://github.com/feather-store/feather && cd feather
|
| 382 |
+
|
| 383 |
+
python -m bench run longmemeval --dataset s --limit 0 \\
|
| 384 |
+
--embedder openai \\
|
| 385 |
+
--answerer-provider gemini --answerer-model gemini-2.5-flash \\
|
| 386 |
+
--judge llm --judge-provider gemini --judge-model gemini-2.0-flash \\
|
| 387 |
+
--decay-half-life 14 --decay-time-weight 0.4 --k 10
|
| 388 |
+
```
|
| 389 |
+
|
| 390 |
+
### Audit trail
|
| 391 |
+
|
| 392 |
+
Per-run JSON results — every number above is one of these files:
|
| 393 |
+
- **HuggingFace Dataset:** [Hawky-ai/feather-db-benchmarks](https://huggingface.co/datasets/Hawky-ai/feather-db-benchmarks)
|
| 394 |
+
- **GitHub:** [`bench/results/`](https://github.com/feather-store/feather/tree/master/bench/results)
|
| 395 |
+
- **Full report:** [`docs/benchmarks/longmemeval.md`](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md)
|
| 396 |
+
- **arXiv paper:** [`docs/featherdb_paper.pdf`](https://github.com/feather-store/feather/blob/master/docs/featherdb_paper.pdf)
|
| 397 |
+
""")
|
| 398 |
+
|
| 399 |
# ── Search ────────────────────────────────────────────────────────────
|
| 400 |
with gr.TabItem("🔍 Semantic Search"):
|
| 401 |
gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")
|