Spaces:

Hawky-ai
/

feather-db

Sleeping

App Files Files Community

Feather Bench commited on Apr 26

Commit

0f62ca9

1 Parent(s): 1057d9d

Add Benchmarks tab + headline banner — LongMemEval 0.693 visible on landing

Browse files

Files changed (1) hide show

app.py +77 -0

app.py CHANGED Viewed

@@ -309,6 +309,16 @@ with gr.Blocks(
         <code>pip install feather-db</code>
       </p>
     </div>
     """)
     gr.Markdown("""
@@ -319,6 +329,73 @@ feature performance, competitor moves, community signals, strategy briefs, and u
     with gr.Tabs():
         # ── Search ────────────────────────────────────────────────────────────
         with gr.TabItem("🔍 Semantic Search"):
             gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")

         <code>pip install feather-db</code>
       </p>
     </div>
+    <div style="background:linear-gradient(90deg,#eef2ff,#fef3c7);border:1px solid #c7d2fe;border-radius:8px;padding:0.85rem 1rem;margin-bottom:1rem">
+      <strong style="font-size:1.05rem">📊 Latest benchmark — LongMemEval (Apr 2026):</strong>
+      <code style="background:white;padding:2px 6px;border-radius:4px">Feather + GPT-4o = 0.693</code> ·
+      <code style="background:white;padding:2px 6px;border-radius:4px">Feather + Gemini-Flash = 0.657</code>
+      <span style="color:#6b7280">— beats the LongMemEval paper's full-context GPT-4o ceiling (0.640).
+      Reproducible on a $2.40 budget.</span>
+      <a href="https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md"
+         target="_blank" style="margin-left:0.4rem">Full report →</a>
+    </div>
     """)
     gr.Markdown("""
     with gr.Tabs():
+        # ── Benchmarks ────────────────────────────────────────────────────────
+        with gr.TabItem("📊 Benchmarks"):
+            gr.Markdown("""
+## Feather DB v0.8.0 — Reproducible Benchmark Results
+### LongMemEval (Xu et al., 2024 / ICLR 2025)
+500-question end-to-end memory QA benchmark. Each question carries up to ~115K
+tokens of chat history; system must ingest, retrieve, and answer correctly
+across 5 memory ability axes.
+| System | Variant | Answerer | Overall | Cost / run |
+|---|---|---|---|---|
+| **Feather DB v0.8.0 + decay** | **S** | **gpt-4o** | **0.693** | ~$8 |
+| **Feather DB v0.8.0 + decay** | **S** | **gemini-2.5-flash** | **0.657** | ~$2.40 |
+| Full-context GPT-4o (paper "ceiling") | S | gpt-4o + CoN | 0.640 | n/a |
+| Zep (graphiti) | S | gpt-4o-mini | 0.638 | (vendor) |
+| Full-context GPT-4o-mini | S | gpt-4o-mini | 0.554 | n/a |
+| Naive vector RAG (paper) | S/M | gpt-4o | ~0.31 | n/a |
+**Feather + GPT-4o (0.693) beats the LongMemEval paper's full-context GPT-4o
+ceiling (0.640).** Our 10-snippet retrieval carries more useful signal to the
+answerer than dumping the whole 115K-token haystack into a frontier model — at
+~40× lower input-token cost per query.
+#### Per-axis (Feather + GPT-4o on _S_)
+| Axis | Score |
+|---|---|
+| single-session-user | **1.000** *(perfect)* |
+| single-session-assistant | 0.964 |
+| single-session-preference | 0.767 |
+| knowledge-update | 0.714 |
+| multi-session-reasoning | 0.606 |
+| temporal-reasoning | 0.477 |
+### ANN performance — SIFT1M (real data, 500K × 128-dim)
+| ef | p50 latency | p99 latency | Recall@10 |
+|---|---|---|---|
+| 10 | 0.07 ms | 0.13 ms | 0.774 |
+| **50 (default)** | **0.19 ms** | **0.23 ms** | **0.972** |
+| 100 | 0.32 ms | 0.39 ms | 0.991 |
+| 200 | 0.56 ms | 0.69 ms | 0.998 |
+### Reproduce
+```bash
+pip install feather-db
+git clone https://github.com/feather-store/feather && cd feather
+python -m bench run longmemeval --dataset s --limit 0 \\
+    --embedder openai \\
+    --answerer-provider gemini --answerer-model gemini-2.5-flash \\
+    --judge llm --judge-provider gemini --judge-model gemini-2.0-flash \\
+    --decay-half-life 14 --decay-time-weight 0.4 --k 10
+```
+### Audit trail
+Per-run JSON results — every number above is one of these files:
+- **HuggingFace Dataset:** [Hawky-ai/feather-db-benchmarks](https://huggingface.co/datasets/Hawky-ai/feather-db-benchmarks)
+- **GitHub:** [`bench/results/`](https://github.com/feather-store/feather/tree/master/bench/results)
+- **Full report:** [`docs/benchmarks/longmemeval.md`](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md)
+- **arXiv paper:** [`docs/featherdb_paper.pdf`](https://github.com/feather-store/feather/blob/master/docs/featherdb_paper.pdf)
+""")
         # ── Search ────────────────────────────────────────────────────────────
         with gr.TabItem("🔍 Semantic Search"):
             gr.Markdown("Find nodes by **meaning**, not keywords. Filtered by product or entity type.")