AgentBay on LOCOMO
LOCOMO checks whether a memory system can recover the right facts from long, multi-session conversations.
Single-hop
2.1%
Exact match 1.7%
Multi-hop
3.9%
Exact match 3.2%
Temporal reasoning
2.8%
Exact match 7.2%
Open-domain
5.2%
Exact match 4.2%
Latency p50
76.0ms
p95 134.0ms
Run Cost
0.210 credits
Judge 0.210 credits · Embeddings 0.000 credits
Single-hop is our weakest category in this run. We keep that visible because memory buyers need honest numbers, not a cleaned-up headline.
Scorecard
Judge accuracy is the headline metric. Exact match shows how often the gold answer text appeared in the recalled memories after normalization.
| Reasoning type | Judge accuracy | Exact match | What it checks |
|---|---|---|---|
| Single-hop | 2.1% | 1.7% | One fact from one moment in the conversation. |
| Multi-hop | 3.9% | 3.2% | Several linked facts that need to be combined. |
| Temporal reasoning | 2.8% | 7.2% | Time and ordering across sessions. |
| Open-domain | 5.2% | 4.2% | Commonsense or world knowledge grounded in the dialogue. |
Methodology
Memories written with
AgentBay store() without embeddings
Memories queried with
AgentBay recall() with hybrid search and no vector embeddings
Judge model
gpt-4o-mini
Total run cost
0.210 credits
Dataset size
10 conversations and 1986 questions
Reproducibility
This page is backed by the raw benchmark artifact in public/benchmarks/locomo-results.json.
Commit SHA: 8a27dabb55d1929bc8d95f1d44d910a7d2eb9971
Run Notes
Vector search was disabled because VOYAGE_API_KEY was not set. This run used keyword and tag retrieval only.
Want the internal retrieval benchmark too? The hub keeps the synthetic precision and recall test alongside LOCOMO.