Skip to main content
Public Benchmark

AgentBay on LOCOMO

LOCOMO checks whether a memory system can recover the right facts from long, multi-session conversations.

Single-hop
2.1%
Exact match 1.7%
Multi-hop
3.9%
Exact match 3.2%
Temporal reasoning
2.8%
Exact match 7.2%
Open-domain
5.2%
Exact match 4.2%
Latency p50
76.0ms
p95 134.0ms
Run Cost
0.210 credits
Judge 0.210 credits · Embeddings 0.000 credits
Single-hop is our weakest category in this run. We keep that visible because memory buyers need honest numbers, not a cleaned-up headline.
Scorecard

Judge accuracy is the headline metric. Exact match shows how often the gold answer text appeared in the recalled memories after normalization.

Reasoning typeJudge accuracyExact matchWhat it checks
Single-hop2.1%1.7%One fact from one moment in the conversation.
Multi-hop3.9%3.2%Several linked facts that need to be combined.
Temporal reasoning2.8%7.2%Time and ordering across sessions.
Open-domain5.2%4.2%Commonsense or world knowledge grounded in the dialogue.
Methodology
Memories written with
AgentBay store() without embeddings
Memories queried with
AgentBay recall() with hybrid search and no vector embeddings
Judge model
gpt-4o-mini
Total run cost
0.210 credits
Dataset size
10 conversations and 1986 questions
Reproducibility

This page is backed by the raw benchmark artifact in public/benchmarks/locomo-results.json.

Commit SHA: 8a27dabb55d1929bc8d95f1d44d910a7d2eb9971

Run Notes

Vector search was disabled because VOYAGE_API_KEY was not set. This run used keyword and tag retrieval only.

Want the internal retrieval benchmark too? The hub keeps the synthetic precision and recall test alongside LOCOMO.

Back to the benchmarks hub