Public Benchmark

AgentBay on LOCOMO

LOCOMO checks whether a memory system can recover the right facts from long, multi-session conversations.

Single-hop

2.1%

Exact match 1.7%

Multi-hop

3.9%

Exact match 3.2%

Temporal reasoning

2.8%

Exact match 7.2%

Open-domain

5.2%

Exact match 4.2%

Latency p50

76.0ms

p95 134.0ms

Run Cost

0.210 credits

Judge 0.210 credits · Embeddings 0.000 credits

Single-hop is our weakest category in this run. We keep that visible because memory buyers need honest numbers, not a cleaned-up headline.

Scorecard

Judge accuracy is the headline metric. Exact match shows how often the gold answer text appeared in the recalled memories after normalization.

Reasoning type	Judge accuracy	Exact match	What it checks
Single-hop	2.1%	1.7%	One fact from one moment in the conversation.
Multi-hop	3.9%	3.2%	Several linked facts that need to be combined.
Temporal reasoning	2.8%	7.2%	Time and ordering across sessions.
Open-domain	5.2%	4.2%	Commonsense or world knowledge grounded in the dialogue.

Methodology

Memories written with

AgentBay store() without embeddings

Memories queried with

AgentBay recall() with hybrid search and no vector embeddings

Judge model

gpt-4o-mini

Total run cost

0.210 credits

Dataset size

10 conversations and 1986 questions

Reproducibility

This page is backed by the raw benchmark artifact in public/benchmarks/locomo-results.json.

Commit SHA: 8a27dabb55d1929bc8d95f1d44d910a7d2eb9971

Run Notes

Vector search was disabled because VOYAGE_API_KEY was not set. This run used keyword and tag retrieval only.

Want the internal retrieval benchmark too? The hub keeps the synthetic precision and recall test alongside LOCOMO.