Benchmarks

AgentBay benchmark hub

We publish the buyer-facing benchmark and the internal search benchmark side by side so the numbers are inspectable.

Public

LOCOMO

LOCOMO is the public benchmark for long-context conversational memory. It is the benchmark memory-literate buyers already know.

Judge accuracy 2.8%

Exact match 3.2%

Latency p95 134.0ms

Internal

Memory search

This is the internal synthetic benchmark we use to track precision, recall, and latency across labeled memory-search scenarios.

Precision@5 81.8%

Recall@5 81.8%

Latency p95 10.0ms

Memory search

Synthetic precision and recall benchmark across labeled search scenarios.

Precision@5

81.8%

Recall@5

81.8%

Latency p50

5.3ms

Latency p95

10.0ms

Scenario	Precision@5	Recall@5	Latency
exact lookup	100.0%	100.0%	4.5ms
multi word	100.0%	100.0%	8.2ms
error lookup	100.0%	100.0%	4.8ms
question	84.0%	84.0%	8.4ms
synonym	80.0%	80.0%	5.3ms
misspelled	80.0%	80.0%	4.6ms
conceptual	72.0%	72.0%	9.0ms
code identifier	60.0%	60.0%	2.6ms
broad	60.0%	60.0%	3.3ms

Dataset size 1,000 entries with 45 labeled queries.