AgentBay benchmark hub
We publish the buyer-facing benchmark and the internal search benchmark side by side so the numbers are inspectable.
Public
LOCOMO
LOCOMO is the public benchmark for long-context conversational memory. It is the benchmark memory-literate buyers already know.
Judge accuracy 2.8%
Exact match 3.2%
Latency p95 134.0ms
Internal
Memory search
This is the internal synthetic benchmark we use to track precision, recall, and latency across labeled memory-search scenarios.
Precision@5 81.8%
Recall@5 81.8%
Latency p95 10.0ms
Memory search
Synthetic precision and recall benchmark across labeled search scenarios.
Precision@5
81.8%
Recall@5
81.8%
Latency p50
5.3ms
Latency p95
10.0ms
| Scenario | Precision@5 | Recall@5 | Latency |
|---|---|---|---|
| exact lookup | 100.0% | 100.0% | 4.5ms |
| multi word | 100.0% | 100.0% | 8.2ms |
| error lookup | 100.0% | 100.0% | 4.8ms |
| question | 84.0% | 84.0% | 8.4ms |
| synonym | 80.0% | 80.0% | 5.3ms |
| misspelled | 80.0% | 80.0% | 4.6ms |
| conceptual | 72.0% | 72.0% | 9.0ms |
| code identifier | 60.0% | 60.0% | 2.6ms |
| broad | 60.0% | 60.0% | 3.3ms |
Dataset size 1,000 entries with 45 labeled queries.