AgentBay Memory Search Benchmarks
We believe in showing real numbers, not marketing claims. These benchmarks run against 10,000+ realistic memory entries with 45 labeled queries across 9 search scenarios.
AgentBay automatically expands search queries with synonyms and identifier splits. Here is the measured impact vs baseline (no expansion).
Biggest gains: synonym queries (+20.0% recall) and conceptual queries (+20.0% recall)
With query expansion enabled. 5 queries per scenario.
| Scenario | Precision@5 | Recall@5 | Latency | Description |
|---|---|---|---|---|
| Exact Lookup | 100.0% | 100.0% | 27.4ms | Direct name/term search |
| Error Lookup | 100.0% | 100.0% | 30.4ms | Error codes and stack traces |
| Multi-Word | 96.0% | 96.0% | 52.8ms | Specific multi-term queries |
| Synonym | 80.0% | 80.0% | 33.4ms | Abbreviated terms (auth, db, perf) |
| Misspelled | 80.0% | 80.0% | 30.2ms | Typos in query terms |
| Question | 80.0% | 80.0% | 53.0ms | Natural language questions |
| Conceptual | 72.0% | 72.0% | 55.3ms | Abstract questions about patterns |
| Code Identifier | 72.0% | 72.0% | 17.4ms | Function names, file paths |
| Broad/Vague | 60.0% | 60.0% | 21.3ms | Single-word vague queries |
Local search latency (tag + FTS strategies, no network calls). Production latency adds ~2-5ms for database queries + ~100-200ms if vector search or reranking is enabled.
Honest comparison with published competitor data. We only include numbers we can verify or that competitors have published themselves.
| Feature | AgentBay | Mem0 | Zep |
|---|---|---|---|
| Search Strategy | 4-strategy RRF fusion | Vector-only | Graph-only |
| Query Expansion | Yes (heuristic) | Yes (LLM-based) | No |
| Cross-Encoder Reranking | Yes (optional) | No | No |
| Entity Extraction | Yes (auto, 9 types) | Yes (graph memory) | Yes (knowledge graph) |
| Confidence Decay | Yes (4-tier half-life) | No | No |
| Poison Detection | Yes (20+ patterns) | No | No |
| Local Mode | Yes (SQLite + FastEmbed) | Partial | No |
| Published Benchmarks | This page | LOCOMO benchmark | None |
Note: Mem0 reports 26% higher accuracy than OpenAI Memory on the LOCOMO benchmark (their own testing). We have not yet run LOCOMO against AgentBay for a direct comparison. When we do, we will publish the results here regardless of outcome.
Dataset: 10,000 synthetic memory entries generated from realistic templates covering 8 knowledge types (PATTERN, PITFALL, ARCHITECTURE, DEPENDENCY, PERFORMANCE, DECISION, CONTEXT, TEST_INSIGHT). Content includes real code patterns, stack traces, deployment configs, and architecture decisions.
Queries: 45 hand-crafted queries across 9 scenarios (exact lookup, conceptual, multi-word, synonym, misspelled, question, code identifier, error lookup, broad). Each query has labeled ground-truth: expected entry types, tags, and content keywords.
Metrics: Precision@k measures what fraction of returned results are relevant. Recall@k measures what fraction of all relevant results were found. Latency is wall-clock time for the search operation (no network, no database — pure algorithm speed).
Limitations: This benchmark tests tag + FTS strategies only (no vector search, no alias matching, no database). Production performance will differ. Vector search adds ~100-200ms latency but significantly improves conceptual and semantic queries. Cross-encoder reranking is not measured here (requires Voyage AI API).
Reproducibility: Run the benchmarks yourself:npx tsx scripts/benchmark/run-benchmarks.ts --count 10000