Transparent Benchmarks

AgentBay Memory Search Benchmarks

We believe in showing real numbers, not marketing claims. These benchmarks run against 10,000+ realistic memory entries with 45 labeled queries across 9 search scenarios.

82.2%
Precision@5
With query expansion
82.2%
Recall@5
With query expansion
34.1ms
Latency p50
Median query time
24 qps
Throughput
Queries per second
Query Expansion Impact

AgentBay automatically expands search queries with synonyms and identifier splits. Here is the measured impact vs baseline (no expansion).

+4.4%
Precision@3
+4.4%
Precision@5
+4.4%
Recall@5

Biggest gains: synonym queries (+20.0% recall) and conceptual queries (+20.0% recall)

Performance by Search Scenario

With query expansion enabled. 5 queries per scenario.

ScenarioPrecision@5Recall@5LatencyDescription
Exact Lookup100.0%100.0%27.4msDirect name/term search
Error Lookup100.0%100.0%30.4msError codes and stack traces
Multi-Word96.0%96.0%52.8msSpecific multi-term queries
Synonym80.0%80.0%33.4msAbbreviated terms (auth, db, perf)
Misspelled80.0%80.0%30.2msTypos in query terms
Question80.0%80.0%53.0msNatural language questions
Conceptual72.0%72.0%55.3msAbstract questions about patterns
Code Identifier72.0%72.0%17.4msFunction names, file paths
Broad/Vague60.0%60.0%21.3msSingle-word vague queries
Latency Distribution

Local search latency (tag + FTS strategies, no network calls). Production latency adds ~2-5ms for database queries + ~100-200ms if vector search or reranking is enabled.

34.1ms
p50 (median)
61.9ms
p95
85.1ms
p99
How We Compare

Honest comparison with published competitor data. We only include numbers we can verify or that competitors have published themselves.

FeatureAgentBayMem0Zep
Search Strategy4-strategy RRF fusionVector-onlyGraph-only
Query ExpansionYes (heuristic)Yes (LLM-based)No
Cross-Encoder RerankingYes (optional)NoNo
Entity ExtractionYes (auto, 9 types)Yes (graph memory)Yes (knowledge graph)
Confidence DecayYes (4-tier half-life)NoNo
Poison DetectionYes (20+ patterns)NoNo
Local ModeYes (SQLite + FastEmbed)PartialNo
Published BenchmarksThis pageLOCOMO benchmarkNone

Note: Mem0 reports 26% higher accuracy than OpenAI Memory on the LOCOMO benchmark (their own testing). We have not yet run LOCOMO against AgentBay for a direct comparison. When we do, we will publish the results here regardless of outcome.

Methodology

Dataset: 10,000 synthetic memory entries generated from realistic templates covering 8 knowledge types (PATTERN, PITFALL, ARCHITECTURE, DEPENDENCY, PERFORMANCE, DECISION, CONTEXT, TEST_INSIGHT). Content includes real code patterns, stack traces, deployment configs, and architecture decisions.

Queries: 45 hand-crafted queries across 9 scenarios (exact lookup, conceptual, multi-word, synonym, misspelled, question, code identifier, error lookup, broad). Each query has labeled ground-truth: expected entry types, tags, and content keywords.

Metrics: Precision@k measures what fraction of returned results are relevant. Recall@k measures what fraction of all relevant results were found. Latency is wall-clock time for the search operation (no network, no database — pure algorithm speed).

Limitations: This benchmark tests tag + FTS strategies only (no vector search, no alias matching, no database). Production performance will differ. Vector search adds ~100-200ms latency but significantly improves conceptual and semantic queries. Cross-encoder reranking is not measured here (requires Voyage AI API).

Reproducibility: Run the benchmarks yourself:npx tsx scripts/benchmark/run-benchmarks.ts --count 10000