Engineering

91 % of the accuracy at 1 % of the tokens — the Pareto position for AI memory

Alexander Bering
Alexander Bering
April 27, 2026 · 6 min read

There is one sentence from the v6 ZenBrain paper I know by heart:

ZenBrain reaches 91.3 % of long-context-oracle accuracy at 1/106ᵗʰ the per-query token budget.

That is the sentence that decides whether building a memory architecture is worth it at all — or whether we should just dump the entire conversation history into the context window every time and hope the LLM does the right thing.

Spoiler: it is worth it.

What a "long-context oracle" baseline actually is

LongMemEval-S Full-500 is a memory benchmark with 500 questions, each one bound to its own conversation context — about 494 conversation turns per question, averaging 105,577 tokens (median 105,744, max 107,740). It spans six categories: multi-session, temporal-reasoning, knowledge-update, plus three single-session variants.

A "long-context oracle" baseline works like this: we take the entire conversation archive for each question — all 105,600 tokens — and stuff it into the context window of gpt-4o-mini. No memory system in between, no retrieval filter, no forgetting, no consolidation. The model reads the full history every time and answers from it.

That's the theoretical upper bound. You cannot get more information than that — it is all in context.

What ZenBrain does instead

ZenBrain receives the same 494 turns once at ingest, sorts them through the 7-layer architecture (Working, Short-Term, Episodic, Semantic, Procedural, Core, Cross-Context), writes a knowledge graph with two-factor synaptic edges, runs sleep consolidation in idle time, and forgets along an Ebbinghaus curve. At query time we return the 5 most relevant memories (k = 5) — about 1,000 tokens including answer-prompt overhead.

That is 1/106 of what the oracle gets.

The result (Table 8 in the paper)

| System | Accuracy (95 % CI) | Token budget per query | |---|---|---| | Long-Context Oracle (gpt-4o-mini, full haystack) | 52.2 % | ~105,600 | | ZenBrain (k = 5, 7-layer memory) | 47.7 % (47.4 – 47.8) | ~1,000 | | Letta (k = 5) | 42.8 % | ~1,000 | | A-Mem (k = 5) | 35.4 % | ~1,000 | | Mem0 (k = 5) | 31.8 % | ~1,000 |

ZenBrain reaches 47.7 / 52.2 = 91.3 % of oracle accuracy. The oracle wins by 4.5 percentage points. It burns 106× more tokens per query for that.

Among the k = 5 memory systems — all on the same token budget — ZenBrain dominates:

  • +4.9 pp over Letta
  • +12.3 pp over A-Mem
  • +15.9 pp over Mem0

These three gaps are statistically rock-solid (Bonferroni-corrected p ≤ 6.2 × 10⁻³¹ across three independent LLM judges: Sonnet 4.5, Opus 4.6, GPT-4o; effect sizes d ∈ [0.18, 0.52]). It isn't seed-luck: ZenBrain wins 12 of 12 head-to-head comparisons across all judges.

What this means in practice

The discussion "do I even need a memory system, or can I just dump the whole archive into the context window?" has a quantitative answer.

If you are willing to pay 106× more tokens per query, you gain 4.5 percentage points of accuracy. At one million queries, a typical pricing of ~0.15 €/1M input tokens for gpt-4o-mini, and a haystack of 105,600 tokens per query:

  • Long-context oracle cost: 1,000,000 × 105,600 × 0.15 / 10⁶ = €15,840
  • ZenBrain cost (k = 5, ~1,000 tokens): 1,000,000 × 1,000 × 0.15 / 10⁶ = €150

A €15,690 difference for 4.5 percentage points. Per million queries.

And that ignores the latency cost (a 105 K-token prompt takes 5–15 seconds time-to-first-token in practice; a 1 K-token prompt is under 500 ms), the storage cost when caching (ZenBrain consolidates to roughly 48 % fewer tokens after sleep cycle than naïve storage), and the fact that the oracle has no memory concept at all — it cannot learn across sessions, has no calibrated confidence, no forgetting, no GDPR-compliant deletion, no reconsolidation.

The oracle is a brute-force tool. ZenBrain is an architecture.

Where the Pareto position comes from

Four mechanisms carry the 91.3 % mark together:

  1. Layer routing. A query with temporal markers gets boosted by the episodic layer (w = 2.0); a procedural query goes to the procedural layer; an identity query hits Core Memory directly. On LoCoMo, layer routing alone delivers +181 % on temporal queries versus flat-dense storage.
  2. Two-Factor Synaptic Consolidation. Edges in the knowledge graph carry not just a weight w_ij but also a consolidation variance σ²_ij. Mature edges (σ² → 0) resist both decay and catastrophic overwriting — mathematically equivalent to Elastic Weight Consolidation. The importance I_ij = 1/σ²_ij is reused as a retrieval boost.
  3. Sleep consolidation. At idle, the simulation-selection loop runs: the system generates replay candidates from real episodes + counterfactuals, scores them with TAG = 0.4·|δ_TD| + 0.35·R + 0.25·N, strengthens top candidates via LTP, prunes weak ones via LTD. Result: +37 % stability, −47.4 % storage.
  4. Predictive Memory Architecture (PMA). Reconsolidation opens a 10-minute window after retrieval where memories transition into one of four update modes (confirmed / selective_edit / integration / new_episode) depending on prediction error. TripleCopyMemory stores each event in three copies with divergent decay constants — the deep copy with logarithmic growth dominates after 7+ days and retains 91.2 % strength at 30 days versus near-zero for plain Ebbinghaus.

None of these mechanisms alone produce the Pareto position. The 15-algorithm ablation in the paper shows why: under moderate conditions most algorithms are cooperatively redundant — sleep is the only individually significant contribution. Under stress (60 days, decay = 0.25/day) 9 of 15 algorithms become individually critical (ΔQ down to −93.7 %). That isn't a bug, it's the design: the algorithms form a cooperative survival network.

The honest framing

We do not claim 91.3 % is universally enough. For safety-critical applications (diagnostics, legal, financial transactions) the missing 4.5 pp may be too expensive — long-context is the right answer there, and the 106× token premium is acceptable.

For the vast majority of knowledge-work applications — personal assistants, CRM, research agents, coding copilots, learning systems — the Pareto position at k = 5 is the economically correct choice. That is precisely the regime memory-based systems were invented for.

And among memory systems, ZenBrain is not narrowly ahead on LongMemEval-500. It leads by 4.9 / 12.3 / 15.9 percentage points over the next three competitors at the same token budget.

More memory is not the answer. Better memory is.

Reproducibility

All results come from docs/papers/results/g5-full500-nomic/ in the repo. The pipeline is deterministic — three retrieval seeds (42, 123, 456) for ZenBrain produce bit-identical aggregates. The reproduction command is in Appendix F.4 of the paper.