How We Built A-RAG: When Retrieval Learns to Think Before Searching

Alexander Bering
Alexander Bering
March 31, 2026 ยท 6 min read

The Problem with Standard RAG

Retrieval-Augmented Generation (RAG) has become the default pattern for giving LLMs access to external knowledge. The standard pipeline is straightforward:

  1. Embed the user's query into a vector
  2. Search a vector store for similar chunks
  3. Stuff the top-k results into the prompt
  4. Generate a response

This works for simple lookups. "What is our return policy?" Embed, search, done.

But what about: "Compare our Q1 revenue trends with the strategy outlined in last month's board deck, and identify contradictions."

That query needs multiple retrieval steps. It needs different data sources. It needs different retrieval strategies. It needs a plan.

Standard RAG would embed the entire question, search the vector store, and return whatever chunks are most similar to the full query string. The results would be a random mix of revenue data and strategy fragments, none of which answer the actual question.

The Evolution: From RAG to CRAG to A-RAG

Before explaining A-RAG, it helps to understand the progression:

Standard RAG (2023): Embed query, search vectors, return top-k. No quality assessment. No fallback if results are poor.

Self-RAG (Asai et al., 2023): Adds a critique step. After retrieval, the system assesses whether the results are sufficient. If confidence is low, it can reformulate and retry. This was a major improvement but still uses a single retrieval strategy.

CRAG (Corrective RAG, Yan et al., 2024): Adds a quality gate that classifies retrieval results as correct, ambiguous, or incorrect, and routes to different actions accordingly.

A-RAG (our approach): A meta-agent that reasons about the optimal retrieval strategy before executing any search. It classifies the query type, selects appropriate retrieval interfaces, generates a multi-step plan with dependencies, and executes with quality gates at each step.

The key difference: A-RAG doesn't just correct bad retrieval. It prevents it by choosing the right strategy upfront.

How A-RAG Works

Step 1: Query Classification (Zero LLM Cost)

Every query is first classified using a heuristic classifier that requires no LLM call:

  • simple_lookup โ€” Single fact retrieval ("What is X?")
  • multi_hop โ€” Requires connecting information across documents ("How does X relate to Y?")
  • comparison โ€” Needs data from multiple sources to compare ("Compare X and Y")
  • temporal โ€” Involves time-based reasoning ("What changed since Q1?")
  • analytical โ€” Requires synthesis and reasoning over multiple data points

The classifier uses keyword patterns, question structure, and entity count analysis. Simple queries skip the planning step entirely and go straight to vector search โ€” zero overhead for easy questions. Only complex queries trigger the full planning pipeline.

This is important: a planning step that adds latency to every query, including simple ones, would be a net negative. A-RAG's heuristic classifier ensures the planning overhead only applies where it provides value.

Step 2: Strategy Agent (LLM-Powered Planning)

For complex queries, a Claude-based strategy agent generates a retrieval plan as structured JSON:

{
  "steps": [
    { "interface": "semantic", "query": "Q1 revenue trends 2026", "depends_on": [] },
    { "interface": "keyword", "query": "board deck strategy Q1", "depends_on": [] },
    { "interface": "graph", "query": "revenue strategy contradictions", "depends_on": [0, 1] }
  ]
}

Five retrieval interfaces are available:

| Interface | How It Works | Best For | |-----------|-------------|----------| | keyword | BM25 full-text search | Exact terms, names, codes | | semantic | Vector similarity (pgvector) | Conceptual similarity | | chunk_read | Direct document chunk access | Known documents | | graph | Knowledge graph traversal | Relationships, multi-hop | | community | Graph community summaries | High-level themes |

Independent steps (no depends_on) run in parallel. Dependent steps wait for their prerequisites. The plan is bounded to maximum 4 steps to prevent runaway complexity.

Step 3: Graph-Aware Query Expansion

When a retrieval step returns low-confidence results, A-RAG doesn't just retry the same query. It expands the query using the knowledge graph:

  1. Look up entities mentioned in the query
  2. Find related entities and relation types from the graph
  3. Append expansion terms to the original query
  4. Re-retrieve with the enriched query

This is the difference between searching for "revenue trends" and searching for "revenue trends, ARR, MRR, quarterly growth, board projections" โ€” the graph provides domain-specific context that improves recall.

Step 4: Quality Gates with Confidence Scoring

After each retrieval iteration, a 4-component confidence score is computed:

  • topScore โ€” Best individual match quality (is the best result actually good?)
  • avgScore โ€” Average across results (are results consistently relevant?)
  • variance โ€” Consistency of results (low variance = focused results)
  • diversity โ€” Coverage of different sources (are we seeing multiple perspectives?)

Three thresholds control the flow:

EARLY_EXIT = 0.8   โ†’ Stop, results are excellent
CONTINUE   = 0.5   โ†’ Proceed to next iteration
REFORMULATE < 0.5  โ†’ Expand query and retry

Maximum 3 iterations to bound latency. In practice, most queries resolve in 1-2 iterations.

The GraphRAG Foundation

A-RAG operates on top of a 3-layer graph architecture:

Layer 1 โ€” Event Subgraph: Temporal interactions with timestamps and activity scoring. "User discussed project X with colleague Y on March 15." This layer powers temporal queries.

Layer 2 โ€” Semantic Graph: Named entities, typed relations, community detection via the Louvain algorithm, and centrality metrics. This is the persistent knowledge structure.

Layer 3 โ€” Community Summaries: Auto-generated cluster summaries for high-level queries. "What are the main themes in my research?" uses community summaries rather than individual facts.

Retrieval strategies are combined with learned weights: semantic 0.5, event 0.3, community 0.2. This means high-level questions leverage community summaries, while specific questions use the event subgraph or direct semantic search.

Contextual Retrieval: Better Chunks

Before any retrieval happens, we enhance our chunks using Anthropic's Contextual Retrieval method. Each chunk gets a 1-2 sentence context prefix generated by Claude Haiku, explaining where the chunk appears in the source document and what it discusses.

This achieves 35-67% reduction in retrieval failures versus standard chunking, because the context helps disambiguate chunks that would otherwise be similar in vector space but different in meaning.

Results in Practice

In ZenAI's production deployment, A-RAG handles all retrieval for the chat interface, processing thousands of queries across 4 context domains (personal, work, learning, creative). Key observations:

  • Simple queries (60-70% of traffic) bypass planning entirely โ€” zero added latency
  • Complex queries get structured plans that consistently outperform single-pass retrieval
  • Graph-aware expansion recovers failed retrievals that would have returned empty results
  • Quality gates prevent hallucination by detecting when retrieval is insufficient

References

  • Asai, A., et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511.
  • Yan, S., et al. (2024). Corrective Retrieval Augmented Generation. arXiv:2401.15884.
  • Anthropic (2024). Introducing Contextual Retrieval. anthropic.com/news/contextual-retrieval.
  • Ye, J., Su, J., & Cao, Y. (2022). A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling. KDD 2022.

Implementation

A-RAG is implemented in ZenAI's backend:

  • backend/src/services/arag/strategy-agent.ts โ€” Query classification + plan generation
  • backend/src/services/arag/iterative-retriever.ts โ€” Plan execution with quality gates
  • backend/src/services/arag/strategy-evaluator.ts โ€” Confidence scoring

The system is part of ZenAI, an open-source AI operating system with 9,228 passing tests.

Source: github.com/zensation-ai/zenbrain Technical Reference: zensation.ai/technologie