The Problem with Standard RAG
Retrieval-Augmented Generation (RAG) has become the default pattern for giving LLMs access to external knowledge. The standard pipeline is straightforward:
- Embed the user's query into a vector
- Search a vector store for similar chunks
- Stuff the top-k results into the prompt
- Generate a response
This works for simple lookups. "What is our return policy?" Embed, search, done.
But what about: "Compare our Q1 revenue trends with the strategy outlined in last month's board deck, and identify contradictions."
That query needs multiple retrieval steps. It needs different data sources. It needs different retrieval strategies. It needs a plan.
Standard RAG would embed the entire question, search the vector store, and return whatever chunks are most similar to the full query string. The results would be a random mix of revenue data and strategy fragments, none of which answer the actual question.
The Evolution: From RAG to CRAG to A-RAG
Before explaining A-RAG, it helps to understand the progression:
Standard RAG (2023): Embed query, search vectors, return top-k. No quality assessment. No fallback if results are poor.
Self-RAG (Asai et al., 2023): Adds a critique step. After retrieval, the system assesses whether the results are sufficient. If confidence is low, it can reformulate and retry. This was a major improvement but still uses a single retrieval strategy.
CRAG (Corrective RAG, Yan et al., 2024): Adds a quality gate that classifies retrieval results as correct, ambiguous, or incorrect, and routes to different actions accordingly.
A-RAG (our approach): A meta-agent that reasons about the optimal retrieval strategy before executing any search. It classifies the query type, selects appropriate retrieval interfaces, generates a multi-step plan with dependencies, and executes with quality gates at each step.
The key difference: A-RAG doesn't just correct bad retrieval. It prevents it by choosing the right strategy upfront.
How A-RAG Works
Step 1: Query Classification (Zero LLM Cost)
Every query is first classified using a heuristic classifier that requires no LLM call:
- simple_lookup โ Single fact retrieval ("What is X?")
- multi_hop โ Requires connecting information across documents ("How does X relate to Y?")
- comparison โ Needs data from multiple sources to compare ("Compare X and Y")
- temporal โ Involves time-based reasoning ("What changed since Q1?")
- analytical โ Requires synthesis and reasoning over multiple data points
The classifier uses keyword patterns, question structure, and entity count analysis. Simple queries skip the planning step entirely and go straight to vector search โ zero overhead for easy questions. Only complex queries trigger the full planning pipeline.
This is important: a planning step that adds latency to every query, including simple ones, would be a net negative. A-RAG's heuristic classifier ensures the planning overhead only applies where it provides value.
Step 2: Strategy Agent (LLM-Powered Planning)
For complex queries, a Claude-based strategy agent generates a retrieval plan as structured JSON:
{
"steps": [
{ "interface": "semantic", "query": "Q1 revenue trends 2026", "depends_on": [] },
{ "interface": "keyword", "query": "board deck strategy Q1", "depends_on": [] },
{ "interface": "graph", "query": "revenue strategy contradictions", "depends_on": [0, 1] }
]
}
Five retrieval interfaces are available:
| Interface | How It Works | Best For | |-----------|-------------|----------| | keyword | BM25 full-text search | Exact terms, names, codes | | semantic | Vector similarity (pgvector) | Conceptual similarity | | chunk_read | Direct document chunk access | Known documents | | graph | Knowledge graph traversal | Relationships, multi-hop | | community | Graph community summaries | High-level themes |
Independent steps (no depends_on) run in parallel. Dependent steps wait for their prerequisites. The plan is bounded to maximum 4 steps to prevent runaway complexity.
Step 3: Graph-Aware Query Expansion
When a retrieval step returns low-confidence results, A-RAG doesn't just retry the same query. It expands the query using the knowledge graph:
- Look up entities mentioned in the query
- Find related entities and relation types from the graph
- Append expansion terms to the original query
- Re-retrieve with the enriched query
This is the difference between searching for "revenue trends" and searching for "revenue trends, ARR, MRR, quarterly growth, board projections" โ the graph provides domain-specific context that improves recall.
Step 4: Quality Gates with Confidence Scoring
After each retrieval iteration, a 4-component confidence score is computed:
- topScore โ Best individual match quality (is the best result actually good?)
- avgScore โ Average across results (are results consistently relevant?)
- variance โ Consistency of results (low variance = focused results)
- diversity โ Coverage of different sources (are we seeing multiple perspectives?)
Three thresholds control the flow:
EARLY_EXIT = 0.8 โ Stop, results are excellent
CONTINUE = 0.5 โ Proceed to next iteration
REFORMULATE < 0.5 โ Expand query and retry
Maximum 3 iterations to bound latency. In practice, most queries resolve in 1-2 iterations.
The GraphRAG Foundation
A-RAG operates on top of a 3-layer graph architecture:
Layer 1 โ Event Subgraph: Temporal interactions with timestamps and activity scoring. "User discussed project X with colleague Y on March 15." This layer powers temporal queries.
Layer 2 โ Semantic Graph: Named entities, typed relations, community detection via the Louvain algorithm, and centrality metrics. This is the persistent knowledge structure.
Layer 3 โ Community Summaries: Auto-generated cluster summaries for high-level queries. "What are the main themes in my research?" uses community summaries rather than individual facts.
Retrieval strategies are combined with learned weights: semantic 0.5, event 0.3, community 0.2. This means high-level questions leverage community summaries, while specific questions use the event subgraph or direct semantic search.
Contextual Retrieval: Better Chunks
Before any retrieval happens, we enhance our chunks using Anthropic's Contextual Retrieval method. Each chunk gets a 1-2 sentence context prefix generated by Claude Haiku, explaining where the chunk appears in the source document and what it discusses.
This achieves 35-67% reduction in retrieval failures versus standard chunking, because the context helps disambiguate chunks that would otherwise be similar in vector space but different in meaning.
Results in Practice
In ZenAI's production deployment, A-RAG handles all retrieval for the chat interface, processing thousands of queries across 4 context domains (personal, work, learning, creative). Key observations:
- Simple queries (60-70% of traffic) bypass planning entirely โ zero added latency
- Complex queries get structured plans that consistently outperform single-pass retrieval
- Graph-aware expansion recovers failed retrievals that would have returned empty results
- Quality gates prevent hallucination by detecting when retrieval is insufficient
References
- Asai, A., et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511.
- Yan, S., et al. (2024). Corrective Retrieval Augmented Generation. arXiv:2401.15884.
- Anthropic (2024). Introducing Contextual Retrieval. anthropic.com/news/contextual-retrieval.
- Ye, J., Su, J., & Cao, Y. (2022). A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling. KDD 2022.
Implementation
A-RAG is implemented in ZenAI's backend:
backend/src/services/arag/strategy-agent.tsโ Query classification + plan generationbackend/src/services/arag/iterative-retriever.tsโ Plan execution with quality gatesbackend/src/services/arag/strategy-evaluator.tsโ Confidence scoring
The system is part of ZenAI, an open-source AI operating system with 9,228 passing tests.
Source: github.com/zensation-ai/zenbrain Technical Reference: zensation.ai/technologie
