RAG is powerful. But like any powerful tool, the problem isn’t whether to use it — it’s when to use it.
I’ve seen too many agentic pipelines that blindly trigger a semantic search on every single user query. The user says “thanks” and the system dutifully fires off an embedding call, scans a 10-million-document vector store, retrieves the top-k chunks, and stuffs them into a context window. For nothing.
This isn’t just inefficient. At scale, it’s expensive — and it quietly degrades your system’s response time in ways that are hard to trace back to the root cause.
The core problem
Most orchestration tools today handle retrieval and generation as a coupled operation. You define a query engine, point it at your index, and the framework takes care of the rest — retrieval always happens, every time, no questions asked. That’s fine for demos. It’s not fine for production systems handling thousands of requests with diverse intent.
What’s missing is a lightweight routing layer that decides, before touching the vector store, whether retrieval is even warranted.
The 2-Step Gate Pattern
The architecture is straightforward:
User Query
│
▼
┌─────────────────────────────┐
│ Gate Agent (LLM #1) │ ← Cheap call, no retrieval
│ "Does this need RAG?" │
└─────────────┬───────────────┘
│
┌───────┴────────┐
│ │
NEEDS RAG NO RAG NEEDED
│ │
▼ ▼
Vector Store ┌─────────────────┐
Semantic Search │ Response Agent │
│ │ (LLM #2) │
└────────► └─────────────────┘
The first LLM call is intentionally lean — it carries only the system context describing what your knowledge base contains, and a single instruction: decide whether retrieval is necessary. Nothing else. No generation, no answering.
Here’s the pseudocode pattern:
# Step 1: Gate Agent — retrieval decision only
gate_response = gate_agent.evaluate(
knowledge_domain="Product catalog, company history, pricing data",
user_query=user_query
)
# Step 2a: RAG path — retrieve then generate
if gate_response.requires_retrieval:
retrieved_context = knowledge_store.semantic_search(user_query)
final_response = response_agent.generate(
query=user_query,
context=retrieved_context
)
# Step 2b: Direct path — generate without retrieval
else:
final_response = response_agent.generate(
query=user_query,
context=None
)
The gate agent’s system prompt is the critical piece. It needs to know precisely what lives in your knowledge base — not vaguely, but specifically enough to make a reliable yes/no decision. If your RAG contains company-specific data, product documentation, or domain knowledge that a general-purpose LLM wouldn’t know, that boundary should be explicitly described in the gate prompt.
The latency and cost savings are the obvious win. But the more important benefit is architectural clarity. You’re separating two concerns that have no business being coupled: retrieval relevance assessment and answer generation. When these are split, each agent can be optimized, monitored, and replaced independently.
It also gives you a natural instrumentation point. Log every gate decision. Over time, you’ll learn the real distribution of your query types — and that data is genuinely valuable for refining your system prompt and catching edge cases.
What current orchestration tools get wrong
Frameworks like LlamaIndex, LangChain, and similar tools do an excellent job of combining retrieval and generation into a single ergonomic call. That’s their design goal and they achieve it well. But none of them — at least not yet, not natively — provide a first-class pattern for conditional retrieval based on semantic intent assessment. The router patterns that exist are mostly about choosing which tool or index to call, not about whether to call any of them at all.
This is a gap worth solving at the framework level. Until then, you wire it yourself.
One caveat worth mentioning
The gate call itself has a cost. For very high-volume systems with simple, predictable query patterns, a lightweight rule-based or embedding-distance pre-filter can sit before the gate LLM — catching the obvious cases (greetings, clarifications, out-of-scope chatter) at near-zero cost. The 2-step LLM pattern then handles everything in the gray zone. Layered filtering, not a single magic solution.
The pattern is simple. The discipline to apply it consistently is less so. But if you’re building RAG systems that you expect to run at any meaningful scale, the question shouldn’t be how to search your knowledge base — it should first be whether to.