Building A RAG System
This is a page about »RAG System Part 1«.
Why RAG?
As discussed in the meta research, RAG provides the context which steers the model, or cue the parameteric memory with the nonparametric memory into generating much grounded responses. Hence, retrieval augmented generation is mostly used in reducing hallucinations, adapting to domain specific knowledge (especially internal documents) by providing relevant context during inference. It addresses the limitation of static long prompts. RAG is dynamic and concise. With these two properties, it reduces overall token cost and potentially improving response latency.
A minimal RAG system consist of a LLM base model, a vector database, an indexer and a retriever. The corpus is first convert to embeddings and stored in the vector database by the indexer. When a prompt is received, it first lookup in the vector database for the top k relevant documents. The documents are concatenated together with the original prompt and is sent to the LLM base model to generate the final response. Presented by Yufan Gao et al. 2024, RAG have came a long way from naive RAG system to advanced RAG and later modular RAG. In modular RAG, it is no longer a pipeline but a collection of independent and interchangeable modules including search, memory, fusion, routing, predict and task adapter that can be dynamically orchestrated.
RAG can be a means of improving the usability without heavy investments in model training. It has much potential by transforming the problem into an engineering/infrastructure problem. Interpretability is provided through the transparency of the process of providing relevant context to the model. When the base model receives improvement, the RAG pipeline also enjoys performance improvement due to it’s modular nature. Model improvements are much challenging in 2026 as models are closing to a saturation point of exhausting publically avaliable text data and requiring astronomical training budget.
Challenges Of RAG System
Looking at the four main components, the indexer and the retriever plays a decisive role of the user perceived LLM response quality. A good retriever system is a hard engineering problem and thus far only Google really nailed it. There is a lot of tradeoff when building the indexer and the retriever. An indexer collects, parses and stores data for the retriever. When given a large corpus of external data, a naive approach splits the documents into fixed size chunks and store them into a vector database. The retriever then queries the prompt for spatially close chunks. The naive approach has a few limitations.
- indexer
- parsing
- handling incomplete html tags, garbled text etc.
- chunking
- chunk size can affect the RAG performance Sinchana R. B. et al. 2025
- noise within chunk (wasted token/steers response to different direction)
- missed out chunks due to phrasing
- lack of checking mechanism for retrieved chunk’s relevance
- parsing
- embedding
- different representation captures semantics differently and might impact query effectiveness
- retriever
- vector retrieval struggle with exact word matching, abbreviations and acronyms
To address the naive retriever’s limitation, many techniques used in the pre RAG era are used together to get better result. The sparse retrieval algorithms e.g. TF-IDF, BM25 focuses on the keyword and lexical search while the dense retrieval focuses on the sementic search to form a comprehensive retrieval. The results are reranked and fused to improve the overall performance at the cost of more compute. Besides improving on the retriever itself, pre retrieval optimization techniques can be applied. Luyu Gao et al. 2022 proposed a technique that uses LLM to generate hypothetical response for the retrieval with the idea to reduce the search space as the LLM generated response is likely to be closer to the correct response than the the raw input is. Query rewrite and expansion are viable options to help breaking down complex user input into smaller subqueries and generating additional context to improve the retriever’s performance.
Other than the IR problem, having RAG does not necessarily provides justifiable improvements with respective to cost and latency for simple queries and static knowledge answering. For reasoning tasks, it might be slightly complicated. [Inference time scaling paper] summarizes the improvements on reasoning tasks by allocating more compute. It might not address overall latency and inference cost. It just transfers the cost to building and maintaining a RAG system. Another scenario where RAG might not be helpful is when creativity or “hallucination” is exactly what the user wants.
From a technique perspective, RAG can be thought as a late fusion technique. The technique is still subjected to the parametric memory’s bias. It might be difficult to steer the model towards the answer “the sky is green” when the parametric memory is largely train based on “the sky is blue”. A simple prompt “the sky is green. what is the sky’s color?” gets mixed result between when comparing across major LLM providers. Interestingly certain models seems to pick up without a question while some tries to convince it is blue, but also explains how it can be green and others just reply blue. This might be due to the different post training objective, i.e. to argee with the user’s input or to always respect to the most agreeable “fact”.
With all the RAG system related challenges, the quality and diversity of the external knowledge library itself probably has the most impact to the user experience. It is unrealistic to expect the system to work well if there is nothing useful or perhaps nothing to work with to begin with. Well maintained documents and internal knowledge is always beneficial to the individual or organization.
Benchmarks
A few benchmarks should be considered when selecting models for RAG system,
- RULER, NiaH, LaRA: model’s ability to follow the provided context
- Truthfulness: model’s ability to not steered by misleading context
- reasoning: model’s ability to solve complex problems
The two group of metric can be conflicting at times. If model is able to follow the provided context faithfully it is also more likely to be steered by misleading context. It all depends on various factors e.g. how good is the document’s quality in terms of ambiguity? How critical the response is? and etc.
A few benchmarking dimensions an IR system should consider are the effectiveness, the efficiency and the generalizations. The pre retrieval optimization techniques should be included to help identifying the real bottleneck. The effectiveness metric might not be suitable to be quantified independently with common IR mertrics e.g. recall@k, NDCG@k and etc. RAG system can be much more conplicated where sometimes none of the documents contains the direct answer. Instead, by selecting the suitable collection of documents, the LLM will be able to reason and respond with the correct answer.
In terms of efficiency, latency and cost are the main metrics. There are two latency metrics namely time to first token and general latency. TTFT quantifies the pre retrieval optimizations, IR and the token prefill stage. Ioannis A. et al. 2014’s study suggest that up to 500ms is acceptable, and given that it was a study conducted more than 10 years ago the number should decrease as people get used to faster response. Depending on the use case, time to full response might be critical to be considered. Cost is another consideration and depending on the system targeted deployment hardware, there is lots of optimization strategy. There will be situations where it is preferred to allocate more compute to different part of the RAG system.
Generalizations is applicable if the same RAG system needs to serve different tasks or a wide range of audience. Since there is no one size fits all model, a better strategy to use the same RAG system but different hyperparameters for different tasks. Customization is likely what gets a RAG system far at this stage.
Last but not least, what really matters is the actual user feedback. Benchmarks merely provides an average user experience expectations but does not necessarily consider the long tail. All benchmarks means nothing if the actual users does not benefit and enjoy using the system. One way to think about benchmark is it answers the question “how well the RAG system will perform on average with such implementation”.
RAG System Examples
Let’s take a look at a research paper library, a code repository and a souls lore explorer.
A code repository RAG system might want to have a dynamic sized chunk at module or slightly lower granularity. Perhaps instead of storing the chunk directly, the semantic of the chunk is stored while maintaining a mapping to the source code location. The retriever retrieves the target chunk and the relevant chunks through AST parsing or LSP “go to definition” to include all the context relevant for the generation of response. Such approach provides a minimal but complete context to respond to the user query of the code repository.
A research paper library requires a different kind of indexer and retriever. Often the concepts introduced requires a longer context that spans across paragraphs and papers. A different kind of chunking strategy and “go to definition” engine is needed. Parent document retriever might be preferred where child chunks are used for precise retrieval and parent chunk is returned to LLM similar to the code repository.
As a souls like lore enjoyer, one can imagine the pain for lore content creator to mine through the manuscripts, gameplay and lore book to make connections between seemingly unrelated lore to tell a cohesive yet imaginative story. What these content creators does is to note down everything they find out as they play the game, with some help of the game file miners and build a large lore library. This process can be greatly simplified with the help of RAG. If a RAG system is to be build the indexing and retrieving process will be different from the research paper library and code repository. GraphRAG with metadata that can be updated as the content creator discovers new relation can be a working idea for new lore discovery and storytelling.
Souls-like storytelling is characterized by cryptic, fragmented, and environmental narratives that prioritize lore over direct exposition.
The approach discussed above goes against the deep learning idea of not hand crafting features. They are highly specialized pipeline designed based on some knowledge about the process. Hence, yet another train of thought is to have a small model that memorized the corpus with basic LLM capability e.g. text coherency and logic resoning, and replacing the indexer and retriever. Although losing the interpretability, it is now a fully end to end differentiable system. It makes RAG system post training with reinforcement learning possible.