Retrieval-Augmented Generation — Architecture, Limits, and Practice

Type
Article
Published
2026-04-29
Aliases
RAG architecture, retrieval augmented generation, RAG pipelines
A 1937 photograph of Paul Otlet at his desk at the Mundaneum, surrounded by papers, drawers, and a card catalogue.
Paul Otlet at his desk at the Mundaneum, 1937. Otlet's universal-knowledge index held twelve million 3×5 cards, each cross-referenced via the Universal Decimal Classification. RAG reinvents the same problem with vectors instead of cards — and bumps into many of the same scaling limits. Source: Wikimedia Commons · public domain.
Summary

Retrieval-Augmented Generation is not a feature you bolt onto an LLM. It is a multi-stage architecture whose accuracy depends far more on how you retrieve than on how you generate — and which collapses past a few tens of thousands of documents unless you design for it.

Overview

Retrieval-Augmented Generation (RAG) is the dominant technique for grounding LLM outputs in domain-specific knowledge that the model was never trained on. The basic shape is familiar: take a question, retrieve relevant chunks from a corpus, paste them into the prompt, and let the model generate an answer over that augmented context.

The naive version of this pattern fits demos and toy projects. In production it breaks. Shraddha Bharuka is blunt about the source of the misconception: most teams think RAG means “add docs, retrieve, generate.” That works at small scale. At real-world scale, the system fails in ways that look like model failures but are actually retrieval failures — and the fixes live in the retrieval pipeline, not the prompt.

The wiki entry for Retrieval-Augmented Generation gives a one-paragraph definition. This article covers the architecture as practitioners actually build it, the failure modes that catch teams off guard, and the design choices that separate working pipelines from broken ones.

Key Concepts

RAG as a five-component pipeline

Vaishnavi frames a production RAG system as five interlocking components, each with its own failure modes:

  1. Extraction — pulling clean text out of PDFs, Slack exports, Notion pages, databases, and other source formats. Garbage in, garbage out: a corpus full of malformed extractions will degrade every downstream step regardless of how good the embeddings or model are.
  2. Embeddings — chunking strategy, embedding model choice, and the metadata attached to each chunk. Small changes (chunk size, overlap, whether you embed the section header) can shift retrieval quality by tens of percentage points.
  3. Vector database — speed and recall under realistic latency budgets. The choice of index (HNSW, IVF, ScaNN) is mostly invisible until you scale, then it dominates.
  4. Models — the LLM that generates the final answer, and the trade-off between open-weight models (cheaper, more private, often weaker) and frontier APIs.
  5. Evaluation — the most-skipped step. Bharuka and Vaishnavi both stress that without measurable retrieval quality, you cannot improve the pipeline; you can only guess.

The mistake teams make is treating these as five plug-and-play modules. They aren’t. Retrieval quality, chunk size, and reranker choice interact non-linearly. Changing the embedding model often invalidates the chunking strategy that worked with the previous one.

The architecture zoo

Weaviate AI Database catalogues seven RAG architectures, each suited to different use cases:

  • Naive RAG — similarity search, augment, generate. Fast and simple. Accuracy degrades quickly with corpus size.
  • Retrieve-and-Rerank — adds a reranker model after initial retrieval to score and reorder candidates. Slower per query, dramatically more precise.
  • Multimodal RAG — retrieves and generates across text and images using multimodal embeddings. Used in product catalogues, medical imaging, visual document Q&A.
  • Graph RAG — retrieves over a knowledge graph rather than a flat vector store. Captures relationships that semantic similarity misses.
  • Hybrid RAG — combines dense (semantic) and sparse (keyword/BM25) retrieval, then merges the rankings.
  • Agentic RAG — an agent decides what to retrieve, re-queries when initial results are insufficient, and chains retrieval steps.
  • Hierarchical RAG — retrieves over a tree structure (sections within documents within domains), narrowing the search space before the final pass.

Vaishnavi’s later catalogue extends the taxonomy to 34 named techniques. The proliferation reflects that “RAG” describes a design space, not a single algorithm.

What “RAG eliminates hallucinations” actually delivers

A persistent vendor claim — repeated by Casetext, Thomson Reuters, and LexisNexis through 2023 — was that RAG had eliminated, avoided, or made impossible the hallucinations that plague general-purpose chatbots. The first preregistered empirical audit of these systems was Magesh, Surani, Dahl, Suzgun, Manning, and Ho (2024), the Stanford RegLab paper Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. The finding directly contradicts the marketing: Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI each hallucinated between 17% and 33% of the time on the test queries. RAG reduced hallucinations relative to GPT-4 alone, but did not come close to eliminating them.

The implication is structural rather than incremental. RAG redistributes the hallucination problem rather than solving it: when retrieval surfaces the wrong context, the model still generates a fluent answer that cites real-looking sources. For users — especially in high-stakes domains like legal research — this is more dangerous than a chatbot that obviously makes things up, because the failure mode looks like authority. The Stanford paper argues that the question of who is responsible for verifying AI output remains the central open problem for responsible deployment.

Theoretical limits of embedding-based retrieval

A complementary result, often confused with the Stanford paper in social-media summaries, is Weller, Boratko, Naim, and Lee (2025), On the Theoretical Limitations of Embedding-Based Retrieval, from Google DeepMind and Johns Hopkins. This paper proves a different point: there is a mathematical ceiling on how much information a fixed-dimensional embedding can represent, which puts a hard cap on the scale at which a single-vector retrieval system can work.

The intuition is the curse of dimensionality applied to retrieval. Every document is encoded as a vector in a high-dimensional embedding space — 768 to 3072 dimensions is typical. At small scale, similar documents cluster cleanly. As the corpus grows, the embedding space fills up. The number of top-k document subsets that any query can possibly distinguish is bounded by the embedding dimension: at 512 dimensions, retrieval saturates around hundreds of thousands of documents; at 4096 dimensions, the ceiling pushes into the hundreds of millions. Past those thresholds, no amount of training can recover the lost discriminating power, because the geometry simply does not have room.

This is the result that has been widely (and incorrectly) circulated as “Stanford’s semantic collapse paper” with claims of “87% precision drops at 50,000 documents.” The actual paper does not use the term “semantic collapse” and is more careful about thresholds — they depend on dimensionality and on the structure of the queries. The qualitative claim survives the misattribution: naive vector retrieval has a hard limit, the limit is reached at scales that real corpora hit, and the fix is not better chunking or a fancier reranker but a shift in retrieval architecture: hierarchical retrieval that narrows the search space in stages, graph databases that exploit explicit relationships, and Hybrid Retrieval that combines semantic and keyword signals so a query that fails one method can still succeed via the other. See Semantic Collapse for the concept entry, including the attribution history.

Memory efficiency at scale

The other scaling pressure is storage. Embeddings are dense float vectors — at 1024 dimensions and 4 bytes per float, every document costs 4KB just for its index entry. At a billion documents, that is terabytes of RAM-resident index.

Avi Chawla describes the technique that Perplexity, Azure search, and HubSpot’s AI assistant all use: binary or scalar quantisation that compresses each vector to a fraction of its original size — up to 32× memory reduction with manageable recall loss. The trade-off is a small precision drop on the initial retrieval, recovered by running an exact rerank over the top-k shortlist. The result is a system that scales to billions of documents on hardware that would otherwise top out in the millions.

Practical applications

For legal and academic research, the practical implications are concrete:

  • Vendor “hallucination-free” claims should be treated as marketing. Magesh et al.’s 17–33% hallucination rates are a baseline for what to expect from current commercial legal RAG tools. Verification of every cited authority is non-optional, regardless of vendor claims.
  • Embedding-only retrieval has a ceiling. The Weller et al. theoretical limit means a research group’s full document corpus eventually outgrows what a single-vector index can discriminate. The exact threshold depends on embedding dimensionality, but the qualitative point — naive vector retrieval is not a long-run solution at scale — is stable.
  • PDF parsing is its own problem. LlamaIndex notes that PDFs were never designed to be machine-readable: text is stored as glyph shapes positioned at coordinates with no semantic structure, tables are just lines that happen to look tabular, and reading order is “pure guesswork.” Hybrid extraction (text plus vision models) is now the practical default.
  • Hybrid retrieval should be the baseline, not an optimisation. Keyword search recovers exact-match queries (case names, statute numbers, defined terms) that semantic search routinely misses.
  • Evaluation must be measured, not assumed. Without a labelled retrieval test set, no architectural change can be validated.

Aakash Gupta and the wider RAG-introduction literature converge on the same advice for teams just starting: build naive RAG to learn the moving parts, then immediately layer in reranking and hybrid retrieval before scaling the corpus. The cost of retrofitting these later is much higher than building them in from the start.

Limitations and open questions

  • Hallucination is reduced, not eliminated. Magesh et al.’s 17–33% rates on commercial legal RAG tools should reset expectations: RAG hides the failure mode behind plausible citations, and verification responsibility falls on the user.
  • Evaluation is unsolved at the system level. Component-level metrics (recall@k, MRR) are well-understood. End-to-end answer quality under realistic queries remains hard to measure without expensive human annotation, which is why the Magesh et al. preregistered methodology was novel.
  • The retrieval-vs-finetuning trade-off keeps moving. As context windows grow and long-context models improve, the boundary between “retrieve into context” and “train into weights” shifts every few months.
  • Graph and hierarchical RAG are under-tooled. The vendor ecosystem is still dominated by flat vector stores; the architectures that scale past Weller et al.’s embedding-dimension ceiling are harder to assemble from off-the-shelf components.

Sources

Papers

Tweet archive

  • @weaviate_io — Seven-architecture taxonomy (Naive, Rerank, Multimodal, Graph, Hybrid, Agentic, Hierarchical)
  • @_vmlops — Five-component pipeline framing; evaluation as the skipped step
  • @BharukaShraddha — RAG as architecture, not feature; production failure modes
  • @_vmlops — 34-technique RAG catalogue
  • @_avichawla — 32× memory-efficient quantised indexing
  • @aakashgupta — Introductory RAG framing
  • @llama_index — PDF parsing as a hybrid text-plus-vision problem