Skip to content
·8 min read

Build a Document Q and A Tool With RAG From Scratch in 2026

How to ship a document Q&A system using retrieval-augmented generation, the four architectural decisions that matter, and the costs to expect

Share

To build a document Q&A tool with RAG (retrieval-augmented generation) in 2026, you need four architectural pieces: a document ingestion pipeline that chunks and embeds your content, a vector database to store and search embeddings, a retrieval step that finds relevant chunks for each query, and an LLM that generates answers grounded in the retrieved chunks. The whole stack ships in 4 to 8 hours of focused work using modern tools (LangChain or LlamaIndex for orchestration, Pinecone or pgvector for storage, OpenAI or Voyage for embeddings, Claude or GPT for generation), and the resulting system answers questions about your documents with high accuracy and inline citations.

This piece walks through each of the four pieces, the architectural choices that affect quality and cost, the chunking strategy that produces good answers, and the four common failures that turn RAG systems into hallucination factories rather than useful knowledge interfaces.

The patterns described here apply equally to internal knowledge bases, customer-facing documentation Q&A, and specialized domain assistants. The corpus changes; the architecture is the same.

Why RAG Matters in 2026

RAG is the dominant pattern for getting LLMs to answer questions about specific documents accurately. Without RAG, you either fine-tune a model on your documents (expensive, slow to update) or stuff everything into the context window (limited and expensive at scale). RAG splits the difference: cheap to update, scales to millions of documents, and produces answers grounded in source material.

The pattern has matured dramatically since 2023. The 2026 RAG stack is well-understood, the tools are mature, and the architectural patterns are settled. The remaining differentiator is in the details: how you chunk documents, which embedding model you use, how many chunks you retrieve per query, and how you prompt the LLM to use the chunks.

Key Takeaway

A 2025 LangChain benchmark across 50 RAG implementations found that the difference between the worst and best implementations was 4x in answer quality despite using similar tools. The differentiator was chunking strategy (semantic vs fixed-size), retrieval count (top-3 vs top-10), and prompt engineering (with vs without explicit grounding instructions). Same tools, dramatically different outcomes; the architecture choices matter more than the tool choices.

The pattern to copy is the way library reference systems work. A librarian does not memorize every book; they know how to find the relevant book for a question, then read the relevant section to answer. RAG does the same: the LLM does not memorize your documents; it knows how to find and read the relevant chunks for each question.

The Four Architectural Pieces

Each piece has a clear job and well-understood implementation patterns in 2026.

Piece 1, document ingestion. Take your documents (PDFs, markdown, web pages), split them into chunks, generate embeddings for each chunk, and store the embeddings with metadata. This is a one-time process that runs when documents are added or updated.

Piece 2, vector database. Store and search the embeddings. Pinecone, Weaviate, and Qdrant are dedicated vector DBs. pgvector turns Postgres into a vector DB. Supabase has pgvector built in. For most apps, pgvector is sufficient and dramatically simpler operationally than a dedicated vector DB.

EXPLAINER DIAGRAM titled THE FOUR PIECE RAG ARCHITECTURE shown as a horizontal four-stage pipeline on a slate background. Stage 1 colored blue DOCUMENT INGESTION sublabel CHUNK AND EMBED, runs ONCE PER DOCUMENT. Stage 2 colored green VECTOR DATABASE sublabel STORE AND SEARCH EMBEDDINGS, options PGVECTOR PINECONE QDRANT. Stage 3 colored orange RETRIEVAL sublabel FIND TOP K RELEVANT CHUNKS, runs PER QUERY. Stage 4 colored purple LLM GENERATION sublabel ANSWER GROUNDED IN CHUNKS, options CLAUDE GPT GEMINI. Footer reads ALL FOUR PIECES MATTER, ALL FOUR HAVE WELL UNDERSTOOD PATTERNS.
Four pieces form the RAG architecture. Each one has a clear job and standard tools; the quality is in the details, not the tool choice.

Piece 3, retrieval. When a user asks a question, embed the question, search the vector database for the top-k most similar chunks, return them. Top-5 to top-10 is the typical range; lower for cost and quality, higher for breadth.

Piece 4, LLM generation. Pass the retrieved chunks plus the user's question to an LLM with a prompt that instructs it to answer using only the chunks. The output is the answer plus optional citations to the source chunks, which gives users a way to verify the response against the original source material when accuracy matters.

The Chunking Strategy That Works

Chunking is where most RAG implementations make their biggest quality decisions. Three patterns dominate in 2026.

Fixed-size chunking. Split documents into chunks of N tokens (typically 500 to 1000) with M tokens of overlap (typically 100 to 200). Simple, fast, and works for most use cases.

Build RAG that answers questions accurately

Browse more AI feature build guides

Read more build articles

Semantic chunking. Split documents at semantic boundaries (paragraph breaks, heading changes, topic shifts). Produces better answers for documents with clear structure but is slower to ingest.

Hybrid chunking. Use semantic chunking where structure exists, fall back to fixed-size for unstructured text. The best of both worlds; what most production RAG uses in 2026, especially when documents come from heterogeneous sources like a mix of PDFs, web pages, and markdown.

The Cost Profile to Expect

RAG costs split between ingestion (one-time per document) and query (per user question). Both have predictable patterns.

EXPLAINER DIAGRAM titled RAG COST PROFILE shown as a horizontal split panel on a slate background. Left panel labeled INGESTION COSTS colored blue, with three rows: EMBEDDING 0 POINT 02 PER 1K TOKENS, STORAGE 0 POINT 50 PER GB PER MONTH, ONE TIME PER DOCUMENT. Right panel labeled QUERY COSTS colored green, with three rows: EMBEDDING QUERY 0 POINT 0001 PER QUERY, RETRIEVAL FREE OR LOW, LLM GENERATION 0 POINT 01 TO 0 POINT 10 PER ANSWER. Footer reads MOST COSTS ARE IN LLM GENERATION OPTIMIZE THERE FIRST.
RAG costs split between one-time ingestion and per-query work. LLM generation dominates query costs; optimize there first.

Ingestion costs. Embedding 1 million tokens of documents costs roughly $0.02 with OpenAI text-embedding-3-small. Storage in pgvector or Pinecone is roughly $0.50 per GB per month. Total for most small apps: under $5 per month for the corpus.

Query costs. Each question requires one embedding (negligible cost), one vector search (free in pgvector, cheap in dedicated vector DBs), and one LLM generation (the expensive part: $0.01 to $0.10 per answer depending on model and length).

For most apps, the LLM generation dominates costs. Optimization focus should be on choosing the right model (Haiku for simple, Sonnet for complex) and on keeping retrieved context concise (fewer chunks per query).

Common Mistake

The most damaging RAG mistake is retrieving too many chunks per query and stuffing them all into the LLM context. More context does not produce better answers; it produces more confused answers because the LLM has to figure out which chunks are actually relevant. Top-3 to top-7 chunks is the sweet spot for most applications. If your retrieval is producing too few relevant chunks, fix the chunking or embedding quality rather than retrieving more.

The other mistake is treating RAG as a static system. Documents change, user questions evolve, and the optimal chunking and retrieval parameters drift over time. Plan for periodic re-indexing (when the document corpus changes substantially) and quarterly reviews of retrieval quality (looking at queries that produced bad answers and tuning the system).

A useful discipline is to maintain a small set of evaluation questions with known correct answers and run them against the RAG system after every meaningful change. The eval set takes maybe an hour to build initially and produces objective signal on whether your changes are improving or degrading quality. Without an eval set, RAG quality drifts silently and you do not notice until users complain.

A related habit is to log queries that produced thumbs-down feedback (or any negative signal) and review them weekly. Most weeks surface a few questions where the retrieval missed the relevant chunks; fixing the chunking or adding the missing content closes the gap quickly and prevents the same questions from failing again.

What This Means For You

RAG is one of the most useful AI patterns to learn in 2026. The architecture is well-understood, the tools are mature, and the resulting systems are genuinely valuable for any product with substantial documentation or content.

  • If you're a founder: Add RAG to your product if you have any meaningful corpus of customer-facing documentation. The support ticket reduction and conversion lift are both real.
  • If you're changing careers: RAG is one of the most in-demand AI skills in 2026. Building two or three RAG projects produces a portfolio that gets interviews.
  • If you're a student: Build a RAG system over a personal document corpus (your notes, your bookmarks, your code). The hands-on experience teaches more than any course.
Build RAG that gets used

Browse more AI feature build guides

Read more build articles
PJ
Pranay Joshi

20+ years building products at scale. VP of Product & Engineering, startup founder, and AI coach. Helping dreamers turn ideas into reality with vibe coding.

Written forIndie Hackers

The Tuesday Shipping Report

Every Tuesday, one focused email:

  • - The tool or technique that's actually working right now
  • - A real problem from the community (and how to solve it)
  • - What changed this week in the vibe coding landscape

Read by 1,000+ founders, developers, and creators building with AI. Free forever. No spam.