Intermediate 25 min

RAG Architecture

The RAG pipeline consists of several key stages that work together to produce enhanced responses. Let’s explore each component and how they interact to create a system that’s greater than the sum of its parts.

The Complete RAG Pipeline

Here’s an interactive visualization of the entire RAG flow. Click play to see how data moves through the system:

Text Vector Search Context Prompt Query Embedding Vector DB Top-k Docs Augment LLM

Step-by-Step Breakdown

Interactive Diagram

This interactive diagram requires JavaScript to be enabled.

Diagram

Steps:

  1. User submits a natural language query to the system
  2. The query is converted into a vector embedding using an embedding model
  3. Vector similarity search finds the most relevant documents in the knowledge base
  4. Top-k most similar documents are retrieved based on semantic similarity
  5. Retrieved documents are combined with the original query to create an augmented prompt
  6. The LLM generates a response using both the query and retrieved context
  7. The final response is returned to the user, grounded in retrieved documents

Core Components Explained

Let’s break down each component in detail:

1. Query Processing 🔍

What it does: Prepares the user’s question for the retrieval system

Key Functions:

  • Cleans and normalizes the input text
  • Removes stop words and special characters
  • May expand the query with synonyms
  • Prepares text for embedding

Example:

Input: "How do I reset my password???"
Processed: "reset password"

2. Embedding Model 🧮

What it does: Transforms text into dense vector representations

Key Characteristics:

  • Converts text to high-dimensional vectors (768-1536 dimensions)
  • Captures semantic meaning, not just keywords
  • Similar concepts have similar vectors
  • Enables semantic search

Popular Models:

  • OpenAI text-embedding-ada-002
  • Sentence Transformers
  • Cohere Embed
  • Google’s Universal Sentence Encoder

Example:

Text: "machine learning"
Vector: [0.23, -0.45, 0.67, ..., 0.12] (1536 dimensions)

3. Vector Database 💾

What it does: Stores document embeddings and enables fast similarity search

Key Features:

  • Specialized for vector similarity search
  • Handles millions of vectors efficiently
  • Supports various distance metrics
  • Enables real-time retrieval

Popular Options:

  • Pinecone
  • Weaviate
  • Qdrant
  • Chroma
  • FAISS

How it works:

  • Documents are pre-embedded and stored
  • Query vector is compared against all stored vectors
  • Returns most similar documents based on distance

4. Retrieval System 📚

What it does: Finds and ranks the most relevant documents

Retrieval Strategies:

  • Semantic Search: Based on vector similarity
  • Keyword Search: Traditional text matching
  • Hybrid Search: Combines both approaches
  • Re-ranking: Refines initial results

Parameters:

  • Top-k: Number of documents to retrieve (typically 3-10)
  • Similarity Threshold: Minimum relevance score
  • Diversity: Ensure varied results

5. Context Augmentation 🔗

What it does: Combines retrieved documents with the query

Process:

  1. Takes top-k retrieved documents
  2. Formats them into a structured prompt
  3. Adds the original query
  4. Creates a complete prompt for the LLM

Example Prompt Template:

Context:
[Document 1: Password reset instructions...]
[Document 2: Account security guidelines...]
[Document 3: Two-factor authentication setup...]

Question: How do I reset my password?

Answer based on the context provided above:

6. Language Model 🤖

What it does: Generates the final response using query and context

Key Capabilities:

  • Understands natural language
  • Synthesizes information from multiple sources
  • Generates coherent, contextual responses
  • Can cite specific sources

Popular Models:

  • GPT-4 / GPT-3.5
  • Claude
  • Llama 2
  • PaLM

7. Response Formatting 📝

What it does: Presents the answer to the user

Enhancements:

  • Adds source citations
  • Formats for readability
  • Includes confidence scores
  • Provides follow-up suggestions

Step-by-Step RAG Process

Let’s see the RAG pipeline in action with a concrete example:

RAG Pipeline in Action

This animated concept requires JavaScript to be enabled.

Frames:

  1. Step 1: User asks 'What are the benefits of RAG?' - Query is received by the system

    Step 1: User asks 'What are the benefits of RAG?' - Query is received by the system

  2. Step 2: Query is embedded into a 1536-dimensional vector using an embedding model

    Step 2: Query is embedded into a 1536-dimensional vector using an embedding model

  3. Step 3: Vector database performs similarity search across millions of document embeddings

    Step 3: Vector database performs similarity search across millions of document embeddings

  4. Step 4: Top 5 most relevant documents are retrieved based on cosine similarity scores

    Step 4: Top 5 most relevant documents are retrieved based on cosine similarity scores

  5. Step 5: Retrieved documents are combined with the query into a structured prompt

    Step 5: Retrieved documents are combined with the query into a structured prompt

  6. Step 6: LLM generates a response grounded in the retrieved context with source citations

    Step 6: LLM generates a response grounded in the retrieved context with source citations

  7. Step 7: Final response is returned to user: 'RAG provides fresh information, reduces hallucinations, and enables source attribution. [Sources: Doc1, Doc2, Doc3]'

    Step 7: Final response is returned to user: 'RAG provides fresh information, reduces hallucinations, and enables source attribution. [Sources: Doc1, Doc2, Doc3]'

Data Flow Summary

Here’s how information flows through the RAG system:

User Query

Query Embedding (vector)

Vector Similarity Search

Top-k Documents Retrieved

Context + Query → Augmented Prompt

LLM Generation

Grounded Response with Citations

Key Architecture Insights

1. Two-Stage Process

  • Stage 1: Retrieval (finding relevant information)
  • Stage 2: Generation (creating the response)

2. Separation of Concerns

  • Retrieval handles finding information
  • LLM handles understanding and generation
  • Each component can be optimized independently

3. Scalability

  • Vector databases can handle millions of documents
  • Retrieval is fast (milliseconds)
  • Generation time depends on LLM

4. Flexibility

  • Can swap embedding models
  • Can change retrieval strategies
  • Can use different LLMs
  • Can update knowledge base without retraining

What’s Next?

Now that you understand the architecture, let’s dive deeper into the Retrieval Process in the next page. You’ll learn about vector embeddings, similarity search, and get hands-on experience building a RAG pipeline!

Progress 40%
Page 2 of 5
Previous Next