The Generation Process
Once relevant documents are retrieved, they’re used to augment the LLM’s generation process. This is where RAG’s power becomes evident - the model can now generate responses grounded in actual, retrieved information.
Context Augmentation
The retrieved documents are formatted into a prompt template that the LLM can understand and use effectively.
Prompt Template Structure
Context:
[Document 1 content]
[Document 2 content]
[Document 3 content]
Question: [User's original query]
Instructions: Answer the question based on the context provided above.
If the context doesn't contain enough information, say so.
Cite the specific documents you used in your answer.
Answer:
Example Augmented Prompt
Query: “What are the new features in Python 3.12?”
Retrieved Documents:
- Doc 1: Python 3.12 Release Notes (excerpt)
- Doc 2: Performance improvements in Python 3.12
- Doc 3: Breaking changes in Python 3.12
Augmented Prompt:
Context:
Document 1: Python 3.12 was released in October 2023. Key features include:
- Improved error messages with more precise locations
- Per-interpreter GIL for better subinterpreter isolation
- New f-string syntax allowing inline expressions
Document 2: Python 3.12 shows significant performance improvements:
- Up to 5% faster execution compared to Python 3.11
- Optimized frame stack implementation
- Better memory management
Document 3: Breaking changes in Python 3.12:
- Removed deprecated modules: asynchat, asyncore, smtpd
- Changed behavior of some built-in functions
- Updated type hints syntax
Question: What are the new features in Python 3.12?
Answer based on the context provided above:
Grounded Generation
The LLM generates a response while being “grounded” in the retrieved context. This fundamentally changes how the model behaves.
How Grounding Works
Grounded vs Ungrounded Generation
This animated concept requires JavaScript to be enabled.
Frames:
-
Ungrounded: LLM relies only on training data. May hallucinate or provide outdated information. No source attribution.
-
Grounded: LLM uses retrieved context. Facts are based on actual documents. Can cite specific sources. More accurate and trustworthy.
Benefits of Grounding
1. Factual Accuracy ✓
- Responses are based on actual retrieved content
- Facts come from verified sources
- Less likely to make things up
2. Reduced Hallucination ✓
- Model is constrained by provided context
- Can’t invent information not in documents
- Admits when context doesn’t have the answer
3. Source Attribution ✓
- Responses can reference specific documents
- Users can verify information
- Builds trust through transparency
4. Confidence Calibration ✓
- Model can indicate uncertainty
- Can say “based on the provided context…”
- Distinguishes between what it knows and what’s in the context
RAG vs. Standard LLM: Side-by-Side Comparison
Let’s see the dramatic difference RAG makes with real examples:
Example 1: Recent Information
Query: “What are the new features in Python 3.12?”
Standard LLM vs RAG Response
This animated concept requires JavaScript to be enabled.
Frames:
-
Standard LLM: 'I don't have information about Python 3.12 as my training data only goes up to early 2023. Python 3.11 introduced features like...' ❌ Outdated, can't help with current version
-
RAG System: 'Python 3.12, released in October 2023, introduces: improved error messages, per-interpreter GIL, new f-string syntax, and 5% faster execution. [Sources: Python 3.12 Release Notes]' ✅ Current, accurate, cited
Example 2: Domain-Specific Knowledge
Query: “What’s our company’s remote work policy?”
Standard LLM Response:
I don't have access to your company's specific policies.
Generally, remote work policies might include...
[Generic information that doesn't help]
❌ Can’t access company-specific information
RAG-Enhanced Response:
According to your company's HR Policy Document (updated March 2024):
- Employees can work remotely up to 3 days per week
- Must be in office on Tuesdays for team meetings
- Requires manager approval for full-time remote work
- Equipment provided: laptop, monitor, ergonomic chair
[Source: HR Policy Document, Section 4.2]
✅ Specific, accurate, company policy with citation
Example 3: Technical Documentation
Query: “How do I configure SSL for our application?”
Standard LLM Response:
To configure SSL, you typically need to:
1. Obtain an SSL certificate
2. Configure your web server
3. Update application settings
[Generic steps that may not match your setup]
❌ Generic advice, not specific to your application
RAG-Enhanced Response:
Based on your application's deployment guide:
1. Place SSL certificates in /etc/ssl/certs/
2. Update config.yaml:
ssl:
enabled: true
cert_path: /etc/ssl/certs/app.crt
key_path: /etc/ssl/private/app.key
3. Restart the application: sudo systemctl restart app
[Source: Deployment Guide v2.3, Section 7]
✅ Specific to your application, step-by-step, cited
Generation Best Practices
1. Prompt Engineering
Good Prompt Template:
Context: {retrieved_docs}
Question: {user_query}
Instructions:
- Answer based only on the provided context
- If unsure, say "I don't have enough information"
- Cite specific documents used
- Be concise and accurate
Answer:
Why it works:
- Clear instructions for the LLM
- Encourages source citation
- Handles uncertainty gracefully
- Focuses on accuracy
2. Context Window Management
Challenge: LLMs have limited context windows (4k-32k tokens)
Solutions:
- Retrieve fewer, more relevant documents
- Summarize long documents before including
- Prioritize most relevant sections
- Use sliding window for long documents
3. Citation Formatting
Inline Citations:
Python 3.12 introduces improved error messages [1] and
performance improvements of up to 5% [2].
Sources:
[1] Python 3.12 Release Notes
[2] Python 3.12 Performance Benchmarks
Document References:
According to the Python 3.12 Release Notes, key features include...
[Source: Python 3.12 Release Notes, October 2023]
4. Handling Insufficient Context
When context doesn’t have the answer:
Bad Response:
[Makes up an answer anyway]
Good Response:
Based on the provided documents, I don't have enough
information to answer this question. The available
context covers X and Y, but doesn't address Z.
Would you like me to search for additional information?
Response Quality Metrics
How do we measure if generation is working well?
Key Metrics
1. Faithfulness
- Is the response grounded in the retrieved context?
- Does it cite actual information from documents?
- Measured by comparing response to source documents
2. Answer Relevance
- Does the response actually answer the question?
- Is it on-topic and helpful?
- Measured by semantic similarity to query
3. Context Relevance
- Were the retrieved documents actually relevant?
- Did they contain information needed to answer?
- Measured by human evaluation or LLM-as-judge
4. Completeness
- Does the response cover all aspects of the question?
- Is any important information missing?
- Measured against ground truth answers
Key Takeaways
Before moving to the final page, remember:
- Context augmentation combines retrieved docs with the query
- Grounded generation constrains the LLM to use provided context
- RAG dramatically outperforms standard LLMs for current/specific information
- Source citation builds trust and enables verification
- Prompt engineering is crucial for quality responses
What’s Next?
In the final page, you’ll test your knowledge with an interactive quiz and learn about next steps for implementing RAG in your own projects!