PORTFOLIO CASE STUDY β€’ MEDICAL AI β€’ RAG

RAG Medical Assistant

Grounded answers over the Merck Manual corpus.
By Zaheer Shaikh Evaluation-first RAG engineering Architecture + safety gates

A Retrieval-Augmented Generation system designed to produce medically grounded, citation-backed responses using the Merck Manual corpusβ€”built with an evaluation-first methodology and iterative tuning (R1–R5) to improve groundedness, relevance, and faithfulness.

1. The Problem

Large language models generate fluent answers β€” but medical domains demand verifiable grounding, citation-backed claims, controlled hallucination behavior, and transparent evaluation.

The objective was not just to generate answers, but to build a measurable RAG system with structured evaluation loops.

Architecture

2. System Architecture

Merck Manual Corpus ↓ Document Cleaning ↓ Chunking Strategy ↓ Embeddings ↓ Vector Store ↓ Retriever (Top-k) ↓ Generator (LLM) ↓ Grounded Response + Citations

Principles: separation of retrieval and generation; citation-first prompting; refusal when context is insufficient; evaluation-driven tuning loop.

End-to-end RAG pipeline diagram.
End-to-End RAG Pipeline β€” from corpus ingestion to citation-backed responses.
Retrieval layer vs Generation layer diagram.
Retrieval vs Generation Layers β€” context quality governs output quality.
Tuning

3. Iterative Tuning Framework (R1 β†’ R5)

Instead of a static implementation, the system was tuned across multiple controlled iterations. Each version was evaluated systematically.

Version Change Focus Improvement Objective
R1Baseline RAGEstablish baseline groundedness
R2Chunk size tuningImprove contextual completeness
R3Retriever k optimizationImprove relevance precision
R4Prompt refinementReduce hallucinations
R5Final tuningMaximize grounded + faithful outputs
RAG tuning and evaluation loop diagram (R1–R5).
Evaluation + Tuning Loop β€” configure β†’ evaluate β†’ measure β†’ tune β†’ repeat.
Evaluation

4. Evaluation Methodology

Unlike typical demos, this system includes structured evaluation across: groundedness, relevance, faithfulness, and consistency across runs.

Key insight: prompt engineering alone is insufficient β€” retrieval configuration materially impacts hallucination control.

Example

5. Example Query Flow

Query: What are the causes of iron deficiency anemia? Retrieval: Top-k relevant Merck Manual sections extracted. Generation: Response generated strictly from retrieved context. Output Behavior: β€’ Grounded explanation β€’ Avoids speculative claims β€’ Refuses when context is insufficient
Stack

6. Technical Stack

LLM: Mistral-7B-Instruct (inference optimized)

Embeddings: sentence-transformers

Vector Store: ChromaDB

Framework: LangChain

Runtime: Google Colab / Local GPU

Evaluation Loop: custom evaluation harness

Conclusion

7. Conclusion

This project demonstrates how Retrieval-Augmented Generation should be built: not as a demo, but as an evaluated, measurable, architected system.

Interested in safety-first RAG systems?

Let’s build domain assistants where trust is non-negotiable.