A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics

We initiate this tutorial by configuring a high-performance evaluation environment, specifically focused on integrating the DeepEval framework to bring unit-testing rigor to our LLM applications. By bridging the gap between raw retrieval and final generation, we implement a system that treats model outputs as testable code and uses LLM-as-a-judge metrics to quantify performance. We move beyond manual inspection by building a structured pipeline in which every query, retrieved context, and generated response is validated against rigorous academic-standard metrics. Check out the FULL CODES here.

import sys, os, textwrap, json, math, re
from getpass import getpass

print(“🔧 Hardening environment (prevents common Colab/py3.12 numpy corruption)…”)

!pip -q uninstall -y numpy || true
!pip -q install –no-cache-dir –force-reinstall “numpy==1.26.4”

!pip -q install -U deepeval openai scikit-learn pandas tqdm

print(“✅ Packages installed.”)

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
GEval,
)

print(“✅ Imports loaded successfully.”)

OPENAI_API_KEY = getpass(“🔑 Enter OPENAI_API_KEY (leave empty to run without OpenAI): “).strip()
openai_enabled = bool(OPENAI_API_KEY)

if openai_enabled:
os.environ[“OPENAI_API_KEY”] = OPENAI_API_KEY
print(f”🔌 OpenAI enabled: {openai_enabled}”)

We initialize our environment by stabilizing core dependencies and installing the deepeval framework to ensure a robust testing pipeline. Next, we import specialized metrics like Faithfulness and Contextual Recall while configuring our API credentials to enable automated, high-fidelity evaluation of our LLM responses. Check out the FULL CODES here.

DOCS = [
{
“id”: “doc_01”,
“title”: “DeepEval Overview”,
“text”: (
“DeepEval is an open-source LLM evaluation framework for unit testing LLM apps. ”
“It supports LLM-as-a-judge metrics, custom metrics like G-Eval, and RAG metrics ”
“such as contextual precision and faithfulness.”
),
},
{
“id”: “doc_02”,
“title”: “RAG Evaluation: Why Faithfulness Matters”,
“text”: (
“Faithfulness checks whether the answer is supported by retrieved context. ”
“In RAG, hallucinations occur when the model states claims not grounded in context.”
),
},
{
“id”: “doc_03”,
“title”: “Contextual Precision”,
“text”: (
“Contextual precision evaluates how well retrieved chunks are ranked by relevance ”
“to a query. High precision means relevant chunks appear earlier in the ranked list.”
),
},
{
“id”: “doc_04”,
“title”: “Contextual Recall”,
“text”: (
“Contextual recall measures whether the retriever returns enough relevant context ”
“to answer the query. Low recall means key information was missed in retrieval.”
),
},
{
“id”: “doc_05”,
“title”: “Answer Relevancy”,
“text”: (
“Answer relevancy measures whether the generated answer addresses the user’s query. ”
“Even grounded answers can be irrelevant if they don’t respond to the question.”
),
},
{
“id”: “doc_06”,
“title”: “G-Eval (GEval) Custom Rubrics”,
“text”: (
“G-Eval lets you define evaluation criteria in natural language. ”
“It uses an LLM judge to score outputs against your rubric (e.g., correctness, tone, policy).”
),
},
{
“id”: “doc_07”,
“title”: “What a DeepEval Test Case Contains”,
“text”: (
“A test case typically includes input (query), actual_output (model answer), ”
“expected_output (gold answer), and retrieval_context (ranked retrieved passages) for RAG.”
),
},
{
“id”: “doc_08”,
“title”: “Common Pitfall: Missing expected_output”,
“text”: (
“Some RAG metrics require expected_output in addition to input and retrieval_context. ”
“If expected_output is None, evaluation fails for metrics like contextual precision/recall.”
),
},
]

EVAL_QUERIES = [
{
“query”: “What is DeepEval used for?”,
“expected”: “DeepEval is used to evaluate and unit test LLM applications using metrics like LLM-as-a-judge, G-Eval, and RAG metrics.”,
},
{
“query”: “What does faithfulness measure in a RAG system?”,
“expected”: “Faithfulness measures whether the generated answer is supported by the retrieved context and avoids hallucinations not grounded in that context.”,
},
{
“query”: “What does contextual precision mean?”,
“expected”: “Contextual precision evaluates whether relevant retrieved chunks are ranked higher than irrelevant ones for a given query.”,
},
{
“query”: “What does contextual recall mean in retrieval?”,
“expected”: “Contextual recall measures whether the retriever returns enough relevant context to answer the query, capturing key missing information issues.”,
},
{
“query”: “Why might an answer be relevant but still low quality in RAG?”,
“expected”: “An answer can address the question (relevant) but still be low quality if it is not grounded in retrieved context or misses important details.”,
},
]

We define a structured knowledge base consisting of documentation snippets that serve as our ground-truth context for the RAG system. We also establish a set of evaluation queries and corresponding expected outputs to create a “gold dataset,” enabling us to assess how accurately our model retrieves information and generates grounded responses. Check out the FULL CODES here.

class TfidfRetriever:
def __init__(self, docs):
self.docs = docs
self.texts = [f”{d[‘title’]}\n{d[‘text’]}” for d in docs]
self.vectorizer = TfidfVectorizer(stop_words=”english”, ngram_range=(1, 2))
self.matrix = self.vectorizer.fit_transform(self.texts)

def retrieve(self, query, k=4):
qv = self.vectorizer.transform([query])
sims = cosine_similarity(qv, self.matrix).flatten()
top_idx = np.argsort(-sims)[:k]
results = []
for i in top_idx:
results.append(
{
“id”: self.docs[i][“id”],
“score”: float(sims[i]),
“text”: self.texts[i],
}
)
return results

retriever = TfidfRetriever(DOCS)

We implement a custom TF-IDF Retriever class that transforms our documentation into a searchable vector space using bigram-aware TF-IDF vectorization. This allows us to perform cosine similarity searches against the knowledge base, ensuring we can programmatically fetch the top-k most relevant text chunks for any given query. Check out the FULL CODES here.

def extractive_baseline_answer(query, retrieved_contexts):
“””
Offline fallback: we create a short answer by extracting the most relevant sentences.
This keeps the notebook runnable even without OpenAI.
“””
joined = “\n”.join(retrieved_contexts)
sents = re.split(r”(?<=[.!?])\s+”, joined)
keywords = [w.lower() for w in re.findall(r”[a-zA-Z]{4,}”, query)]
scored = []
for s in sents:
s_l = s.lower()
score = sum(1 for k in keywords if k in s_l)
if len(s.strip()) > 20:
scored.append((score, s.strip()))
scored.sort(key=lambda x: (-x[0], -len(x[1])))
best = [s for sc, s in scored[:3] if sc > 0]
if not best:
best = [s.strip() for s in sents[:2] if len(s.strip()) > 20]
ans = ” “.join(best).strip()
if not ans:
ans = “I could not find enough context to answer confidently.”
return ans

def openai_answer(query, retrieved_contexts, model=”gpt-4.1-mini”):
“””
Simple RAG prompt for demonstration. DeepEval metrics can still evaluate even if
your generation prompt differs; the key is we store retrieval_context separately.
“””
from openai import OpenAI
client = OpenAI()

context_block = “\n\n”.join([f”[CTX {i+1}]\n{c}” for i, c in enumerate(retrieved_contexts)])
prompt = f”””You are a concise technical assistant.
Use ONLY the provided context to answer the query. If the answer is not in context, say you don’t know.

Query:
{query}

Context:
{context_block}

Answer:”””
resp = client.chat.completions.create(
model=model,
messages=[{“role”: “user”, “content”: prompt}],
temperature=0.2,
)
return resp.choices[0].message.content.strip()

def rag_answer(query, retrieved_contexts):
if openai_enabled:
try:
return openai_answer(query, retrieved_contexts)
except Exception as e:
print(f”⚠️ OpenAI generation failed, falling back to extractive baseline. Error: {e}”)
return extractive_baseline_answer(query, retrieved_contexts)
else:
return extractive_baseline_answer(query, retrieved_contexts)

We implement a hybrid answering mechanism that prioritizes high-fidelity generation via OpenAI while maintaining a keyword-based extractive baseline as a reliable fallback. By isolating the retrieval context from the final generation, we ensure our DeepEval test cases remain consistent regardless of whether the answer is synthesized by an LLM or extracted programmatically. Check out the FULL CODES here.

print(“\n🚀 Running RAG to create test cases…”)

test_cases = []
K = 4

for item in tqdm(EVAL_QUERIES):
q = item[“query”]
expected = item[“expected”]

retrieved = retriever.retrieve(q, k=K)
retrieval_context = [r[“text”] for r in retrieved]

actual = rag_answer(q, retrieval_context)

tc = LLMTestCase(
input=q,
actual_output=actual,
expected_output=expected,
retrieval_context=retrieval_context,
)
test_cases.append(tc)

print(f”✅ Built {len(test_cases)} LLMTestCase objects.”)

print(“\n✅ Metrics configured.”)

metrics = [
AnswerRelevancyMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
FaithfulnessMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualRelevancyMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualPrecisionMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualRecallMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),

GEval(
name=”RAG Correctness Rubric (GEval)”,
criteria=(
“Score the answer for correctness and usefulness. ”
“The answer must directly address the query, must not invent facts not supported by context, ”
“and should be concise but complete.”
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT,
],
model=”gpt-4.1″,
threshold=0.5,
async_mode=True,
),
]

if not openai_enabled:
print(“\n⚠️ You did NOT provide an OpenAI API key.”)
print(“DeepEval’s LLM-as-a-judge metrics (AnswerRelevancy/Faithfulness/Contextual* and GEval) require an LLM judge.”)
print(“Re-run this cell and provide OPENAI_API_KEY to run DeepEval metrics.”)
print(“\n✅ However, your RAG pipeline + test case construction succeeded end-to-end.”)
rows = []
for i, tc in enumerate(test_cases):
rows.append({
“id”: i,
“query”: tc.input,
“actual_output”: tc.actual_output[:220] + (“…” if len(tc.actual_output) > 220 else “”),
“expected_output”: tc.expected_output[:220] + (“…” if len(tc.expected_output) > 220 else “”),
“contexts”: len(tc.retrieval_context or []),
})
display(pd.DataFrame(rows))
raise SystemExit(“Stopped before evaluation (no OpenAI key).”)

We execute the RAG pipeline to generate LLMTestCase objects by pairing our retrieved context with model-generated answers and ground-truth expectations. We then configure a comprehensive suite of DeepEval metrics, including G-Eval and specialized RAG indicators, to evaluate the system’s performance using an LLM-as-a-judge approach. Check out the FULL CODES here.

print(“\n🧪 Running DeepEval evaluate(…) …”)

results = evaluate(test_cases=test_cases, metrics=metrics)

summary_rows = []
for idx, tc in enumerate(test_cases):
row = {
“case_id”: idx,
“query”: tc.input,
“actual_output”: tc.actual_output[:200] + (“…” if len(tc.actual_output) > 200 else “”),
}
for m in metrics:
row[m.__class__.__name__ if hasattr(m, “__class__”) else str(m)] = None

summary_rows.append(row)

def try_extract_case_metrics(results_obj):
extracted = []
candidates = []
for attr in [“test_results”, “results”, “evaluations”]:
if hasattr(results_obj, attr):
candidates = getattr(results_obj, attr)
break
if not candidates and isinstance(results_obj, list):
candidates = results_obj

for case_i, case_result in enumerate(candidates or []):
item = {“case_id”: case_i}
metrics_list = None
for attr in [“metrics_data”, “metrics”, “metric_results”]:
if hasattr(case_result, attr):
metrics_list = getattr(case_result, attr)
break
if isinstance(metrics_list, dict):
for k, v in metrics_list.items():
item[f”{k}_score”] = getattr(v, “score”, None) if v is not None else None
item[f”{k}_reason”] = getattr(v, “reason”, None) if v is not None else None
else:
for mr in metrics_list or []:
name = getattr(mr, “name”, None) or getattr(getattr(mr, “metric”, None), “name”, None)
if not name:
name = mr.__class__.__name__
item[f”{name}_score”] = getattr(mr, “score”, None)
item[f”{name}_reason”] = getattr(mr, “reason”, None)
extracted.append(item)
return extracted

case_metrics = try_extract_case_metrics(results)

df_base = pd.DataFrame([{
“case_id”: i,
“query”: tc.input,
“actual_output”: tc.actual_output,
“expected_output”: tc.expected_output,
} for i, tc in enumerate(test_cases)])

df_metrics = pd.DataFrame(case_metrics) if case_metrics else pd.DataFrame([])
df = df_base.merge(df_metrics, on=”case_id”, how=”left”)

score_cols = [c for c in df.columns if c.endswith(“_score”)]
compact = df[[“case_id”, “query”] + score_cols].copy()

print(“\n📊 Compact score table:”)
display(compact)

print(“\n🧾 Full details (includes reasons):”)
display(df)

print(“\n✅ Done. Tip: if contextual precision/recall are low, improve retriever ranking/coverage; if faithfulness is low, tighten generation to only use context.”)

We finalize the workflow by executing the evaluate function, which triggers the LLM-as-a-judge process to score each test case against our defined metrics. We then aggregate these scores and their corresponding qualitative reasoning into a centralized DataFrame, providing a granular view of where the RAG pipeline excels or requires further optimization in retrieval and generation.

At last, we conclude by running our comprehensive evaluation suite, in which DeepEval transforms complex linguistic outputs into actionable data using metrics such as Faithfulness, Contextual Precision, and the G-Eval rubric. This systematic approach allows us to diagnose “silent failures” in retrieval and hallucinations in generation with surgical precision, providing the reasoning necessary to justify architectural changes. With these results, we move forward from experimental prototyping to a production-ready RAG system backed by a verifiable, metric-driven safety net.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link

A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics

Why “AI SEO” Is No Longer Optional but a Survival Game in Gemini, ChatGPT and Beyond

VMware starts down the AI route, but it’s not core business

Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy

Novel method detects microbial contamination in cell cultures | MIT News

PENGUIN Memecoin Climbs to Over $136M Market Cap After White House Post

Tether Posts Largest Crypto Revenue in 2025: $5.2B From Stablecoin Dominance

Uniswap brings token launch auctions and price discovery to Base

Paradex Refunds $650K After Maintenance Bug Triggers Liquidations

1 Mid-Cap Stock Will Stand Head and Shoulders Above the Energy Giants in 2026

Top Insights

A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics

I Ranked the Best AI Tools to Make Money in 2026

A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics

Related Posts