How to Build AI Eval Pipelines with DeepEval

TL;DR

The Goal: Build an automated evaluation pipeline to score LLM and RAG outputs.
The Stack: DeepEval, Confident AI, OpenAI.
The Outcome: A fully observable CI/CD pipeline for tracking LLM quality, catching regressions, and optimizing prompts.

Why Build This?

Testing LLMs is fundamentally different from traditional software testing. You cannot rely on exact string matches when the output is non-deterministic. Without a proper evaluation framework, deploying AI into production means flying blind. You might break a core feature or introduce hallucinations with a simple prompt tweak, and you will not know until users complain.

By automating LLM evaluations with tools like DeepEval and tracking results in Confident AI, you gain the ability to catch regressions early. DeepEval acts as an "LLM-as-a-judge" framework, running locally to score responses against 50+ built-in metrics such as answer relevancy, contextual precision, and hallucination. When combined with Confident AI, you get a centralized dashboard for observability, dataset management, and production monitoring.

The Architecture

The architecture centers on DeepEval executing tests locally or within a CI/CD pipeline, and Confident AI serving as the centralized system of record.

Input Generation: You define an LLMTestCase containing the input prompt, actual output, expected output, and retrieval context.
Evaluation: DeepEval utilizes an evaluator LLM (like GPT-4) to score the test case against specified metrics (e.g., FaithfulnessMetric).
Observability: Results are logged to the Confident AI cloud platform, where traces, metric scores, and metadata are aggregated into dashboards.

Step-by-Step Implementation

Setting Up DeepEval

First, install DeepEval and configure your API keys. Since DeepEval uses an LLM as a judge, you need an API key from a provider like OpenAI.

pip install -U deepeval
export OPENAI_API_KEY="your_openai_api_key_here"

Next, log in to Confident AI to enable cloud observability. This step is optional for local testing but crucial for production tracking.

deepeval login

You will be prompted to paste your Confident AI API Key, which automatically routes your local test runs to the cloud dashboard.

Creating Your First Test Case

Create a Python file named test_llm.py and define an LLMTestCase. This object represents a single evaluation scenario. In this example, we test for hallucination.

import os
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Define the test scenario
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France, known for its Eiffel Tower.",
    expected_output="Paris",
    context=["Paris is the capital and most populous city of France."]
)

# Initialize the metric with a passing threshold
hallucination_metric = HallucinationMetric(threshold=0.7)

# Run the assertion
assert_test(test_case, [hallucination_metric])

Run this test via the CLI using deepeval test run test_llm.py. DeepEval will evaluate the actual_output against the context using the evaluator LLM and determine if it passes the 0.7 threshold.

Evaluating RAG Applications

For Retrieval-Augmented Generation (RAG) systems, metrics like AnswerRelevancyMetric and ContextualPrecisionMetric are essential. Here is how you evaluate a RAG output.

import os
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

rag_test_case = LLMTestCase(
    input="What are the benefits of using DeepEval?",
    actual_output="DeepEval is an open-source framework that helps evaluate LLM outputs and RAG applications.",
    retrieval_context=["DeepEval is an open-source LLM evaluation framework for testing and benchmarking LLM applications."],
    expected_output="DeepEval provides an open-source framework for evaluating LLMs."
)

evaluate(test_cases=[rag_test_case], metrics=[answer_relevancy_metric])

This script evaluates whether the actual_output directly addresses the input prompt, independent of the factual correctness.

Creating Custom Metrics with G-Eval

If the built-in metrics do not cover your use case, you can define custom criteria using G-Eval.

from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
    evaluation_params=["actual_output", "expected_output"]
)

test_case_geval = LLMTestCase(
    input="Summarize AI ethics.",
    actual_output="AI ethics involves fairness, accountability, and transparency in systems.",
    expected_output="Key principles include fairness, accountability, transparency, and privacy.",
)

assert_test(test_case_geval, [correctness_metric])

G-Eval provides a programmatic interface to instruct the evaluator LLM on exactly how to grade the response.

The Results & Takeaways

Automated CI/CD Gates: You can now block deployments if the LLM fails critical evaluation thresholds.
Granular Observability: Confident AI provides complete traces of every LLM call, enabling rapid debugging of failed test cases.
Customizability: The G-Eval framework allows you to evaluate domain-specific criteria without writing complex parsing logic.

What you can build next: Integrate DeepEval into your GitHub Actions pipeline to automatically score pull requests that modify your system prompts or RAG retrieval logic.

Frequently Asked Questions

Why use DeepEval instead of raw API calls?+

DeepEval abstracts away the complexity of parsing LLM responses, defining evaluation logic, and handling edge cases. It provides standardized, mathematically grounded metrics like Hallucination and Faithfulness out of the box, saving you from writing brittle regex or custom grading prompts.

How do you handle API costs when running evaluations?+

DeepEval uses LLMs to judge other LLMs, which incurs API costs. To manage this, teams typically use smaller, cheaper models for routine CI/CD checks and reserve flagship models like GPT-4 for evaluating golden datasets or highly critical metrics.

Need help building this?+

If you want to implement advanced AI workflows like this but don't have the engineering bandwidth, Varnan Labs specializes in building custom AI pipelines. Book a strategy call to see if we can help.