top of page

Evaluating RAG Agents with RAGAS: From Datasets to Automated Experiments

Retrieval‑Augmented Generation (RAG) systems promise grounded, trustworthy answers—but only if we can measure how well they actually perform. Over the past few weeks, I deep‑dived into RAGAS, an evaluation framework purpose‑built for RAG pipelines.

In this post, I’ll walk through my hands‑on journey with RAGAS:

  • Creating and saving evaluation datasets from scratch

  • Understanding and applying core metrics like Faithfulness, Answer Relevancy, and Context Precision

  • Running experiments programmatically over datasets

  • Generating synthetic test sets using TestsetGenerator

  • Stitching everything together into repeatable evaluation loops

If you’re building RAG agents and exploring RAGAS for setting up an evaulation framework, this post is for you.



Why RAG Evaluation Is Different from classic AI use case

Unlike traditional NLP tasks, RAG systems have multiple moving parts:

  • Retrieval quality (Did we fetch the right context?)

  • Generation quality (Did the model answer correctly?)

  • Grounding (Did the answer rely on retrieved context or hallucinate?)

Classic metrics like BLEU or ROUGE don’t capture this well. What we need instead are LLM‑as‑judge metrics that understand:

  • Whether an answer is supported by the context

  • Whether it actually addresses the question

  • Whether the retrieved documents were ranked in the order of relevance.

This is exactly where RAGAS fits in.


What is RAGAS and How It Supports Evaluation


By using RAGAS (Retrieval-Augmented Generation Agent Suite) toolkit, developers can benchmark their agents against common datasets, compare performance across different retrieval and generation strategies, and identify areas for improvement.


RAGAS is an evaluation framework designed specifically for RAG pipelines. It standardizes:

  • Dataset formats for RAG evaluation

  • A set of well-defined metrics

  • Experiment orchestration over datasets

At a high level, RAGAS expects data in the form:

  • Question

  • Generated Answer

  • Retrieved Contexts

  • (Optionally) Ground‑truth answers

Once you have this, RAGAS can score your system across multiple dimensions.



Step 1: Creating Evaluation Datasets from Scratch

My first step was learning how to build RAGAS datasets manually.

Each dataset row represents a single RAG interaction:

  • A user query

  • Context chunks retrieved by the retriever

  • The final answer generated by the LLM

Once structured, these rows are converted into a RAGAS Dataset object. The key benefit here is repeatability—you can:

  • Save datasets to disk

  • Reuse them across experiments

  • Compare multiple RAG versions on the same inputs

This immediately changed how I think about RAG debugging: evaluation became data‑driven rather than anecdotal.



Step 2: Understanding Core RAGAS Metrics

Before running experiments, I spent time understanding what each metric actually measures.

1. Faithfulness

Question: Is the answer supported by the retrieved context?

Faithfulness checks whether the generated answer can be traced back to the provided context chunks. If the model invents facts—even if they sound correct—this score drops.

This metric is critical for:

  • Enterprise RAG systems

  • Compliance and regulated domains

  • Any application where hallucinations are unacceptable

2. Answer Relevancy

Question: Did the model answer the question that was asked?

Answer Relevancy evaluates how well the response addresses the user query, independent of grounding. A verbose but off‑topic answer scores poorly here.

This helped surface issues where:

  • Retrieval was fine

  • Generation drifted into unrelated explanations

3. Context Precision

Question: How much of the retrieved context was actually useful?

Context Precision penalizes over‑retrieval. If you fetch many chunks but only a few are relevant, this metric exposes it.

This was especially useful when tuning:

  1. Top‑k values

  2. Chunk sizes

  3. Retriever scoring thresholds

  4. Extending RAGAS with Custom Metrics using DiscreteMetric

    While the built‑in RAGAS metrics cover many common RAG failure modes, real‑world systems often need domain‑specific evaluation logic. This is where RAGAS’s support for custom metrics, especially DiscreteMetric, becomes extremely powerful.

    A generic metric may not fully capture:

    • Domain‑specific correctness

    • Business rules

    • Output format constraints

    • Policy or safety requirements

    • Binary pass/fail conditions

    When creating a DiscreteMetric, you typically define:

    1. What inputs the metric consumes, you are in-charge of setting the prompt context here:

      • Question

      • Generated answer

      • Retrieved contexts

      • (Optionally) reference answers

    2. The evaluation prompt or instruction

      • Clear criteria for how the judgment should be made using the input metrics defined above.

      • Explicit definitions of allowed outputs < I leveraged this to add a reasoning as well as part of the output which we will see in the code snippet below>

    3. The discrete output space

      • Fixed labels (e.g., PASS/ FAIL , Supported/Unsupported, Correct/Incorrect )

      • Or a small, well‑defined set of categories

In the below code I aimed at Strict Grounding Score

A discrete list of ratings to see to what extent the RAG workflow abide by my defined metric.


Step 3: Running Experiments Over a Dataset

Once datasets and metrics were ready, the next step was experimentation.

Using RAGAS experiments, I was able to:

  • Loop over every row in a dataset

  • Apply multiple metrics in one run

  • Collect structured scores per example

What are all components you need to get a ready to use experimenting workflow:

  1. Load a dataset from your test case csv files < that you created , we will create with test data generation in later part of the blog >

    2. A retriver and rag_agent callable for each of adopting to different datasets:

    3. Write and run the experiment. When called over the dataset, RAGAS automatically iterates over each Sample row in a loop. You can save your experiments in the root_dir < in our case ./eval that is already defined during Dataset creation >. RAGAS automatically saves the experiment under

  2. <your defined root_dir>/Experiments folder :

    So you will have an eval folder having datasets and expriments folders in it:


Instead of eyeballing outputs, I now had experiments that have the scores to be compared.


Step 4: Generating Synthetic Test Sets with TestsetGenerator

Before we start, I would like to tell you something that I only learned by experimenting and did not find written anywhere.

RAGAS test generation utility performs significantly well over a well-defined node-based (or structured) document representation < for example an md file with a title and metadata info > for synthetic test set generation. RAGAS documentation uses md files to achieve the results. Do not lose hope if the same doesnt work in the exact way for your pdf documents.

Usual PDF documents might not have the node compatible structure, as langchain_PDF document loaders add metadata but most of the time they will have blank title and subject as most of the documents / pages do not contain these info unless present in the page explicitly.


So as a solution or more like a fallback, ensure to add headlines and summaries during Knowledge graph build, as you will see in my code snippet below.


RAGAS 0.4+ has evolved how it deals with the document chunks , it provides us a way to generate a knowledge graph ourselves that:

  • represents each chunk as a node

  • defines relationship between nodes

  • defines metadata for the nodes [ headlines, summaries, key phrases etc. ]

To give you an idea how it looks:

I am not diving deeper into query synthesizer, have used SingleHopSpecificQuerySenthesizer, which generates headlines/key-phrase related test queries < you can read more about it here: https://docs.ragas.io/en/stable/howtos/applications/singlehop_testset_gen/?h=query#configuring-personas-for-query-generation >

Once generated, these test sets can be converted to RAGAS dataset and be directly plugged into the evaluator.


Step 5: Stitching Everything Together in a Loop

The final step was automation.

My evaluation loop looked roughly like this:

  1. Load or generate a dataset

  2. Run the RAG pipeline to produce answers

  3. Attach retrieved contexts and outputs

  4. Run RAGAS metrics across all rows

  5. Store results for comparison

With this setup, evaluating a new retriever, prompt, or model became a single function call.

This is where RAG evaluation stopped being a one‑off task and became part of the development workflow.


What to do Next

  • Compare multiple retrievers < using the same RAGAS dataset >

  • Continuous Integration (CI) evaluation integration

  • Custom metrics for domain‑specific grounding

  • Visual dashboards on top of RAGAS outputs


⚠️ Cost Awareness When Using RAGAS

RAGAS relies heavily on LLM‑as‑a‑judge, which means evaluation cost scales quickly—especially with large documents, long contexts, and frequent runs.

To keep costs under control:

  • Run RAGAS selectively, not continuously

  • Keep evaluation datasets small but high‑signal

  • Be mindful of context length and document size

  • Prefer CI‑style evaluation over ad‑hoc experimentation

  • Use judge models that are equal or stronger than your RAG agent model

RAGAS is best treated as a quality gate, not a runtime metric.

Comments


bottom of page