Using DeepEval with cognee
DeepEval is like a magnifying glass for your AI system’s performance. It breaks down outputs into measurable metrics, so you can pinpoint strengths and weaknesses.
It is an open-source framework designed to evaluate LLM outputs based on specific criteria. Some of the key metrics include:
- G-Eval: A versatile metric capable of evaluating various use cases with human-like accuracy.
- Prompt Alignment: Assesses how well the LLM’s output aligns with the given prompt.
- Faithfulness: Evaluates the factual accuracy of the LLM’s response.
- Answer Relevancy: Measures how relevant the LLM’s answer is to the posed question.
- Contextual Relevancy, Precision, and Recall: Analyze the LLM’s ability to understand and generate contextually appropriate responses.
- Tool Correctness and JSON Correctness: Ensure the LLM’s outputs are correctly formatted and utilize tools appropriately.
- RAGAS: A metric for evaluating Retrieval-Augmented Generation systems.
- Hallucination, Toxicity, Bias, and Summarization: Metrics to assess the LLM’s tendency to generate false information, offensive content, biased responses, and its summarization capabilities.
Additionally, DeepEval provides Conversational Metrics to evaluate entire interactions, such as:
- Conversational G-Eval
- Knowledge Retention
- Role Adherence
- Conversation Completeness
- Conversation Relevancy
Users can also develop custom evaluation metrics tailored to specific needs. Cognee’s evaluation framework defines LLM-as-a-judge metrics as described in Microsoft’s GraphRAG paper and offers them as options:
- Comprehensiveness: How much detail does the answer provide to cover all aspects and details of the question?
- Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
- Empowerment: How well does the answer help the reader understand and make informed judgements about the topic?
- Directness: How specifically and clearly does the answer address the question?
We also define Correctness as an additional custom metric option.
- Correctness: Is the actual output factually correct based on the expected output?
F1 score and Exact Match (EM) score are also included in the framework as custom Deepeval metrics, following the implementation of the official hotpot benchmark.
- EM score: the rate at which the predicted strings exactly match their references, ignoring white spaces and capitalization.
- F1 score: the harmonic mean of the precision and recall, using word-level Exact Match
For an in-depth look at DeepEval’s capabilities, visit the DeepEval documentation.
Using DeepEval with Cognee
Let’s say you have a Q&A pipeline. You’ve fed it a dataset of user questions, and now you need to evaluate how well it performed. Here’s how DeepEval helps:
- Extract Your Outputs After running the dataset through the Cognify pipeline, export the responses. These might be answers to questions like, “What’s the best way to secure a database?”
context = await get_context_with_cognee(instance)
args = {
"question": instance["question"],
"context": context,
}
user_prompt = render_prompt("context_for_question.txt", args)
system_prompt = read_query_prompt("answer_hotpot_using_cognee_search.txt")
llm_client = get_llm_client()
answer_prediction = await llm_client.acreate_structured_output(
text_input=user_prompt,
system_prompt=system_prompt,
response_model=str,
)
return answer_prediction
- Set Up DeepEval Load your outputs into DeepEval and configure it to analyze for precision, recall, and relevance. The tool can also identify inconsistencies, so you know exactly where to tweak.
test_cases = []
for instance, answer in zip(instances, answers):
test_case = LLMTestCase(
input=instance["question"], actual_output=answer, expected_output=instance["answer"]
)
test_cases.append(test_case)
eval_set = EvaluationDataset(test_cases)
eval_results = eval_set.evaluate(eval_metrics)
- Interpret Metrics DeepEval generates detailed reports, highlighting areas where the pipeline excels and where it needs improvement. For example, maybe it’s great at answering factual questions but struggles with subjective ones.
Join the Conversation!
Join our community now to connect with professionals, share insights, and get your questions answered!