Evaluation

AI systems are only as good as their ability to deliver consistent, high-quality results. At cognee, we take evaluation seriously because it’s how we ensure our pipelines meet real-world demands. Whether it’s answering complex questions or understanding intricate code, our evaluation framework is designed to measure what matters.

To maintain this standard, cognee uses tools like DeepEval and Promptfoo, combined with datasets that stress-test performance across scenarios. Find detailed explanations in the following pages:

Why Evaluate?

When you build AI models, it’s not just about getting them to work – it’s about making them reliable. Evaluation enables us to measure system performance, identify areas for improvement, and ensure our solutions meet the highest standards of quality. Evaluations help you understand:

How accurate are the results?
Are they consistent across different inputs?
Are they aligned with the ground truths?

Cognee’s evaluation framework covers:

Cognify pipeline: Evaluates LLM answers on Q&A datasets to ensure accurate, context-aware answers. Currently supported datasets (like HotpotQA or 2WikiMultihopQA) provide context for each of the questions in the form of wikipedia paragraphs. The cognify pipeline builds a knowledge graph of the given context, and uses retrieval results to provide structured information to the LLM. The evaluation framework supports multiple RAG options (simple RAG, cognee with or without summaries) and a simple, RAG-less LLM call to act as a baseline.
Codegraph pipeline: Uses the SWE Bench benchmark to assess the performance of an LLM that is prompted to — given a codebase and an issue — generate a patch that resolves the described problem. To evaluate a proposed patch, the generated patch is applied, using unix’s patch program, to the codebase, and then the unit and system tests associated with the issue are executed. The patch is considered to have successfully resolved the issue if the patch applies successfully and all of the tests pass. The codegraph pipeline builds a knowledge graph from the code files, enabling the identification of intricate relationships within the codebase, and uses retrieval results to provide structured information to the LLM.

Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!

Telemetry