Evaluations in cognee

The AI Memory Challenge

Large Language Models (LLMs) face a fundamental limitation: they lack persistent memory. Each interaction essentially starts from scratch, creating a significant bottleneck when dealing with thousands of interactions or complex, multi-step reasoning tasks.

The performance of LLM applications isn’t just about selecting the right model or crafting clever prompts. The data pipeline—what happens before the prompt—plays a crucial role in enabling rich, context-aware interactions.

Benchmark Results

Our comparative analysis examined three distinct approaches to AI memory systems:

Metrics Comparison

With our proprietary Dreamify tool, the results are even higher:

Metrics Comparison

The results demonstrate cognee’s strong performance in key areas:

Accuracy: Higher precision in multi-hop question answering
Context Retention: Better preservation of semantic relationships
Retrieval Speed: Efficient access to relevant information

Understanding the Metrics

Our evaluation framework uses several key metrics:

Human Evaluation: Manual annotation of results for factual accuracy
F1 Scores: Balance between precision and recall in predictions
Exact Match (EM): Percentage of predictions matching ground truth exactly
Correctness: LLM-based evaluation using Deepeval metrics

Evaluation Methodology

For the evaluations shown utilize the HotpotQA dataset, which specializes in questions requiring reasoning across multiple documents. This helps evaluate knowledge integration across sources and logical connection formation. Our main goal is to understand contextual understanding on the LLM side.

For systems to handle such queries effectively, they must:

Identify relevant information across documents
Build meaningful relationships between disparate data points
Deliver structured, coherent context to the LLM

Our analysis included three common approaches:

Chatbot Memory Systems (e.g., Mem0): Focus on direct context management
Graph-based Systems (e.g., Graphiti): Emphasize relationship mapping
cognee’s Approach: Combines semantic understanding with graph structures

Evaluation Framework

We developed an evaluation framework that can help run any benchmark with custom metrics. To learn more, check our guide section.

Current Limitations

Current benchmarking approaches face several limitations:

Metric Reliability: LLM-based evaluation metrics can show inconsistency
Granularity Issues: Character-level F1 scores may not reflect semantic understanding
Scaling Constraints: Human evaluation provides accuracy but doesn’t scale
Dataset Scope: Standard datasets may not capture all real-world complexity

For detailed evaluation methodologies and complete results, visit our GitHub repository .

Join the Conversation!

Have questions about our evaluation approach? Join our community to discuss benchmarking and AI memory systems!