Evaluations in cognee
The AI Memory Challenge
Large Language Models (LLMs) face a fundamental limitation: they lack persistent memory. Each interaction essentially starts from scratch, creating a significant bottleneck when dealing with thousands of interactions or complex, multi-step reasoning tasks.
The performance of LLM applications isn’t just about selecting the right model or crafting clever prompts. The data pipeline—what happens before the prompt—plays a crucial role in enabling rich, context-aware interactions.
Benchmark Results
Our comparative analysis examined three distinct approaches to AI memory systems:
With our proprietary Dreamify tool, the results are even higher:
The results demonstrate cognee’s strong performance in key areas:
- Accuracy: Higher precision in multi-hop question answering
- Context Retention: Better preservation of semantic relationships
- Retrieval Speed: Efficient access to relevant information
Understanding the Metrics
Our evaluation framework uses several key metrics:
- Human Evaluation: Manual annotation of results for factual accuracy
- F1 Scores: Balance between precision and recall in predictions
- Exact Match (EM): Percentage of predictions matching ground truth exactly
- Correctness: LLM-based evaluation using Deepeval metrics
Evaluation Methodology
For the evaluations shown utilize the HotpotQA dataset, which specializes in questions requiring reasoning across multiple documents. This helps evaluate knowledge integration across sources and logical connection formation. Our main goal is to understand contextual understanding on the LLM side.
For systems to handle such queries effectively, they must:
-
Identify relevant information across documents
-
Build meaningful relationships between disparate data points
-
Deliver structured, coherent context to the LLM
Our analysis included three common approaches:
- Chatbot Memory Systems (e.g., Mem0): Focus on direct context management
- Graph-based Systems (e.g., Graphiti): Emphasize relationship mapping
- cognee’s Approach: Combines semantic understanding with graph structures
Evaluation Framework
We developed an evaluation framework that can help run any benchmark with custom metrics. To learn more, check our guide section.
Current Limitations
Current benchmarking approaches face several limitations:
- Metric Reliability: LLM-based evaluation metrics can show inconsistency
- Granularity Issues: Character-level F1 scores may not reflect semantic understanding
- Scaling Constraints: Human evaluation provides accuracy but doesn’t scale
- Dataset Scope: Standard datasets may not capture all real-world complexity
For detailed evaluation methodologies and complete results, visit our GitHub repository .
Join the Conversation!
Have questions about our evaluation approach? Join our community to discuss benchmarking and AI memory systems!