Skip to Content
ReferenceBenchmarks

Evaluations in cognee

The AI Memory Challenge

Large Language Models (LLMs) face a fundamental limitation: they lack persistent memory. Each interaction essentially starts from scratch, creating a significant bottleneck when dealing with thousands of interactions or complex, multi-step reasoning tasks.

The performance of LLM applications isn’t just about selecting the right model or crafting clever prompts. The data pipeline—what happens before the prompt—plays a crucial role in enabling rich, context-aware interactions.

Benchmark Results

Our comparative analysis examined three distinct approaches to AI memory systems:

Metrics Comparison

With our proprietary Dreamify tool, the results are even higher:

Metrics Comparison

The results demonstrate cognee’s strong performance in key areas:

  • Accuracy: Higher precision in multi-hop question answering
  • Context Retention: Better preservation of semantic relationships
  • Retrieval Speed: Efficient access to relevant information

Understanding the Metrics

Our evaluation framework uses several key metrics:

  • Human Evaluation: Manual annotation of results for factual accuracy
  • F1 Scores: Balance between precision and recall in predictions
  • Exact Match (EM): Percentage of predictions matching ground truth exactly
  • Correctness: LLM-based evaluation using Deepeval metrics

Evaluation Methodology

For the evaluations shown utilize the HotpotQA dataset, which specializes in questions requiring reasoning across multiple documents. This helps evaluate knowledge integration across sources and logical connection formation. Our main goal is to understand contextual understanding on the LLM side.

For systems to handle such queries effectively, they must:

  1. Identify relevant information across documents

  2. Build meaningful relationships between disparate data points

  3. Deliver structured, coherent context to the LLM

Our analysis included three common approaches:

  • Chatbot Memory Systems (e.g., Mem0): Focus on direct context management
  • Graph-based Systems (e.g., Graphiti): Emphasize relationship mapping
  • cognee’s Approach: Combines semantic understanding with graph structures

Evaluation Framework

We developed an evaluation framework that can help run any benchmark with custom metrics. To learn more, check our guide section.

Current Limitations

Current benchmarking approaches face several limitations:

  • Metric Reliability: LLM-based evaluation metrics can show inconsistency
  • Granularity Issues: Character-level F1 scores may not reflect semantic understanding
  • Scaling Constraints: Human evaluation provides accuracy but doesn’t scale
  • Dataset Scope: Standard datasets may not capture all real-world complexity

For detailed evaluation methodologies and complete results, visit our GitHub repository.

Join the Conversation!

Have questions about our evaluation approach? Join our community to discuss benchmarking and AI memory systems!