Descriptive Metrics for Graph Validation
The descriptive metrics functionality in Cognee provides key insights into the correctness and structure of the generated knowledge graph. These metrics help ensure the integrity of the graph, detect inconsistencies, and evaluate the efficiency of the Cognee pipeline.
Here’s how to generate descriptive metrics in Cognee:
Step 1: Prepare your documents
For this example, let’s use a short text.
text = """
Natural language processing (NLP) is an interdisciplinary
subfield of computer science and information retrieval.
"""
Step 2: Prune, Add, Cognify!
Before processing new data, clear any previously stored information.
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
Add the documents (in this case, a single string) to Cognee.
await cognee.add(text)
Cognify! (Knowledge graph generation step)
- This step extracts insights, generates summaries, and creates connections.
- During this process, descriptive metrics are calculated in the background.
await cognee.cognify()
Step 3: Retrieve descriptive metrics from the database
- Open PG Admin (or your database management tool) and refresh the database.
data:image/s3,"s3://crabby-images/4a2cf/4a2cf031c17808e49757e352d4b166cf28ba1897" alt="Refresh database"
- Find the graph_metrics table and retrieve the computed metrics.
data:image/s3,"s3://crabby-images/73212/7321246e645eba83e3f233e6f6a51e155a1f836c" alt="Metrics in pgadmin"
Note: postgres and pgvector needs to be set in the env variables
Metrics
Below is a list of descriptive metrics calculated for graph correctness and quality assessment. All calculations convert the graph into an undirected graph, meaning edge direction is not considered in path-related metrics.
1. Token & Graph Size Metrics
num_tokens
: Total number of tokens in the input text.num_nodes
: Total number of nodes (entities) in the knowledge graph.num_edges
: Total number of edges (relationships) connecting nodes.
2. Connectivity & Density Metrics
mean_degree
: Average number of edges per node.edge_density
: Ratio of existing edges to the maximum possible edges, measuring graph sparsity.num_connected_components
: Number of connected components; higher values may indicate fragmentation.sizes_of_connected_components
: Number of nodes in each connected component.
3. Structural Integrity Metrics
num_selfloops
: Number of self-loops (nodes connected to themselves); may indicate errors.diameter
: The longest shortest path between any two nodes in the graph.avg_shortest_path_length
: The average shortest path between all node pairs, indicating graph efficiency.avg_clustering
: Average clustering coefficient, measuring the likelihood of nodes forming tightly connected groups.
Interpreting the Metrics
When analyzing these metrics, certain extreme values may indicate potential issues, e.g.:
- Number of input tokens in relation to graph size : Large input documents with only a few generated nodes indicates discrepancy.
- Mean Degree : A value of 1 suggests an extremely sparse graph, while a value too close to num_nodes suggests excessive connectivity. Since multiple edges can exist between two nodes, it is possible for the mean degree to be higher than num_nodes, but an unusually high value should be investigated.
- Diameter: Extreme values such as 1 (a fully connected graph) or num_nodes-1 (a path) clearly indicate issues.
The following table presents examples of graphs with 50 nodes that exhibit extreme values for diameter and mean degree. These cases illustrate how structural properties can vary widely, from highly connected graphs with small diameters to sparse graphs with minimal connectivity — configurations that are unlikely to correspond to real-world knowledge graphs, which typically balance connectivity and semantic structure to preserve meaningful relationships.
Too dense | Correct | Too sparse | |
---|---|---|---|
Mean degree | 49 | 3.52 | 1.96 |
Diameter | 1 | 9 | 49 |
Tracking Changes Over Time:
Beyond analyzing static graphs, monitoring how these metrics evolve over time can reveal important insights about structural shifts. Unexpected deviations may indicate potential issues.
- Adding new data should increase the number of nodes.
- Adding information about existing entities should increase the number of edges and edge density.
- If new data is loosely related to existing data, the diameter may increase.
- If new data strengthens existing entity relationships, the diameter may decrease.
For a quick visual explanation of these graph metrics and their implications, click here