Use CasesCode Assistants

CodeGraph: Enhancing Codebase Understanding with Graphs and LLMs

Scenario:

Modern software development often involves massive codebases spread across multiple repositories, teams, and services. Engineers and AI-based coding copilots struggle to maintain a clear mental model of how components interrelate. For instance:

  • Large Repositories: A large software project might have a large of GitHub repository, containing thousands of files.
  • Complex Dependencies: Services often call each other via APIs, share data models, or rely on specific configuration files. Finding the right function, class, or module can become tedious.
  • Evolving Code: As code evolves, comments get stale, architectural assumptions shift, and documentation becomes outdated, making it hard for coding copilots to reliably generate correct, context-aware suggestions.

Challenges:

  1. Fragmented Knowledge: It’s difficult to piece together the entire dependency graph across the entire repository.
  2. Limited Context for LLMs: Large Language Models struggle with providing accurate code completions or refactoring suggestions if they lack a broader view of the project’s architecture.
  3. Time Lost: Developers spend significant time searching through repositories, reading documentation, and attempting to piece together the “big picture” of the codebase.

Solution: Creating a CodeGraph

A CodeGraph is a knowledge graph that models the codebase at multiple levels of granularity. It goes beyond just indexing code: it captures entities and relationships within and across repositories.

  • Entities: Functions, classes, modules, services, configuration files, APIs, tests, CI/CD pipelines, and documentation pages.

  • Relationships: Who-calls-what (function call graphs), import dependencies, version histories, code ownership, and semantic links (e.g., “this module implements a particular design pattern” or “this API endpoint is deprecated and replaced by another”).

How we constructed a CodeGraph:

  • Build chain access direct dependency
  • Build init mediated direct dependency
  • Define pydantic data structures that describe a single knowledge nodes for all nodes
  • Create a knowledge graph
  • Write an in-memory retriever that gets the graphs skeletons and extracts triplets

Here is an example graph generated with cognee: code_assistants_graph_example

Read more about our approach in our blog.

Enriching CodeGraph with LLMs: To make this knowledge even more actionable, integrate Large Language Models that understand code semantics and developer documentation. The LLM can:

  1. Ingest the Graph: The LLM has access to structured context from the CodeGraph, so when a developer asks, “Where is the function that parses user inputs for our search engine?”, the LLM can quickly locate that function by following the graph’s relationships rather than brute-forcing file searches.

  2. Provide Context-Rich Suggestions: When the coding copilot suggests a code snippet, it can reference related modules, highlight deprecations, or warn about known compatibility issues. For example, “You might want to call FunctionParseUserInput from Utils/InputProcessor.js. It’s used in SearchEngine.js and depends on InputSchema.json.”

  3. Explain Architectural Decisions: Developers can query the LLM about architectural choices: “Why does ServiceD depend on ServiceE?”. The LLM, using the CodeGraph, responds: “ServiceD calls ServiceE’s authentication endpoint to validate tokens, as documented in ServiceE/docs/auth.md.”

  4. Link to Documentation and Commit Histories: The LLM can connect a piece of code to its associated design docs, recent commit messages, or open pull requests. If a developer asks, “How has UserProfileAPI.js changed over the last quarter?” the LLM can summarize major refactoring steps, point to relevant issues that were closed, and link to architectural decision records.

Outcomes:

  • Improved Developer Productivity: Instead of wading through multiple repositories, developers get immediate, context-aware guidance, saving countless hours of manual searching and guesswork.

  • More Accurate Code Suggestions: Coding copilots armed with a CodeGraph context deliver more reliable and secure code completions, better refactoring strategies, and insightful recommendations.

  • Evolving with the Codebase: As repositories grow, the CodeGraph and the LLM continuously update. This ensures that as code evolves, the memory and context available to developers—and their automated assistants—stays fresh and relevant.

Run a Demo Yourself!

Curious about how this works with cognee? Try it out in our notebook here.

Join the Conversation!

Have questions? Join our community now to connect with professionals, share insights, and get your questions answered!