ScrapeGraphAI - Cognee Documentation

Scrape web pages and feed the extracted content into cognee using ScrapeGraphAI. The cognee-community-tasks-scrapegraph package provides two async tasks: scrape_urls for extraction and scrape_and_add for end-to-end scrape-to-graph ingestion.

Why Use This Integration

Prompt-Based Extraction: Describe what you want in natural language — no CSS selectors or scraper maintenance
Single Function Pipeline: scrape_and_add scrapes, ingests, and builds the graph in one call
Structured Output: Optionally pass a Pydantic schema for domain-specific extraction
Source Attribution: Each scraped page is tagged with its origin URL in the knowledge graph
JavaScript Rendering: Handles JS-rendered pages and common bot protection

Installation

pip install cognee-community-tasks-scrapegraph cognee

Or with uv:

uv pip install cognee-community-tasks-scrapegraph cognee

Requirements

You need two API keys:

Variable	Description
`LLM_API_KEY`	OpenAI (or other LLM provider) API key used by cognee
`SGAI_API_KEY`	ScrapeGraphAI API key

export LLM_API_KEY="sk-..."
export SGAI_API_KEY="sgai-..."

See LLM Providers and Embedding Providers if you want to use a provider other than OpenAI.

Quick Start

1. Scrape and Inspect

Use scrape_urls to verify what ScrapeGraphAI extracts before building a graph:

import asyncio
from cognee_community_tasks_scrapegraph import scrape_urls

async def main():
    results = await scrape_urls(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
    )

    for item in results:
        if item.get("error"):
            print(f"[!] {item['url']}: {item['error']}")
        else:
            print(f"\n=== {item['url']} ===")
            print(str(item["content"])[:500])

asyncio.run(main())

The user_prompt tells ScrapeGraphAI what to focus on when extracting content from each page.

2. Build the Knowledge Graph

Use scrape_and_add to scrape, ingest, and build the graph in one call:

import asyncio
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    await scrape_and_add(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
            "https://github.com/topoteretes/cognee",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
        dataset_name="cognee_research",
    )

    results = await cognee.recall(
        query_text="What is Cognee and what problems does it solve?",
        datasets=["cognee_research"],
    )
    for r in results:
        print(r)

asyncio.run(main())

The prune calls reset the local database. Skip them when building incrementally on top of an existing graph.

Structured Extraction

When you know the shape of the data you need, pass a Pydantic schema to ScrapeGraphAI’s smartscraper directly. This bypasses the integration’s scrape_urls and gives you full control over the output structure:

import asyncio
import os
import cognee
from pydantic import BaseModel, Field
from scrapegraph_py import Client

class ProductPage(BaseModel):
    name: str = Field(description="Product or company name")
    tagline: str = Field(description="One-line value proposition")
    features: list[str] = Field(description="List of key features or capabilities")
    pricing_model: str = Field(description="How the product is priced")
    target_audience: str = Field(description="Who the product is primarily aimed at")

async def scrape_structured(urls: list[str]) -> list[str]:
    client = Client(api_key=os.environ["SGAI_API_KEY"])
    texts = []
    try:
        for url in urls:
            response = client.smartscraper(
                website_url=url,
                user_prompt="Extract product information",
                output_schema=ProductPage,
            )
            result = response.get("result", {})
            text = f"""Source: {url}
Product: {result.get('name', 'N/A')}
Tagline: {result.get('tagline', 'N/A')}
Features: {', '.join(result.get('features', []))}
Pricing: {result.get('pricing_model', 'N/A')}
Audience: {result.get('target_audience', 'N/A')}"""
            texts.append(text)
    finally:
        client.close()
    return texts

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    urls = ["https://cognee.ai", "https://scrapegraphai.com", "https://firecrawl.dev"]
    texts = await scrape_structured(urls)

    combined = "\n\n".join(texts)
    await cognee.remember(combined, dataset_name="product_landscape")

    results = await cognee.recall(
        query_text="How do these products differ in their approach?",
        datasets=["product_landscape"],
    )
    for r in results:
        print(r)

asyncio.run(main())

Querying the Graph

Once the graph is built, use cognee.recall(...) to query it.

Use Cases

Competitive Intelligence

Scrape competitor product and pricing pages, build a knowledge graph, then query across all of them:

Gather competitor URLs (product pages, pricing, docs)
Use scrape_and_add with a prompt focused on pricing, features, and positioning
Query with synthesis questions like “Which product is best for enterprise use cases?”

Scope queries to the dataset with datasets=["competitive_intel"].

News Monitoring

Scrape news sources on a schedule and add to the existing graph incrementally:

Set up a list of news/blog URLs
Run scrape_and_add daily (skip the prune calls to accumulate data)
Query across the full timeline: “What are the biggest trends this week?”

The graph gets better over time as entities and relationships accumulate.

Research Aggregation

Collect and correlate information from many sources:

Scrape documentation, blog posts, and GitHub READMEs for a topic
Build the graph with scrape_and_add
Ask cross-source questions: “How does library X compare to library Y?”

Use structured extraction with Pydantic schemas for consistent input.

Product Landscape Analysis

Map out an entire product category:

Scrape product pages with a schema targeting name, features, pricing, and audience
Ingest into cognee
Query for patterns: “Which products target developers?” or “What pricing models are most common?”

GitHub Repository

View source code and examples

Blog Post

Read full tutorial on ScrapeGraph website

​Why Use This Integration

​Installation

​Requirements

​Quick Start

​1. Scrape and Inspect

​2. Build the Knowledge Graph

​Structured Extraction

​Querying the Graph

​Use Cases

GitHub Repository

Blog Post

Why Use This Integration

Installation

Requirements

Quick Start

1. Scrape and Inspect

2. Build the Knowledge Graph

Structured Extraction

Querying the Graph

Use Cases