Skip to main content
Scrape web pages and feed the extracted content into cognee using ScrapeGraphAI. The cognee-community-tasks-scrapegraph package provides two async tasks: scrape_urls for extraction and scrape_and_add for end-to-end scrape-to-graph ingestion.

Why Use This Integration

  • Prompt-Based Extraction: Describe what you want in natural language — no CSS selectors or scraper maintenance
  • Single Function Pipeline: scrape_and_add scrapes, ingests via cognee.add, and runs cognify in one call
  • Structured Output: Optionally pass a Pydantic schema for domain-specific extraction
  • Source Attribution: Each scraped page is tagged with its origin URL in the knowledge graph
  • JavaScript Rendering: Handles JS-rendered pages and common bot protection

Installation

pip install cognee-community-tasks-scrapegraph cognee
Or with uv:
uv pip install cognee-community-tasks-scrapegraph cognee

Requirements

You need two API keys:
VariableDescription
LLM_API_KEYOpenAI (or other LLM provider) API key used by cognee
SGAI_API_KEYScrapeGraphAI API key
export LLM_API_KEY="sk-..."
export SGAI_API_KEY="sgai-..."
See LLM Providers and Embedding Providers if you want to use a provider other than OpenAI.

Quick Start

1. Scrape and Inspect

Use scrape_urls to verify what ScrapeGraphAI extracts before building a graph:
import asyncio
from cognee_community_tasks_scrapegraph import scrape_urls

async def main():
    results = await scrape_urls(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
    )

    for item in results:
        if item.get("error"):
            print(f"[!] {item['url']}: {item['error']}")
        else:
            print(f"\n=== {item['url']} ===")
            print(str(item["content"])[:500])

asyncio.run(main())
The user_prompt tells ScrapeGraphAI what to focus on when extracting content from each page.

2. Build the Knowledge Graph

Use scrape_and_add to scrape, ingest, and cognify in one call:
import asyncio
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    await scrape_and_add(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
            "https://github.com/topoteretes/cognee",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
        dataset_name="cognee_research",
    )

    results = await cognee.search("What is Cognee and what problems does it solve?")
    for r in results:
        print(r)

asyncio.run(main())
The prune calls reset the local database. Skip them when building incrementally on top of an existing graph. See cognee.add and cognify for details on the ingestion pipeline.

Structured Extraction

When you know the shape of the data you need, pass a Pydantic schema to ScrapeGraphAI’s smartscraper directly. This bypasses the integration’s scrape_urls and gives you full control over the output structure:
import asyncio
import os
import cognee
from pydantic import BaseModel, Field
from scrapegraph_py import Client

class ProductPage(BaseModel):
    name: str = Field(description="Product or company name")
    tagline: str = Field(description="One-line value proposition")
    features: list[str] = Field(description="List of key features or capabilities")
    pricing_model: str = Field(description="How the product is priced")
    target_audience: str = Field(description="Who the product is primarily aimed at")

async def scrape_structured(urls: list[str]) -> list[str]:
    client = Client(api_key=os.environ["SGAI_API_KEY"])
    texts = []
    try:
        for url in urls:
            response = client.smartscraper(
                website_url=url,
                user_prompt="Extract product information",
                output_schema=ProductPage,
            )
            result = response.get("result", {})
            text = f"""Source: {url}
Product: {result.get('name', 'N/A')}
Tagline: {result.get('tagline', 'N/A')}
Features: {', '.join(result.get('features', []))}
Pricing: {result.get('pricing_model', 'N/A')}
Audience: {result.get('target_audience', 'N/A')}"""
            texts.append(text)
    finally:
        client.close()
    return texts

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    urls = ["https://cognee.ai", "https://scrapegraphai.com", "https://firecrawl.dev"]
    texts = await scrape_structured(urls)

    combined = "\n\n".join(texts)
    await cognee.add(combined, dataset_name="product_landscape")
    await cognee.cognify()

    results = await cognee.search("How do these products differ in their approach?")
    for r in results:
        print(r)

asyncio.run(main())

Querying the Graph

Once cognify completes, use cognee.search to query the graph. See Search for all available search types and parameters.

Use Cases

Scrape competitor product and pricing pages, build a knowledge graph, then query across all of them:
  1. Gather competitor URLs (product pages, pricing, docs)
  2. Use scrape_and_add with a prompt focused on pricing, features, and positioning
  3. Query with synthesis questions like “Which product is best for enterprise use cases?”
Scope queries to the dataset with datasets=["competitive_intel"].
Scrape news sources on a schedule and add to the existing graph incrementally:
  1. Set up a list of news/blog URLs
  2. Run scrape_and_add daily (skip the prune calls to accumulate data)
  3. Query across the full timeline: “What are the biggest trends this week?”
The graph gets better over time as entities and relationships accumulate.
Collect and correlate information from many sources:
  1. Scrape documentation, blog posts, and GitHub READMEs for a topic
  2. Build the graph with scrape_and_add
  3. Ask cross-source questions: “How does library X compare to library Y?”
Use structured extraction with Pydantic schemas for consistent input.
Map out an entire product category:
  1. Scrape product pages with a schema targeting name, features, pricing, and audience
  2. Ingest into cognee and cognify
  3. Query for patterns: “Which products target developers?” or “What pricing models are most common?”