> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# ScrapeGraphAI

Scrape web pages and feed the extracted content into cognee using [ScrapeGraphAI](https://scrapegraphai.com). The [`cognee-community-tasks-scrapegraph`](https://github.com/topoteretes/cognee-community/tree/main/packages/task/scrapegraph_tasks) package provides two async tasks: `scrape_urls` for extraction and `scrape_and_add` for end-to-end scrape-to-graph ingestion.

## Why Use This Integration

* **Prompt-Based Extraction**: Describe what you want in natural language — no CSS selectors or scraper maintenance
* **Single Function Pipeline**: `scrape_and_add` scrapes, ingests, and builds the graph in one call
* **Structured Output**: Optionally pass a Pydantic schema for domain-specific extraction
* **Source Attribution**: Each scraped page is tagged with its origin URL in the knowledge graph
* **JavaScript Rendering**: Handles JS-rendered pages and common bot protection

## Installation

```bash theme={null}
pip install cognee-community-tasks-scrapegraph cognee
```

Or with uv:

```bash theme={null}
uv pip install cognee-community-tasks-scrapegraph cognee
```

## Requirements

You need two API keys:

| Variable       | Description                                                  |
| -------------- | ------------------------------------------------------------ |
| `LLM_API_KEY`  | OpenAI (or other LLM provider) API key used by cognee        |
| `SGAI_API_KEY` | [ScrapeGraphAI](https://dashboard.scrapegraphai.com) API key |

```bash theme={null}
export LLM_API_KEY="sk-..."
export SGAI_API_KEY="sgai-..."
```

<Info>
  See [LLM Providers](/setup-configuration/llm-providers) and [Embedding Providers](/setup-configuration/embedding-providers) if you want to use a provider other than OpenAI.
</Info>

## Quick Start

### 1. Scrape and Inspect

Use `scrape_urls` to verify what ScrapeGraphAI extracts before building a graph:

```python theme={null}
import asyncio
from cognee_community_tasks_scrapegraph import scrape_urls

async def main():
    results = await scrape_urls(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
    )

    for item in results:
        if item.get("error"):
            print(f"[!] {item['url']}: {item['error']}")
        else:
            print(f"\n=== {item['url']} ===")
            print(str(item["content"])[:500])

asyncio.run(main())
```

The `user_prompt` tells ScrapeGraphAI what to focus on when extracting content from each page.

### 2. Build the Knowledge Graph

Use `scrape_and_add` to scrape, ingest, and build the graph in one call:

```python theme={null}
import asyncio
import cognee
from cognee_community_tasks_scrapegraph import scrape_and_add

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    await scrape_and_add(
        urls=[
            "https://cognee.ai",
            "https://docs.cognee.ai",
            "https://github.com/topoteretes/cognee",
        ],
        user_prompt="Extract the product description, key features, and target use cases",
        dataset_name="cognee_research",
    )

    results = await cognee.recall(
        query_text="What is Cognee and what problems does it solve?",
        datasets=["cognee_research"],
    )
    for r in results:
        print(r)

asyncio.run(main())
```

<Info>
  The `prune` calls reset the local database. Skip them when building incrementally on top of an existing graph.
</Info>

## Structured Extraction

When you know the shape of the data you need, pass a Pydantic schema to ScrapeGraphAI's `smartscraper` directly. This bypasses the integration's `scrape_urls` and gives you full control over the output structure:

```python theme={null}
import asyncio
import os
import cognee
from pydantic import BaseModel, Field
from scrapegraph_py import Client

class ProductPage(BaseModel):
    name: str = Field(description="Product or company name")
    tagline: str = Field(description="One-line value proposition")
    features: list[str] = Field(description="List of key features or capabilities")
    pricing_model: str = Field(description="How the product is priced")
    target_audience: str = Field(description="Who the product is primarily aimed at")

async def scrape_structured(urls: list[str]) -> list[str]:
    client = Client(api_key=os.environ["SGAI_API_KEY"])
    texts = []
    try:
        for url in urls:
            response = client.smartscraper(
                website_url=url,
                user_prompt="Extract product information",
                output_schema=ProductPage,
            )
            result = response.get("result", {})
            text = f"""Source: {url}
Product: {result.get('name', 'N/A')}
Tagline: {result.get('tagline', 'N/A')}
Features: {', '.join(result.get('features', []))}
Pricing: {result.get('pricing_model', 'N/A')}
Audience: {result.get('target_audience', 'N/A')}"""
            texts.append(text)
    finally:
        client.close()
    return texts

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    urls = ["https://cognee.ai", "https://scrapegraphai.com", "https://firecrawl.dev"]
    texts = await scrape_structured(urls)

    combined = "\n\n".join(texts)
    await cognee.remember(combined, dataset_name="product_landscape")

    results = await cognee.recall(
        query_text="How do these products differ in their approach?",
        datasets=["product_landscape"],
    )
    for r in results:
        print(r)

asyncio.run(main())
```

## Querying the Graph

Once the graph is built, use `cognee.recall(...)` to query it.

## Use Cases

<AccordionGroup>
  <Accordion title="Competitive Intelligence">
    Scrape competitor product and pricing pages, build a knowledge graph, then query across all of them:

    1. Gather competitor URLs (product pages, pricing, docs)
    2. Use `scrape_and_add` with a prompt focused on pricing, features, and positioning
    3. Query with synthesis questions like "Which product is best for enterprise use cases?"

    Scope queries to the dataset with `datasets=["competitive_intel"]`.
  </Accordion>

  <Accordion title="News Monitoring">
    Scrape news sources on a schedule and add to the existing graph incrementally:

    1. Set up a list of news/blog URLs
    2. Run `scrape_and_add` daily (skip the `prune` calls to accumulate data)
    3. Query across the full timeline: "What are the biggest trends this week?"

    The graph gets better over time as entities and relationships accumulate.
  </Accordion>

  <Accordion title="Research Aggregation">
    Collect and correlate information from many sources:

    1. Scrape documentation, blog posts, and GitHub READMEs for a topic
    2. Build the graph with `scrape_and_add`
    3. Ask cross-source questions: "How does library X compare to library Y?"

    Use structured extraction with Pydantic schemas for consistent input.
  </Accordion>

  <Accordion title="Product Landscape Analysis">
    Map out an entire product category:

    1. Scrape product pages with a schema targeting name, features, pricing, and audience
    2. Ingest into cognee
    3. Query for patterns: "Which products target developers?" or "What pricing models are most common?"
  </Accordion>
</AccordionGroup>

<CardGroup cols={2}>
  <Card title="GitHub Repository" icon="github" href="https://github.com/topoteretes/cognee-community/tree/main/packages/task/scrapegraph_tasks">
    View source code and examples
  </Card>

  <Card title="Blog Post" icon="newspaper" href="https://scrapegraphai.com/blog/scrapegraphai-cognee">
    Read full tutorial on ScrapeGraph website
  </Card>
</CardGroup>
