cognee-community-tasks-scrapegraph package provides two async tasks: scrape_urls for extraction and scrape_and_add for end-to-end scrape-to-graph ingestion.
Why Use This Integration
- Prompt-Based Extraction: Describe what you want in natural language — no CSS selectors or scraper maintenance
- Single Function Pipeline:
scrape_and_addscrapes, ingests viacognee.add, and runscognifyin one call - Structured Output: Optionally pass a Pydantic schema for domain-specific extraction
- Source Attribution: Each scraped page is tagged with its origin URL in the knowledge graph
- JavaScript Rendering: Handles JS-rendered pages and common bot protection
Installation
Requirements
You need two API keys:| Variable | Description |
|---|---|
LLM_API_KEY | OpenAI (or other LLM provider) API key used by cognee |
SGAI_API_KEY | ScrapeGraphAI API key |
See LLM Providers and Embedding Providers if you want to use a provider other than OpenAI.
Quick Start
1. Scrape and Inspect
Usescrape_urls to verify what ScrapeGraphAI extracts before building a graph:
user_prompt tells ScrapeGraphAI what to focus on when extracting content from each page.
2. Build the Knowledge Graph
Usescrape_and_add to scrape, ingest, and cognify in one call:
The
prune calls reset the local database. Skip them when building incrementally on top of an existing graph. See cognee.add and cognify for details on the ingestion pipeline.Structured Extraction
When you know the shape of the data you need, pass a Pydantic schema to ScrapeGraphAI’ssmartscraper directly. This bypasses the integration’s scrape_urls and gives you full control over the output structure:
Querying the Graph
Oncecognify completes, use cognee.search to query the graph.
See Search for all available search types and parameters.
Use Cases
Competitive Intelligence
Competitive Intelligence
Scrape competitor product and pricing pages, build a knowledge graph, then query across all of them:
- Gather competitor URLs (product pages, pricing, docs)
- Use
scrape_and_addwith a prompt focused on pricing, features, and positioning - Query with synthesis questions like “Which product is best for enterprise use cases?”
datasets=["competitive_intel"].News Monitoring
News Monitoring
Scrape news sources on a schedule and add to the existing graph incrementally:
- Set up a list of news/blog URLs
- Run
scrape_and_adddaily (skip theprunecalls to accumulate data) - Query across the full timeline: “What are the biggest trends this week?”
Research Aggregation
Research Aggregation
Collect and correlate information from many sources:
- Scrape documentation, blog posts, and GitHub READMEs for a topic
- Build the graph with
scrape_and_add - Ask cross-source questions: “How does library X compare to library Y?”
Product Landscape Analysis
Product Landscape Analysis
Map out an entire product category:
- Scrape product pages with a schema targeting name, features, pricing, and audience
- Ingest into cognee and cognify
- Query for patterns: “Which products target developers?” or “What pricing models are most common?”