cognee-community-tasks-scrapegraph package provides two async tasks: scrape_urls for extraction and scrape_and_add for end-to-end scrape-to-graph ingestion.
Why Use This Integration
- Prompt-Based Extraction: Describe what you want in natural language — no CSS selectors or scraper maintenance
- Single Function Pipeline:
scrape_and_addscrapes, ingests, and builds the graph in one call - Structured Output: Optionally pass a Pydantic schema for domain-specific extraction
- Source Attribution: Each scraped page is tagged with its origin URL in the knowledge graph
- JavaScript Rendering: Handles JS-rendered pages and common bot protection
Installation
Requirements
You need two API keys:| Variable | Description |
|---|---|
LLM_API_KEY | OpenAI (or other LLM provider) API key used by cognee |
SGAI_API_KEY | ScrapeGraphAI API key |
See LLM Providers and Embedding Providers if you want to use a provider other than OpenAI.
Quick Start
1. Scrape and Inspect
Usescrape_urls to verify what ScrapeGraphAI extracts before building a graph:
user_prompt tells ScrapeGraphAI what to focus on when extracting content from each page.
2. Build the Knowledge Graph
Usescrape_and_add to scrape, ingest, and build the graph in one call:
The
prune calls reset the local database. Skip them when building incrementally on top of an existing graph.Structured Extraction
When you know the shape of the data you need, pass a Pydantic schema to ScrapeGraphAI’ssmartscraper directly. This bypasses the integration’s scrape_urls and gives you full control over the output structure:
Querying the Graph
Once the graph is built, usecognee.recall(...) to query it.
Use Cases
Competitive Intelligence
Competitive Intelligence
Scrape competitor product and pricing pages, build a knowledge graph, then query across all of them:
- Gather competitor URLs (product pages, pricing, docs)
- Use
scrape_and_addwith a prompt focused on pricing, features, and positioning - Query with synthesis questions like “Which product is best for enterprise use cases?”
datasets=["competitive_intel"].News Monitoring
News Monitoring
Scrape news sources on a schedule and add to the existing graph incrementally:
- Set up a list of news/blog URLs
- Run
scrape_and_adddaily (skip theprunecalls to accumulate data) - Query across the full timeline: “What are the biggest trends this week?”
Research Aggregation
Research Aggregation
Collect and correlate information from many sources:
- Scrape documentation, blog posts, and GitHub READMEs for a topic
- Build the graph with
scrape_and_add - Ask cross-source questions: “How does library X compare to library Y?”
Product Landscape Analysis
Product Landscape Analysis
Map out an entire product category:
- Scrape product pages with a schema targeting name, features, pricing, and audience
- Ingest into cognee
- Query for patterns: “Which products target developers?” or “What pricing models are most common?”
GitHub Repository
View source code and examples
Blog Post
Read full tutorial on ScrapeGraph website