Multilingual Ingestion - Cognee Documentation

A minimal guide to enabling translation during ingestion. Cognee includes a built-in translation pipeline that detects languages and translates content before graph extraction, so non-English documents are indexed as English knowledge. Before you start:

Complete Quickstart to understand basic operations
Ensure you have LLM Providers configured
Have non-English text or documents to process

What Translation Does

Detects language automatically using the langdetect library
Skips chunks already in the target language
Translates using one of three providers: llm (default), google, or azure
Stores original text alongside the translation in the knowledge graph

Configuration

Set these environment variables in your .env file:

# Provider: "llm" (default), "google", or "azure"
TRANSLATION_PROVIDER=llm

# Target language ISO 639-1 code (default: "en")
TARGET_LANGUAGE=en

# Minimum detection confidence to trigger translation (default: 0.8)
CONFIDENCE_THRESHOLD=0.8

The llm provider uses your existing LLM configuration — no additional keys needed.

Using Translation in a Pipeline

Insert translate_content as a pipeline task between chunk extraction and graph building:

import asyncio
import os
import cognee
from cognee.infrastructure.llm import get_max_chunk_tokens
from cognee.tasks.documents import classify_documents, extract_chunks_from_documents
from cognee.shared.data_models import KnowledgeGraph
from cognee.tasks.translation import translate_content
from cognee.modules.pipelines import Task, run_pipeline
from cognee.tasks.graph import extract_graph_from_data
from cognee.tasks.storage import add_data_points

# the translated text is in data_chunks[].text, 
async def drop_translation_metadata(data_chunks):
    for chunk in data_chunks:
        chunk.contains = None
    return data_chunks


async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    text_fr = "La mémoire artificielle permet aux agents IA de retenir des informations complexes."

    tasks = [
        Task(classify_documents),
        Task(extract_chunks_from_documents, max_chunk_size=get_max_chunk_tokens()),
        Task(translate_content, target_language="en", translation_provider="llm"),
        Task(drop_translation_metadata),
        Task(extract_graph_from_data, graph_model=KnowledgeGraph),
        Task(add_data_points),
    ]

    async for _ in run_pipeline(tasks=tasks, datasets=["multilingual"]):
        pass

    visualize_graph_path = os.path.join(
        os.path.dirname(__file__), ".artifacts", "multilingual.html"
    )
    await cognee.visualize_graph(visualize_graph_path)

asyncio.run(main())

translate_content mutates chunks in-place: chunk.text is replaced with the translation and the original is preserved in a TranslatedContent data point attached to the chunk.

Additional Information

How Language Detection Works

Cognee detects language per chunk with the langdetect library. Each chunk produced by the chunker is analyzed independently, so a document that mixes languages has every chunk detected — and translated — on its own.A chunk is translated only when both conditions hold: the detected language differs from TARGET_LANGUAGE, and the detection confidence is at least CONFIDENCE_THRESHOLD (default 0.8). Otherwise the chunk is left untouched and only tagged with LanguageMetadata. Chunks shorter than 10 characters skip detection entirely.langdetect recognizes 55 languages:af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-twLanguages outside this set — for example Azerbaijani (az) — cannot be detected. langdetect either misclassifies them as a related language (Azerbaijani is often read as Turkish, tr) or returns low confidence, so such chunks may be skipped or translated from the wrong source language. Detection drives translation off the detected code, not the document’s true language, so verify coverage before relying on it for an unsupported language.

Translating Individual Strings

For one-off translation without a pipeline, use translate_text:

from cognee.tasks.translation import translate_text

result = await translate_text("Bonjour le monde!", target_language="en")
print(result.translated_text)   # "Hello world!"
print(result.source_language)   # "fr"

Choosing a Provider

All three providers translate non-English chunks to your TARGET_LANGUAGE. Pick based on cost, setup, and quality trade-offs:

Provider	Setup	Cost	Best for
`llm` (default)	None — reuses your LLM config	Per-token LLM usage; higher quality, slower	Mixed/long-form documents where context-aware translation matters
`google`	Install `google-cloud-translate`, Google Cloud project	Per-character pricing; fast batch translation	High-volume ingestion across many languages
`azure`	Azure Cognitive Services key + region	Per-character pricing; fast batch translation	Enterprise deployments already on Azure

Supported languages: detection uses langdetect (~55 languages). The llm provider supports any language the underlying model handles. Google Translate and Azure Translator each support 130+ language codes, including locale-specific variants such as zh-CN and zh-TW — see the Google Cloud Translation language list and Azure Translator language list for the full set.Set TRANSLATION_PROVIDER in .env to switch — no code changes required.

Provider-Specific Setup

LLM Provider (default)

Uses your existing LLM — no extra configuration needed. Works with any provider configured via LLM_PROVIDER and LLM_API_KEY.

TRANSLATION_PROVIDER=llm

Google Cloud Translation

Requires the google-cloud-translate package and a Google Cloud project.

pip install google-cloud-translate

TRANSLATION_PROVIDER=google
GOOGLE_TRANSLATE_API_KEY=your_api_key
GOOGLE_PROJECT_ID=your_project_id

Azure Translator

Requires an Azure Cognitive Services resource.

TRANSLATION_PROVIDER=azure
AZURE_TRANSLATOR_KEY=your_key
AZURE_TRANSLATOR_REGION=eastus
# Endpoint defaults to https://api.cognitive.microsofttranslator.com
AZURE_TRANSLATOR_ENDPOINT=https://api.cognitive.microsofttranslator.com

Advanced Options

Variable	Default	Description
`TRANSLATION_BATCH_SIZE`	`10`	Chunks per translation batch
`TRANSLATION_MAX_RETRIES`	`3`	Retry attempts on failure
`TRANSLATION_TIMEOUT_SECONDS`	`30`	Request timeout

Custom Pipelines

Learn to build custom task pipelines

LLM Providers

Configure your LLM provider

Core Concepts

Understand knowledge graph fundamentals

​What Translation Does

​Configuration

​Using Translation in a Pipeline

​Additional Information

Custom Pipelines

LLM Providers

Core Concepts

What Translation Does

Configuration

Using Translation in a Pipeline

Additional Information