> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Multilingual Ingestion

> Translate non-English content before building the knowledge graph

A minimal guide to enabling translation during ingestion. Cognee includes a built-in translation pipeline that detects languages and translates content before graph extraction, so non-English documents are indexed as English knowledge.

**Before you start:**

* Complete [Quickstart](getting-started/quickstart) to understand basic operations
* Ensure you have [LLM Providers](setup-configuration/llm-providers) configured
* Have non-English text or documents to process

## What Translation Does

* Detects language automatically using the `langdetect` library
* Skips chunks already in the target language
* Translates using one of three providers: `llm` (default), `google`, or `azure`
* Stores original text alongside the translation in the knowledge graph

## Configuration

Set these environment variables in your `.env` file:

```dotenv theme={null}
# Provider: "llm" (default), "google", or "azure"
TRANSLATION_PROVIDER=llm

# Target language ISO 639-1 code (default: "en")
TARGET_LANGUAGE=en

# Minimum detection confidence to trigger translation (default: 0.8)
CONFIDENCE_THRESHOLD=0.8
```

The `llm` provider uses your existing [LLM configuration](setup-configuration/llm-providers) — no additional keys needed.

## Using Translation in a Pipeline

Insert `translate_content` as a pipeline task between chunk extraction and graph building:

```python theme={null}
import asyncio
import os
import cognee
from cognee.infrastructure.llm import get_max_chunk_tokens
from cognee.tasks.documents import classify_documents, extract_chunks_from_documents
from cognee.shared.data_models import KnowledgeGraph
from cognee.tasks.translation import translate_content
from cognee.modules.pipelines import Task, run_pipeline
from cognee.tasks.graph import extract_graph_from_data
from cognee.tasks.storage import add_data_points

# the translated text is in data_chunks[].text, 
async def drop_translation_metadata(data_chunks):
    for chunk in data_chunks:
        chunk.contains = None
    return data_chunks


async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    text_fr = "La mémoire artificielle permet aux agents IA de retenir des informations complexes."

    tasks = [
        Task(classify_documents),
        Task(extract_chunks_from_documents, max_chunk_size=get_max_chunk_tokens()),
        Task(translate_content, target_language="en", translation_provider="llm"),
        Task(drop_translation_metadata),
        Task(extract_graph_from_data, graph_model=KnowledgeGraph),
        Task(add_data_points),
    ]

    async for _ in run_pipeline(tasks=tasks, datasets=["multilingual"]):
        pass

    visualize_graph_path = os.path.join(
        os.path.dirname(__file__), ".artifacts", "multilingual.html"
    )
    await cognee.visualize_graph(visualize_graph_path)

asyncio.run(main())
```

<Note>
  `translate_content` mutates chunks in-place: `chunk.text` is replaced with the translation and the original is preserved in a `TranslatedContent` data point attached to the chunk.
</Note>

## Additional Information

<AccordionGroup>
  <Accordion title="How Language Detection Works">
    Cognee detects language **per chunk** with the [`langdetect`](https://pypi.org/project/langdetect/) library. Each chunk produced by the chunker is analyzed independently, so a document that mixes languages has every chunk detected — and translated — on its own.

    A chunk is translated only when **both** conditions hold: the detected language differs from `TARGET_LANGUAGE`, and the detection confidence is at least `CONFIDENCE_THRESHOLD` (default `0.8`). Otherwise the chunk is left untouched and only tagged with `LanguageMetadata`. Chunks shorter than 10 characters skip detection entirely.

    `langdetect` recognizes 55 languages:

    `af`, `ar`, `bg`, `bn`, `ca`, `cs`, `cy`, `da`, `de`, `el`, `en`, `es`, `et`, `fa`, `fi`, `fr`, `gu`, `he`, `hi`, `hr`, `hu`, `id`, `it`, `ja`, `kn`, `ko`, `lt`, `lv`, `mk`, `ml`, `mr`, `ne`, `nl`, `no`, `pa`, `pl`, `pt`, `ro`, `ru`, `sk`, `sl`, `so`, `sq`, `sv`, `sw`, `ta`, `te`, `th`, `tl`, `tr`, `uk`, `ur`, `vi`, `zh-cn`, `zh-tw`

    Languages outside this set — for example Azerbaijani (`az`) — cannot be detected. `langdetect` either misclassifies them as a related language (Azerbaijani is often read as Turkish, `tr`) or returns low confidence, so such chunks may be skipped or translated from the wrong source language. Detection drives translation off the *detected* code, not the document's true language, so verify coverage before relying on it for an unsupported language.
  </Accordion>

  <Accordion title="Translating Individual Strings">
    For one-off translation without a pipeline, use `translate_text`:

    ```python theme={null}
    from cognee.tasks.translation import translate_text

    result = await translate_text("Bonjour le monde!", target_language="en")
    print(result.translated_text)   # "Hello world!"
    print(result.source_language)   # "fr"
    ```
  </Accordion>

  <Accordion title="Choosing a Provider">
    All three providers translate non-English chunks to your `TARGET_LANGUAGE`. Pick based on cost, setup, and quality trade-offs:

    | Provider        | Setup                                                               | Cost                                          | Best for                                                          |
    | --------------- | ------------------------------------------------------------------- | --------------------------------------------- | ----------------------------------------------------------------- |
    | `llm` (default) | None — reuses your [LLM config](/setup-configuration/llm-providers) | Per-token LLM usage; higher quality, slower   | Mixed/long-form documents where context-aware translation matters |
    | `google`        | Install `google-cloud-translate`, Google Cloud project              | Per-character pricing; fast batch translation | High-volume ingestion across many languages                       |
    | `azure`         | Azure Cognitive Services key + region                               | Per-character pricing; fast batch translation | Enterprise deployments already on Azure                           |

    **Supported languages:** detection uses `langdetect` (\~55 languages). The `llm` provider supports any language the underlying model handles. Google Translate and Azure Translator each support 130+ language codes, including locale-specific variants such as `zh-CN` and `zh-TW` — see the [Google Cloud Translation language list](https://cloud.google.com/translate/docs/languages) and [Azure Translator language list](https://learn.microsoft.com/azure/ai-services/translator/language-support) for the full set.

    Set `TRANSLATION_PROVIDER` in `.env` to switch — no code changes required.
  </Accordion>

  <Accordion title="Provider-Specific Setup">
    <AccordionGroup>
      <Accordion title="LLM Provider (default)" defaultOpen>
        Uses your existing LLM — no extra configuration needed. Works with any provider configured via `LLM_PROVIDER` and `LLM_API_KEY`.

        ```dotenv theme={null}
        TRANSLATION_PROVIDER=llm
        ```
      </Accordion>

      <Accordion title="Google Cloud Translation">
        Requires the `google-cloud-translate` package and a Google Cloud project.

        ```bash theme={null}
        pip install google-cloud-translate
        ```

        ```dotenv theme={null}
        TRANSLATION_PROVIDER=google
        GOOGLE_TRANSLATE_API_KEY=your_api_key
        GOOGLE_PROJECT_ID=your_project_id
        ```
      </Accordion>

      <Accordion title="Azure Translator">
        Requires an Azure Cognitive Services resource.

        ```dotenv theme={null}
        TRANSLATION_PROVIDER=azure
        AZURE_TRANSLATOR_KEY=your_key
        AZURE_TRANSLATOR_REGION=eastus
        # Endpoint defaults to https://api.cognitive.microsofttranslator.com
        AZURE_TRANSLATOR_ENDPOINT=https://api.cognitive.microsofttranslator.com
        ```
      </Accordion>
    </AccordionGroup>
  </Accordion>

  <Accordion title="Advanced Options">
    | Variable                      | Default | Description                  |
    | ----------------------------- | ------- | ---------------------------- |
    | `TRANSLATION_BATCH_SIZE`      | `10`    | Chunks per translation batch |
    | `TRANSLATION_MAX_RETRIES`     | `3`     | Retry attempts on failure    |
    | `TRANSLATION_TIMEOUT_SECONDS` | `30`    | Request timeout              |
  </Accordion>
</AccordionGroup>

<Columns cols={3}>
  <Card title="Custom Pipelines" icon="workflow" href="/guides/custom-tasks-pipelines">
    Learn to build custom task pipelines
  </Card>

  <Card title="LLM Providers" icon="cpu" href="/setup-configuration/llm-providers">
    Configure your LLM provider
  </Card>

  <Card title="Core Concepts" icon="brain" href="/core-concepts/overview">
    Understand knowledge graph fundamentals
  </Card>
</Columns>