> ## Documentation Index
> Fetch the complete documentation index at: https://langchat.neurobrains.co/llms.txt
> Use this file to discover all available pages before exploring further.

# DocumentIndexer

> Standalone document indexing class for advanced use cases.

## Overview

`DocumentIndexer` is the underlying class that powers `LangChat.index()`. Use it directly only when you need to index documents outside of a `LangChat` instance — for example, in a standalone indexing script that doesn't start the full chatbot.

For most use cases, use `lc.index()` instead.

```python theme={null}
from langchat.core.utils.document_indexer import DocumentIndexer
```

***

## Constructor

```python theme={null}
DocumentIndexer(
    pinecone_api_key: str,
    pinecone_index_name: str,
    openai_api_key: str,
    embedding_model: str = "text-embedding-3-large",
)
```

<ParamField path="pinecone_api_key" type="str" required>
  Pinecone API key.
</ParamField>

<ParamField path="pinecone_index_name" type="str" required>
  Pinecone index name.
</ParamField>

<ParamField path="openai_api_key" type="str" required>
  OpenAI API key for creating embeddings.
</ParamField>

<ParamField path="embedding_model" type="str" default="text-embedding-3-large">
  OpenAI embedding model.
</ParamField>

***

## Methods

### `load_and_index_documents()`

Index a single file.

```python theme={null}
def load_and_index_documents(
    self,
    file_path: str,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
```

<ParamField path="file_path" type="str" required>
  Path to the document file.
</ParamField>

<ParamField path="chunk_size" type="int" default="1000">
  Characters per chunk.
</ParamField>

<ParamField path="chunk_overlap" type="int" default="200">
  Overlap between adjacent chunks.
</ParamField>

<ParamField path="namespace" type="str | None" default="None">
  Pinecone namespace.
</ParamField>

<ParamField path="prevent_duplicates" type="bool" default="True">
  Skip chunks already in Pinecone (checked by content hash).
</ParamField>

**Returns:** `dict` with `chunks_indexed`, `chunks_skipped`, and metadata.

***

### `load_and_index_multiple_documents()`

Index multiple files.

```python theme={null}
def load_and_index_multiple_documents(
    self,
    file_paths: list[str],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
```

Same parameters as `load_and_index_documents()`, but accepts a list of file paths.

***

## Standalone indexing script

Use `DocumentIndexer` directly when you want to index documents independently of the chatbot:

```python theme={null}
# standalone_indexer.py
import os
from langchat.core.utils.document_indexer import DocumentIndexer
from dotenv import load_dotenv

load_dotenv()

indexer = DocumentIndexer(
    pinecone_api_key=os.environ["PINECONE_API_KEY"],
    pinecone_index_name="my-index",
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Index multiple documents
result = indexer.load_and_index_multiple_documents(
    file_paths=[
        "docs/manual.pdf",
        "docs/faq.md",
        "docs/policies.txt",
    ],
    chunk_size=1000,
    chunk_overlap=200,
    namespace="main",
    prevent_duplicates=True,
)

print(f"Indexed: {result['chunks_indexed']}")
print(f"Skipped: {result['chunks_skipped']}")
```

<Note>
  `LangChat.index()` is a convenience wrapper around `DocumentIndexer` that reads credentials from environment variables automatically. Prefer it when you already have a `LangChat` instance.
</Note>
