> ## Documentation Index
> Fetch the complete documentation index at: https://langchat.neurobrains.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Indexing

> Load PDFs, text files, CSVs, and more into your Pinecone vector store.

## The index() method

`LangChat.index()` indexes one file, multiple files, or an entire directory into Pinecone:

```python theme={null}
# Single file
result = lc.index("docs/manual.pdf")

# Multiple files
result = lc.index(["docs/faq.pdf", "docs/policies.txt", "data/products.csv"])

# Entire directory (recursive)
result = lc.index("docs/")

print(result["chunks_indexed"])  # chunks added
print(result["chunks_skipped"])  # duplicates skipped
```

***

## Supported file formats

LangChat uses [docsuite](https://pypi.org/project/docsuite/) for file loading, which supports:

| Format     | Extension       |
| ---------- | --------------- |
| PDF        | `.pdf`          |
| Plain text | `.txt`          |
| Markdown   | `.md`           |
| CSV        | `.csv`          |
| Word       | `.docx`         |
| PowerPoint | `.pptx`         |
| HTML       | `.html`, `.htm` |
| JSON       | `.json`         |

***

## Chunking

Documents are split into overlapping chunks before indexing. Configure chunk size and overlap:

```python theme={null}
result = lc.index(
    "docs/",
    chunk_size=1000,     # characters per chunk (default: 1000)
    chunk_overlap=200,   # overlap between adjacent chunks (default: 200)
)
```

**Choosing chunk size:**

| Chunk size | Best for                    | Trade-off                              |
| ---------- | --------------------------- | -------------------------------------- |
| 500–800    | FAQs, short paragraphs      | More chunks, lower cost per query      |
| 1000–1500  | Documentation, articles     | Balanced                               |
| 2000+      | Long-form prose, legal text | Fewer chunks, richer context per chunk |

Overlap ensures sentences aren't cut off at chunk boundaries. A 200-character overlap on 1000-character chunks means adjacent chunks share roughly their last/first 200 characters.

***

## Duplicate prevention

By default, `index()` skips chunks it has already indexed. It detects duplicates by hashing each chunk's content and checking Pinecone metadata:

```python theme={null}
# Safe to run repeatedly — won't re-index the same content
result = lc.index("docs/")
# chunks_skipped will increase on subsequent runs
```

To force re-indexing (e.g., after updating documents):

```python theme={null}
result = lc.index("docs/", prevent_duplicates=False)
```

***

## Namespaces

Use Pinecone namespaces to separate document collections:

```python theme={null}
# Index different document types separately
lc.index("products/", namespace="products")
lc.index("support/", namespace="support")
lc.index("legal/", namespace="legal")
```

Namespaces allow a single Pinecone index to serve multiple use cases.

***

## Full example: build a knowledge base

```python theme={null}
# build_kb.py
import os
from langchat import LangChat
from langchat.providers import OpenAI, Pinecone, Supabase

LangChat.load_env()

lc = LangChat(
    llm=OpenAI("gpt-4o-mini"),
    vector_db=Pinecone("my-index"),
    db=Supabase(),
)

# Index all documents
paths = [
    "content/product-manual.pdf",
    "content/faq.md",
    "content/pricing.csv",
]

result = lc.index(
    paths,
    chunk_size=800,
    chunk_overlap=150,
    prevent_duplicates=True,
)

print(f"✓ Indexed {result['chunks_indexed']} chunks")
print(f"  Skipped {result['chunks_skipped']} duplicates")
```

Run it once to build the index, then run your chatbot normally.

***

## Return value

`index()` returns a dict with indexing statistics:

| Key               | Type   | Description                            |
| ----------------- | ------ | -------------------------------------- |
| `chunks_indexed`  | `int`  | Number of chunks added to Pinecone     |
| `chunks_skipped`  | `int`  | Number of duplicate chunks skipped     |
| `files_processed` | `int`  | Number of files successfully processed |
| `errors`          | `list` | Any files that failed to load          |

***

## Re-indexing after document updates

When documents change, re-index them with `prevent_duplicates=False` to replace the old content:

```python theme={null}
# After updating docs/manual.pdf:
result = lc.index("docs/manual.pdf", prevent_duplicates=False)
```

Or delete the old vectors from Pinecone and re-index from scratch via the Pinecone dashboard.

***

## Performance tips

* Index documents **once** at setup time, not on every server start
* For large document collections (thousands of files), index in batches
* Use smaller `chunk_size` for large collections to stay within Pinecone's metadata limits
* Monitor `errors` in the return value to catch files that failed to load
