The index() method
LangChat.index() indexes one file, multiple files, or an entire directory into Pinecone:
Supported file formats
LangChat uses docsuite for file loading, which supports:| Format | Extension |
|---|---|
.pdf | |
| Plain text | .txt |
| Markdown | .md |
| CSV | .csv |
| Word | .docx |
| PowerPoint | .pptx |
| HTML | .html, .htm |
| JSON | .json |
Chunking
Documents are split into overlapping chunks before indexing. Configure chunk size and overlap:| Chunk size | Best for | Trade-off |
|---|---|---|
| 500–800 | FAQs, short paragraphs | More chunks, lower cost per query |
| 1000–1500 | Documentation, articles | Balanced |
| 2000+ | Long-form prose, legal text | Fewer chunks, richer context per chunk |
Duplicate prevention
By default,index() skips chunks it has already indexed. It detects duplicates by hashing each chunk’s content and checking Pinecone metadata:
Namespaces
Use Pinecone namespaces to separate document collections:Full example: build a knowledge base
Return value
index() returns a dict with indexing statistics:
| Key | Type | Description |
|---|---|---|
chunks_indexed | int | Number of chunks added to Pinecone |
chunks_skipped | int | Number of duplicate chunks skipped |
files_processed | int | Number of files successfully processed |
errors | list | Any files that failed to load |
Re-indexing after document updates
When documents change, re-index them withprevent_duplicates=False to replace the old content:
Performance tips
- Index documents once at setup time, not on every server start
- For large document collections (thousands of files), index in batches
- Use smaller
chunk_sizefor large collections to stay within Pinecone’s metadata limits - Monitor
errorsin the return value to catch files that failed to load
