Documentation Index
Fetch the complete documentation index at: https://langchat.neurobrains.co/llms.txt
Use this file to discover all available pages before exploring further.
The index() method
LangChat.index() indexes one file, multiple files, or an entire directory into Pinecone:
# Single file
result = lc.index("docs/manual.pdf")
# Multiple files
result = lc.index(["docs/faq.pdf", "docs/policies.txt", "data/products.csv"])
# Entire directory (recursive)
result = lc.index("docs/")
print(result["chunks_indexed"]) # chunks added
print(result["chunks_skipped"]) # duplicates skipped
LangChat uses docsuite for file loading, which supports:
| Format | Extension |
|---|
| PDF | .pdf |
| Plain text | .txt |
| Markdown | .md |
| CSV | .csv |
| Word | .docx |
| PowerPoint | .pptx |
| HTML | .html, .htm |
| JSON | .json |
Chunking
Documents are split into overlapping chunks before indexing. Configure chunk size and overlap:
result = lc.index(
"docs/",
chunk_size=1000, # characters per chunk (default: 1000)
chunk_overlap=200, # overlap between adjacent chunks (default: 200)
)
Choosing chunk size:
| Chunk size | Best for | Trade-off |
|---|
| 500–800 | FAQs, short paragraphs | More chunks, lower cost per query |
| 1000–1500 | Documentation, articles | Balanced |
| 2000+ | Long-form prose, legal text | Fewer chunks, richer context per chunk |
Overlap ensures sentences aren’t cut off at chunk boundaries. A 200-character overlap on 1000-character chunks means adjacent chunks share roughly their last/first 200 characters.
Duplicate prevention
By default, index() skips chunks it has already indexed. It detects duplicates by hashing each chunk’s content and checking Pinecone metadata:
# Safe to run repeatedly — won't re-index the same content
result = lc.index("docs/")
# chunks_skipped will increase on subsequent runs
To force re-indexing (e.g., after updating documents):
result = lc.index("docs/", prevent_duplicates=False)
Namespaces
Use Pinecone namespaces to separate document collections:
# Index different document types separately
lc.index("products/", namespace="products")
lc.index("support/", namespace="support")
lc.index("legal/", namespace="legal")
Namespaces allow a single Pinecone index to serve multiple use cases.
Full example: build a knowledge base
# build_kb.py
import os
from langchat import LangChat
from langchat.providers import OpenAI, Pinecone, Supabase
LangChat.load_env()
lc = LangChat(
llm=OpenAI("gpt-4o-mini"),
vector_db=Pinecone("my-index"),
db=Supabase(),
)
# Index all documents
paths = [
"content/product-manual.pdf",
"content/faq.md",
"content/pricing.csv",
]
result = lc.index(
paths,
chunk_size=800,
chunk_overlap=150,
prevent_duplicates=True,
)
print(f"✓ Indexed {result['chunks_indexed']} chunks")
print(f" Skipped {result['chunks_skipped']} duplicates")
Run it once to build the index, then run your chatbot normally.
Return value
index() returns a dict with indexing statistics:
| Key | Type | Description |
|---|
chunks_indexed | int | Number of chunks added to Pinecone |
chunks_skipped | int | Number of duplicate chunks skipped |
files_processed | int | Number of files successfully processed |
errors | list | Any files that failed to load |
Re-indexing after document updates
When documents change, re-index them with prevent_duplicates=False to replace the old content:
# After updating docs/manual.pdf:
result = lc.index("docs/manual.pdf", prevent_duplicates=False)
Or delete the old vectors from Pinecone and re-index from scratch via the Pinecone dashboard.
- Index documents once at setup time, not on every server start
- For large document collections (thousands of files), index in batches
- Use smaller
chunk_size for large collections to stay within Pinecone’s metadata limits
- Monitor
errors in the return value to catch files that failed to load