Skip to main content

The index() method

LangChat.index() indexes one file, multiple files, or an entire directory into Pinecone:
# Single file
result = lc.index("docs/manual.pdf")

# Multiple files
result = lc.index(["docs/faq.pdf", "docs/policies.txt", "data/products.csv"])

# Entire directory (recursive)
result = lc.index("docs/")

print(result["chunks_indexed"])  # chunks added
print(result["chunks_skipped"])  # duplicates skipped

Supported file formats

LangChat uses docsuite for file loading, which supports:
FormatExtension
PDF.pdf
Plain text.txt
Markdown.md
CSV.csv
Word.docx
PowerPoint.pptx
HTML.html, .htm
JSON.json

Chunking

Documents are split into overlapping chunks before indexing. Configure chunk size and overlap:
result = lc.index(
    "docs/",
    chunk_size=1000,     # characters per chunk (default: 1000)
    chunk_overlap=200,   # overlap between adjacent chunks (default: 200)
)
Choosing chunk size:
Chunk sizeBest forTrade-off
500–800FAQs, short paragraphsMore chunks, lower cost per query
1000–1500Documentation, articlesBalanced
2000+Long-form prose, legal textFewer chunks, richer context per chunk
Overlap ensures sentences aren’t cut off at chunk boundaries. A 200-character overlap on 1000-character chunks means adjacent chunks share roughly their last/first 200 characters.

Duplicate prevention

By default, index() skips chunks it has already indexed. It detects duplicates by hashing each chunk’s content and checking Pinecone metadata:
# Safe to run repeatedly — won't re-index the same content
result = lc.index("docs/")
# chunks_skipped will increase on subsequent runs
To force re-indexing (e.g., after updating documents):
result = lc.index("docs/", prevent_duplicates=False)

Namespaces

Use Pinecone namespaces to separate document collections:
# Index different document types separately
lc.index("products/", namespace="products")
lc.index("support/", namespace="support")
lc.index("legal/", namespace="legal")
Namespaces allow a single Pinecone index to serve multiple use cases.

Full example: build a knowledge base

# build_kb.py
import os
from langchat import LangChat
from langchat.providers import OpenAI, Pinecone, Supabase

LangChat.load_env()

lc = LangChat(
    llm=OpenAI("gpt-4o-mini"),
    vector_db=Pinecone("my-index"),
    db=Supabase(),
)

# Index all documents
paths = [
    "content/product-manual.pdf",
    "content/faq.md",
    "content/pricing.csv",
]

result = lc.index(
    paths,
    chunk_size=800,
    chunk_overlap=150,
    prevent_duplicates=True,
)

print(f"✓ Indexed {result['chunks_indexed']} chunks")
print(f"  Skipped {result['chunks_skipped']} duplicates")
Run it once to build the index, then run your chatbot normally.

Return value

index() returns a dict with indexing statistics:
KeyTypeDescription
chunks_indexedintNumber of chunks added to Pinecone
chunks_skippedintNumber of duplicate chunks skipped
files_processedintNumber of files successfully processed
errorslistAny files that failed to load

Re-indexing after document updates

When documents change, re-index them with prevent_duplicates=False to replace the old content:
# After updating docs/manual.pdf:
result = lc.index("docs/manual.pdf", prevent_duplicates=False)
Or delete the old vectors from Pinecone and re-index from scratch via the Pinecone dashboard.

Performance tips

  • Index documents once at setup time, not on every server start
  • For large document collections (thousands of files), index in batches
  • Use smaller chunk_size for large collections to stay within Pinecone’s metadata limits
  • Monitor errors in the return value to catch files that failed to load