What is Document Indexing?
Document indexing loads your documents, splits them into chunks, converts them to vectors, and stores them in Pinecone so your chatbot can find and use them.
The Process:
Document (PDF/TXT/CSV)
↓
Load & Split into Chunks
↓
Convert to Embeddings
↓
Store in Pinecone
↓
Ready for Search!
Quick Start
Index your first document:
from langchat import LangChat
from langchat.llm import OpenAI
from langchat.vector_db import Pinecone
from langchat.database import Supabase
# Setup
llm = OpenAI(api_key="sk-...", model="gpt-4o-mini")
vector_db = Pinecone(api_key="...", index_name="...")
db = Supabase(url="https://...", key="...")
ai = LangChat(llm=llm, vector_db=vector_db, db=db)
# Index document
result = ai.load_and_index_documents(
file_path="my-document.pdf",
chunk_size=1000,
chunk_overlap=200
)
print(f"✅ Indexed {result['chunks_indexed']} chunks")
That’s it! Your document is now searchable.
Chunk Size and Overlap
Chunk Size
How many characters in each piece:
# Small chunks (500) - Precise answers
chunk_size=500
# Medium chunks (1000) - RECOMMENDED
chunk_size=1000
# Large chunks (2000) - More context
chunk_size=2000
Chunk Overlap
How much text is shared between chunks:
# Default: 200 (20% of 1000 chunk size)
chunk_overlap=200
Why overlap?
- Prevents losing context at boundaries
- Keeps related information together
- Improves search accuracy
Preventing Duplicates
LangChat automatically prevents duplicate documents:
# First time
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 50 chunks, Skipped 0
# Try again (safe!)
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 0 chunks, Skipped 50 ✅
Safe to run multiple times! Duplicates are automatically skipped.
Multiple Documents
Index multiple files at once:
result = ai.load_and_index_multiple_documents(
file_paths=[
"doc1.pdf",
"doc2.txt",
"data.csv"
],
chunk_size=1000,
chunk_overlap=200
)
print(f"✅ Total chunks: {result['total_chunks_indexed']}")
print(f"⏭️ Skipped: {result['total_chunks_skipped']}")
Using Namespaces
Organize documents into groups:
# Education documents
ai.load_and_index_documents(
file_path="universities.pdf",
namespace="education"
)
# Travel documents
ai.load_and_index_documents(
file_path="destinations.pdf",
namespace="travel"
)
Best Practices
1. Choose Right Chunk Size
# ✅ Good: Balanced
chunk_size=1000, chunk_overlap=200
# ❌ Too small: Loses context
chunk_size=200
# ❌ Too large: Less precise
chunk_size=5000
2. Always Use Duplicate Prevention
# ✅ Good: Prevents duplicates
prevent_duplicates=True # Default
# ❌ Bad: Can create duplicates
prevent_duplicates=False
3. Organize with Namespaces
# ✅ Good: Organized
namespace="product-docs"
namespace="support-articles"
# ❌ Bad: Everything mixed
# (No namespace)
Troubleshooting
No Chunks Indexed
Check:
- All chunks were duplicates? (check
chunks_skipped)
- Document is empty?
- File path correct?
Unsupported File Type
Solution:
- Check file extension (PDF, TXT, CSV supported)
- Convert to supported format
Error Indexing
Check:
- Pinecone API key valid?
- Index name exists?
- Network connection?
Next Steps
Built with ❤️ by NeuroBrain