Skip to main content

What is Document Indexing?

Document indexing loads your documents, splits them into chunks, converts them to vectors, and stores them in Pinecone so your chatbot can find and use them. The Process:
Document (PDF/TXT/CSV)

Load & Split into Chunks

Convert to Embeddings

Store in Pinecone

Ready for Search!

Quick Start

Index your first document:
from langchat import LangChat
from langchat.llm import OpenAI
from langchat.vector_db import Pinecone
from langchat.database import Supabase

# Setup
llm = OpenAI(api_key="sk-...", model="gpt-4o-mini")
vector_db = Pinecone(api_key="...", index_name="...")
db = Supabase(url="https://...", key="...")

ai = LangChat(llm=llm, vector_db=vector_db, db=db)

# Index document
result = ai.load_and_index_documents(
    file_path="my-document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

print(f"✅ Indexed {result['chunks_indexed']} chunks")
That’s it! Your document is now searchable.

Chunk Size and Overlap

Chunk Size

How many characters in each piece:
# Small chunks (500) - Precise answers
chunk_size=500

# Medium chunks (1000) - RECOMMENDED
chunk_size=1000

# Large chunks (2000) - More context
chunk_size=2000

Chunk Overlap

How much text is shared between chunks:
# Default: 200 (20% of 1000 chunk size)
chunk_overlap=200
Why overlap?
  • Prevents losing context at boundaries
  • Keeps related information together
  • Improves search accuracy

Preventing Duplicates

LangChat automatically prevents duplicate documents:
# First time
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 50 chunks, Skipped 0

# Try again (safe!)
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 0 chunks, Skipped 50 ✅
Safe to run multiple times! Duplicates are automatically skipped.

Multiple Documents

Index multiple files at once:
result = ai.load_and_index_multiple_documents(
    file_paths=[
        "doc1.pdf",
        "doc2.txt",
        "data.csv"
    ],
    chunk_size=1000,
    chunk_overlap=200
)

print(f"✅ Total chunks: {result['total_chunks_indexed']}")
print(f"⏭️  Skipped: {result['total_chunks_skipped']}")

Using Namespaces

Organize documents into groups:
# Education documents
ai.load_and_index_documents(
    file_path="universities.pdf",
    namespace="education"
)

# Travel documents
ai.load_and_index_documents(
    file_path="destinations.pdf",
    namespace="travel"
)

Best Practices

1. Choose Right Chunk Size

# ✅ Good: Balanced
chunk_size=1000, chunk_overlap=200

# ❌ Too small: Loses context
chunk_size=200

# ❌ Too large: Less precise
chunk_size=5000

2. Always Use Duplicate Prevention

# ✅ Good: Prevents duplicates
prevent_duplicates=True  # Default

# ❌ Bad: Can create duplicates
prevent_duplicates=False

3. Organize with Namespaces

# ✅ Good: Organized
namespace="product-docs"
namespace="support-articles"

# ❌ Bad: Everything mixed
# (No namespace)

Troubleshooting

No Chunks Indexed

Check:
  • All chunks were duplicates? (check chunks_skipped)
  • Document is empty?
  • File path correct?

Unsupported File Type

Solution:
  • Check file extension (PDF, TXT, CSV supported)
  • Convert to supported format

Error Indexing

Check:
  • Pinecone API key valid?
  • Index name exists?
  • Network connection?

Next Steps


Built with ❤️ by NeuroBrain