Document Indexing

What is Document Indexing?

Document indexing loads your documents, splits them into chunks, converts them to vectors, and stores them in Pinecone so your chatbot can find and use them. The Process:

Document (PDF/TXT/CSV)
    ↓
Load & Split into Chunks
    ↓
Convert to Embeddings
    ↓
Store in Pinecone
    ↓
Ready for Search!

Quick Start

Index your first document:

from langchat import LangChat
from langchat.llm import OpenAI
from langchat.vector_db import Pinecone
from langchat.database import Supabase

# Setup
llm = OpenAI(api_key="sk-...", model="gpt-4o-mini")
vector_db = Pinecone(api_key="...", index_name="...")
db = Supabase(url="https://...", key="...")

ai = LangChat(llm=llm, vector_db=vector_db, db=db)

# Index document
result = ai.load_and_index_documents(
    file_path="my-document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

print(f"✅ Indexed {result['chunks_indexed']} chunks")

That’s it! Your document is now searchable.

Chunk Size and Overlap

Chunk Size

How many characters in each piece:

# Small chunks (500) - Precise answers
chunk_size=500

# Medium chunks (1000) - RECOMMENDED
chunk_size=1000

# Large chunks (2000) - More context
chunk_size=2000

Chunk Overlap

How much text is shared between chunks:

# Default: 200 (20% of 1000 chunk size)
chunk_overlap=200

Why overlap?

Prevents losing context at boundaries
Keeps related information together
Improves search accuracy

Preventing Duplicates

LangChat automatically prevents duplicate documents:

# First time
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 50 chunks, Skipped 0

# Try again (safe!)
result = ai.load_and_index_documents("document.pdf")
# Result: Indexed 0 chunks, Skipped 50 ✅

Safe to run multiple times! Duplicates are automatically skipped.

Multiple Documents

Index multiple files at once:

result = ai.load_and_index_multiple_documents(
    file_paths=[
        "doc1.pdf",
        "doc2.txt",
        "data.csv"
    ],
    chunk_size=1000,
    chunk_overlap=200
)

print(f"✅ Total chunks: {result['total_chunks_indexed']}")
print(f"⏭️  Skipped: {result['total_chunks_skipped']}")

Using Namespaces

Organize documents into groups:

# Education documents
ai.load_and_index_documents(
    file_path="universities.pdf",
    namespace="education"
)

# Travel documents
ai.load_and_index_documents(
    file_path="destinations.pdf",
    namespace="travel"
)

Best Practices

1. Choose Right Chunk Size

# ✅ Good: Balanced
chunk_size=1000, chunk_overlap=200

# ❌ Too small: Loses context
chunk_size=200

# ❌ Too large: Less precise
chunk_size=5000

2. Always Use Duplicate Prevention

# ✅ Good: Prevents duplicates
prevent_duplicates=True  # Default

# ❌ Bad: Can create duplicates
prevent_duplicates=False

3. Organize with Namespaces

# ✅ Good: Organized
namespace="product-docs"
namespace="support-articles"

# ❌ Bad: Everything mixed
# (No namespace)

Troubleshooting

No Chunks Indexed

Check:

All chunks were duplicates? (check chunks_skipped)
Document is empty?
File path correct?

Unsupported File Type

Solution:

Check file extension (PDF, TXT, CSV supported)
Convert to supported format

Error Indexing

Check:

Pinecone API key valid?
Index name exists?
Network connection?

Next Steps

Vector Search - Understand search
Examples - See it in action
Configuration - Learn settings

Built with ❤️ by NeuroBrain

Quick Start

Guides

Components

API Reference

Examples

Advanced

Document Indexing

What is Document Indexing?

Quick Start

Chunk Size and Overlap

Chunk Size

Chunk Overlap

Preventing Duplicates

Multiple Documents

Using Namespaces

Best Practices

1. Choose Right Chunk Size

2. Always Use Duplicate Prevention

3. Organize with Namespaces

Troubleshooting

No Chunks Indexed

Unsupported File Type

Error Indexing

Next Steps

Quick Start

Guides

Components

API Reference

Examples

Advanced

​What is Document Indexing?

​Quick Start

​Chunk Size and Overlap

​Chunk Size

​Chunk Overlap

​Preventing Duplicates

​Multiple Documents

​Using Namespaces

​Best Practices

​1. Choose Right Chunk Size

​2. Always Use Duplicate Prevention

​3. Organize with Namespaces

​Troubleshooting

​No Chunks Indexed

​Unsupported File Type

​Error Indexing

​Next Steps

What is Document Indexing?

Quick Start

Chunk Size and Overlap

Chunk Size

Chunk Overlap

Preventing Duplicates

Multiple Documents

Using Namespaces

Best Practices

1. Choose Right Chunk Size

2. Always Use Duplicate Prevention

3. Organize with Namespaces

Troubleshooting

No Chunks Indexed

Unsupported File Type

Error Indexing

Next Steps