Skip to main content

Overview

DocumentIndexer is the underlying class that powers LangChat.index(). Use it directly only when you need to index documents outside of a LangChat instance — for example, in a standalone indexing script that doesn’t start the full chatbot. For most use cases, use lc.index() instead.
from langchat.core.utils.document_indexer import DocumentIndexer

Constructor

DocumentIndexer(
    pinecone_api_key: str,
    pinecone_index_name: str,
    openai_api_key: str,
    embedding_model: str = "text-embedding-3-large",
)
pinecone_api_key
str
required
Pinecone API key.
pinecone_index_name
str
required
Pinecone index name.
openai_api_key
str
required
OpenAI API key for creating embeddings.
embedding_model
str
default:"text-embedding-3-large"
OpenAI embedding model.

Methods

load_and_index_documents()

Index a single file.
def load_and_index_documents(
    self,
    file_path: str,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
file_path
str
required
Path to the document file.
chunk_size
int
default:"1000"
Characters per chunk.
chunk_overlap
int
default:"200"
Overlap between adjacent chunks.
namespace
str | None
default:"None"
Pinecone namespace.
prevent_duplicates
bool
default:"True"
Skip chunks already in Pinecone (checked by content hash).
Returns: dict with chunks_indexed, chunks_skipped, and metadata.

load_and_index_multiple_documents()

Index multiple files.
def load_and_index_multiple_documents(
    self,
    file_paths: list[str],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
Same parameters as load_and_index_documents(), but accepts a list of file paths.

Standalone indexing script

Use DocumentIndexer directly when you want to index documents independently of the chatbot:
# standalone_indexer.py
import os
from langchat.core.utils.document_indexer import DocumentIndexer
from dotenv import load_dotenv

load_dotenv()

indexer = DocumentIndexer(
    pinecone_api_key=os.environ["PINECONE_API_KEY"],
    pinecone_index_name="my-index",
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Index multiple documents
result = indexer.load_and_index_multiple_documents(
    file_paths=[
        "docs/manual.pdf",
        "docs/faq.md",
        "docs/policies.txt",
    ],
    chunk_size=1000,
    chunk_overlap=200,
    namespace="main",
    prevent_duplicates=True,
)

print(f"Indexed: {result['chunks_indexed']}")
print(f"Skipped: {result['chunks_skipped']}")
LangChat.index() is a convenience wrapper around DocumentIndexer that reads credentials from environment variables automatically. Prefer it when you already have a LangChat instance.