Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchat.neurobrains.co/llms.txt

Use this file to discover all available pages before exploring further.

Overview

DocumentIndexer is the underlying class that powers LangChat.index(). Use it directly only when you need to index documents outside of a LangChat instance — for example, in a standalone indexing script that doesn’t start the full chatbot. For most use cases, use lc.index() instead.
from langchat.core.utils.document_indexer import DocumentIndexer

Constructor

DocumentIndexer(
    pinecone_api_key: str,
    pinecone_index_name: str,
    openai_api_key: str,
    embedding_model: str = "text-embedding-3-large",
)
pinecone_api_key
str
required
Pinecone API key.
pinecone_index_name
str
required
Pinecone index name.
openai_api_key
str
required
OpenAI API key for creating embeddings.
embedding_model
str
default:"text-embedding-3-large"
OpenAI embedding model.

Methods

load_and_index_documents()

Index a single file.
def load_and_index_documents(
    self,
    file_path: str,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
file_path
str
required
Path to the document file.
chunk_size
int
default:"1000"
Characters per chunk.
chunk_overlap
int
default:"200"
Overlap between adjacent chunks.
namespace
str | None
default:"None"
Pinecone namespace.
prevent_duplicates
bool
default:"True"
Skip chunks already in Pinecone (checked by content hash).
Returns: dict with chunks_indexed, chunks_skipped, and metadata.

load_and_index_multiple_documents()

Index multiple files.
def load_and_index_multiple_documents(
    self,
    file_paths: list[str],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: str | None = None,
    prevent_duplicates: bool = True,
) -> dict
Same parameters as load_and_index_documents(), but accepts a list of file paths.

Standalone indexing script

Use DocumentIndexer directly when you want to index documents independently of the chatbot:
# standalone_indexer.py
import os
from langchat.core.utils.document_indexer import DocumentIndexer
from dotenv import load_dotenv

load_dotenv()

indexer = DocumentIndexer(
    pinecone_api_key=os.environ["PINECONE_API_KEY"],
    pinecone_index_name="my-index",
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Index multiple documents
result = indexer.load_and_index_multiple_documents(
    file_paths=[
        "docs/manual.pdf",
        "docs/faq.md",
        "docs/policies.txt",
    ],
    chunk_size=1000,
    chunk_overlap=200,
    namespace="main",
    prevent_duplicates=True,
)

print(f"Indexed: {result['chunks_indexed']}")
print(f"Skipped: {result['chunks_skipped']}")
LangChat.index() is a convenience wrapper around DocumentIndexer that reads credentials from environment variables automatically. Prefer it when you already have a LangChat instance.