Class: DocumentIndexer
Standalone document loader and indexer for Pinecone. Only requires Pinecone API key and OpenAI API key for embeddings. Perfect for indexing documents without full LangChat setup.Constructor
Your Pinecone API key. Get it from pinecone.io
Name of your Pinecone index. Must be pre-created in your Pinecone dashboard.
Your OpenAI API key for generating embeddings. Get it from platform.openai.com
OpenAI embedding model to use. Options:
"text-embedding-3-large"(recommended, 3072 dimensions)"text-embedding-3-small"(faster, 1536 dimensions)"text-embedding-ada-002"(legacy, 1536 dimensions)
DocumentIndexer automatically connects to Pinecone and verifies the index is accessible on initialization.
Methods
load_and_index_documents()
Load documents from a file, split them into chunks, and index them to Pinecone.
Path to the document file. Supports:
- PDF files (
.pdf) - Text files (
.txt) - CSV files (
.csv) - And more (see docsuite documentation)
Size of each text chunk in characters. Recommended values:
- Small documents:
500 - Medium documents:
1000(default) - Large documents:
1500-2000
Overlap between chunks in characters. Should be 10-20% of chunk_size. Default
200 (20% of 1000).Optional Pinecone namespace to store documents in. Use namespaces to organize documents by topic, project, or department.
If
True, checks for existing documents before indexing to prevent duplicates. Uses SHA256 hash of file path + content.Dict with the following keys:
status(str):"success"or error statuschunks_indexed(int): Number of chunks successfully indexedchunks_skipped(int): Number of duplicate chunks skipped (ifprevent_duplicates=True)documents_loaded(int): Number of documents loaded from filefile_path(str): Path to the indexed filenamespace(str | None): Namespace used (if any)message(str, optional): Additional message (e.g., “No documents to index”)
UnsupportedFileTypeError: If file type is not supported by docsuiteRuntimeError: If indexing to Pinecone failsValueError: If required parameters are missing
load_and_index_multiple_documents()
Load multiple documents, split them, and index them to Pinecone in batch.
List of file paths to load and index. All files will be processed with the same chunk settings.
Size of each text chunk (same as
load_and_index_documents)Overlap between chunks (same as
load_and_index_documents)Optional namespace for all documents (same as
load_and_index_documents)Prevent duplicate indexing (same as
load_and_index_documents)Dict with the following keys:
status(str):"completed"total_chunks_indexed(int): Total chunks indexed across all filestotal_chunks_skipped(int): Total duplicate chunks skippedfiles_processed(int): Total number of files processedfiles_succeeded(int): Number of files successfully indexedfiles_failed(int): Number of files that failedresults(List[Dict]): Detailed results for each fileerrors(List[Dict] | None): List of errors (if any)
Properties
index
Access the Pinecone index directly (advanced usage).
embeddings
Access the OpenAI embeddings model (advanced usage).
vector_store
Access the LangChain PineconeVectorStore (advanced usage).
Usage Examples
Basic Single Document Indexing
Batch Document Indexing
Using Namespaces
Error Handling
Custom Chunk Settings
Disabling Duplicate Prevention
Integration with LangChat
DocumentIndexer is automatically used by LangChat’sload_and_index_documents() method:
Advanced Usage
Custom Embedding Models
Direct Index Access
Custom Vector Store Operations
Related Documentation
- Document Indexing Guide - Complete guide to document indexing
- Vector Search Guide - How vector search works
- LangChat API - Main LangChat API with document indexing
- Pinecone Adapter - Pinecone integration details
Questions? Check the Document Indexing Guide for detailed examples and best practices!