Skip to main content

Class: DocumentIndexer

Standalone document loader and indexer for Pinecone. Only requires Pinecone API key and OpenAI API key for embeddings. Perfect for indexing documents without full LangChat setup.

Constructor

DocumentIndexer(
    pinecone_api_key: str,
    pinecone_index_name: str,
    openai_api_key: str,
    embedding_model: str = "text-embedding-3-large"
)
Creates a new DocumentIndexer instance. Parameters:
pinecone_api_key
str
required
Your Pinecone API key. Get it from pinecone.io
pinecone_index_name
str
required
Name of your Pinecone index. Must be pre-created in your Pinecone dashboard.
openai_api_key
str
required
Your OpenAI API key for generating embeddings. Get it from platform.openai.com
embedding_model
str
default:"text-embedding-3-large"
OpenAI embedding model to use. Options:
  • "text-embedding-3-large" (recommended, 3072 dimensions)
  • "text-embedding-3-small" (faster, 1536 dimensions)
  • "text-embedding-ada-002" (legacy, 1536 dimensions)
Example:
from langchat.utils.document_indexer import DocumentIndexer

# Initialize DocumentIndexer
indexer = DocumentIndexer(
    pinecone_api_key="pcsk-...",
    pinecone_index_name="my-index",
    openai_api_key="sk-...",
    embedding_model="text-embedding-3-large"
)
DocumentIndexer automatically connects to Pinecone and verifies the index is accessible on initialization.

Methods

load_and_index_documents()

Load documents from a file, split them into chunks, and index them to Pinecone.
def load_and_index_documents(
    self,
    file_path: str,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: Optional[str] = None,
    prevent_duplicates: bool = True
) -> Dict
Parameters:
file_path
str
required
Path to the document file. Supports:
  • PDF files (.pdf)
  • Text files (.txt)
  • CSV files (.csv)
  • And more (see docsuite documentation)
chunk_size
int
default:"1000"
Size of each text chunk in characters. Recommended values:
  • Small documents: 500
  • Medium documents: 1000 (default)
  • Large documents: 1500-2000
chunk_overlap
int
default:"200"
Overlap between chunks in characters. Should be 10-20% of chunk_size. Default 200 (20% of 1000).
namespace
str | None
default:"None"
Optional Pinecone namespace to store documents in. Use namespaces to organize documents by topic, project, or department.
prevent_duplicates
bool
default:"True"
If True, checks for existing documents before indexing to prevent duplicates. Uses SHA256 hash of file path + content.
Returns: Dict with the following keys:
  • status (str): "success" or error status
  • chunks_indexed (int): Number of chunks successfully indexed
  • chunks_skipped (int): Number of duplicate chunks skipped (if prevent_duplicates=True)
  • documents_loaded (int): Number of documents loaded from file
  • file_path (str): Path to the indexed file
  • namespace (str | None): Namespace used (if any)
  • message (str, optional): Additional message (e.g., “No documents to index”)
Example:
from langchat.utils.document_indexer import DocumentIndexer

indexer = DocumentIndexer(
    pinecone_api_key="pcsk-...",
    pinecone_index_name="my-index",
    openai_api_key="sk-..."
)

# Index a single document
result = indexer.load_and_index_documents(
    file_path="document.pdf",
    chunk_size=1000,
    chunk_overlap=200,
    namespace="company-docs",
    prevent_duplicates=True
)

print(f"Indexed: {result['chunks_indexed']} chunks")
print(f"Skipped: {result['chunks_skipped']} duplicates")
Raises:
  • UnsupportedFileTypeError: If file type is not supported by docsuite
  • RuntimeError: If indexing to Pinecone fails
  • ValueError: If required parameters are missing

load_and_index_multiple_documents()

Load multiple documents, split them, and index them to Pinecone in batch.
def load_and_index_multiple_documents(
    self,
    file_paths: List[str],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    namespace: Optional[str] = None,
    prevent_duplicates: bool = True
) -> Dict
Parameters:
file_paths
List[str]
required
List of file paths to load and index. All files will be processed with the same chunk settings.
chunk_size
int
default:"1000"
Size of each text chunk (same as load_and_index_documents)
chunk_overlap
int
default:"200"
Overlap between chunks (same as load_and_index_documents)
namespace
str | None
default:"None"
Optional namespace for all documents (same as load_and_index_documents)
prevent_duplicates
bool
default:"True"
Prevent duplicate indexing (same as load_and_index_documents)
Returns: Dict with the following keys:
  • status (str): "completed"
  • total_chunks_indexed (int): Total chunks indexed across all files
  • total_chunks_skipped (int): Total duplicate chunks skipped
  • files_processed (int): Total number of files processed
  • files_succeeded (int): Number of files successfully indexed
  • files_failed (int): Number of files that failed
  • results (List[Dict]): Detailed results for each file
  • errors (List[Dict] | None): List of errors (if any)
Example:
# Index multiple documents
result = indexer.load_and_index_multiple_documents(
    file_paths=[
        "doc1.pdf",
        "doc2.txt",
        "data.csv"
    ],
    chunk_size=1000,
    chunk_overlap=200,
    namespace="project-docs"
)

print(f"Total chunks: {result['total_chunks_indexed']}")
print(f"Files succeeded: {result['files_succeeded']}")
print(f"Files failed: {result['files_failed']}")

# Check individual file results
for file_result in result['results']:
    print(f"{file_result['file_path']}: {file_result['status']}")

Properties

index

Access the Pinecone index directly (advanced usage).
index: pinecone.Index
Example:
# Access index stats
stats = indexer.index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")

embeddings

Access the OpenAI embeddings model (advanced usage).
embeddings: OpenAIEmbeddings
Example:
# Generate custom embedding
embedding = indexer.embeddings.embed_query("custom text")
print(f"Embedding dimension: {len(embedding)}")

vector_store

Access the LangChain PineconeVectorStore (advanced usage).
vector_store: PineconeVectorStore
Example:
# Use vector store directly
retriever = indexer.vector_store.as_retriever(search_kwargs={"k": 5})

Usage Examples

Basic Single Document Indexing

from langchat.utils.document_indexer import DocumentIndexer

# Initialize
indexer = DocumentIndexer(
    pinecone_api_key="pcsk-...",
    pinecone_index_name="my-index",
    openai_api_key="sk-..."
)

# Index document
result = indexer.load_and_index_documents("document.pdf")
print(f"✅ Indexed {result['chunks_indexed']} chunks")

Batch Document Indexing

# Index multiple documents
files = ["doc1.pdf", "doc2.txt", "doc3.csv"]

result = indexer.load_and_index_multiple_documents(
    file_paths=files,
    chunk_size=1000,
    chunk_overlap=200
)

print(f"✅ Total: {result['total_chunks_indexed']} chunks")
print(f"📄 Files: {result['files_succeeded']}/{result['files_processed']}")

Using Namespaces

# Organize documents by topic
indexer.load_and_index_documents(
    file_path="product-docs.pdf",
    namespace="products"
)

indexer.load_and_index_documents(
    file_path="support-articles.pdf",
    namespace="support"
)

Error Handling

from docsuite.exceptions import UnsupportedFileTypeError

try:
    result = indexer.load_and_index_documents("document.pdf")
    
    if result['chunks_indexed'] == 0:
        print("⚠️  No chunks indexed")
        if result.get('chunks_skipped', 0) > 0:
            print("All chunks were duplicates")
    
except UnsupportedFileTypeError as e:
    print(f"❌ Unsupported file type: {e}")
except RuntimeError as e:
    print(f"❌ Indexing failed: {e}")
except Exception as e:
    print(f"❌ Unexpected error: {e}")

Custom Chunk Settings

# For short documents
result = indexer.load_and_index_documents(
    file_path="short-doc.txt",
    chunk_size=500,
    chunk_overlap=100
)

# For long documents
result = indexer.load_and_index_documents(
    file_path="long-doc.pdf",
    chunk_size=2000,
    chunk_overlap=400
)

Disabling Duplicate Prevention

# Allow duplicates (not recommended)
result = indexer.load_and_index_documents(
    file_path="document.pdf",
    prevent_duplicates=False
)

Integration with LangChat

DocumentIndexer is automatically used by LangChat’s load_and_index_documents() method:
from langchat import LangChat, LangChatConfig

config = LangChatConfig.from_env()
langchat = LangChat(config=config)

# Uses DocumentIndexer internally
result = langchat.load_and_index_documents("document.pdf")

Advanced Usage

Custom Embedding Models

# Use smaller, faster model
indexer = DocumentIndexer(
    pinecone_api_key="...",
    pinecone_index_name="...",
    openai_api_key="...",
    embedding_model="text-embedding-3-small"  # Faster
)

Direct Index Access

# Query index directly
results = indexer.index.query(
    vector=[0.0] * 3072,  # Dummy vector
    top_k=5,
    filter={"source_file": {"$eq": "document.pdf"}},
    namespace="my-namespace"
)

Custom Vector Store Operations

# Use LangChain retriever
retriever = indexer.vector_store.as_retriever(
    search_kwargs={"k": 10, "filter": {"namespace": "my-namespace"}}
)

# Retrieve documents
docs = retriever.get_relevant_documents("query text")

Questions? Check the Document Indexing Guide for detailed examples and best practices!