Skip to main content

Quick Optimizations

Use Faster Models

from langchat.llm import OpenAI
from langchat.vector_db import Pinecone

# Faster LLM
llm = OpenAI(api_key="sk-...", model="gpt-4o-mini")  # Fastest

# Faster embeddings
vector_db = Pinecone(
    api_key="...",
    index_name="...",
    embedding_model="text-embedding-3-small"  # Faster than large
)

Reduce Retrieval Count

# Get retriever with fewer documents
retriever = vector_db.get_retriever(k=5)  # Instead of k=10

Limit Chat History

from langchat import LangChat

ai = LangChat(
    llm=llm,
    vector_db=vector_db,
    db=db,
    max_chat_history=10  # Less history = faster
)

Best Practices

1. Balance Speed and Quality

# Fast but less accurate
llm = OpenAI(model="gpt-4o-mini")
vector_db = Pinecone(embedding_model="text-embedding-3-small")
retriever = vector_db.get_retriever(k=3)

# Slower but more accurate
llm = OpenAI(model="gpt-4o")
vector_db = Pinecone(embedding_model="text-embedding-3-large")
retriever = vector_db.get_retriever(k=10)

2. Use Multiple API Keys

# Distribute load
llm = OpenAI(api_keys=["key1", "key2", "key3"])

3. Optimize Reranking

from langchat.reranker import Flashrank

# Smaller model = faster
reranker = Flashrank(
    model_name="ms-marco-MiniLM-L-6-v2",  # Faster model
    top_n=3  # Fewer documents
)

Monitoring

Track performance in production:
import time

start = time.time()
result = await ai.chat(query="...", user_id="...")
elapsed = time.time() - start

print(f"Response time: {elapsed:.2f}s")

Next Steps


Built with ❤️ by NeuroBrain