Skip to main content

Response time breakdown

A typical LangChat chat() call involves:
StepTypical latency
Standalone question generation (LLM call)300–800ms
Pinecone embedding + search100–300ms
Flashrank reranking50–150ms
LLM response generation500–3000ms
Supabase save (background, non-blocking)
Total~1–4s
The two LLM calls dominate. Everything else is fast.

Use a faster model

The single highest-impact change. gpt-4o-mini is ~5× faster and ~20× cheaper than gpt-4o:
# Slower and expensive
llm = OpenAI("gpt-4o")

# Faster and cheap — good for most use cases
llm = OpenAI("gpt-4o-mini")
For even faster responses at lower quality, try Mistral’s small models or Gemini Flash:
llm = Gemini("gemini-2.0-flash")         # very fast
llm = Mistral("mistral-small-latest")    # fast and cheap

Reduce context size

Smaller prompts = faster LLM calls + lower cost. Reduce history window:
lc = LangChat(
    llm=OpenAI("gpt-4o-mini"),
    vector_db=Pinecone("my-index"),
    db=Supabase(),
    max_chat_history=5,   # default: 20
)
Reduce reranker top_n:
from langchat.adapters.reranker import FlashrankRerankAdapter

reranker = FlashrankRerankAdapter(top_n=2)   # default: 3
Use smaller chunks:
result = lc.index("docs/", chunk_size=600, chunk_overlap=100)
Smaller chunks mean shorter context per retrieved document.

Use smaller embedding model

Switch to text-embedding-3-small for faster, cheaper embeddings:
from langchat.providers import Pinecone

vector_db = Pinecone("my-index", embedding_model="text-embedding-3-small")
You must re-create your Pinecone index (1536 dimensions) and re-index all documents when switching embedding models.

Concurrent users

LangChat’s chat() is async and non-blocking. Run multiple chats concurrently:
import asyncio
from langchat import LangChat

lc = LangChat(...)

async def handle_concurrent_users():
    # These run in parallel
    results = await asyncio.gather(
        lc.chat(query="Question 1", user_id="alice"),
        lc.chat(query="Question 2", user_id="bob"),
        lc.chat(query="Question 3", user_id="carol"),
    )
    for r in results:
        print(r.text)
For the API server, use multiple uvicorn workers:
uvicorn server:app --workers 4 --host 0.0.0.0 --port 8000

Session caching

Sessions are cached in memory. The first call for a user loads history from Supabase; subsequent calls use the in-memory cache. No extra configuration needed. After a server restart, the cache is empty — first calls incur a Supabase query. For large-scale deployments with many unique users, the history loading is fast (Supabase queries are indexed by user_id and platform).

Cost optimization

ChangeSavings
Switch gpt-4ogpt-4o-mini~20× cheaper per query
Reduce max_chat_history 20 → 5~30% fewer tokens
Reduce top_n 3 → 2~15% fewer tokens
Switch to text-embedding-3-small~5× cheaper embeddings
Use Ollama for dev/testingFree (runs locally)

Background operations

LangChat saves chat history and metrics to Supabase in background threads — these never block the response:
# This returns immediately after getting the LLM response
# Supabase saving happens in the background
response = await lc.chat(query="Hello", user_id="alice")
This design means your users get responses as fast as the LLM allows, without waiting for database writes.