Response time breakdown
A typical LangChat chat() call involves:
| Step | Typical latency |
|---|
| Standalone question generation (LLM call) | 300–800ms |
| Pinecone embedding + search | 100–300ms |
| Flashrank reranking | 50–150ms |
| LLM response generation | 500–3000ms |
| Supabase save (background, non-blocking) | — |
| Total | ~1–4s |
The two LLM calls dominate. Everything else is fast.
Use a faster model
The single highest-impact change. gpt-4o-mini is ~5× faster and ~20× cheaper than gpt-4o:
# Slower and expensive
llm = OpenAI("gpt-4o")
# Faster and cheap — good for most use cases
llm = OpenAI("gpt-4o-mini")
For even faster responses at lower quality, try Mistral’s small models or Gemini Flash:
llm = Gemini("gemini-2.0-flash") # very fast
llm = Mistral("mistral-small-latest") # fast and cheap
Reduce context size
Smaller prompts = faster LLM calls + lower cost.
Reduce history window:
lc = LangChat(
llm=OpenAI("gpt-4o-mini"),
vector_db=Pinecone("my-index"),
db=Supabase(),
max_chat_history=5, # default: 20
)
Reduce reranker top_n:
from langchat.adapters.reranker import FlashrankRerankAdapter
reranker = FlashrankRerankAdapter(top_n=2) # default: 3
Use smaller chunks:
result = lc.index("docs/", chunk_size=600, chunk_overlap=100)
Smaller chunks mean shorter context per retrieved document.
Use smaller embedding model
Switch to text-embedding-3-small for faster, cheaper embeddings:
from langchat.providers import Pinecone
vector_db = Pinecone("my-index", embedding_model="text-embedding-3-small")
You must re-create your Pinecone index (1536 dimensions) and re-index all documents when switching embedding models.
Concurrent users
LangChat’s chat() is async and non-blocking. Run multiple chats concurrently:
import asyncio
from langchat import LangChat
lc = LangChat(...)
async def handle_concurrent_users():
# These run in parallel
results = await asyncio.gather(
lc.chat(query="Question 1", user_id="alice"),
lc.chat(query="Question 2", user_id="bob"),
lc.chat(query="Question 3", user_id="carol"),
)
for r in results:
print(r.text)
For the API server, use multiple uvicorn workers:
uvicorn server:app --workers 4 --host 0.0.0.0 --port 8000
Session caching
Sessions are cached in memory. The first call for a user loads history from Supabase; subsequent calls use the in-memory cache. No extra configuration needed.
After a server restart, the cache is empty — first calls incur a Supabase query. For large-scale deployments with many unique users, the history loading is fast (Supabase queries are indexed by user_id and platform).
Cost optimization
| Change | Savings |
|---|
Switch gpt-4o → gpt-4o-mini | ~20× cheaper per query |
Reduce max_chat_history 20 → 5 | ~30% fewer tokens |
Reduce top_n 3 → 2 | ~15% fewer tokens |
Switch to text-embedding-3-small | ~5× cheaper embeddings |
| Use Ollama for dev/testing | Free (runs locally) |
Background operations
LangChat saves chat history and metrics to Supabase in background threads — these never block the response:
# This returns immediately after getting the LLM response
# Supabase saving happens in the background
response = await lc.chat(query="Hello", user_id="alice")
This design means your users get responses as fast as the LLM allows, without waiting for database writes.