> ## Documentation Index
> Fetch the complete documentation index at: https://langchat.neurobrains.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Performance

> Optimize LangChat for lower latency, higher throughput, and lower cost.

## Response time breakdown

A typical LangChat `chat()` call involves:

| Step                                      | Typical latency |
| ----------------------------------------- | --------------- |
| Standalone question generation (LLM call) | 300–800ms       |
| Pinecone embedding + search               | 100–300ms       |
| Flashrank reranking                       | 50–150ms        |
| LLM response generation                   | 500–3000ms      |
| Supabase save (background, non-blocking)  | —               |
| **Total**                                 | **\~1–4s**      |

The two LLM calls dominate. Everything else is fast.

***

## Use a faster model

The single highest-impact change. `gpt-4o-mini` is \~5× faster and \~20× cheaper than `gpt-4o`:

```python theme={null}
# Slower and expensive
llm = OpenAI("gpt-4o")

# Faster and cheap — good for most use cases
llm = OpenAI("gpt-4o-mini")
```

For even faster responses at lower quality, try Mistral's small models or Gemini Flash:

```python theme={null}
llm = Gemini("gemini-2.0-flash")         # very fast
llm = Mistral("mistral-small-latest")    # fast and cheap
```

***

## Reduce context size

Smaller prompts = faster LLM calls + lower cost.

**Reduce history window:**

```python theme={null}
lc = LangChat(
    llm=OpenAI("gpt-4o-mini"),
    vector_db=Pinecone("my-index"),
    db=Supabase(),
    max_chat_history=5,   # default: 20
)
```

**Reduce reranker top\_n:**

```python theme={null}
from langchat.adapters.reranker import FlashrankRerankAdapter

reranker = FlashrankRerankAdapter(top_n=2)   # default: 3
```

**Use smaller chunks:**

```python theme={null}
result = lc.index("docs/", chunk_size=600, chunk_overlap=100)
```

Smaller chunks mean shorter context per retrieved document.

***

## Use smaller embedding model

Switch to `text-embedding-3-small` for faster, cheaper embeddings:

```python theme={null}
from langchat.providers import Pinecone

vector_db = Pinecone("my-index", embedding_model="text-embedding-3-small")
```

<Warning>
  You must re-create your Pinecone index (1536 dimensions) and re-index all documents when switching embedding models.
</Warning>

***

## Concurrent users

LangChat's `chat()` is async and non-blocking. Run multiple chats concurrently:

```python theme={null}
import asyncio
from langchat import LangChat

lc = LangChat(...)

async def handle_concurrent_users():
    # These run in parallel
    results = await asyncio.gather(
        lc.chat(query="Question 1", user_id="alice"),
        lc.chat(query="Question 2", user_id="bob"),
        lc.chat(query="Question 3", user_id="carol"),
    )
    for r in results:
        print(r.text)
```

For the API server, use multiple uvicorn workers:

```bash theme={null}
uvicorn server:app --workers 4 --host 0.0.0.0 --port 8000
```

***

## Session caching

Sessions are cached in memory. The first call for a user loads history from Supabase; subsequent calls use the in-memory cache. No extra configuration needed.

After a server restart, the cache is empty — first calls incur a Supabase query. For large-scale deployments with many unique users, the history loading is fast (Supabase queries are indexed by `user_id` and `platform`).

***

## Cost optimization

| Change                             | Savings                 |
| ---------------------------------- | ----------------------- |
| Switch `gpt-4o` → `gpt-4o-mini`    | \~20× cheaper per query |
| Reduce `max_chat_history` 20 → 5   | \~30% fewer tokens      |
| Reduce `top_n` 3 → 2               | \~15% fewer tokens      |
| Switch to `text-embedding-3-small` | \~5× cheaper embeddings |
| Use Ollama for dev/testing         | Free (runs locally)     |

***

## Background operations

LangChat saves chat history and metrics to Supabase in background threads — these never block the response:

```python theme={null}
# This returns immediately after getting the LLM response
# Supabase saving happens in the background
response = await lc.chat(query="Hello", user_id="alice")
```

This design means your users get responses as fast as the LLM allows, without waiting for database writes.
