What it does
After Pinecone returns the top-k chunks by cosine similarity, the Flashrank reranker re-scores them using a cross-encoder model. Cross-encoders jointly encode the query and each document together, giving a much more accurate relevance score than the embedding similarity alone. Result: fewer but better chunks reach the LLM prompt.Default configuration
LangChat uses Flashrank automatically — no setup required:Custom configuration
Pass a custom reranker toLangChat to change the model or top_n:
Parameters
Flashrank cross-encoder model. The default is a good balance of speed and accuracy.
Directory where the model is cached after first download.
Number of reranked chunks to include in the LLM prompt.
Model download
The Flashrank model is downloaded automatically on first use (about 100MB). It’s cached locally incache_dir and reused on subsequent runs.
No API key or external service is required — reranking runs entirely locally.
Trade-offs
top_n | LLM context quality | Token cost | Latency |
|---|---|---|---|
| 1–2 | Focused but may miss context | Lowest | Fastest |
| 3 | Good balance (default) | Moderate | Fast |
| 5–10 | Rich context | Higher | Slower |
top_n to 5 or more.