Skip to main content

What it does

After Pinecone returns the top-k chunks by cosine similarity, the Flashrank reranker re-scores them using a cross-encoder model. Cross-encoders jointly encode the query and each document together, giving a much more accurate relevance score than the embedding similarity alone. Result: fewer but better chunks reach the LLM prompt.
Pinecone: 5 candidates (by cosine similarity)

Flashrank: re-scores all 5

Top 3 most relevant chunks → LLM prompt

Default configuration

LangChat uses Flashrank automatically — no setup required:
# This is what LangChat uses by default (you don't need to write this):
from langchat.adapters.reranker import FlashrankRerankAdapter

reranker = FlashrankRerankAdapter(
    model_name="ms-marco-MiniLM-L-12-v2",
    cache_dir="rerank_models",
    top_n=3,
)

Custom configuration

Pass a custom reranker to LangChat to change the model or top_n:
from langchat import LangChat
from langchat.providers import OpenAI, Pinecone, Supabase
from langchat.adapters.reranker import FlashrankRerankAdapter

reranker = FlashrankRerankAdapter(
    model_name="ms-marco-MiniLM-L-12-v2",
    cache_dir="rerank_models",
    top_n=5,   # pass more chunks to the LLM
)

lc = LangChat(
    llm=OpenAI("gpt-4o-mini"),
    vector_db=Pinecone("my-index"),
    db=Supabase(),
    reranker=reranker,
)

Parameters

model_name
str
default:"ms-marco-MiniLM-L-12-v2"
Flashrank cross-encoder model. The default is a good balance of speed and accuracy.
cache_dir
str
default:"rerank_models"
Directory where the model is cached after first download.
top_n
int
default:"3"
Number of reranked chunks to include in the LLM prompt.

Model download

The Flashrank model is downloaded automatically on first use (about 100MB). It’s cached locally in cache_dir and reused on subsequent runs. No API key or external service is required — reranking runs entirely locally.

Trade-offs

top_nLLM context qualityToken costLatency
1–2Focused but may miss contextLowestFastest
3Good balance (default)ModerateFast
5–10Rich contextHigherSlower
For long, complex documents where multiple chunks may be needed to answer a question, increase top_n to 5 or more.