Skip to main content

What is RAG?

Retrieval-Augmented Generation (RAG) combines a vector search step with an LLM generation step:
  1. Retrieve — find document chunks most relevant to the user’s question
  2. Augment — inject those chunks into the LLM prompt as context
  3. Generate — the LLM answers based on the retrieved context, not just its training data
This lets your chatbot answer questions about your specific documents — product manuals, FAQs, legal documents, internal wikis — without retraining any model.

How LangChat’s pipeline works

User question


Standalone question reformulation (via LLM)


OpenAI Embeddings (text-embedding-3-large)


Pinecone similarity search (top-k chunks)


Flashrank reranking (top-3 from top-k)


LLM prompt: context + history + question


AI response

1. Standalone question reformulation

Before searching, LangChat uses the LLM to rewrite the user’s message as a standalone query. This resolves pronouns and references to earlier messages:
User: "What is our return policy?"
Bot: "You can return items within 30 days."
User: "What about damaged items?"

→ Reformulated: "What is Acme Corp's return policy for damaged items?"
The reformulated question is then embedded and searched.

2. Embedding

The question is embedded using OpenAI’s text-embedding-3-large model (3072 dimensions). The same model must be used when indexing documents — mixing models produces incorrect results. LangChat queries Pinecone for the top-k most similar chunks (k=5 by default via the retriever). The similarity metric is cosine distance.

4. Flashrank reranking

The top-k Pinecone results are reranked by Flashrank, a fast cross-encoder model that more accurately scores relevance than cosine similarity alone. The default model is ms-marco-MiniLM-L-12-v2, keeping the top 3 results. Reranking improves answer quality significantly — especially for long documents where many chunks may be superficially similar but only a few are truly relevant.

Pinecone namespaces

Use namespaces to partition documents within a single index. Searches are scoped to the namespace you configure:
# Index documents in separate namespaces
lc.index("products/", namespace="products")
lc.index("policies/", namespace="policies")

# Search only the products namespace
vector_db = Pinecone("my-index")
# (namespace is set at indexing time; retrieval uses the same namespace)
Namespaces are useful for:
  • Separating different clients in a multi-tenant app
  • Partitioning by language or region
  • Separating document types (e.g., products vs. policies)

Changing the retriever depth

The default retriever fetches k=5 chunks before reranking. To fetch more candidates before reranking (improves recall at the cost of latency): This is controlled by the PineconeVectorAdapter internals. For advanced customization, see Extending Adapters.

Embedding model choice

ModelDimensionsQualityCost
text-embedding-3-large3072Highest~2× more than small
text-embedding-3-small1536GoodLower
text-embedding-ada-0021536Older baselineSimilar to small
Configure the embedding model on the Pinecone provider:
from langchat.providers import Pinecone

vector_db = Pinecone("my-index", embedding_model="text-embedding-3-small")
You must use the same embedding model for both indexing and retrieval. If you change the model, re-index all documents with the new model in a fresh Pinecone index.

When there’s no relevant context

If Pinecone returns no relevant results (low similarity scores), the LLM still receives the prompt — but the {context} placeholder will be empty or contain low-quality chunks. This can lead to hallucinated answers. Best practices:
  • Always tell the model what to do when context is missing: “If the answer is not in the context, say you don’t know.”
  • Ensure documents are indexed before going live
  • Monitor queries that return empty context (visible in Supabase request_metrics)