What is RAG?
Retrieval-Augmented Generation (RAG) combines a vector search step with an LLM generation step:
- Retrieve — find document chunks most relevant to the user’s question
- Augment — inject those chunks into the LLM prompt as context
- Generate — the LLM answers based on the retrieved context, not just its training data
This lets your chatbot answer questions about your specific documents — product manuals, FAQs, legal documents, internal wikis — without retraining any model.
How LangChat’s pipeline works
User question
│
▼
Standalone question reformulation (via LLM)
│
▼
OpenAI Embeddings (text-embedding-3-large)
│
▼
Pinecone similarity search (top-k chunks)
│
▼
Flashrank reranking (top-3 from top-k)
│
▼
LLM prompt: context + history + question
│
▼
AI response
Before searching, LangChat uses the LLM to rewrite the user’s message as a standalone query. This resolves pronouns and references to earlier messages:
User: "What is our return policy?"
Bot: "You can return items within 30 days."
User: "What about damaged items?"
→ Reformulated: "What is Acme Corp's return policy for damaged items?"
The reformulated question is then embedded and searched.
2. Embedding
The question is embedded using OpenAI’s text-embedding-3-large model (3072 dimensions). The same model must be used when indexing documents — mixing models produces incorrect results.
3. Pinecone similarity search
LangChat queries Pinecone for the top-k most similar chunks (k=5 by default via the retriever). The similarity metric is cosine distance.
4. Flashrank reranking
The top-k Pinecone results are reranked by Flashrank, a fast cross-encoder model that more accurately scores relevance than cosine similarity alone. The default model is ms-marco-MiniLM-L-12-v2, keeping the top 3 results.
Reranking improves answer quality significantly — especially for long documents where many chunks may be superficially similar but only a few are truly relevant.
Pinecone namespaces
Use namespaces to partition documents within a single index. Searches are scoped to the namespace you configure:
# Index documents in separate namespaces
lc.index("products/", namespace="products")
lc.index("policies/", namespace="policies")
# Search only the products namespace
vector_db = Pinecone("my-index")
# (namespace is set at indexing time; retrieval uses the same namespace)
Namespaces are useful for:
- Separating different clients in a multi-tenant app
- Partitioning by language or region
- Separating document types (e.g., products vs. policies)
Changing the retriever depth
The default retriever fetches k=5 chunks before reranking. To fetch more candidates before reranking (improves recall at the cost of latency):
This is controlled by the PineconeVectorAdapter internals. For advanced customization, see Extending Adapters.
Embedding model choice
| Model | Dimensions | Quality | Cost |
|---|
text-embedding-3-large | 3072 | Highest | ~2× more than small |
text-embedding-3-small | 1536 | Good | Lower |
text-embedding-ada-002 | 1536 | Older baseline | Similar to small |
Configure the embedding model on the Pinecone provider:
from langchat.providers import Pinecone
vector_db = Pinecone("my-index", embedding_model="text-embedding-3-small")
You must use the same embedding model for both indexing and retrieval. If you change the model, re-index all documents with the new model in a fresh Pinecone index.
When there’s no relevant context
If Pinecone returns no relevant results (low similarity scores), the LLM still receives the prompt — but the {context} placeholder will be empty or contain low-quality chunks. This can lead to hallucinated answers.
Best practices:
- Always tell the model what to do when context is missing: “If the answer is not in the context, say you don’t know.”
- Ensure documents are indexed before going live
- Monitor queries that return empty context (visible in Supabase
request_metrics)