Using RAG-Fusion for Better Context
Explore the post-retrieval process of RAG models by understanding and implementing RAG-Fusion. Learn how generating multiple query variations and using reciprocal rank fusion refines retrieved documents for better context. Gain hands-on experience with LangChain to apply these techniques and generate more accurate responses.
We'll cover the following...
Suppose we’re searching for information online. We type in our query, and the system returns a list of results. But are they truly the most relevant? Traditional ranking algorithms often prioritize factors like keyword matching, which can miss the deeper meaning of your search. This is where reranking comes in.
What is reranking?
Reranking is a two-stage retrieval process that improves the relevance of search results. Here’s how it works:
Initial retrieval: A primary system, like a search engine, retrieves a large pool of potentially relevant items based on keywords or other factors.
Refining the list: A reranking model, often powered by machine learning, analyzes each item in the pool and assigns a new score based on its true relevance to the user’s query. This score can consider factors like semantic similarity and user context.
Reordered results: Finally, the items are reordered based on their new scores, presenting the most relevant results at the top.
Types of reranking techniques
Several innovative techniques can be employed for reranking. Let’s explore two prominent approaches:
RAG-Fusion (Retrieval-Augmented Generation Fusion): This technique combines two models: A retriever that finds potentially relevant documents and a generative model that understands the query’s intent. RAG-Fusion leverages the strengths of both, often using a reranker to improve the final selection of documents for the generative model to process.
Cross-Encoder Reranking: Here, a separate model called a cross-encoder takes the query and each retrieved item as input. It then outputs a score indicating how well the item matches the user’s intent. This score reranks the initial list and presents the most semantically similar items at the top.
What is RAG-Fusion?
RAG-Fusion combines retrieval (finding relevant documents) with generation (formulating queries). It leverages an LLM to create these query variations based on the user’s original question. Using an LLM, RAG-Fusion can capture the nuances of language and generate queries that effectively represent the user’s intent.
RAG-Fusion is a technique that builds on top of RAG models to improve search results, particularly in the context of chatbots. Here’s a breakdown of how it works:
Understanding the user’s intent: RAG-Fusion starts with a user query. Like RAG models, it aims to understand the true intent behind the question.
Generating multiple queries: RAG-Fusion goes beyond a single query. It uses the original query to create multiple variations, essentially rephrasing the question from different angles. This helps capture the nuances of the user’s intent.
Retrieval with embedding: The original and generated queries are converted into a numerical representation using embedding models. This allows for efficient searching within a document collection or knowledge base. Documents relevant to each query are retrieved.
Reciprocal rank fusion (RRF): RAG-Fusion then employs a reciprocal rank fusion (RRF) technique. RRF assigns scores based on how well-retrieved documents match each query. Documents with high scores across multiple queries will likely be more relevant to the user’s intent.
Fusing documents and scores: Finally, RAG-Fusion combines the retrieved documents and their corresponding scores. This provides a richer set of information that can be used to formulate a response.
Step-by-step implementation
Now, let’s dive into the code implementation of RAG-Fusion. The first five steps are similar to the Multi-Query technique:
These libraries and modules are essential for the subsequent steps in the process.
1. Import necessary libraries
We’ll import the required modules from the installed libraries to implement multi-query:
2. Set up the LangSmith and OpenAI API keys
The following code snippet sets up your LangChain API key and OpenAI API key from environment variables. We’ll need valid API keys to interact with the LangChain and OpenAI language models:
Code explanation:
Lines 1–4: Sets up the LangChain environment variables:
LANGCHAIN_TRACING_V2: Enables tracing for LangChain operations.LANGCHAIN_ENDPOINT: Specifies the endpoint for the LangChain API.LANGCHAIN_API_KEY: An empty string placeholder for the LangSmith LangChain API key. Replace it with your actual key.LANGCHAIN_PROJECT: Sets the project name for LangChain operations to'RAG-Fusion'.
Lines 6–8: Sets up the OpenAI API key:
OPENAI_API_KEY: An empty string placeholder for the OpenAI API key. Replace it with your actual key.Validation: Checks if the
OPENAI_API_KEYis empty and raises aValueErrorif it is, ensure a valid API key is provided for authenticating OpenAI API requests.
3. Load and split documents
Now, let’s load the text documents you want to use for retrieval and split them:
Code explanation:
Lines 1–4: Loaders are defined to read text files using
TextLoader, specifying the file paths of the documents to be loaded.Lines 6–8: An empty list
docsis created, and a loop iterates over the loaders, loading the content of each document and extending thedocslist with the loaded content.Lines 10–11: A
RecursiveCharacterTextSplitteris initialized with a chunk size ofcharacters and an overlap of characters between chunks. The splitter then processes the docslist, splitting each document into smaller chunks suitable for processing by large language models (LLMs).
4. Index documents
After splitting the text, we create a vector store to efficiently store and retrieve document chunks. Additionally, we generate embeddings for each chunk to capture its semantic meaning:
Code explanation:
Line 1: We use
Chromato create the vector store (vectorstore). Chroma is a library designed for efficient document storage and retrieval. We then callfrom_documentsto populate the store with our prepared text chunks (splits). To understand the meaning of each chunk, we generate embeddings usingOpenAIEmbeddings. Embeddings are numerical representations that capture the semantic relationships between words in a text snippet. This allows the vector store to efficiently retrieve documents relevant to a user’s query by comparing the query’s embedding to the embeddings of the stored chunks.Line 2: We convert the vector store into a retriever using
as_retriever(). This allows us to retrieve documents based on a query embedding.
5. RAG-Fusion: Query generation
The following code snippet dives into the core concept of RAG-Fusion: generating multiple query variations based on the user’s original question.
Code explanation:
Lines 1–5: We define a multi-line string variable
templatethat acts as a prompt for the LLM (ChatOpenAI in this case). The prompt instructs the LLM to act as an assistant for a vector search engine. It provides context about the user’s question ({question}) and the goal of generating alternative queries (five variations). These variations should capture different aspects of the user’s intent to improve retrieval. The prompt also specifies the desired output format: each variation on a new line, following the"Original question: {question}"format.Line 7: We use
ChatPromptTemplate.from_template(template)to convert the string template into a structuredChatPromptTemplateobject.Lines 9–14: We create a processing chain named
generate_queries. This chain involves several steps:The first step involves the
rag_fusion_prompt_templateobject.The chain then uses
ChatOpenAI(temperature=0). ChatOpenAI refers to the OpenAI language model andtemperature=0ensures a deterministic output (the same results for a given prompt every time).The generated text from the LLM is then parsed using
StrOutputParser(), converting it from a potentially complex data structure into a plain string.Finally, the
lambdafunction splits the string into a list of individual query variations using the newline character (\n) as the delimiter.
6. Retrieval with reciprocal rank fusion (RRF)
In this step, we will retrieve documents based on the generated query variations and refine the results using Reciprocal Rank Fusion (RRF):
Code explanation:
Lines 1–29: Reciprocal rank fusion function
Line 1: Define the function
reciprocal_rank_function, which takes a list of lists (results) containing ranked documents and an optional parameterk(default value is). Line 6: Initialize an empty dictionary
fused_scoresto store the cumulative scores for each unique document encountered during the fusion process.Lines 9: Begin iterating over each list of ranked documents (
docs) in theresultslist.Line 11: For each list of documents, iterate through each document (
doc) and its corresponding rank (rank) using theenumeratefunction.Line 13: Convert the document to a unique string identifier (
doc_str) usingstr(doc).Lines 15-16: Check if the document is already in the
fused_scoresdictionary. If not, add it with an initial score of. Line 18: Retrieve the current score of the document from the
fused_scoresdictionary.Line 20: Update the score of the document using the RRF formula:
1 / (rank + k). Add this value to the current score in thefused_scoresdictionary.Lines 23-26: After processing all documents, sort the
fused_scoresdictionary based on the scores in descending order. Store the sorted results as a list of tuples (reranked_results), each containing the document and its fused score.Line 29: Return the
reranked_resultslist, providing the reranked documents and their fused scores.
Line 31: Define the user question (
question) as"What is LangSmith, and why do we need it?"to serve as an example for the retrieval process.Line 32: Create a
retrieval_chainby combining several components using the pipe (|) operator:generate_queries: Generates multiple query variations.retriever.map(): Applies each query variation to the document retriever, retrieving a ranked list of documents for each query.reciprocal_rank_function: Combines and reranks the retrieved documents based on their scores across all query variations.
Line 33: Invoke the
retrieval_chainwith the user question provided in a dictionary ({"question": question}). This triggers the entire retrieval and RRF process, storing the resulting documents in thedocsvariable.Line 34: Calculate and return the number of documents retrieved and reranked by the
retrieval_chainusing thelen(docs)function.
Educative Byte: Using RRF, we can leverage the insights from each query variation to produce a more comprehensive and relevant set of retrieved documents. Documents that consistently rank high across multiple query variations are likely more relevant to the user’s intent, even if they don’t perfectly match the exact keywords used in any query.
7. Run the RAG model
The following code snippet demonstrates answer generation with RAG (Retrieval-Augmented Generation). Using an LLM, it leverages the retrieved documents (context) to answer the user’s question:
Code explanation:
Lines 1–4: Define a multi-line string variable
templatethat acts as a prompt for the language model (LLM). This template instructs the LLM to answer the user’s question ({question}) based on the provided context ({context}). The context will consist of the retrieved documents relevant to the question.Line 6: Convert the string template into a structured
ChatPromptTemplateobject usingChatPromptTemplate.from_template(template).Line 8: Initialize the LLM using
ChatOpenAIwith a temperature of 0, ensuring deterministic output (the same results for a given prompt every time).Lines 10–16: Create a new processing chain named
final_rag_chainusing the pipe (|) operator, which involves several components:Lines 11–12: A dictionary is used to provide two inputs:
"context": Refers to the output of theretrieval_chain, containing the retrieved documents."question": Retrieves the user’s question using theitemgetter("question")function from theoperatormodule.
Line 13: The
promptobject contains the answer generation template.Line 14: The
llmprocesses the prompt, context, and question to generate a response.Line 15:
StrOutputParser()converts the LLM output into a plain string.
Line 18: Call
final_rag_chain.invoke({"question": question}), providing the user’s question as input. This triggers the entire answer generation process using the retrieved documents (context) and the user’s question. After processing the context and question, the final response from the LLM will be the generated answer.
Educative Byte: By combining retrieval with answer generation, RAG utilizes the retrieved documents to provide a more informed and comprehensive answer to the user’s query.
LangSmith
LangSmith is a powerful tool for exploring language models. We’ll use it to visualize and understand the inner workings of our queries.
We’ll understand how the language model processes and responds to our prompts by examining the sub-queries, inputs, and outputs:
Try it yourself
You can practice executing this code yourself in the Jupyter Notebook below: