Using RAG-Fusion for Better Context

Explore the post-retrieval process of RAG models by understanding and implementing RAG-Fusion. Learn how generating multiple query variations and using reciprocal rank fusion refines retrieved documents for better context. Gain hands-on experience with LangChain to apply these techniques and generate more accurate responses.

We'll cover the following...

What is reranking?
- Types of reranking techniques
What is RAG-Fusion?
Step-by-step implementation
LangSmith
Try it yourself

Suppose we’re searching for information online. We type in our query, and the system returns a list of results. But are they truly the most relevant? Traditional ranking algorithms often prioritize factors like keyword matching, which can miss the deeper meaning of your search. This is where reranking comes in.

What is reranking?

Reranking is a two-stage retrieval process that improves the relevance of search results. Here’s how it works:

Initial retrieval: A primary system, like a search engine, retrieves a large pool of potentially relevant items based on keywords or other factors.
Refining the list: A reranking model, often powered by machine learning, analyzes each item in the pool and assigns a new score based on its true relevance to the user’s query. This score can consider factors like semantic similarity and user context.
Reordered results: Finally, the items are reordered based on their new scores, presenting the most relevant results at the top.

Types of reranking techniques

Several innovative techniques can be employed for reranking. Let’s explore two prominent approaches:

RAG-Fusion (Retrieval-Augmented Generation Fusion): This technique combines two models: A retriever that finds potentially relevant documents and a generative model that understands the query’s intent. RAG-Fusion leverages the strengths of both, often using a reranker to improve the final selection of documents for the generative model to process.
Cross-Encoder Reranking: Here, a separate model called a cross-encoder takes the query and each retrieved item as input. It then outputs a score indicating how well the item matches the user’s intent. This score reranks the initial list and presents the most semantically similar items at the top.

What is RAG-Fusion?

RAG-Fusion combines retrieval (finding relevant documents) with generation (formulating queries). It leverages an LLM to create these query variations based on the user’s original question. Using an LLM, RAG-Fusion can capture the nuances of language and generate queries that effectively represent the user’s intent.

RAG-Fusion is a technique that builds on top of RAG models to improve search results, particularly in the context of chatbots. Here’s a breakdown of how it works:

Understanding the user’s intent: RAG-Fusion starts with a user query. Like RAG models, it aims to understand the true intent behind the question.
Generating multiple queries: RAG-Fusion goes beyond a single query. It uses the original query to create multiple variations, essentially rephrasing the question from different angles. This helps capture the nuances of the user’s intent.
Retrieval with embedding: The original and generated queries are converted into a numerical representation using embedding models. This allows for efficient searching within a document collection or knowledge base. Documents relevant to each query are retrieved.
Reciprocal rank fusion (RRF): RAG-Fusion then employs a reciprocal rank fusion (RRF) technique. RRF assigns scores based on how well-retrieved documents match each query. Documents with high scores across multiple queries will likely be more relevant to the user’s intent.
Fusing documents and scores: Finally, RAG-Fusion combines the retrieved documents and their corresponding scores. This provides a richer set of information that can be used to formulate a response.

Python 3.10.4

import os
import bs4
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from operator import itemgetter
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

Code explanation:

Lines 1–4: Sets up the LangChain environment variables:
- LANGCHAIN_TRACING_V2: Enables tracing for LangChain operations.
- LANGCHAIN_ENDPOINT: Specifies the endpoint for the LangChain API.
- LANGCHAIN_API_KEY: An empty string placeholder for the LangSmith LangChain API key. Replace it with your actual key.
- LANGCHAIN_PROJECT: Sets the project name for LangChain operations to 'RAG-Fusion'.
Lines 6–8: Sets up the OpenAI API key:
- OPENAI_API_KEY: An empty string placeholder for the OpenAI API key. Replace it with your actual key.
- Validation: Checks if the OPENAI_API_KEY is empty and raises a ValueError if it is, ensure a valid API key is provided for authenticating OpenAI API requests.

3. Load and split documents

Now, let’s load the text documents you want to use for retrieval and split them:

Code explanation:

Lines 1–4: Loaders are defined to read text files using TextLoader, specifying the file paths of the documents to be loaded.
Lines 6–8: An empty list docs is created, and a loop iterates over the loaders, loading the content of each document and extending the docs list with the loaded content.
Lines 10–11: A RecursiveCharacterTextSplitter is initialized with a chunk size of $400$ characters and an overlap of $60$ characters between chunks. The splitter then processes the docs list, splitting each document into smaller chunks suitable for processing by large language models (LLMs).

4. Index documents

After splitting the text, we create a vector store to efficiently store and retrieve document chunks. Additionally, we generate embeddings for each chunk to capture its semantic meaning:

Code explanation:

Line 1: We use Chroma to create the vector store (vectorstore). Chroma is a library designed for efficient document storage and retrieval. We then call from_documents to populate the store with our prepared text chunks (splits). To understand the meaning of each chunk, we generate embeddings using OpenAIEmbeddings. Embeddings are numerical representations that capture the semantic relationships between words in a text snippet. This allows the vector store to efficiently retrieve documents relevant to a user’s query by comparing the query’s embedding to the embeddings of the stored chunks.
Line 2: We convert the vector store into a retriever using as_retriever(). This allows us to retrieve documents based on a query embedding.

5. RAG-Fusion: Query generation

The following code snippet dives into the core concept of RAG-Fusion: generating multiple query variations based on the user’s original question.

Python 3.10.4

template = """You are an AI language model assistant tasked with generating seach queries for a vector search engine.
The user has a question: "{question}"
Your goal/task is to create five variations of this {question} that capture different aspects of the user's intent. These variations will help the search engine retrieve relevant documents even if they don't use the exact keywords as the original question.
Provide these alternative questions, each on a new line.**
Original question: {question}"""
rag_fusion_prompt_template = ChatPromptTemplate.from_template(template)
generate_queries = (
    rag_fusion_prompt_template
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

Code explanation:

Lines 1–5: We define a multi-line string variable template that acts as a prompt for the LLM (ChatOpenAI in this case). The prompt instructs the LLM to act as an assistant for a vector search engine. It provides context about the user’s question ({question}) and the goal of generating alternative queries (five variations). These variations should capture different aspects of the user’s intent to improve retrieval. The prompt also specifies the desired output format: each variation on a new line, following the "Original question: {question}" format.
Line 7: We use ChatPromptTemplate.from_template(template) to convert the string template into a structured ChatPromptTemplate object.
Lines 9–14: We create a processing chain named generate_queries. This chain involves several steps:
- The first step involves the rag_fusion_prompt_template object.
- The chain then uses ChatOpenAI(temperature=0). ChatOpenAI refers to the OpenAI language model and temperature=0 ensures a deterministic output (the same results for a given prompt every time).
- The generated text from the LLM is then parsed using StrOutputParser(), converting it from a potentially complex data structure into a plain string.
- Finally, the lambda function splits the string into a list of individual query variations using the newline character (\n) as the delimiter.

6. Retrieval with reciprocal rank fusion (RRF)

In this step, we will retrieve documents based on the generated query variations and refine the results using Reciprocal Rank Fusion (RRF):

Python 3.10.4

def reciprocal_rank_function(results: list[list], k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents 
        and an optional parameter k used in the RRF formula """
    
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}
    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a unique string identifier 
            doc_str = str(doc)  # Simple string conversion 
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)
    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (doc, score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results
question = "What is LangSmith, and why do we need it?"
retrieval_chain = generate_queries | retriever.map() | reciprocal_rank_function
docs = retrieval_chain.invoke({"question":question})
len(docs)

Code explanation:

Lines 1–29: Reciprocal rank fusion function
- Line 1: Define the function reciprocal_rank_function, which takes a list of lists (results) containing ranked documents and an optional parameter k (default value is $60$ ).
- Line 6: Initialize an empty dictionary fused_scores to store the cumulative scores for each unique document encountered during the fusion process.
- Lines 9: Begin iterating over each list of ranked documents (docs) in the results list.
- Line 11: For each list of documents, iterate through each document (doc) and its corresponding rank (rank) using the enumerate function.
- Line 13: Convert the document to a unique string identifier (doc_str) using str(doc).
- Lines 15-16: Check if the document is already in the fused_scores dictionary. If not, add it with an initial score of $0$ .
- Line 18: Retrieve the current score of the document from the fused_scores dictionary.
- Line 20: Update the score of the document using the RRF formula: 1 / (rank + k). Add this value to the current score in the fused_scores dictionary.
- Lines 23-26: After processing all documents, sort the fused_scores dictionary based on the scores in descending order. Store the sorted results as a list of tuples (reranked_results), each containing the document and its fused score.
- Line 29: Return the reranked_results list, providing the reranked documents and their fused scores.
Line 31: Define the user question (question) as "What is LangSmith, and why do we need it?" to serve as an example for the retrieval process.
Line 32: Create a retrieval_chain by combining several components using the pipe (|) operator:
- generate_queries: Generates multiple query variations.
- retriever.map(): Applies each query variation to the document retriever, retrieving a ranked list of documents for each query.
- reciprocal_rank_function: Combines and reranks the retrieved documents based on their scores across all query variations.
Line 33: Invoke the retrieval_chain with the user question provided in a dictionary ({"question": question}). This triggers the entire retrieval and RRF process, storing the resulting documents in the docs variable.
Line 34: Calculate and return the number of documents retrieved and reranked by the retrieval_chain using the len(docs) function.

Educative Byte: Using RRF, we can leverage the insights from each query variation to produce a more comprehensive and relevant set of retrieved documents. Documents that consistently rank high across multiple query variations are likely more relevant to the user’s intent, even if they don’t perfectly match the exact keywords used in any query.

7. Run the RAG model

The following code snippet demonstrates answer generation with RAG (Retrieval-Augmented Generation). Using an LLM, it leverages the retrieved documents (context) to answer the user’s question:

Code explanation:

Lines 1–4: Define a multi-line string variable template that acts as a prompt for the language model (LLM). This template instructs the LLM to answer the user’s question ({question}) based on the provided context ({context}). The context will consist of the retrieved documents relevant to the question.
Line 6: Convert the string template into a structured ChatPromptTemplate object using ChatPromptTemplate.from_template(template).
Line 8: Initialize the LLM using ChatOpenAI with a temperature of 0, ensuring deterministic output (the same results for a given prompt every time).
Lines 10–16: Create a new processing chain named final_rag_chain using the pipe (|) operator, which involves several components:
- Lines 11–12: A dictionary is used to provide two inputs:
  - "context": Refers to the output of the retrieval_chain, containing the retrieved documents.
  - "question": Retrieves the user’s question using the itemgetter("question") function from the operator module.
- Line 13: The prompt object contains the answer generation template.
- Line 14: The llm processes the prompt, context, and question to generate a response.
- Line 15: StrOutputParser() converts the LLM output into a plain string.
Line 18: Call final_rag_chain.invoke({"question": question}), providing the user’s question as input. This triggers the entire answer generation process using the retrieved documents (context) and the user’s question. After processing the context and question, the final response from the LLM will be the generated answer.

Educative Byte: By combining retrieval with answer generation, RAG utilizes the retrieved documents to provide a more informed and comprehensive answer to the user’s query.

1.Getting Started

2.Introduction to Retrieval-Augmented Generation (RAG)

3.Advanced RAG: Pre-Retrieval (Optimizing Indexing)

4.Advanced RAG: Pre-Retrieval (Optimizing Query)

5.Advanced RAG: Post-Retrieval Process

Mini Project

6.Conclusion

Using RAG-Fusion for Better Context

What is reranking?

Types of reranking techniques

What is RAG-Fusion?

Step-by-step implementation

1. Import necessary libraries

2. Set up the LangSmith and OpenAI API keys

3. Load and split documents

4. Index documents

5. RAG-Fusion: Query generation

6. Retrieval with reciprocal rank fusion (RRF)

7. Run the RAG model

LangSmith

Try it yourself