RAG Value Chain: Retrieval Strategies in Information Augmentation for Large Language Models

Abhinav Kimothi
7 min readJan 1, 2024

Perhaps, the most critical step in the entire RAG value chain is searching and retrieving the relevant pieces of information (known as documents).

When the user enters a query or a prompt, it is this system (Retriever) that is responsible for accurately fetching the correct snippet of information that is used in responding to the user query.

According to LangChain’s 2023 State of AI survey, amongst the 6 most used retrieval strategies were Self Query, Contextual Compression, Multi-query and time weighted.

Source : LangChain State of AI 2023

Note : For more context on RAG, please read one of my previous blogs

Popular Retrieval Methods

Similarity Search

The similarity search functionality of vector databases forms the backbone of a Retriever. Similarity is calculated by calculating the distance between the embedding vectors of the input and the documents

Maximum Marginal Relevance

MMR addresses redundancy in retrieval. MMR considers the relevance of each document only in terms of how much new information it brings given the previous results. MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases

Multi-query Retrieval

Multi-query Retrieval automates prompt tuning using a language model to generate diverse queries for a user input, retrieving relevant documents from each query and combining them to overcome limitations and obtain a more comprehensive set of results. This approach aims to enhance retrieval performance by considering multiple perspectives on the same query.

Contextual compression

Sometimes, relevant info is hidden in long documents with a lot of extra stuff. Contextual Compression helps with this by squeezing down the documents to only the important parts that match your search.

Multi Vector Retrieval

Sometimes it makes sense to store more than one vectors in a document. E.g A chapter, its summary and a few quotes. The retrieval becomes more efficient because it can match with all the different itypes of nformation that has been embedded.

Parent Document Retrieval

In breaking down documents for retrieval, there’s a dilemma. Small pieces capture meaning better in embeddings, but if they’re too short, context is lost. The Parent Document Retrieval finds a middle ground by storing small chunks. During retrieval, it fetches these bits, then gets the larger documents they came from using their parent IDs

Self Query

A self-querying retriever is a system that can ask itself questions. When you give it a question in normal language, it uses a special process to turn that question into a structured query. Then, it uses this structured query to search through its stored information. This way, it doesn’t just compare your question with the documents; it also looks for specific details in the documents based on your question, making the search more efficient and accurate.

Time-weighted Retrieval

This method supplements the semantic similarity search with a time delay. It gives more weightage, then, to documents that are fresher or more used than the ones that are older

Ensemble Techniques

As the term suggests, multiple retrieval methods can be used in conjunction with each other. There are many ways of implementing ensemble techniques and use cases will define the structure of the retriever

Example : Similarity Search using LangChain

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using all-MiniLM-L6-v2
  • Storing the embeddings into Chromadb
  • Retrieving chunks using similarity_search
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import
SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)

docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings
(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did Andrej say about LLM operating system?"


# distance based search
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Example : Similarity Vector Search

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using all-MiniLM-L6-v2
  • Storing the embeddings into Chromadb
  • Converting input query into a vector embedding
  • Retrieving chunks using similarity_search_by_vector
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)

docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did Andrej say about LLM operating system?"

# convert query to embedding
query_vector=embedding_function.embed_query(query)

# distance based search
docs = db.similarity_search_by_vector(query)

# print results
print(docs[0].page_content)

Example : Maximum Marginal Relevance

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using OpenAI Embeddings
  • Storing the embeddings into Qdrant
  • Retrieving and ranking chunks using max_marginal_relevance_search
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)

docs = text_splitter.split_documents(documents)

# create the openai embedding function
embedding_function = OpenAIEmbeddings(openai_api_key=openai_api_key)

# load it into Qdrant
db = Qdrant.from_documents(docs, embedding_function, location=":memory:",
collection_name="my_documents")

# query it
query = "What did Andrej say about LLM operating system?"

# max marginal relevance search
docs = db.max_marginal_relevance_search(query,k=2, fetch_k=10)

# print results
for i, doc in enumerate(docs):
print(f"{i + 1}.", doc.page_content, "\n")

Example : Multi-query Retrieval

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using OpenAI Embeddings
  • Storing the embeddings into Qdrant
  • Set the LLM as ChatOpenA I (gpt 3.5)
  • Set up logging to see the query variations generated by the LLM
  • use MultiQueryRetriever & get_relevant_documents functions
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the openai embedding function
embedding_function = OpenAIEmbeddings(openai_api_key=openai_api_key)

# load it into Qdrant
db = Qdrant.from_documents(docs, embedding_function, location=":memory:",collection_name="my_documents")

# query it
query = "What did Andrej say about LLM operating system?"

# set the LLM for multiquery
llm = ChatOpenAI(temperature=0, openai_api_key=openai_api_key)


# Multiquery retrieval using OpenAI
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(), llm=llm)

# set up logging to see the queries generated
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# retrieved documents
unique_docs = retriever_from_llm.get_relevant_documents(query=query)

# print results
for i, doc in enumerate(unique_docs):
print(f"{i + 1}.", doc.page_content, "\n")

Example : Contextual compression

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using OpenAI Embeddings
  • Set up retriever as FAISS
  • Set the LLM as C hatOpenAI (gpt 3.5)
  • Use LLMChainExtractor as the compressor
  • use ContextualCompressionRetriever & get_relevant_documents functions
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# load and split text
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
chunk_overlap=200)
docs = text_splitter.split_documents(documents)
# save as vector embeddings
retriever = FAISS.from_documents(docs,
OpenAIEmbeddings(
openai_api_key=openai_api_key)).as_retriever()

# use a compressor
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever)

query = "What did Andrej say about LLM operating system?"

# retrieve docs
compressed_docs = compression_retriever.get_relevant_documents(query)

# print docs
for i, doc in enumerate(unique_docs):
print(f"{i + 1}.", doc.page_content, "\n")

The dynamic landscape of retrieval methods, as highlighted by LangChain’s 2023 State of AI survey, showcases the ongoing advancements in the field. By understanding and implementing these strategies, practitioners can navigate the intricacies of document retrieval, enabling the generation of more informed and contextually relevant responses.

--

--

Abhinav Kimothi

Co-founder and Head of AI @ Yarnit.app || Data Science, Analytics & AIML since 2007 || BITS-Pilani, ISB-Hyderabad || Ex-HSBC, Ex-Genpact, Ex-LTI || Theatre