RAG Value Chain: Retrieval Strategies in Information Augmentation for Large Language Models

7 min readJan 1, 2024

Perhaps, the most critical step in the entire RAG value chain is searching and retrieving the relevant pieces of information (known as documents).

When the user enters a query or a prompt, it is this system (Retriever) that is responsible for accurately fetching the correct snippet of information that is used in responding to the user query.

According to LangChain’s 2023 State of AI survey, amongst the 6 most used retrieval strategies were Self Query, Contextual Compression, Multi-query and time weighted.

Note : For more context on RAG, please read one of my previous blogs

Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation

The advancements in the LLM space have been mind-boggling. However, when it comes to using LLMs in real scenarios, we…

medium.com

Popular Retrieval Methods

Similarity Search

The similarity search functionality of vector databases forms the backbone of a Retriever. Similarity is calculated by calculating the distance between the embedding vectors of the input and the documents

Maximum Marginal Relevance

MMR addresses redundancy in retrieval. MMR considers the relevance of each document only in terms of how much new information it brings given the previous results. MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases

Multi-query Retrieval

Multi-query Retrieval automates prompt tuning using a language model to generate diverse queries for a user input, retrieving relevant documents from each query and combining them to overcome limitations and obtain a more comprehensive set of results. This approach aims to enhance retrieval performance by considering multiple perspectives on the same query.

Contextual compression

Sometimes, relevant info is hidden in long documents with a lot of extra stuff. Contextual Compression helps with this by squeezing down the documents to only the important parts that match your search.

Multi Vector Retrieval

Sometimes it makes sense to store more than one vectors in a document. E.g A chapter, its summary and a few quotes. The retrieval becomes more efficient because it can match with all the different itypes of nformation that has been embedded.

Parent Document Retrieval

In breaking down documents for retrieval, there’s a dilemma. Small pieces capture meaning better in embeddings, but if they’re too short, context is lost. The Parent Document Retrieval finds a middle ground by storing small chunks. During retrieval, it fetches these bits, then gets the larger documents they came from using their parent IDs

Self Query

A self-querying retriever is a system that can ask itself questions. When you give it a question in normal language, it uses a special process to turn that question into a structured query. Then, it uses this structured query to search through its stored information. This way, it doesn’t just compare your question with the documents; it also looks for specific details in the documents based on your question, making the search more efficient and accurate.

Time-weighted Retrieval

This method supplements the semantic similarity search with a time delay. It gives more weightage, then, to documents that are fresher or more used than the ones that are older

Ensemble Techniques

As the term suggests, multiple retrieval methods can be used in conjunction with each other. There are many ways of implementing ensemble techniques and use cases will define the structure of the retriever

Example : Similarity Search using LangChain

Loading our text file using TextLoader,
Splitting the text into chunks using RecursiveCharacterTextSplitter,
Creating embeddings using all-MiniLM-L6-v2
Storing the embeddings into Chromadb
Retrieving chunks using similarity_search

from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import 
SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=200) 

docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings
(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did Andrej say about LLM operating system?"


# distance based search 
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Example : Similarity Vector Search

Loading our text file using TextLoader,
Splitting the text into chunks using RecursiveCharacterTextSplitter,
Creating embeddings using all-MiniLM-L6-v2
Storing the embeddings into Chromadb
Converting input query into a vector embedding
Retrieving chunks using similarity_search_by_vector

from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=200) 

docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did Andrej say about LLM operating system?"

# convert query to embedding
query_vector=embedding_function.embed_query(query)

# distance based search 
docs = db.similarity_search_by_vector(query)

# print results
print(docs[0].page_content)

Example : Maximum Marginal Relevance

Loading our text file using TextLoader,
Splitting the text into chunks using RecursiveCharacterTextSplitter,
Creating embeddings using OpenAI Embeddings
Storing the embeddings into Qdrant
Retrieving and ranking chunks using max_marginal_relevance_search

from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=200) 

docs = text_splitter.split_documents(documents)

# create the openai embedding function
embedding_function = OpenAIEmbeddings(openai_api_key=openai_api_key)

# load it into Qdrant
db = Qdrant.from_documents(docs, embedding_function, location=":memory:",
    collection_name="my_documents")

# query it
query = "What did Andrej say about LLM operating system?"

# max marginal relevance search 
docs = db.max_marginal_relevance_search(query,k=2, fetch_k=10)

# print results
for i, doc in enumerate(docs):
    print(f"{i + 1}.", doc.page_content, "\n")

Example : Multi-query Retrieval

Loading our text file using TextLoader,
Splitting the text into chunks using RecursiveCharacterTextSplitter,
Creating embeddings using OpenAI Embeddings
Storing the embeddings into Qdrant
Set the LLM as ChatOpenA I (gpt 3.5)
Set up logging to see the query variations generated by the LLM
use MultiQueryRetriever & get_relevant_documents functions

from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI

# load the document and split it into chunks
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) 
docs = text_splitter.split_documents(documents)

# create the openai embedding function
embedding_function = OpenAIEmbeddings(openai_api_key=openai_api_key)

# load it into Qdrant
db = Qdrant.from_documents(docs, embedding_function, location=":memory:",collection_name="my_documents")

# query it
query = "What did Andrej say about LLM operating system?"

# set the LLM for multiquery
llm = ChatOpenAI(temperature=0, openai_api_key=openai_api_key)


# Multiquery retrieval using OpenAI 
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(), llm=llm)

# set up logging to see the queries generated
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# retrieved documents
unique_docs = retriever_from_llm.get_relevant_documents(query=query)

# print results
for i, doc in enumerate(unique_docs):
    print(f"{i + 1}.", doc.page_content, "\n")

Example : Contextual compression

Loading our text file using TextLoader,
Splitting the text into chunks using RecursiveCharacterTextSplitter,
Creating embeddings using OpenAI Embeddings
Set up retriever as FAISS
Set the LLM as C hatOpenAI (gpt 3.5)
Use LLMChainExtractor as the compressor
use ContextualCompressionRetriever & get_relevant_documents functions

from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# load and split text
loader = TextLoader('../Data/AK_BusyPersonIntroLLM.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=200) 
docs = text_splitter.split_documents(documents)
# save as vector embeddings
retriever = FAISS.from_documents(docs, 
                OpenAIEmbeddings(
                    openai_api_key=openai_api_key)).as_retriever()

# use a compressor
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=retriever)

query = "What did Andrej say about LLM operating system?"

# retrieve docs
compressed_docs = compression_retriever.get_relevant_documents(query)

# print docs
for i, doc in enumerate(unique_docs):
    print(f"{i + 1}.", doc.page_content, "\n")

The dynamic landscape of retrieval methods, as highlighted by LangChain’s 2023 State of AI survey, showcases the ongoing advancements in the field. By understanding and implementing these strategies, practitioners can navigate the intricacies of document retrieval, enabling the generation of more informed and contextually relevant responses.

If you’re someone who follows Generative AI and Large Language Models let’s connect on LinkedIn — https://www.linkedin.com/in/abhinav-kimothi/

Also, please read a free copy of my notes on Large Language Models — https://abhinavkimothi.gumroad.com/l/GenAILLM

I write about Generative AI and Large Language Models. Please follow https://medium.com/@abhinavkimothi to read my other blogs

Creating Impact: A Spotlight on 6 Practical Retrieval Augmented Generation Use Cases

In 2023, RAG has become one of the most used technique in the domain of Large Language Models. In fact, one can assume…

medium.com

3 LLM Architectures

Transformers form the backbone of the revolutionary Large Language Models

medium.com

RAG Value Chain: Retrieval Strategies in Information Augmentation for Large Language Models

Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation

The advancements in the LLM space have been mind-boggling. However, when it comes to using LLMs in real scenarios, we…

Popular Retrieval Methods

Similarity Search

Maximum Marginal Relevance

Multi-query Retrieval

Contextual compression

Multi Vector Retrieval

Parent Document Retrieval

Self Query

Time-weighted Retrieval

Ensemble Techniques

Example : Similarity Search using LangChain

Example : Similarity Vector Search

Example : Maximum Marginal Relevance

Example : Multi-query Retrieval

Example : Contextual compression

Creating Impact: A Spotlight on 6 Practical Retrieval Augmented Generation Use Cases

In 2023, RAG has become one of the most used technique in the domain of Large Language Models. In fact, one can assume…

3 LLM Architectures

Transformers form the backbone of the revolutionary Large Language Models

Conscious Machines and The Hard Problem

Artificial Consciousness or the idea of a Sentient AI is something that has gained some prominence in the tech…

WRITER at MLearning.ai / New York Times vs. AI / The Best 2023 AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Abhinav Kimothi