Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation

Abhinav Kimothi
14 min readDec 21, 2023
Source : Image by Author

The advancements in the LLM space have been mind-boggling. However, when it comes to using LLMs in real scenarios, we still grapple with the knowledge limitations and hallucinations of the LLMs.

Retrieval Augmented Generation becomes powerful as it provides additional memory and context, and increases the confidence in LLM responses.

The Curse of the LLMs

30th November, 2022 will be remembered as the watershed moment in artificial intelligence. OpenAI released ChatGPT and the world was mesmerised. Interest in previously obscure terms like Generative AI and Large Language Models (LLMs), was unstoppable over the following 12 months.

Google Trends for Generative AI and LLMs

As usage exploded, so did the expectations. Many users started using ChatGPT as a source of information, like an alternative to Google. As a result, they also started encountering prominent weaknesses of the system. Concerns around copyright, privacy, security, ability to do mathematical calculations etc. aside, people realised that there are two major limitations of Large Language Models.

A Knowledge Cut-off date

Training an LLM is an expensive and time-consuming process. LLMs are trained on massive amount of data. The data that LLMs are trained on is therefore historical (or dated).e.g. The latest GPT4 model by OpenAI has knowledge only till April 2023 and any event that happened post that date, the information is not available to the model.

Hallucinations

Often, it was observed that LLMs provided responses that were factually incorrect. Despite being factually incorrect, the LLM responses “sounded” extremely confident and legitimate. This characteristic of “lying with confidence” proved to be one of the biggest criticisms of ChatGPT and LLM techniques, in general.

While the weaknesses of LLMs were being discussed, a parallel discourse around providing context to the models started. In essence, it meant creating a ChatGPT on proprietary data.

The Challenge

  • Make LLMs respond with up-to-date information
  • Make LLMs not respond with factually inaccurate information
  • Make LLMs aware of proprietary information

Providing Context

While model re-training/fine-tuning/reinforcement learning are options that solve the aforementioned challenges, these approaches are time-consuming and costly. In majority of the use-case, these costs are prohibitive.

In May 2020, researchers in their paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” explored models which combine pre-trained parametric and non-parametric memory for language generation.

How does RAG help?

Unlimited Knowledge

The Retriever of an RAG system can have access to external sources of information. Therefore, the LLM is not limited to its internal knowledge. The external sources can be proprietary documents and data or even the internet.

Confidence in Responses

With the context (extra information that is retrieved) made available to the LLM, the confidence in LLM responses is increased.

  1. Context Awareness : Added information assists LLMs in generating responses that are accurate and contextually appropriate
  2. Source Citation : Access to sources of information improves the transparency of the LLM responses
  3. Reduced Hallucinations : RAG enabled LLM systems are observed to be less prone to hallucinations than the ones without RAG

RAG Architecture

In 2023, RAG has become one of the most used technique in the domain of Large Language Models.

Source : Image by Author

Step 1: User writes a prompt or a query that is passed to an orchestrator

Step 2: Orchestrator sends a search query to the retriever

Step 3: Retriever fetches the relevant information from the knowledge sources and sends back

Step 4: Orchestrator augments the prompt with the context and sends to the LLM

Step 5: LLM responds with the generated text which is displayed to the user via the orchestrator

Two pipelines become important in setting up the RAG system. The first one being setting up the knowledge sources for efficient search and retrieval and the second one being the five steps of the generation.

Indexing Pipeline

Data for the knowledge is ingested from the source and indexed. This involves steps like splitting, creation of embeddings and storage of data.

RAG Pipeline

This involves the actual RAG process which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model

We will focus on the Indexing pipeline in this blog

Indexing Pipeline

The indexing pipeline sets up the knowledge source for the RAG system. It is generally considered an offline process. However, information can also be fetched in real time. It involves four primary steps.

Source : Image by Author
  1. Loading: This step involves extracting information from different knowledge sources a loading them into documents.
  2. Splitting: This step involves splitting documents into smaller manageable chunks. Smaller chunks are easier to search and to use in LLM context windows.
  3. Embedding: This step involves converting text documents into numerical vectors. ML models are mathematical models and therefore require numerical data.
  4. Storing: This step involves storing the embeddings vectors. Vectors are typically stored in Vector Databases which are best suited for searching.

Let’s look at all these one by one

Loading Data

As we’ve been discussing, the utility of RAG is to access data for all sorts of sources. These sources can be -

  • Websites & HTML pages
  • Documents like word, pdf etc.
  • Code in python, java etc.
  • Data in json, csv etc.
  • APIs
  • File Directories
  • Databases
  • And many more

The first step is to extract the information present in these source locations.

This is a good time to introduce two popular frameworks that are being used to develop LLM powered applications.

Source : Image by Author

LangChain

Use cases: Good for applications that need enhanced AI capabilities, like language understanding tasks and more sophisticated text generation

Features: Stands out for its versatility and adaptability in building robust applications with LLMs

Agents: Makes creating agents using large language models simple through their agents API

LlamaIndex

Use cases: Good for tasks that require text search and retrieval, like information retrieval or content discovery

Features: Excels in data indexing and language model enhancement

Connectors: Provides connectors to access data from databases, external APIs, or other datasets

Both frameworks are rapidly evolving and adding new capabilities every week. It’s not an either/or situation and you can use both together (or neither).

Both LangChain and LlamaIndex offer loader integrations with more than a hundred data sources and the list keeps on growing

These Document Loaders are particularly helpful in quickly making connections and accessing information. For specific sources, custom loaders can also be developed.

It is worthwhile exploring documentation for both

  • Loading documents from a list of sources may turn out to be a complicated process. Make sure to plan for all the sources and loaders in advance.
  • More often than naught, transformations/clean-ups to the loaded data will be required like removing duplicate content, html parsing, etc. LangChain also provides a variety of document transformers

Document Splitting

Once the data is loaded, the next step in the indexing pipeline is splitting the documents into manageable chunks. The question arises around the need of this step. Why is splitting of documents necessary. There are two reasons for that -

  1. Ease of Search: Large chunks of data are harder to search over. Splitting data into smaller chunks therefore helps in better indexation.
  2. Context Window Size: LLMs allow only a finite number of tokens in prompts and completions. The context therefore cannot be larger than what the context window permits.

Chunking Strategies

While splitting documents into chunks might sound a simple concept, there are certain best practices that researchers have discovered. There are a few considerations that may influence the overall chunking strategy.

Nature of Content — Consider whether you are working with lengthy documents, such as articles or books, or shorter content like tweets or instant messages. The chosen model for your goal and, consequently, the appropriate chunking strategy depend on your response.

Embedding Model being Used- We will discuss embeddings in detail in the next section but the choice of embedding model also dictates the chunking strategy. Some models perform better with chunks of specific length

Expected Length and Complexity of User Queries- Determine whether the content will be short and specific or long and complex. This factor will influence the approach to chunking the content, ensuring a closer correlation between the embedded query and the embedded chunks

Application Specific Requirements- The application use case, such as semantic search, question answering, summarization, or other purposes will also determine how text should be chunked. If the results need to be input into another language model with a token limit, it is crucial to factor this into your decision-making process.

Chunking Methods

Depending on the aforementioned considerations, a number of text splitters are available. At a broad level, text splitters operate in the following manner:

  • Divide the text into compact, semantically meaningful units, often sentences.
  • Merge these smaller units into larger chunks until a specific size is achieved, measured by a length function.
  • Upon reaching the predetermined size, treat that chunk as an independent segment of text. Thereafter, start creating a new text chunk with some degree of overlap to maintain contextual continuity between chunks.

Two areas to focus on, therefore are -

  1. How the text is split?
  2. How the chunk size is measured?

A very common approach is where we pre-determine the size of the text chunks.

Additionally, we can specify the overlap between chunks (Remember, overlap is preferred to maintain contextual continuity between chunks).

This approach is simple and cheap and is, therefore, widely used. Let’s look at some examples -

A. Split by Character

In this approach, the text is split based on a character and the chunk size is measured by the number of characters.

B. Recursive Split by Character

A subtle variation to splitting by character is Recursive Split. The only difference is that instead of a single character used for splitting, this technique uses a list of characters and tries to split hierarchically till the chunk sizes are small enough. This technique is generally recommended for generic text.

C. Split by Tokens

For those well versed with Large Language Models, tokens is not a new concept. All LLMs have a token limit in their respective context windows which we cannot exceed. It is therefore a good idea to count the tokens while creating chunks. All LLMs also have their tokenizers.

  • Tiktoken Tokenizer: Tiktoken tokenizer has been created by OpenAI for their family of models. Using this strategy, the split still happens based on the character. However, the length of the chunk is determined by the number of tokens.
  • Hugging Face Tokenizer: Hugging Face has become the go-to platform for anyone building apps using LLMs or even other models. All models available via Hugging Face are also accompanied by their tokenizers.
  • Other Tokenizer: Other libraries like Spacy, NLTK and SentenceTransformers also provide splitters

D. Specialized Chunking

Chunking often aims to keep text with common context together. With this in mind, we might want to specifically honour the structure of the document itself for example HTML, Markdown, Latex or even code.

Things to Keep in Mind

  • Ensure data quality by preprocessing it before determining the optimal chunk size. Examples include removing HTML tags or eliminating specific elements that contribute noise, particularly when data is sourced from the web.
  • Consider factors such as content nature (e.g., short messages or lengthy documents), embedding model characteristics, and capabilities like token limits in choosing chunk sizes. Aim for a balance between preserving context and maintaining accuracy.
  • Test different chunk sizes. Create embeddings for the chosen chunk sizes and store them in your index or indices. Run a series of queries to evaluate quality and compare the performance of different chunk sizes.

Embeddings

All Machine Learning/AI models work with numerical data. Before the performance of any operation all text/image/audio/video data has to be transformed into a numerical representation. Embeddings are vector representations of data that capture meaningful relationships between entities. As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations. A word embedding is a vector representation of words.

Source : Image by Author

The process of embedding transforms data (like text) into vectors, compresses the input information resulting in an embedding space specific to the training data

While we keep our discussion around embeddings limited to RAG application and how to create embeddings for our data, a great resource to find more about embeddings is this book by Vicky Boykis [What are embeddings]

The good news for anyone building RAG Applications is that embeddings once created can also generalize to other tasks and domains through transfer learning — the ability to switch contexts — which is one of the reasons embeddings have exploded in popularity across machine learning applications

Popular Embedding Models

MTEB Leaderboard @ HuggingFace

How to Choose Embeddings?

Ever since the release of ChatGPT and the advent of the aptly described LLM Wars, there has also been a mad rush in developing embeddings models. There are many evolving standards of evaluating LLMs and embeddings alike.

When building RAG powered LLM apps, there is no right answer to “Which embeddings model to use?”. However, you may notice particular embeddings working better for specific use cases (like summarization, text generations, classification etc.)

OpenAI used to recommend different embeddings models for different use cases. However, now they recommend ada v2 for all tasks.

MTEB Leaderboard at Hugging Face evaluates almost all available embedding models across seven use cases — Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity (STS) and Summarization.

Another important consideration is cost. With OpenAI models you can incur significant costs if you are working with a lot of documents. The cost of open source models will depend on the implementation.

Creating Embeddings

Once you’ve chosen your embedding model, there are several ways of creating the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in pretty handy to convert documents (split into chunks) into vector embeddings. Other times you can use the service from a provider directly or get the embeddings from HuggingFace

Storing

We are at the last step of creating the indexing pipeline. We have loaded and split the data, and created the embeddings. Now, for us to be able to use the information repeatedly, we need to store it so that it can be accessed on demand. For this we use a special kind of database called the Vector Database.

What is a Vector Database?

For those familiar with databases, indexing is a data structure technique that allows users to quickly retrieve data from a database. Vector databases specialise in indexing and storing embeddings for fast retrieval and similarity search.

A strip down variant of a Vector Database is a Vector Index like FAISS (Facebook AI Similarity Search). It is this vector indexing that improves the search and retrieval of vector embeddings. Vector Databases augment the indexing with typical database features like data management, metadata storage, scalability, integrations, security etc.

In short, Vector Databases provide -

  • Scalable Embedding Storage.
  • Precise Similarity Search.
  • Faster Search Algorithm.

Popular Vector Databases

  • Facebook AI Similarity search is a vector index released with a library in 2017
  • Pinecone is one of the most popular managed Vector DB for large scale
  • Weaviate is an open source vector database that stores both objects and vectors
  • Chromadb is also an open source vector database.

With the growth in demand for vector storage, it can be anticipated that all major database players will add the vector indexing capabilities to their offerings.

How to choose a Vector Database?

All vector databases offer the same basic capabilities. Your choice should be influenced by the nuance of your use case matching with the value proposition of the database.

A few things to consider -

  • Balance search accuracy and query speed based on application needs. Prioritize accuracy for precision applications or speed for real-time systems.
  • Weigh increased flexibility vs potential performance impacts. More customization can add overhead and slow systems down.
  • Evaluate data durability and integrity requirements vs the need for fast query performance. Additional persistence safeguards can reduce speed.
  • Assess tradeoffs between local storage speed and access vs cloud storage benefits like security, redundancy and scalability.
  • Determine if tight integration control via direct libraries is required or if ease-of-use abstractions like APIs better suit your use case.
  • Compare advanced algorithm optimizations, query features, and indexing vs how much complexity your use case necessitates vs needs for simplicity.
  • Cost considerations — while you many incur regular cost in a fully managed solution, a self hosted one might prove costlier if not managed well
Source : Image by Author

There are many more Vector DBs. For a comprehensive understanding of the pros and cons of each, this blog is highly recommended

Storing Embeddings in Vector DBs

To store the embeddings, LangChain and LlamaIndex can be used for quick prototyping. The more nuanced implementation will depend on the choice of the DB, use case, volume etc.

A. Example : FAISS from langchain.vectorstores

In this example, we complete our indexing pipeline for one document.

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using OpenAIEmbeddings
  • Storing the embeddings into FAISS vector index

B. Example : Chroma from langchain.vectorstores

  • Loading our text file using TextLoader,
  • Splitting the text into chunks using RecursiveCharacterTextSplitter,
  • Creating embeddings using all-MiniLM-L6-v2
  • Storing the embeddings into Chromadb

Indexing Pipeline Recap

Loading :

  • A variety of data loaders from LangChain and LlamaIndex can be leveraged to load data from all sort of sources.
  • Loading documents from a list of sources may turn out to be a complicated process. Make sure to plan for all the sources and loaders in advance.
  • More often than naught, transformations/clean-ups to the loaded data will be required

Splitting :

  • Documents need to be split for ease of search and limitations of the llm context windows
  • Chunking strategies are dependent on the use case, nature of content, embeddings, query length & complexity
  • Chunking methods determine how the text is split and how the chunks are measured

Embedding :

  • Embeddings are vector representations of data that capture meaningful relationships between entities
  • Some embeddings work better for some use cases

Storing :

  • Vector databases specialise in indexing and storing embeddings for fast retrieval and similarity search
  • Different vector databases present different benefits and can be used in accordance with the use case

By indexing relevant data from external sources, we can expand the “memory” available to LLMs. There’s so much potential to augment LLMs with external knowledge through RAG indexing! Connect with me if you want to continue the discussion. What real-world uses cases could this enable? Let me know in the comments! ⤵️

--

--

Abhinav Kimothi

Co-founder and Head of AI @ Yarnit.app || Data Science, Analytics & AIML since 2007 || BITS-Pilani, ISB-Hyderabad || Ex-HSBC, Ex-Genpact, Ex-LTI || Theatre