DIY, Search Engine: How LangChain SQL Agent Simplifies Data Extraction

7 min readJun 17, 2023

The advent of large language models (LLMs), such as OpenAI’s GPT-3, has ushered in a new era of possibilities in the realm of natural language processing. One such use case is the capacity to search for pertinent data effectively.

At present, there’s a growing buzz around Vector Databases. Pinecone has already secured over 100 million in funding, and open-source competitors like ChromaDb and LanceDB are hot on their trail, revolutionizing the way we store and retrieve data.

Vector databases are a vast and complex topic, and discussing them in detail is beyond the scope of this article. While they offer numerous benefits, the reality is that their large-scale implementation is currently hindered by factors such as cost, infrastructure, and the requisite knowledge.

Novel development frameworks to the rescue:

Traditional database management, a complex process that typically requires a deep understanding of structured query languages, can be revolutionized with the application of LLMs. As promising as this approach may seem, however, there are several challenges associated with deploying these models at a production level. Some of these include:

Interpretability:

One of the major challenges with LLMs is their ‘black box’ nature, making it difficult to understand the decision-making processes that lead to specific outputs. This lack of interpretability can be particularly problematic when the model produces unexpected or incorrect outputs, as the root cause often remains a mystery.

Reliability:

LLMs, despite their impressive capabilities, are not entirely reliable. They have a propensity to generate misleading or completely false information, a phenomenon known as ‘hallucination’. This can result in inconsistencies that are particularly problematic in scenarios where accuracy is crucial, such as in knowledge retrieval.

Resource Intensive:

The training of LLMs is a resource-intensive process. Models like GPT-4 require substantial computational resources, which can pose a barrier to entry for individuals or organizations with limited resources. Furthermore, licensed models like those from OpenAI charge based on the number of computed tokens, units that represent pieces of text. The use of a large number of tokens can lead to increased costs and slower response times.

So, how do we leverage the latest advancements in LLMs to access our data in a fast, easy, and meaningful way? By building our own search engine, of course.

An innovative solution to these challenges:

LangChain. This framework harnesses the power of LLMs to create a seamless, user-friendly interface across numerous development frameworks. In this case, we’ll demonstrate its use for understanding and querying databases.

Simplifying Data Extraction with LangChain Agents

Retrieving data from a database is seldom a straightforward endeavor. Non-technical users often lack both the time and the knowledge to figure out complex queries that match their data needs. More often than not, data access becomes a static and time-consuming process when new requirements emerge.

With the advent of Large Language Models (LLMs) and Vector Databases, the concept of data retrieval has undergone a radical transformation. However, these new technologies bring their own set of challenges.

In response to these issues, LangChain Agents present an innovative solution that combines the sophistication of LLMs with traditional programming. This blend results in a tool that is not only interpretable and reliable but also efficient.

LangChain is a revolutionary framework that harnesses the capabilities of language models to build applications that can effectively interact with their environment and various data sources.

Certain applications require not just a fixed sequence of calls to LLMs or other tools, but possibly an indeterminate sequence dependent on user input. In these cases, an agent serves as the control center, equipped with a suite of tools. Based on the user’s input, the agent decides which tools, if any, to use.

Currently, there are two primary types of agents:

Action Agents: These agents determine and execute actions one at a time in a sequential manner.
Plan-and-Execute Agents: These agents first devise a plan of actions and then execute each action in the plan sequentially.

In this scenario, we will use the latter type, specifically, the SQL Database Agent. This agent is designed to interact with SQL databases, from describing a table schema, retrieving data from queries, and even recovering from errors. This means it can handle situations like querying non-existent tables or columns by identifying the closest valid alternative.

The data that we are going to use is a modified version of publicly available indices data. We will download it, and stored on a database for our use.

Agents utilize LLMs along with a set of ‘Tools’ for their use. In our case, we are using the SQLDatabaseToolkit, and a ‘Prompt’ — the initial script that defines the workflow the agent should take and when and how it should use the tools available. Finally, we feed the Agent information about our data source, including table schemas, sample data, field descriptions, etc.

LangChain’s default prompt for the SQL toolkit is designed to answer user’s questions by generating SQL queries, retrieving the final result, and converting that result into natural language. However, this approach is not always ideal. For example, you might have more data than the Large Language Model (LLM) can handle due to token limits, or the user may not need an immediate answer, but the entire dataset for their own analysis or report.

So for our use case, we will first create our own custom prompt. The goal is to ensure that it only generates the SQL query, prevents common mistakes, and always includes two essential columns. We’ll achieve this by modifying the prompt template available in LangChain:

This approach not only minimizes token usage but also enhances interpretability and reliability. The SQL queries generated are transparent (as they are generated as part of the Though/Action process) and can be easily understood and validated by anyone with a basic understanding of SQL, addressing the ‘black box’ issue often associated with LLMs.

Next, we create our search engine class, which contains schema information, data samples, and high-level information about valuable fields. In our use case, the data comprises a list of publicly traded index funds along with some publicly available information about them.

The code itself is straightforward; the variables ‘llm’ and ‘db’ provide the SQL Toolkit with the necessary language model (in this case, chat-gpt3-turbo) and the database information. These are needed to create the agent along with the custom instructions.

When we pass the question “What is the inception date of the Indexes that start with the letter ‘A’ and are ESG?” to the agent, it initiates a chain of thought and observation as per the prompts’ instructions. The agent then verifies that each inference is both syntactically correct and aligns with the question, finally returning what it believes to be the correct answer. In our case, this is a SQL query, as dictated by the prompt’s requirements:

Congratulations, we have now generated a SQL query from a natural language question! But we’re not quite finished. As you may notice, the final output limits the results to 5 rows (‘LIMIT 5’). This is due to the agent’s behavior, as it limits the number of rows it checks against the database table to avoid reaching the token limit. The final output is the last query it tested, and thus the limit persists.

To address this and enhance the user experience, we will build a small Flask web app. This app will take user questions in a search bar, execute the agent against our database, parse the results to their correct format, and finally return the result as a table.

To wrap up the front-end, we will create the HTML, CSS, and a witty, clever and publishing SAFE logo files.

Similarities to other search engines are for satirical purposes Only

The app will take the content from the search bar and invoke the search engine we created using Agents. Then, the result will be parsed to remove the limits, and finally, we will execute the SQL query and feed the content to the webpage.

Voila! We now have a system that allows users to interact with their databases as if they were using a search engine like Google or Bing. Users can input their queries in natural language, and LangChain does the heavy lifting, extracting the required data swiftly and accurately.

The real beauty of this methodology and the tools used is their scalability and speed of data retrieval. If we need to increase the number of tables or alter the schemas, the changes are minimal and can theoretically be parameterized, eliminating the need for any code changes.

I invite you to experience the power of LangChain Agents firsthand. Harness this framework to unlock the potential of your own data. Whether you’re a seasoned developer, a data analyst, or a curious enthusiast, try experimenting with this approach and see the results for yourself. Not only will it change the way you interact with databases, but it will also open new avenues for how you think about data extraction and interpretation. So, go ahead, set up your own LangChain Agent, customize your prompts and see how you can transform your data retrieval process. The future of data interaction is here, and you’re a part of it.

BECOME a WRITER at MLearning.ai //FREE ML Tools// AI Film Critics

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com