Exploring Generative Artificial Intelligence with PandasAI, OpenAI, and Comet LLM

Benny Ifeanyi Iheagwara
Heartbeat
Published in
7 min readFeb 23, 2024

--

Photo by Richard Horvath on Unsplash

Python's Pandas library has long been the trusted companion of data analysts and data scientists. This library is known for its data manipulation, transformation, and wrangling capabilities.

But what if I told you there's a way to bring the power of Generative AI to the Pandas library?

While researching LLMs (large language models), I came across PandasAI, a library that adds Generative AI capabilities to Pandas. This library will revolutionize how we can work with data by allowing us to turn complex queries into simple conversations. This article will dive into PandasAI, explore its capabilities, and how you can keep track of your model with Comet. We will also explore CometLLM.

Overview of PandasAI

PandasAI is a powerful Python library that uses generative AI models to make Pandas conversational. This lets you use natural language to execute Pandas' commands, like manipulating data frames and plotting a graph.

It is essential to know that while PandasAI is a game changer, it is designed to be used with Pandas and not replace it.

Tools to be Used

  • Jupyter Notebook or Google Colab for experimentation

Installation

To get started, we will need to install the PandasAI. Run the command below:

You might run into this error: "ERROR: Could not find a version that satisfies the requirement pandasai (from versions: none) ERROR: No matching distribution found for pandasai".

This is because PandasAI requires Python >=3.9, <4.0. To solve this, you need to upgrade your Python.

Getting Started with PandasAI

LLMs power PandasAI and support several large language models (LLMs), from OpenAI, Azure OpenAI, and Google PaLM to HuggingFace's Starcoder and Falcon models. These models are essential to give PandasAI its natural language query capabilities.

We must use OpenAI LLM API Wrapper for this tutorial to power PandasAI's generative AI capabilities. We will need to set up an OpenAI account and generate an OpenAI API token key, which you can find on your account here. You should set up billing since OpenAI access is a paid service; I will show you how to monitor your billing later in the article.

Note: ******* represents your OpenAI API token key. You can find on your account here.

Our Dataset

I will use the house rent prediction dataset I found on Kaggle for this article. The dataset contains information like the number of bedrooms, kitchens, the rent, city, area, and furnishing status of over 4,000 houses, apartments, and flats in India.

We will now load our dataset using the Pandas library.

Exploring Our Dataset with PandasAI and OpenAI

We can now explore the dataset with PandasAI's generative AI capabilities. Just as Pandas has dataframes, PandasAI has SmartDataframes.

SmartDataframe has the same properties as pd.DataFrame but with conversational features.

Now, let's explore PandasAI.

sdf.impute_missing_values()

This command will impute missing values in your data frame.

To ask your data questions using PandasAI, you use sdf.chat

sdf.chat(“who is the ideal tenant for a 3 bedroom in kolkata”)
sdf.chat(“Return the top 5 expensive city by rent”)
sdf.chat(“In a table show me the average rent in the various cities each month and group this data by the BHK”) 
#BHK here represent numbers of bedrooms, hall, and kitchen
sdf.chat(“In a table show me the number of Point of Contact in the various cities each month and group this data by the Point of Contact”)

You can also generate charts with PandasAI:

sdf.chat(“Plot a chart of the average rent by city”)
sdf.chat(“Create a line chart to show the trend of rent by city in the last few months”)
sdf.plot_correlation_heatmap()
sdf.chat(“Calculate the average cost of rent in Delhi”)

Here is a GitHub gist of the code snippets.

Worried about the Billing? Count Your Tokens

OpenAI's API process and break down your prompts into tokens. You can think of them as pieces of words and characters. You should visit the official documentation to learn more.

You can analyze and count the number of tokens your prompt uses using the command below.

You can also check out your usage on OpenAI here.

Logging your Artifacts on Comet

Once done, you should log the dataset (Comet calls these Artifacts) to Comet. This way, when working and building machine learning models, you can automate and keep track of your code and artifacts.

To get started, we will need to create a Comet account. This will enable us to log our Artifacts on the Comet platform through an "experiment." Let's log a new experiment with our house rent prediction dataset using the code below:

import comet_ml
from comet_ml import Artifact, Experiment
#Initialize comet instance for API Key
comet_ml.init()

Here, we imported and initialized the CometML library. Once you run this, you will be prompted to pass your Comet API key into Colab or Jupyter.

Next, we create an Experiment object by giving it a name and the workspace it should belong to. You can get all your available workspaces here. After that, we will create an Artifact instance by giving it a name, Artifact type and specifying the file path with the artifact.add(). Then, we end the experiment.

You can do that using the code below:

Logging with CometLLM

However, since this is an LLM project, it will be best to log our prompt using CometLLM.

CometLLM is a new suite of LLMOps tools designed to help you effortlessly track and visualize your LLM prompts and chains.

First, we will need to install the Comet Library:

pip install comet_llm

Then, we will also need our API key, which we can get from our Comet account. Once you have that, we can log our prompts and their outputs to Comet.

We can test CometLLM by running the code below:

import comet_llm

comet_llm.log_prompt(
prompt="What is your name?",
output=" My name is Benny Ifeanyi",
api_key="YOUR_COMET_API_KEY",
project = "MY_Project_",
)

This is the output. You can view it on the Comet platform as well.

Now, let's log our prompt and their output. We can do that using the code below:

import comet_llm
import os


# Define your questions
questions = [
"Return the top 5 expensive city by rent"
]

# Create a list to store question-response pairs
question_response_pairs = []

# Log the questions and responses to CometLLM and store them
for question in questions:
# Generate the response using sdf.chat (this is the PandaAI model we created earlier)
response = sdf.chat(question) # Response is a string

# Store the question and response in the list
question_response_pairs.append({"question": question, "response": response})

# Save the question-response pairs to a CSV file
csv_file_path = '/content/question_response_pairs.csv'
pairs_df = pd.DataFrame(question_response_pairs)
pairs_df.to_csv(csv_file_path, index=False)

with open(csv_file_path, 'r') as csv_file:
csv_content = csv_file.read()

# Log the entire CSV content as the response to CometLLM
comet_llm.log_prompt(
prompt="Question-Response Pairs",
output=csv_content,
api_key="YOUR_COMET_API_KEY",
project="MY_Project_",
)

This is the output. You can also view it on the Comet Platform.

However, we can't log our dataset with CometLLM right now because, at the moment, this isn't supported.

What's next?

From here, we can do much more, like using the Comet UI to score our prompt outputs or to search from specific prompts. This is helpful when working with thousands of prompts.

All of these and much more were covered in this article: Organize Your Prompt Engineering with CometLLM.

Conclusion

PandasAI has brought generative AI capabilities to the Pandas library. You can explore these capabilities on your datasets and start exciting projects with other LLM models like Azure OpenAI and LangChain (which you can view in the GitHub gist below). Lastly, remember to track, compare, and optimize your ML experiments as you work with Comet.

P.S. If you prefer to learn by code, check out this Github gist, which hosts an informative Google Colab notebook with all the code snippets.

Editor's Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We're committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don't sell ads.

If you'd like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

Thoughts, theories, growth, and experiences. Finding my path as a Data Analyst 📊 and Technical Writer 🚀