Introduction

In the world of data analysis, extracting useful information from tabular data can be a difficult task. Conventional approaches typically require manual exploration and analysis of data, which can be requires a significant amount of effort, time, or workforce to complete. 

The emergence of advanced language models such as ChatGPT has introduced a promising and innovative approach to extracting useful information from tabular data. OpenAI announced an official API for ChatGPT which is powered by gpt-3.5-turbo, OpenAI’s most advanced language model.

This blog describes you the process of extracting useful information from tabular data using ChatGPT API.

Dataset

For all illustrations in this post, We will be utilizing the following data. These data are taken from the following: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

Note that we have only taken into account 30 records from the dataset. 

Please see the data provided below, which will be used for the purpose of this blog.

genderrace parentallevelofeducationlunchtestpreparationcoursemathscorereadingscorewritingscore
femalegroup Bbachelorstandardnone727274
femalegroup Csome collegestandardcompleted699088
femalegroup Bmasterstandardnone929593
malegroup Aassociatefree/reducednone475744
malegroup Csome collegestandardnone767875
femalegroup Bassociatestandardnone718378
femalegroup Bsome collegestandardcompleted889592
malegroup Bsome collegefree/reducednone404339
malegroup Dhigh schoolfree/reducedcompleted646467
femalegroup Bhigh schoolfree/reducednone386050
malegroup Cassociatestandardnone585452
malegroup Dassociatestandardnone405243
femalegroup Bhigh schoolstandardnone658173
malegroup Asome collegestandardcompleted787270
femalegroup Amasterstandardnone505358
femalegroup Csome high schoolstandardnone697578
malegroup Chigh schoolstandardnone888986
femalegroup Bsome high schoolfree/reducednone183228
malegroup Cmasterfree/reducedcompleted464246
femalegroup Cassociatefree/reducednone545861
malegroup Dhigh schoolstandardnone666963
femalegroup Bsome collegefree/reducedcompleted657570
malegroup Dsome collegestandardnone445453
femalegroup Csome high schoolstandardnone697373
malegroup Dbachelorfree/reducedcompleted747180
malegroup
A
masterfree/reducednone737472
malegroup Bsome collegestandardnone695455
femalegroup Cbachelorstandardnone676975
malegroup Chigh schoolstandardnone707065

Data Extraction using ChatGPT API

ChatGPT relies solely on natural language processing (NLP) techniques to understand and extract information from tabular data. It can analyze the text-based input provided by the user, interpret the query, and generate a response based on the content of the tabular data.

Here’s an example of how you can utilize the ChatGPT API to extract information from tabular data:

Step 1: Prepare Input

We have stored our tabular data in a CSV file, you can read the CSV file using “Pandas” Python library and pass the data to the ChatGPT API for information extraction.

import pandas as pd

read_csv=pd.read_csv("Student.csv")
Step 2: Use the ChatGPT API

Before we begin utilizing the ChatGPT API, please make sure that you have installed OpenAI Python library in your system.

pip install openai

You can extract information by providing the tabular data and input text to the ChatGPT API. This can be done by reading the tabular data from a CSV file, preparing the input for the API, and passing it along with the input text. The API will then extract the relevant information from the data and provide it in the response.

import openai
import pandas as pd

openai.api_key = '<YOUR OPENAI API KEY>'

read_csv = pd.read_csv("Student.csv")


input_text='''What is the average math score for male students?'''

prompt = """Please regard the following data:\n {}. Answer the following question and please return only value: {}""".format(read_csv, input_text)

request = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0301",
    messages=[
        {"role": "user", "content": prompt},
    ]
)
result = request['choices'][0]['message']['content']

print("ChatGPT Response=>",result)  

The following is the response that was received from the ChatGPT API:

Now let’s review the responses to a few questions based on the extracted information.

QuestionChatGPT ResponseAnalysis
What is the gender of the student who scored 72 in math?femaleCorrect
Which race/ethnicity group does the student belong to who scored the highest in reading?group CIncorrect
It should be “group B”
Did the student who scored 47 in math complete the test preparation course?NoCorrect



What is the average math score for male students?
65.4Incorrect
It should be “62.2”
What is the writing score for the student who belongs to group C race/ethnicity and has a master’s degree?The writing score for the specified student is not given in the data.Incorrect
It should be “46”

We observed that ChatGPT is incapable of performing aggregations, such as summing or averaging the table entries, and occasionally struggles to respond to basic questions.

You can add your table and test it right away using ChatGPT playground without using the API. Please visit the below link
https://chat.openai.com/

SQL-based Data Extraction from Database using ChatGPT API

We can now utilize ChatGPT again, but this time not for direct table analysis. Instead, we can use ChatGPT to generate SQL statements for a database that contains the data. SQL is capable of handling filtering, sorting, aggregation, and summation logic, which can help in performing the required calculations accurately.
We utilized SQLite as the database engine and employed the sqlite3 Python library to interact with it.

Step 1: Create SQLite database and table

Here’s a Python code that creates a database and a table in SQLite:

import sqlite3

# Connect to SQLite database (this will create a new database file if it doesn't exist)
conn = sqlite3.connect("chatgpt.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
    CREATE TABLE IF NOT EXISTS student (
        gender TEXT,
        race TEXT,
        parentallevelofeducation TEXT,
        lunch TEXT,
        testpreparationcourse TEXT,
        mathscore INTEGER,
        readingscore INTEGER,
        writingscore INTEGER
    )
""")

# Commit the transaction and close the connection
conn.commit()
conn.close()
Step 2: Adding Data to a Database

Here’s a Python code that read data from CSV file using pandas and insert it into a SQLite database:

import sqlite3
import pandas as pd

df=pd.read_csv("Student.csv")
# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')

# Insert DataFrame into SQLite database
df.to_sql('student', conn, if_exists='replace', index=False)

# Close database connection
conn.close()
Step 3: Use ChatGPT API

You can extract information by providing the database table name, its corresponding columns, and input text to the ChatGPT API.  ChatGPT API will generate an SQL query from the given input text and then use that query to retrieve data from the database.

Here’s an example of how you can extract information using the ChatGPT API by providing the database table name, its corresponding columns, and input text:

import sqlite3
import openai

# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')
cursor = conn.cursor()


openai.api_key = '<YOUR OPENAI API KEY>'

# Function to get table columns from SQLite database
def get_table_columns(table_name):
    cursor.execute("PRAGMA table_info({})".format(table_name))
    columns = cursor.fetchall()
    print(columns)
    return [column[1] for column in columns]


# Function to generate SQL query from input text using ChatGPT
def generate_sql_query(table_name,text,columns):
    prompt = """You are a ChatGPT language model that can generate SQL queries. Please provide a natural language input text, and I will generate the corresponding SQL query for you.The table name is {} and corresponding columns are {}.\nInput: {}\nSQL Query:""".format(table_name,columns,text)
    print(prompt)
    request = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0301",
        messages=[
            {"role": "user", "content": prompt},
        ]
    )
    sql_query = request['choices'][0]['message']['content']
    return sql_query    

# Function to execute SQL query on SQLite database
def execute_sql_query(query):
    cursor.execute(query)
    result = cursor.fetchall()
    return result


text="What is the average math score for male students?"

table_name = 'student'
columns = get_table_columns(table_name)
sql_query=generate_sql_query(table_name,text,columns)
print("Generated SQL query: ",sql_query)
if sql_query:
    result=execute_sql_query(sql_query)
    print("ChatGPT Response=>",result)
    
# Close database connection
cursor.close()
conn.close()

After sending the input text to the ChatGPT API, you will receive a response containing the generated SQL query. You can then use this query to retrieve the desired data from the database.

The response of this approach looks as below:

QuestionSQL query generated by ChatGPTAnswerAnalysis
What is the gender of the student who scored 72 in math?SELECT gender FROM student WHERE mathscore=72;femaleCorrect
Which race/ethnicity group does the student belong to who scored the highest in reading?SELECT race AS “Race/Ethnicity”, MAX(readingscore) AS “Highest Reading Score”
FROM student
GROUP BY race
ORDER BY “Highest Reading Score” DESC
LIMIT 1;
group BCorrect
Did the student who scored 47 in math complete the test preparation course?SELECT testpreparationcourse
FROM student
WHERE mathscore = 47
NoneCorrect
What is the average math score for male students?SELECT AVG(mathscore) FROM student WHERE gender = ‘male’62.2Correct
What is the writing score for the student who belongs to group C race/ethnicity and has a masters degree?SELECT writingscore FROM student WHERE race = ‘group C’ AND parentallevelofeducation = ‘master’46Correct

ChatGPT without SQL relies solely on NLP techniques to understand and generate responses based on tabular data, which may result in incorrect responses. However, when ChatGPT is combined with SQL capabilities, it can perform more advanced and flexible interactions with tabular data by directly executing SQL queries. This allows for more accurate and precise retrieval of information from databases, enhancing the overall capabilities of ChatGPT in dealing with tabular data.


Ready to harness the power of ChatGPT for your chatbot or NLP application? Partner with our expert developers to bring your dream product to life. Connect with us at letstalk@pragnakalp.com or share your requirements here to get started on your journey to success!

Categories: ChatGPT Data Extraction Natural Language Processing NLP Python

Leave a Reply

Your email address will not be published.

You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*