MACHINE LEARNING | ARTIFICIAL INTELLIGENCE | PROGRAMMING

Text to Exam Generator (NLP) Using Machine Learning

T2E (stands for text to exam) is a vocabulary exam generator based on the context of where that word is being used in the sentence.

Mario World

--

In this article, I will take you through what it’s like coding your own AI for the first time at the age of 16. There will be a lot of tasks to complete. Are you ready to explore? Let’s begin!

Photo by Joshua Hoehne on Unsplash

Quick Links

Before It Began

When I started this project, I wanted to make something that I and the people around me, like teachers and friends, will use every day. I came up with an idea of a Natural Language Processing (NLP) AI program that can generate exam questions and choices about Named Entity Recognition (who, what, where, when, why). See the attachment below.

A Named Entity Recognition question example from OpExams — Free question generator.

However, there is a company that already does this exact task very well which is this company. So I tried to think of something else. And I landed on this idea. You know that there is a vocabulary exam type of question in SAT that asks for the correct definition of a word that is selected from the passage that they provided. See the attachment below.

Here is an example of an SAT vocabulary exam type from KAPLAN.

Yeap! And that is exactly what I wanted to do.

this part of the flowchart includes “Before It Began”, “Problem Statement”, and “Metrics and Baseline” sections

Problem Statement

As someone who has a lot of experience in helping teachers create exams, I found out that there usually is a dedicated section to test students’ knowledge of vocabulary from a piece of text. With that experience in mind, I came up with an idea of an AI that can generate exam questions testing students on vocabulary from the given text. The special thing about this AI is that it carefully picks the vocabulary to test the student based on the student’s English level. Like, for example, if you provide the AI with a piece of article and select the English level of the student as CEFR B2, the AI will go through every word in the article to search for the lexical words with the definition from the context listed in the B2 level, then ask the definition about that particular word. This way, the AI will generate a more precise and accurate difficulty level of an exam for students based on the definition from the context while also reducing the time that the teacher has to be working on creating this vocabulary section of the exam. All in all creating a better exam with more fairness, higher precision, and higher accuracy as a result.

How does It work?

  1. The user provided the input in the form by entering a piece of text (passage, article) and the target English difficulty level (CEFR)
  2. The AI analyses the part of speech of every word and filters in just the lexical words, then, it filters even further to just the word that has the CEFR level matching the CEFR level that the user gave in the input form.
  3. The AI generates questions asking for the definition of the vocabulary that made it to the end after the entire filtering process. The definition is based on the context of the word and the AI will provide both the correct and wrong choices for the teacher to use to make a perfect vocabulary exam.

Preprocess Input Text

Before I started collecting data to fine-tune the model. I kicked this project off by trying to think about how the user will interact with the product, like what format of text a user (a normal human being) will type in, and how we can process and reformat it for the AI program to understand. So I came up with this workflow shown above. If you look closely, there is also a step in the flowchart that I have to select (pull out) just the lexical words from the entire user text input, in order to achieve that, I used Spacy Linguistic Features [3] to identify the part of speech (POS) of each word. I let only the word with the pos of NOUN, VERB, ADJ, and ADV to pass through the filter and continue to the next process.

Finding the Best CEFR Dictionary

This is one of the toughest parts of creating my own machine learning program because clean data is one of the most important parts. Since my program is different from other NLP projects because it requires a CEFR dictionary to reference and look up vocabulary when running the program, I need to find a good dictionary that is easy to use and in a good format. I first tried to scrape the information that I want from a CEFR dictionary in the .txt format that is being converted from a PDF file that is available on the internet. Unfortunately, it was unusable because the format is not consistent at all resulting in either an inaccurate output or an incomplete one. So, I then asked for it from the university which created that CEFR dictionary in the PDF format on the internet themselves, and they replied back to my email saying that they couldn’t give me the data for legal reasons. After that, I asked for it from my school because they run a software company in the field of education and therefore have the dictionary I need, but again, they rejected me. I almost decided to scrape all of the definitions of every word from the dictionary website along with its part of speech and the CEFR level, but my mentor just found a dictionary that is well-formatted and perfect for my type of task, so I didn’t proceed with this method. We end up using a combination of 2 things, the NLTK wordnet dictionary for the part of speech and the definition, and a CSV file [4] that a random guy posted on GitHub for the CEFR level labeling. I merge them together and make it work.

Data Collection and Cleaning

This step is about preparing the dataset to train, test, and validate our machine learning on. With the experience of my mentor who has worked with this type of data, he found a resource online that we can twist and convert to make the perfect use of it because it contains a ton of sentences with brackets around every lexical word and the correct definition of each word in the exact wording that the NLTK wordnet uses and also, they have the label that we can train an NLI model on which I will explain what it is later in this article. This piece of data that my mentor found is called “SemCor Corpus [5]” (We access the dataset via NLTK’s SemcorCorpusReader [6]) The reformatted version of the dataset looks something like this.

It might look quite overwhelming but this is what data science and computer engineering are about. Actually, it’s not complex at all when you slice the data up into groups and break each group down. But I have to say that this data is of great quality because we already converted it from messy data into the Python dictionary format that matches our type of work. By the way, we locked each word to have just 4 definitions for ease of use and to keep it simple because some words have 10 meanings while some have less than 3 meanings, with that said, we have to keep it consistent.

Exploratory Data Analysis

This is one of the fun parts because we get to look into and analyze what’s inside the data that we have collected and cleaned. In short, these are the insights I have gathered from looking into my entire dataset with around 120,000 sentences that we will use to train, test, and validate our model.

The number of times that each word is presented across the entire dataset (train set, test set, and validate set all combined) (this word cloud doesn’t include any verb “to be” so we can actually see the vocabulary itself)

The technology behind this AI

This Natural Language Processing (NLP) program uses Zero-Shot Classification of Natural Language Inference (NLI) as its backbone. The approach was proposed by Yin et al. (2019) [7]. This is the link [8] to the article about this Zero-Shot Classification NLP.

Some people called Natural Language Inference (NLI) as Textual Entailment because it process and analyze the relationship between 2 sentences to determine whether they entail, are neutral, or contradict each other by using logic and looking at the conjunction, the preposition, the article, the repeated words, the similar words or synonyms, and the word order.

The technology that is used in this program is called BART. BART stands for Bidirectional and Auto-Regressive Transformer and is used in processing human languages that are related to sentences and text. BART is developed by Facebook AI using Unsupervised Learning that is divided into two main parts: Encoder-Decoder and Masked Language Model (MLM)

Encoder-Decoder, as the name suggests, encodes and decodes the language. The Encoder encodes the input and the Decoder uses the encoded sentences to generate the output.

BART MNLI visualization of this program adapted from https://github.com/google-research/bert/issues/644

The technical process behind this BART Mnli program.

This is the description of the BART Mnli visualization above.
We started off by inputting the text which contains 2 sentences: a premise sentence and a hypothesis sentence.

  1. Tokenization — Cutting down a sentence into tokens. A token can be just one word or a morpheme (subword, part of word)
  2. Text Encoder — Passing the tokens through the Transformer model to find the relationship of each word in the context (convert to 700 numbers)
  3. Dense Layer — Passing the result from Text Encoder through Fully Connected Deep Neural Network to analyse and find the linguistic features (convert to 3 numbers)
  4. Softmax Function — Taking the result from Dense Layer and convert it to a probability that specify the relationship between 2 sentences whether they are classify as Entailment (represented by Label 2), Neutral (represented by Label 1), or Contradict (represented by Label 0). The probability will be a number ranging from 0 to 1 (aka 0% — 100%)
  5. Relationship Identification — Selecting the Label that has the highest probability to conclude the relationship of the 2 sentences that the machine learning model is most confident with.

At the end we look at the probability attached to the Label 2 (Entailment) to see how much both sentences entailed to each other and select the premise-hypothesis pair (the question-answer pair) that has the highest probability in Label 2 to be the correct answer of the question.

Build First Model

Before doing anything, I first split the data into 3 sets: train set, test set, and validation set. Keep in mind that we will only fine-tune the model on the train set only to prevent it from seeing other sets so that later we can use it as unseen sets to test and validate. We then later test for baseline on the test set and start fine-tuning on the train set, after that, we run the program over the test set again to see the results and check the accuracy after the program is trained.

The model we chose was Facebook Bart Large Mnli (MultiNLI) [9]. It is a pre-trained Natural Language Inference (NLI) model. It basically finds the relationship between 2 sentences (premise and hypothesis). For example, given the sentence “[Computer] is good.” and we want to predict the meaning of the word “computer”, our premise sentence and hypothesis sentences are as follows.

Premise sentence:
- What is the meaning of the word “computer” in this sentence “[Computer] is good”?

Hypothesis sentence:
- a machine for performing calculations automatically
- an expert at calculation (or at operating calculating machines)

It then compares 2 sentences together and tells us whether they are entailed, contradict, or neutral to each other through the label number which in this case is labeled like this: 2 (entailment), 1 (neutral), and 0 (contradict). It is a Zero-Shot Classification (ready-made) too so actually we don’t have to fine-tune it anymore and it’s ready to be used, but for our specific task, it’s adapted and twisted a bit differently so the fine-tune we did really bring the accuracy up by around 25% just from training with 10,000 sentences (sentence-hypothesis-label pair). Even though it looks easy on the surface but underneath the hood, it is pretty complicated especially because it’s my first time fine-tuning my first model. However, it’s interesting enough for me to get hooked and want to learn even more about machine learning and the mathematics involved, such as matrix and calculus after this program ends.

Metrics and Baselines

  • We will measure the accuracy of the AI by comparing the meaning of the vocabulary that the AI guesses (predicts) from the context with the correct definition (correct answer) that is manually labeled by a human. The more it gets right compared to a normal human, the more accurate it will be.
  • We are currently aiming for an accuracy of no less than 70% for the AI to perform.
  • Compared to the Facebook Bart Large Mnli model that we started with, our baseline accuracy is at 52.41% as seen in the image below.
  • Our 2 best-performing models are currently sitting at the accuracy of 75.21% and 76.07% (78.80% if tested with a larger test set of 500 sentences instead of 117 sentences originally) which is around 25% higher than our baseline as you can see in the attachments below.
This is the highest accuracy achieved by fine-tuning the model on AWS SageMaker with the training data of 30,000 sentences between sentences 40,000 and 70,000. Tested on 117 sentences in the test set.
This is the highest accuracy achieved by fine-tuning the model on Google Colab with the training data of the first 10,000 sentences. Tested on 117 sentences in the test set.
And this is the accuracy of the 76.07% accuracy model but tested on a test set of 500 sentences which is larger than the original test set which contains just 117 sentences.

Error Analysis

This part is about analyzing the error that our program produces, and later trying to avoid the program from producing that same type of error in the production output. For example, if our model performs badly on the vocabulary with the part of speech of VERB, we should code the program to generate fewer questions about the words that have part of speech as a verb and more on the words of other parts of speech like noun, adjective, and adverb. Also, the Error Analysis step also helps us prevent the error from happening again in the future which is like learning from the mistake that our AI produces.

Here is what I have discovered from doing some error analysis (test set size: 500 sentences)

Prototype Deployment

This step first sounded so easy to me. Since I have strong skills in front-end and more precisely HTML and CSS, I thought I will be coding the entire front-end all by myself but the mentor and the teacher assistant suggested that I use a deployment tool called “Gradio” and host the program on Hugging Face Spaces because it’s free and is a new and interesting service. I tried learning how to code the Gradio interface in Python. However, it felt like I have to learn a new programming language which is quite a lot of work. So I thought to myself, what if I use native HTML, CSS for the front-end and Flask in Python for the back-end because I already have some background knowledge and experience in Flask? With that said, I proceed with this solution. In the end, it worked. The problem came when I have to upload the program to a server. Our first thought was that Hugging Face Space doesn’t support Python Flask as a back-end but luckily, we can host it with Docker File which opens the door to almost every back-end programming language including Flask. After that, I spent a lot of time fixing unexpected bugs and patching some possible errors related to files (When you use Flask and have the same naming for the file, if there are 2 or more users using the server at the same time, one person could download and access the file of the other person because if the first user runs the program and gets the output but was disconnected from the server caused by unstable connectivity with the internet, the second user can run the program and overwrite the first person’s file before he/she downloads it. We patch this issue by naming the file by the IP address of the user instead of naming it the same for everyone.) Apart from that, there are a lot of fixes needed to improve the user experience (UX) and therefore the user interface (UI) has to be modified. This is the first time I use git add, git commit, and git push, but I got used to it very quickly because I pushed the new version of the program as bug fixes and UI improvements multiple times a day. ^^

This flowchart is the entire process of making my AI project

This is the entire process of making the AI program in the form of a flowchart.

Sources

[1] Presidents | The White House
[2] PLAY | English meaning — Cambridge Dictionary
[3] Spacy Linguistic Features
[4] a CSV file containing CEFR-level labeling
[5] SemCor Corpus
[6] SemcorCorpusReader
[7] Yin et al. (2019), proposed the modern approach to the Zero-Shot NLP
[8] link to the article about this Zero-Shot Classification NLP
[9] Facebook Bart Large Mnli (MultiNLI)

Visualization tools

Flowchart created in FigJam (a product branch of Figma)
Chart Generator: https://charts.livegap.com/
Word Cloud Generator: https://www.freewordcloudgenerator.com/

Personal note

At the end of the day, I want to say that the AI Builders camp taught me to make a real AI product from start to finish. And I want to say “thank you” to everyone who has helped support me no matter how big or small it was. I would like to thank especially my mentor who helped me for tens of hours and my TA as well. I also learned and absorbed a lot of things related to AI and more precisely machine learning (ML) including how to train the model, and terms related to that. I also got a lot more comfortable with working with huge data and therefore master the skills of a data scientist along the way.

Credit: AI Builders, Thailand

Thank you for reading until the end,

Nutnornont Chamadol

BECOME a WRITER at MLearning.ai // text-to-video // Divine Code

--

--

Mario World

Hey, I’m Mario! I am a tech geek who is interested in psychology. I like to learn new things and I love sharing. Follow me on Medium if you are also a geek.