N-grams and How to Implement Them With the Python NLTK Library

Understanding and creating N-grams for Natural Language Processing (NLP) with Python NLTK library

Loyford Mwenda
Heartbeat

--

Photo by Jelleke Vanooteghem on Unsplash

In Natural Language Processing (NLP), we train models to enable computers to understand text and spoken words in the same way humans can. The human language is filled with ambiguities like homonyms, homophones, sarcasm, idioms, metaphors, and grammar, making it a hustle to train models that accurately determine the text’s intended meaning.

NLP includes a couple of tasks, some of which may include:

  • Speech recognition.
  • Part of speech tagging.
  • Sentiment analysis.
  • Natural language generation.

Python provides the Natural Language Toolkit (NLTK), which is an open-source collection of libraries for performing NLP tasks.

In this article, we will discuss N-grams, a way to help machines understand the meaning of words and learn how to implement them using Python’s NLTK.

What are n-grams?

Language models often estimate the probability distribution of sequences of words. With random sequence lengths, it’s an enormous task. Thus, it is assumed that the probability of a word is only dependent on the N words preceding it. This is called the N-gram Language Model.

N-grams, therefore, are a type of statistical language model used in Natural Language Processing (NLP) to predict the possibility of a sequence of words. An n-gram is a contiguous series of n items from a given text sample where n is the number of items in the sequence.

Classification of n-grams

We have several classifications of n-grams, depending on the number that n represents. The most commonly used n-grams are:

  • An n-gram of size 1, n = 1, is a unigram.
  • An n-gram of size 2, n = 2, is a bigram.
  • An n-gram of size 3, n = 3, is a trigram.

An n-gram can be of any length, n, and different types of n-grams are suitable for different applications.

We can quickly and easily generate n-grams with the ngrams function available in the nltk.util module. Let’s look at how the above n-grams would look when implemented with the following sentence:

“Natural Language Processing using N-grams is incredibly awesome.”

from nltk.util import ngrams 

sentence = "Natural Language Processing using N-grams is incredibly awesome."

def generate_n_grams(sentence, n):
unigrams = ngrams(sentence.split(), n)
return [unigram for unigram in unigrams]

For Unigrams, n = 1 stores this text in tokens of 1 word:

# n = 1
generate_n_grams(sentence, 1)
# Results

('Natural',)
('Language',)
('Processing',)
('using',)
('N-grams',)
('is',)
('incredibly',)
('awesome.',)

For Bigrams, n = 2 stores this text in tokens of 2 words:

# n = 2
generate_n_grams(sentence, 2)
# Results

('Natural', 'Language')
('Language', 'Processing')
('Processing', 'using')
('using', 'N-grams')
('N-grams', 'is')
('is', 'incredibly')
('incredibly', 'awesome.')

For Trigrams, n = 3 stores this text in tokens of 3 words:

# n = 3
generate_n_grams(sentence, 3)
# Results

('Natural', 'Language', 'Processing')
('Language', 'Processing', 'using')
('Processing', 'using', 'N-grams')
('using', 'N-grams', 'is')
('N-grams', 'is', 'incredibly')
('is', 'incredibly', 'awesome.')

When n > 3, we refer to them as four-grams or five-grams, and so on.

Applications of n-grams in NLP

  • We can use them to create features from text corpus for machine learning algorithms like SVM, Naive Bayes, etc.
  • They help develop functions like:
    — Autocorrect
    — Autocompletion of sentences
    — Text summarization
    — Speech recognition
    — Dictionary lookup
    — Text compression
    — Language identification, etc

Let’s now grab a dataset and use it to create n-grams and showcase how we can use them in NLP.

What tips do big name companies have for students and start ups? We asked them! Read or watch our industry Q&A for advice from teams at Stanford, Google, and HuggingFace.

Creating n-grams step-by-step approach

In this section, we will explore the step-by-step preparation of data for creating n-grams.

The dataset

First, we will need a dataset. We will use Financial Sentiment Analysis data from Kaggle. The dataset contains two sub-datasets (FiQA and Financial PhraseBank) combined into one CSV file. The data here provide financial sentences with sentiment labels.

Let’s load the required imports:

import pandas as pd
import string #library that contains sets of punctuation
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from sklearn.model_selection import train_test_split

Read the data:

data = pd.read_csv('Financial_Sentiment.csv')
print(data.info())
data.head()

We can see that we have two columns of data and 5,842 rows.

Let’s check for the sentiment types and their count in the dataset:

data['Sentiment'].unique(), data['Sentiment'].value_counts()

We have three classes of sentiments which are:

  • neutral
  • positive
  • negative

Check if the dataset contains null values:

data.isnull().sum()

Splitting the data into train and test sets.

We have split the data with a 20% test set and an 80% training set.

train_set, test_set = train_test_split(data, test_size=0.20, random_state=42)

train_set.shape, test_set.shape

# ((4673, 2), (1169, 2))

Preprocessing the data

Before we can create the n-grams, there are certain operations we need to perform on the data, which are vital when performing NLP.

In our case, we will perform the following operations:

  • Tokenization: Here, we break down the sentences into individual words, also called tokens. With these tokens, we can build a dictionary to represent all the words in a list.
  • Removing punctuations: We do not need punctuation in our tokens.
  • Lowercasing: We need to convert all the tokens into lowercase to avoid redundancy of words so that the model does not interpret words like market, Market, and MARKET as different words.
  • Removing stopwords: Stopwords are words that do not add much meaning to our model, like “the”, “is”, and “her.” These words act as noise, and thus we will remove them.

Please note that these operations are optional for all NLP tasks. In some tasks, like Natural Language Generation, we may want to keep the stopwords and the punctuations, but our focus here is on n-grams.

We can do other operations on NLP data, like stemming and lemmatization, which we will not cover here.

In the following code, we will write a function that does the above operations on the sentences. It will also be responsible for the generation of accurate n-grams.

def generate_ngrams(sentence, ngram=1):
# first lets convert the senetence into lower case
sentence_lower = sentence.lower()
sentence = re.sub(r'[^a-zA-Z0-9\s]', ' ', sentence_lower)

# Remove stopwords, and punctuation
stop = set(stopwords.words('english') + list(string.punctuation))

# tokenize and display tokenized sentence
clean_words = [i for i in word_tokenize(sentence) if i not in stop]
print(f"\n===Tokens:=== \n{clean_words}\n")

# Generate the n-grams of any size
ngrams = zip(*[clean_words[i:] for i in range(ngram)])
return [" ".join(ngram) for ngram in ngrams]

In the function, we pass in the sentence and ngram parameters. We assign a default value of 1 to the ngram parameter which you can change to generate an n-gram of your preferred size.

Let’s test the function:

# Generate n-grams of N=4 from the text
text = 'Natural language Processing(NLP) is an awesome task! Learn N-grams today!'
generate_ngrams(texts,4)

It works perfectly. Let’s now dive into creating the n-grams!

Now that our data is ready for use let’s dive into creating the actual n-grams from it.

Generating Unigrams

We can create unigrams from each of the sentences grouped by each of the three classes of sentiments. We will:

  • Check the most frequently used words.
  • Visualize the most frequently used words for each category.

Get each word and generate a unigram where the sentiment is positive:

# Initialize a dictionary to store the words together with their counts
positiveWords=defaultdict(int)

# 1. traverse the dataframe pick sentences with positive sentiment
# 1.1. traverse through sentences and pick each word and preprocess
# them with the generate_ngrams() functions we created
# 1.1.1 store the words in a defaultdict
# 2. convert the dictionary into a df

for text in train_set[train_set['Sentiment']=='positive']['Sentence']:
for word in generate_ngrams(text):
positiveWords[word]+=1
df_positive = pd.DataFrame(sorted(positiveWords.items(),key=lambda x:x[1],reverse=True))

Repeat the above code for each word and generate a unigram where sentiment is ‘negative’:

negativeWords = defaultdict(int)

for text in train_set[train_set['Sentiment']=='negative']['Sentence']:
for word in generate_ngrams(text):
negativeWords[word]+=1

df_negative = pd.DataFrame(sorted(negativeWords.items(),key=lambda x:x[1],reverse=True))

Get every word and generate a unigram where sentiment is ‘neutral’:

for text in train_set[train_set['Sentiment']=='neutral']['Sentence']:
for word in generate_ngrams(text):
neutralWords[word]+=1

df_neutral = pd.DataFrame(sorted(neutralWords.items(),key=lambda x:x[1],reverse=True))

Let’s visualize unigram counts:

x_positive = df_positive[0][:10]
y_positive = df_positive[1][:10]

x_negative = df_negative[0][:10]
y_negative = df_negative[1][:10]

x_neutral = df_neutral[0][:10]
y_neutral = df_neutral[1][:10]


fig, ax = plt.subplots(3, 1, figsize=(10, 7), layout='tight', dpi=100)

ax[0].bar(x_positive, y_positive, color='g')
ax[0].set_title('Top 10 words with positive sentiment')
ax[0].set_xlabel('Words in the positive df')
ax[0].set_ylabel('Count')

ax[1].bar(x_negative, y_negative, color='r')
ax[1].set_title('Top 10 words with negative sentiment')
ax[1].set_xlabel('Words in the negative df')
ax[1].set_ylabel('Count')

ax[2].bar(x_neutral, y_neutral, color='gray')
ax[2].set_title('Top 10 words with neutral sentiment')
ax[2].set_xlabel('Words in the neutral df')
ax[2].set_ylabel('Count')
Unigrams counts.

Generating Bigrams

To create the bigrams, we will remember to invoke the generate_ngrams() function with the value of the ngram parameter as 2.

# Defined new dictionaries

positiveWords_bi=defaultdict(int)
negativeWords_bi=defaultdict(int)
neutralWords_bi=defaultdict(int)

Get words and generate bigrams where sentiment is ‘positive’:

# Creating positive bigrams

for text in train_set[train_set['Sentiment']=='positive']['Sentence']:
for word in generate_ngrams(text, 2):
positiveWords_bi[word]+=1
df_positive_bi = pd.DataFrame(sorted(positiveWords_bi.items(),key=lambda x:x[1],reverse=True))
df_positive_bi

Get words and generate bigrams where sentiment is ‘negative’:

# Creating negative bigrams

for text in train_set[train_set['Sentiment']=='positive']['Sentence']:
for word in generate_ngrams(text, 2):
negativeWords_bi[word]+=1
df_negative_bi = pd.DataFrame(sorted(negativeWords_bi.items(),key=lambda x:x[1],reverse=True))
df_negative_bi

Get words and generate bigrams where sentiment is ‘neutral’:

# Creating neutral bigrams

for text in train_set[train_set['Sentiment']=='positive']['Sentence']:
for word in generate_ngrams(text, 2):
neutralWords_bi[word]+=1
df_neutral_bi = pd.DataFrame(sorted(neutralWords_bi.items(),key=lambda x:x[1],reverse=True))
df_neutral_bi

Visualizing:

x_positive_bi = df_positive_bi[0][:10]
y_positive_bi = df_positive_bi[1][:10]

print(x_positive_bi)
x_negative_bi = df_negative_bi[0][:10]
y_negative_bi = df_negative_bi[1][:10]

x_neutral_bi = df_neutral_bi[0][:10]
y_neutral_bi = df_neutral_bi[1][:10]


fig, ax = plt.subplots(3, 1, figsize=(10, 7), layout='tight', dpi=100)

ax[0].bar(x_positive_bi, y_positive_bi, color='g', width= 0.6)
ax[0].set_title('Top 10 words with positive sentiment(Bigrams)')
ax[0].set_xlabel('Words in the positive df')
ax[0].set_ylabel('Count')

ax[1].bar(x_negative_bi, y_negative_bi, color='r', width= 0.6)
ax[1].set_title('Top 10 words with negative sentiment(Bigrams)')
ax[1].set_xlabel('Words in the negative df')
ax[1].set_ylabel('Count')

ax[2].bar(x_neutral_bi, y_neutral_bi, color='gray', width= 0.6)
ax[2].set_title('Top 10 words with neutral sentiment(Bigrams)')
ax[2].set_xlabel('Words in the neutral df')
ax[2].set_ylabel('Count')
Bigrams counts.

Generating Trigrams

To create the trigrams, we will remember to invoke the generate_ngrams() function with the value of the ngram parameter as 3.

positiveWords_tri = defaultdict(int)
negativeWords_tri = defaultdict(int)
neutralWords_tri = defaultdict(int)

Get words and generate trigrams where sentiment is ‘positive’:

# Creating positive trigrams

for text in train_set[train_set['Sentiment']=='positive']['Sentence']:
for word in generate_ngrams(text, 3):
positiveWords_tri[word] += 1
df_positive_tri = pd.DataFrame(sorted(positiveWords_tri.items(),key=lambda x:x[1],reverse=True))
df_positive_tri

Get words and generate trigrams where sentiment is ‘negative’:

for text in train_set[train_set['Sentiment']=='negative']['Sentence']:
for word in generate_ngrams(text, 3):
negativeWords_tri[word]+=1
df_negative_tri = pd.DataFrame(sorted(negativeWords_tri.items(),key=lambda x:x[1],reverse=True))
df_negative_tri

Get words and generate trigrams where sentiment is ‘neutral’:

# # Creating neutral trigrams

for text in train_set[train_set['Sentiment']=='neutral']['Sentence']:
for word in generate_ngrams(text, 3):
neutralWords_tri[word]+=1
df_neutral_tri = pd.DataFrame(sorted(neutralWords_tri.items(),key=lambda x:x[1],reverse=True))
df_neutral_tri

Visualizing:

x_positive_tri = df_positive_tri[0][:10]
y_positive_tri = df_positive_tri[1][:10]

x_negative_tri = df_negative_tri[0][:10]
y_negative_tri = df_negative_tri[1][:10]

x_neutral_tri = df_neutral_tri[0][:10]
y_neutral_tri = df_neutral_tri[1][:10]


fig, ax = plt.subplots(3, 1, figsize=(12, 8), layout='tight', dpi=110)

ax[0].bar(x_positive_tri, y_positive_tri, color='g', width= 0.6)
ax[0].set_title('Top 10 words with positive sentiment(Trigrams)')
ax[0].set_xlabel('Words in the positive df')
ax[0].set_ylabel('Count')

ax[1].bar(x_negative_tri, y_negative_tri, color='r', width= 0.6)
ax[1].set_title('Top 10 words with negative sentiment(Trigrams)')
ax[1].set_xlabel('Words in the negative df')
ax[1].set_ylabel('Count')

ax[2].bar(x_neutral_tri, y_neutral_tri, color='gray', width= 0.6)
ax[2].set_title('Top 10 words with neutral sentiment(Trigrams)')
ax[2].set_xlabel('Words in the neutral df')
ax[2].set_ylabel('Count')
Trigrams counts.

Final Thoughts

In this article, we have learned the following:

  • What are n-grams?
  • Classification and example of unigrams, bigrams, and trigrams.
  • Implementing any size of n-grams with the nltk Python library from a dataset.
  • Preprocessing data or text to effectively generate n-grams is a crucial step in performing NLP tasks using n-grams.
  • How to visualize the n-grams and get their counts based on sentiments.

Generally, n-grams where n > 1 perform best as they carry more information about the context in general.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--