Guide to Sentiment Analysis using Natural Language Processing

Nikhil Raj 28 May, 2024 • 15 min read

Introduction

Sentiment analysis has become crucial in today’s digital age, enabling businesses to glean insights from vast amounts of textual data, including customer reviews, social media comments, and news articles. By utilizing natural language processing (NLP) techniques, sentiment analysis categorizes opinions as positive, negative, or neutral, providing valuable feedback on products, services, or brands. This analysis is powered by various algorithms such as Naive Bayes, Support Vector Machines (SVM), and Recurrent Neural Networks (RNN), which help in understanding the overall sentiment and emotional tone conveyed in the text, making it an indispensable tool for business intelligence and decision-making.

Learning Outcomes

  • Gain comprehensive knowledge about various sentiment analysis tools and their applications in analyzing textual data from different sources.
  • Develop the ability to classify sentiments into categories such as positive, negative, and neutral using advanced sentiment classification techniques.
  • Learn how to calculate sentiment scores and interpret their significance in understanding the overall sentiment conveyed in a given text.
  • Acquire the skills to analyze sentiment in tweets, recognizing the unique challenges and opportunities presented by social media data.
  • Explore how artificial intelligence (AI) enhances the accuracy and efficiency of sentiment analysis, and understand its role in automating the sentiment analysis process.
  • Develop the ability to assess customer sentiment from various textual data sources, providing insights into customer opinions and improving customer experience strategies.
  • Equip data scientists with the necessary tools and techniques to effectively conduct sentiment analysis and derive actionable insights from textual data.

This article was published as a part of the Data Science Blogathon.

What is Sentiment Analysis?

Sentiment analysis is a method that identifies the emotional state or sentiment behind a situation, often using NLP to analyze text data. Language serves as a mediator for human communication, and each statement carries a sentiment, which can be positive, negative, or neutral.

Suppose there is a fast-food chain company selling a variety of food items like burgers, pizza, sandwiches, and milkshakes. They have created a website where customers can order food and provide reviews.

For instance,

  • User Review 1: “I love this cheese sandwich, it’s so delicious,” is a positive review, indicating customer satisfaction.
  • User Review 2: “This chicken burger has a very bad taste,” is negative, highlighting an issue with the burger.
  • User Review 3: “I ordered this pizza today,” is neutral, not indicating the customer’s satisfaction.

By analyzing these reviews, the company can conclude that they need to focus on promoting their sandwiches and improving their burger quality to increase overall sales.

Guide to Understand and Implement Natural Language Processing

But, now a problem arises, that there will be hundreds and thousands of user reviews for their products and after a point of time it will become nearly impossible to scan through each user review and come to a conclusion.

sentiment analysis 1

A Sentiment Analysis Model is crucial for identifying patterns in user reviews, as initial customer preferences may lead to a skewed perception of positive feedback. By processing a large corpus of user reviews, the model provides substantial evidence, allowing for more accurate conclusions than assumptions from a small sample of data.

We will explore the workings of a basic Sentiment Analysis model using NLP later in this article. Furthermore, principal sentiments like “positive” and “negative” can be broken down into more nuanced sub-sentiments such as “Happy,” “Love,” “Surprise,” “Sad,” “Fear,” and “Angry,” depending on specific business requirements.

Real-World Example

  • Historical Perspective: Initially, social media services like Facebook had only two reactions: “like” or no reaction (dislike).
  • Granular Sentiments: Over time, reactions evolved into more granular sentiments such as “like,” “love,” “sad,” and “angry.”
  • Enhanced Customer Experience: Companies promoting products on Facebook now receive more specific feedback, improving customer experience.
  • Targeted Customer Handling: Granular feedback allows companies to address customers with different sentiments (e.g., “sad” vs. “angry”) more effectively.
  • Need for Advanced Sentiment Analysis: Modern business requires more than the bare minimum; targeted strategies based on specific sentiments are essential.
  • Challenges of Natural Language: Human communication in natural language is complex and messy, making it challenging for machines to interpret.
  • Role of Natural Language Processing (NLP): NLP is needed to help computers understand human language, which includes various styles and sentiments.
  • Sentiment Analysis as a Sub-field of NLP: Sentiment Analysis uses machine learning techniques to identify and extract insights from textual data.
reactions sentiment analysis

Sentiment Analysis Using Python

Types of Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that includes deciding and concentrating on the emotional data in an info text. This can be an assessment, an evaluation, or an inclination about a specific point or item. Here are the fundamental sorts of feeling examination:

  • Fine-grained Sentiment Analysis: This goes beyond just positive, negative, or neutral. It involves very specific ratings, like a 5-star rating, for example.
  • Emotion detection: This aims to detect emotions like happiness, frustration, anger, sadness, etc. The biggest challenge here is being able to accurately identify these emotions in text.
  • Aspect-based Sentiment Analysis: This is generally used to understand specific aspects of a certain product or service. For example, in a review like “The battery life of this phone is great, but the screen is not very clear”, the sentiment towards the battery life is positive, but it’s negative towards the screen.
  • Multilingual sentiment analysis: This can be particularly challenging because the same word can convey different sentiments in different languages.
  • Intent Analysis: This goes a step further to understand the user’s intention behind a certain statement. For example, a statement like “I would need a car” might indicate a purchasing intent.

Sentiment analysis is a mind boggling task because of the innate vagueness of human language. Mockery, for example, is especially difficult to identify. Subsequently, the precision of opinion investigation generally relies upon the intricacy of the errand and the framework’s capacity to gain from a lot of information.

Theory Behind the Basics of NLP

Why Is Sentiment Analysis Important?

Sentiment analysis is important for several reasons:

  • Business Intelligence: It helps businesses understand how their customers feel about their products or services. This can guide improvements, address customer concerns, and enhance overall customer satisfaction.
  • Market Research: By analyzing public sentiment towards products, services, or brand mentions on social media, companies can gain insights into market trends and competitors.
  • Customer Service: Sentiment analysis can help identify negative reviews or feedback in real-time, allowing for quicker responses and problem resolution.
  • Product Analytics: It can be used to understand user feedback on various aspects of a product, helping drive product strategy and development.
  • Public Relations: Sentiment analysis can help monitor public sentiment towards a company or individual, enabling proactive management of public relations.
  • Politics and Public Policy: In politics, sentiment analysis is used to gauge public opinion towards policies or political entities, which can inform strategy and messaging.

Keep in mind, the objective of sentiment analysis using NLP isn’t simply to grasp opinion however to utilize that comprehension to accomplish explicit targets. It’s a useful asset, yet like any device, its worth comes from how it’s utilized.

Sentiment Analysis Challenges

Sentiment analysis, while powerful, comes with its own set of challenges:

  • Sarcasm and Irony: These linguistic features can completely reverse the sentiment of a statement. Detecting sarcasm and irony is a complex task even for humans, and it’s even more challenging for AI systems.
  • Contextual Understanding: The sentiment of certain words can change based on the context in which they’re used. For example, the word “sick” can have a negative connotation in a health-related context (“I’m feeling sick”) but can be positive in a different context (“That’s a sick beat!”).
  • Negations and Double Negatives: Phrases like “not bad” or “not unimpressive” can be difficult to interpret correctly because they require understanding of double negatives and other linguistic nuances.
  • Emojis and Slang: Text data, especially from social media, often contains emojis and slang. The sentiment of these can be hard to determine as their meanings can be subjective and vary across different cultures and communities.
  • Multilingual Sentiment Analysis: Sentiment analysis becomes significantly more difficult when applied to multiple languages. Direct translation might not carry the same sentiment, and cultural differences can further complicate the analysis.
  • Aspect-Based Sentiment Analysis: Determining sentiment towards specific aspects within a text can be challenging. For instance, a restaurant review might have a positive sentiment towards the food, but a negative sentiment towards the service.

These challenges highlight the complexity of human language and communication. Overcoming them requires advanced NLP techniques, deep learning models, and a large amount of diverse and well-labelled training data. Despite these challenges, sentiment analysis continues to be a rapidly evolving field with vast potential.

Applications of Sentiment Analysis

Sentiment Analysis has a wide range of applications across various domains. Here are some key applications:

  • Customer Feedback: Businesses use sentiment analysis to process customer feedback and reviews. This helps them understand customer satisfaction and preferences, and make data-driven decisions.
  • Social Media Monitoring: Brands monitor social media platforms to understand public sentiment about their products or services. This can help in reputation management and in identifying potential crises before they escalate.
  • Market Research: Sentiment analysis can be used to understand public opinion about a product or a political event. This can provide valuable insights for market research.
  • Product Analytics: Companies use sentiment analysis to gather insights from product reviews. This can guide product enhancements and innovations.
  • Healthcare: In healthcare, sentiment analysis can be used to understand patient experiences and feedback about treatments, doctors, or hospitals.
  • Finance: In the financial sector, sentiment analysis is used to gauge market sentiment. Traders and investors use this information to make informed decisions.
  • Politics: In politics, sentiment analysis is used to understand public opinion about certain policies or politicians. This can guide political campaigns and strategies.
  • Human Resources: HR departments use sentiment analysis to understand employee feedback and improve workplace culture.

Remember, these are just a few examples. The potential applications of sentiment analysis are vast and continue to grow with advancements in AI and machine learning technologies.

Step by Step procedure to Implement Sentiment Analysis

First, let’s import all the python libraries that we will use throughout the program.

Step1: Basic Python Libraries

  • Pandas – library for data analysis and data manipulation
  • Matplotlib – library used for data visualization
  • Seaborn – a library based on matplotlib and it provides a high-level interface for data visualization
  • WordCloud – library to visualize text data
  • re – provides functions to pre-process the strings as per the given regular expression
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import re

Step2: Natural Language Processing

  • nltk – Natural Language Toolkit is a collection of libraries for natural language processing
  • stopwords – a collection of words that don’t provide any meaning to a sentence
  • WordNetLemmatizer – used to convert different forms of words into a single item but still keeping the context intact.
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Step3: Scikit-Learn (Machine Learning Library for Python)

  • CountVectorizer – transform text to vectors
  • GridSearchCV – for hyperparameter tuning
  • RandomForestClassifier – machine learning algorithm for classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Step4: Evaluation Metrics

  • Accuracy Score: no. of correctly classified instances/total no. of instances
  • Precision Score: the ratio of correctly predicted instances over total positive instances
  • Recall Score: the ratio of correctly predicted instances over total instances in that class
  • Roc Curve: a plot of true positive rate against false positive rate
  • Classification Report: report of precision, recall and f1 score
  • Confusion Matrix: a table used to describe the classification models
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report
from scikitplot.metrics import plot_confusion_matrix

Step5: Evaluate Dataset

We will use the dataset which is available on Kaggle for sentiment analysis using NLP, which consists of a sentence and its respective sentiment as a target variable. This dataset contains 3 separate files named train.txt, test.txt and val.txt.

You can find the dataset here.

Now, we will read the training data and validation data. As the data is in text format, separated by semicolons and without column names, we will create the data frame with read_csv() and parameters as “delimiter” and “names”.

df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])

Now, we will concatenate these two data frames, as we will be using cross-validation and we have a separate test dataset, so we don’t need a separate validation set of data. And, then we will reset the index to avoid duplicate indexes.

df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)

We can view a sample of the contents of the dataset using the “sample” method of pandas, and check the no. of records and features using the “shape” method.

import pandas as pd
df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])
df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)
print("Shape of the DataFrame:",df.shape)
print(df.sample(5))

Now, we will check for the various target labels in our dataset using seaborn.

sentiment analysis plot

As we can see that, we have 6 labels or targets in the dataset. We can make a multi-class classifier for Sentiment Analysis using NLP. But, for the sake of simplicity, we will merge these labels into two classes, i.e. Positive and Negative sentiment.

  • Positive Sentiment – “joy”,”love”,”surprise”
  • Negative Sentiment – “anger”,”sadness”,”fear”

Now, we will create a custom encoder to convert categorical target labels to numerical form, i.e. (0 and 1)

def custom_encoder(df):
    df.replace(to_replace ="surprise", value =1, inplace=True)
    df.replace(to_replace ="love", value =1, inplace=True)
    df.replace(to_replace ="joy", value =1, inplace=True)
    df.replace(to_replace ="fear", value =0, inplace=True)
    df.replace(to_replace ="anger", value =0, inplace=True)
    df.replace(to_replace ="sadness", value =0, inplace=True)
custom_encoder(df['label'])
sentiment analysis cloud

Now, we can see that our target has changed to 0 and 1,i.e. 0 for Negative and 1 for Positive, and the data is more or less in a balanced state.

Step6: Data Pre-processing

Now, we will perform some pre-processing on the data before converting it into vectors and passing it to the machine learning model.

We will create a function for pre-processing of data.

  • First, we will iterate through each record, and using a regular expression, we will get rid of any characters apart from alphabets.
  • Then, we will convert the string to lowercase as, the word “Good” is different from the word “good”.
  • Because, without converting to lowercase, it will cause an issue when we will create vectors of these words, as two different vectors will be created for the same word which we don’t want to.
  • Then we will check for stopwords in the data and get rid of them. Stopwords are commonly used words in a sentence such as “the”, “an”, “to” etc. which do not add much value.
  • Then, we will perform lemmatization on each word,i.e. change the different forms of a word into a single item called a lemma.

A lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the same lexeme to their respective lemma. And, then return a corpus of processed data.

But first, we will create an object of WordNetLemmatizer and then we will perform the transformation.

#object of WordNetLemmatizer
lm = WordNetLemmatizer()
def text_transformation(df_col):
    corpus = []
    for item in df_col:
        new_item = re.sub('[^a-zA-Z]',' ',str(item))
        new_item = new_item.lower()
        new_item = new_item.split()
        new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
        corpus.append(' '.join(str(x) for x in new_item))
    return corpus
corpus = text_transformation(df['text'])

Now, we will create a Word Cloud. It is a data visualization technique used to depict text in such a way that, the more frequent words appear enlarged as compared to less frequent words. This gives us a little insight into, how the data looks after being processed through all the steps until now.

rcParams['figure.figsize'] = 20,8
word_cloud = ""
for row in corpus:
    for word in row:
        word_cloud+=" ".join(word)
wordcloud = WordCloud(width = 1000, height = 500,background_color ='white',min_font_size = 10).generate(word_cloud)
plt.imshow(wordcloud)

Output:

word cloud

Step7: Bag of Words

Now, we will use the Bag of Words Model(BOW), which is used to represent the text in the form of a bag of words ,i.e. the grammar and the order of words in a sentence are not given any importance, instead, multiplicity, i.e. (the number of times a word occurs in a document) is the main point of concern.

Basically, it describes the total occurrence of words within a document.

Scikit-Learn provides a neat way of performing the bag of words technique using CountVectorizer.

Now, we will convert the text data into vectors, by fitting and transforming the corpus that we have created.

cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = df.label

We will take ngram_range as (1,2) which signifies a bigram.

Ngram is a sequence of ‘n’ of words in a row or sentence. ‘ngram_range’ is a parameter, which we use to give importance to the combination of words, such as, “social media” has a different meaning than “social” and “media” separately.

We can experiment with the value of the ngram_range parameter and select the option which gives better results.

Now comes the machine learning model creation part and in this project, I’m going to use Random Forest Classifier, and we will tune the hyperparameters using GridSearchCV.

GridSearchCV() will take the following parameters.

  • Estimator or model: RandomForestClassifier in our case
  • parameter: dictionary of hyperparameter names and their values
  • cv: signifies cross-validation folds
  • return_train_score: returns the training scores of the various models
  • n_jobs: no. of jobs to run parallelly (“-1” signifies that all CPU cores will be used which reduces the training time drastically)

First, We will create a dictionary, “parameters” which will contain the values of different hyperparameters.

We will pass this as a parameter to GridSearchCV to train our random forest classifier model using all possible combinations of these parameters to find the best model.

parameters = {'max_features': ('auto','sqrt'),
             'n_estimators': [500, 1000, 1500],
             'max_depth': [5, 10, None],
             'min_samples_split': [5, 10, 15],
             'min_samples_leaf': [1, 2, 5, 10],
             'bootstrap': [True, False]}

Now, we will fit the data into the grid search and view the best parameter using the “best_params_” attribute of GridSearchCV.

grid_search = GridSearchCV(RandomForestClassifier(),parameters,cv=5,return_train_score=True,n_jobs=-1)
grid_search.fit(X,y)
grid_search.best_params_

Output:

grid search

And then, we can view all the models and their respective parameters, mean test score and rank as  GridSearchCV stores all the results in the cv_results_ attribute.

for i in range(432):
    print('Parameters: ',grid_search.cv_results_['params'][i])
    print('Mean Test Score: ',grid_search.cv_results_['mean_test_score'][i])
    print('Rank: ',grid_search.cv_results_['rank_test_score'][i])

Output: (a sample of the output)

sentiment analysis - sample output

Now, we will choose the best parameters obtained from GridSearchCV and create a final random forest classifier model and then train our new model.

rfc = RandomForestClassifier(max_features=grid_search.best_params_['max_features'],                                  max_depth=grid_search.best_params_['max_depth'],
                                  n_estimators=grid_search.best_params_['n_estimators'],                                      min_samples_split=grid_search.best_params_['min_samples_split'],                                    min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                    bootstrap=grid_search.best_params_['bootstrap'])
rfc.fit(X,y)

Step8: Test Data Transformation

Now, we will read the test data and perform the same transformations we did on training data and finally evaluate the model on its predictions.

test_df = pd.read_csv('test.txt',delimiter=';',names=['text','label'])
X_test,y_test = test_df.text,test_df.label
#encode the labels into two classes , 0 and 1
test_df = custom_encoder(y_test)
#pre-processing of text
test_corpus = text_transformation(X_test)
#convert text data into vectors
testdata = cv.transform(test_corpus)
#predict the target
predictions = rfc.predict(testdata)

Step9: Model Evaluation

We will evaluate our model using various metrics such as Accuracy Score, Precision Score, Recall Score, Confusion Matrix and create a roc curve to visualize how our model performed.

rcParams['figure.figsize'] = 10,5
plot_confusion_matrix(y_test,predictions)
acc_score = accuracy_score(y_test,predictions)
pre_score = precision_score(y_test,predictions)
rec_score = recall_score(y_test,predictions)
print('Accuracy_score: ',acc_score)
print('Precision_score: ',pre_score)
print('Recall_score: ',rec_score)
print("-"*50)
cr = classification_report(y_test,predictions)
print(cr)

Output:

output

Confusion Matrix:

confusion matrix

Step10: Roc Curve

We will find the probability of the class using the predict_proba() method of Random Forest Classifier and then we will plot the roc curve.

predictions_probability = rfc.predict_proba(testdata)
fpr,tpr,thresholds = roc_curve(y_test,predictions_probability[:,1])
plt.plot(fpr,tpr)
plt.plot([0,1])
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

As we can see that our model performed very well in classifying the sentiments, with an Accuracy score, Precision and  Recall of approx 96%. And the roc curve and confusion matrix are great as well which means that our model is able to classify the labels accurately, with fewer chances of error.

Now, we will check for custom input as well and let our model identify the sentiment of the input statement.

Predict for Custom Input:

def expression_check(prediction_input):
    if prediction_input == 0:
        print("Input statement has Negative Sentiment.")
    elif prediction_input == 1:
        print("Input statement has Positive Sentiment.")
    else:
        print("Invalid Statement.")
# function to take the input statement and perform the same transformations we did earlier
def sentiment_predictor(input):
    input = text_transformation(input)
    transformed_input = cv.transform(input)
    prediction = rfc.predict(transformed_input)
    expression_check(prediction)
input1 = ["Sometimes I just want to punch someone in the face."]
input2 = ["I bought a new phone and it's so good."]
sentiment_predictor(input1)
sentiment_predictor(input2)

Output:

negetive

Hurray, As we can see that our model accurately classified the sentiments behind the two sentences.

Conclusion

Sentiment analysis using NLP stands as a powerful tool in deciphering the complex landscape of human emotions embedded within textual data. By leveraging various techniques and methodologies such as text analysis and lexicon-based approaches, analysts can extract valuable insights, ranging from consumer preferences to political sentiment, thereby informing decision-making processes across diverse domains. The polarity of sentiments identified helps in evaluating brand reputation and other significant use cases. As we conclude this journey through sentiment analysis, it becomes evident that its significance transcends industries, offering a lens through which we can better comprehend and navigate the digital realm.

Key Takeaways

  • The study highlighted the importance of sentiment analysis in various applications, such as customer support and survey responses, demonstrating how it can provide valuable insights.
  • Effective techniques were employed to accurately identify and process negative words and neutral sentiment, crucial for precise sentiment classification.
  • Utilizing open source tools facilitated the development and implementation of efficient sentiment analysis models, ensuring accessibility and adaptability.
  • Regression models played a pivotal role in predicting sentiment scores, showcasing their effectiveness in handling sentiment analysis tasks.
  • Understanding the semantic context was emphasized as a critical factor in accurately interpreting sentiment from unstructured data.
  • The automation of sentiment analysis processes proved beneficial, enhancing efficiency and scalability in analyzing large datasets.
  • Practical case studies demonstrated the real-world applicability and benefits of sentiment analysis, particularly in improving customer support and interpreting survey responses.

Frequently Asked Questions

Q1. What is sentiment analysis work?

A. Sentiment analysis work involves using natural language processing to determine the emotional tone behind a body of text, such as online reviews or social media posts.

Q2. How do positive words affect sentiment analysis?

A. Positive words contribute to identifying favorable sentiments in texts, helping in categorizing feedback as positive in sentiment analysis.

Q3. How is sentiment analysis applied to online reviews?

A. Sentiment analysis processes online reviews to gauge customer satisfaction and sentiment, providing insights into overall consumer opinions.

Q4. What is a rule-based approach in sentiment analysis?

A. A rule-based approach in sentiment analysis uses predefined linguistic rules to classify text sentiment, relying on lists of positive and negative words.

Q5. What does categorizing mean in the context of sentiment analysis?

A5. Categorizing in sentiment analysis refers to the process of classifying text into categories like positive, negative, or neutral based on the sentiment expressed.

Q6. What is an API in sentiment analysis?

A. An API (Application Programming Interface) in sentiment analysis allows developers to integrate sentiment analysis algorithms into their applications, facilitating automated sentiment classification.

Q7. What are sentiment analysis algorithms?

A. Sentiment analysis algorithms are computational methods used to evaluate and categorize sentiment in text data, employing techniques from machine learning and natural language processing.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Nikhil Raj 28 May 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Natural Language Processing
Become a full stack data scientist