CSCI packages can be imported from nltk.corpus

 

 

 

                          

 

                         CSCI – 755 – ARTIFICIAL INTELLIGENCE

TERM
PAPER REPORT

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

                           

                        SENTIMENT ANALYSIS IN NLP

                     A STATE OF REVIEW OF ART

 

BAALA CHANDHREN NATARAJAN

STUDENT ID: 1135780

 

 

 

 

                                                                   

 

 

 

 

ABSTRACT

“Social Media” is
the trending platform now and almost everybody are involved in it one way or
the other. Another platform that is flourishing is “Machine Learning” and when
social media is combined with machine learning, it is just magnificent. For
example, when somebody wants to try a new restaurant, watch a movie, learn a
course on the internet or join a college, the first thing a person does is
“Google” about it and see if it has good reviews. One of the ways to do it is
by Sentiment Analysis where the emotions of the text is captured and
classified.

Sentiment Analysis refers to the elegance of
computational and natural language processing based totally on strategies used
to become aware of, extract or represent subjective facts, which includes
evaluations, expressed in a given piece of text. The primary purpose of
sentiment analysis is to categorize a creator’s mind-set closer to numerous
subjects into fantastic, terrible or neutral categories. Sentiment evaluation
has many packages in exclusive domain names consisting of, but no longer
constrained to, enterprise intelligence, politics, sociology, and so on. Facts
consisting of movie reviews, net-postings, tweets, motion pictures, etc., offer
significant opportunities to look at and analyze human critiques and
sentiments.

There are numerous eventualities wherein
sentiment analysis is used. One such scenario is to rate the movie (on a scale
of 1-10) based on the online movie reviews on the internet using Sentiment
Analysis. Film assessment evaluation is one of the most popular fields to analyze
public sentiment. There are many challenges faced in sentiment analysis such as
cleaning and pre-processing the data. Only if it is done properly, the
classifier will be trained and accuracy level will be at the maximum. There are
lot of packages and tools specifically for data cleaning and these packages can
be imported from nltk.corpus library.

 

INTRODUCTION

In the current era, data can be used for many
purposes and it has become the new trend to analyze data and come to a
conclusion about a certain aspect. Big giants like Google, Twitter, Facebook
give data to engineers for them to analyze and extract opinions from it. Based
on the opinions, companies improve their efficiency or change their gameplay in
order to make the company better. For instance, during the early 2000’s
companies used to give feedback forms which will be filled by the user and it
will be analyzed manually by the employees of the company but now the feedback
form is given online, the feedback data is used as the dataset and an algorithm
is written to analyze the data and a conclusion is made. This is done by a term
commonly called as “Sentiment Analysis”. Sentiment Analysis 1 is analyzing
the data by examining the sentiment of the data. Based on the polarity of the
data, it can be classified if it is positive, negative or neutral. This is a sub
domain of Natural Language Processing 2 which is a machine trying to
understand the human language. In this report, we are going to use the movie
reviews from different websites such as IMDB, fandango, metacritic and amazon
as our dataset to find out if the movie is good or bad based on the reviews
given by different users and each movie has got a minimum of 30 reviews and a
maximum of 200 reviews. Based on the polarity of the movie reviews of a movie,
the movie will be rated on a scale of 1 to 10. Our initial step is to clean the
data by tokenizing, removing the stop words and apply the part of speech
tagging. Once the data is cleaned the text classification can be done using
SentiWordNet. SentiWordNet 3 is an extension of the wordnet database which helps
to classify the words into positive, negative and neutral. There are about 450
movies which is used for training and about 250 movies which will be used for
testing the classifier and the classifier which will be used is the Support
Vector Machine (SVM). SVM 4 is a supervised machine learning algorithm which
is extensively used for classification of images and text. This classifier is
very efficient because it sets a hyper-plane and divides them into two classes.
In our case, the negative and the positive reviews will be divided and fed to
the different classes. SVM is known for its high accuracy level.

 

RELATED
WORK – LITERATURE SURVEY

There has been a tremendous progress in the field
of Natural Language Processing especially in opinion mining which is another terminology
for sentiment analysis. Every time there is an improvement in the field and new
things are discovered.

Tirath et al 5 did a comparison between
different machine learning classifiers and found out that Random Forest produced the highest accuracy with 88.95%. They
designed an algorithm to classify the movie reviews if it is positive, negative
or neutral using sentiwordnet.

Pallavi et al 6 did a comparison between
different text classifiers such as Wordnet, SentiWordNet and Opinion Lexicon
and produced the results for them. They designed a GUI where a movie review is
given as input and the output is classifying the movie review as positive,
negative or neutral. They also handled a negative word in a positive review by
using a custom algorithm called as negation
handling. In this all the words are checked and if there is a negative
word, then the polarity of the sentence is multiplied with a ‘-‘.Synonyms were
handled by combining words which has the same meaning. For example movie, film,
picture all mean the same and whenever these words are present it takes only
the value ‘movie’.

Nagamma et al 7 predicted the box office
collection of the movie based on the online movie reviews. They used TF-IDF
(Term Frequency – Inverse Document Frequency) as their sentiment classifier.
They used a formula for calculating the polarity of a sentence that is to
divide the number of adjectives occurred in a document and the total adjectives
in the document. Based on that, the box office was predicted using Support
Vector Machine which produced an accuracy of 89%.

Nagarjuna et al 8 used laptop reviews to find
out if the laptop is a good model or not. Each feature was extracted such as
Screen resolution, processing speed, weight etc. They took care of Anaphora
Resolution which is one of the challenges faced in sentiment analysis. They did
this by using Part of Speech tagging and use of SentiWordNet, finally using SVM
classifier to produce an accuracy of 88%. 

This is done even using Neural Network model 9
where the IMDB data is used as a dataset using keras. Keras is used to load the
dataset in a neural network model format. An one dimensional neural network
model was designed and produced an accuracy of 88.3%.

 

BACKGROUND

The main
aim of sentiment analysis is to analyze the data and classify it as a positive,
negative or a neutral data. There are several steps to perform this. The first
and foremost step is to analyze the data and understand the data.

 

1.           DATA PREPROCESSING

The data
should be cleaned before training them which is also called data preprocessing.
This involves implementing many functions. In our project report, we have used
Natural Language Toolkit 10 and there are inbuilt packages to preprocess the
data. Word tokenizing which will read each and every word in the dataset and
display them. Removal of the stop words that is to remove the grammar from the
dataset. Converting the data in the dataset to lowercase, removing the
punctuations and finally implementing the part of speech tag. Part of speech
tag will identify the word if it is a Noun, Verb, Adjective or Adverb.

 

2.       FEATURE
EXTRACTION

Each word
is considered as a feature and when we use the sentiwordnet 11 package, every
word has a score ranging from 0 to 1 with the negative words having a low score
and positive words having a high score. Based on that, the polarity of the
sentence can be determined and can predict if the data is positive or negative.

 

3.       TRAINING
THE DATA MODEL

SVM is a classification algorithm which helps to
classify the data. It is a supervised algorithm which means the dataset should
be labelled for the algorithm to perform the classification. There is a package
inbuilt in nltk called the “sklearn” 12. With the help of that package, the
svm package can be imported. The training dataset should be labelled and the
data should be trained and in the test dataset, there should be no labelling
and the algorithm can be tested how efficiently it is classifying without
labelling.

  

 

Diagram
1                                                                  
Diagram 2

 

PROPOSED METHOD

The proposed method is
to get the score of a sentence based on the features extracted. Once the
features are extracted the data will get a score and based on that we can come
to a conclusion if the sentence falls towards the positive or the negative side.
If the score is above 0.5, it is a positive data and below 0.5 it is a negative
data.

1.  
WORD TOKENIZING

It is
the process of breaking the words in the sentence. They are called as tokens.
By this way, the features can be analyzed in the data.

2. 
 STOPWORDS

Stop
words are the most common words occurring in the data. For example, the grammar
used in the data will be removed. The stop words can be imported using the
nltk.corpus package.  

 

3.   PART OF SPEECH TAGGING

This
functionality tags each word with its part of speech that is based on the word
it tags if it is a noun, verb, adjective, adverb etc. This will help when
sentiwordnet is applied on them.

4.  
SENTIWORDNET

Sentiwordnet
is a sophisticated feature that can be imported using the package wordnet. It
is a default package present in the natural language tool kit. Synset is a
functionality which helps to find the score of each word.  We need to tag the word with its part of
speech and it will give us a score.

 

5.   ALGORITHM

The
algorithm is used for getting the sentiment score of the data in the dataset.

 

The
algorithm is used for classification.

 

6.  
TF-IDF VECTORIZATION

As the
name suggests, it states the number of times the word has occurred in the
dataset. Term Frequency – Inverse Document Frequency13 helps in retreiving
the data too. It is majorly used in text mining. The value of tf-idf increases
when the word appears in the dataset. This can be imported using the nltk tool
kit by importing it from sklearn.feature_extraction.text package. Once the
features are extracted then it can be used for training the classifier. For
example, let us consider the following sentences,

It is a
windy day today.

It is
going to rain today.

In both
these sentences, the stop words are removed and only the features are taken,
which is “windy”, “day”, “today” from first sentence and “going”,  “rain”, “today” from the second sentence. It
then calculates the term frequency that is number of times the term has
occurred in the data set and how relevant it is. “today” has occurred two
times.

 

EXPERIMENT

The data
was obtained online where the reviews of about 600 movies were collated and
kept. We modified the dataset by converting the text files into one csv file
and labelled the data as 0 and 1 where 0 indicates it is a negative review and
1 is a positive review.

The
reviews were rated based on several other factors such as precison, recall and
finally the accuracy of the classifier.

All these
factors can be obtained using the sklearn.metrics package and importing
the  “classification_report”. With this
package the precision, recall can be calculated and the results will be
produced.

Precision
can be defined as the true positive divided   
by the sum of true positive and false positive.

Recall can
be defined as the true positivie divided by the sum of true positive and false
negative.

Accuracy
is the percentage of correctness of the classifier.

The
current experiment gave us an accuracy of 87% and thereby classifying almost
all the data correctly.

For
example, let us consider the following data and check the sentiment score for
the below mentioned review from the movie “3 Idiots”.

“there is novelty in the story. originality
is also there. execution is above average. the ending is nice. but the first
half is just like that. technical values are good. casting is top notch and
production values are good. some changes in the execution might have given a
chance for being a trade mark movie. but anyway its good.”

The
sentiment score for this review is 0.9 and therefore we can come to a
conclusion that is a positive review.

 

V1. CONCLUSION

Sentiment
Analysis is a very vast field and there are lots of new things to be learnt
everyday. It is one of the sophisticated way to find the review of about
probably anything online. Natural Language Toolkit is a very efficient way to
do sentiment analysis and the results obtained using NLTK is very accurate.
Sometimes even a human can make a mistake in classifying the data but once we
train our data model with the correct parameters, the data will be classified
correctly and can achieve atleast 95-98% accuracy. It all depends on how well
the data is trained.

Machine
learning is an improving field and is developing on a daily basis. It can be
applied to many domains such as finance, education, entertainment etc.

Support
Vector Machine is one of the best classification algorithm. This is because it
can classify the data very accurately. It sets a hyperplane and classifies the
data into two different classes. There are other algorithms like Naïve Bayes,
Random Forest which can also be used for the same purpose. Out of them, SVM
produces a high accuracy rate when trained properly.

Overall,
opinion mining is one of the improving fields in the field of technology.
Currently lots of research is going on in the same field. When machine learning
and opinion mining are combined, predictions can be made.

REFERENCES

1LexalyticsOnline.Available: https://www.lexalytics.com/technology/sentiment

2 Matt Kiser. Introduction to Natural Language Processing
(NLP) 2016 Online. Available: http://blog.algorithmia.com/introduction-natural-language-processing-nlp/

3 Fabio Benedetti Online. Available: https://www.slideshare.net/faigg/tutotial-of-sentiment-analysis

 

4Support Vector Machine Online. Available: https://en.wikipedia.org/wiki/Support_vector_machine

 

5 Tirath Prasad Sahu and Sanjeev Ahuja
“Sentiment analysis of movie reviews: A study on feature selection &
classification algorithms” International
Conference on Microelectronics, Computing and Communications (MicroCom),
2016.

 

6 Pallavi Sharma and Nidhi Mishra “Feature
level sentiment analysis on movie reviews” 2nd
International Conference on Next Generation Computing Technologies (NGCT),
2016.

 

7 P. Nagamma, H. R. Pruthvi, K.K. Nisha and N H
Shwetha “An Improved Sentiment Analysis Of Online Movie Reviews Based On
Clustering For Box-Office Prediction” International
Conference on Computing, Communication & Automation, 2015.

 

8 D.V. Nagarjuna Devi, Chinta Kishore Kumar and
Siriki Prasad “A Feature Based Approach for Sentiment Analysis by Using Support
Vector Machine” IEEE 6th International Conference on Advanced
Computing, 2016

 

9 Jason Brownlee. Predict Sentiment From Movie
Reviews Using Deep Learning. 2016 Online. Available: http://
machinelearningmastery.com/ predict-sentiment-movie-reviews-using-deep-learning/

 

10 Natural Language Toolitk. Online.
Available. http://www.nltk.org/

 

11 Source
Code for nltk.corpus.reader. sentiwordnet. Online. Available http://
www.nltk.org/ _modules/nltk/corpus/reader/sentiwordnet.html

 

12 Support Vector Machines Online. Available.
http://scikit-learn.org/stable/modules/svm.html

 

13 Text feature extraction (tf-idf) Online.
Available.

Machine Learning :: Text feature extraction (tf-idf) – Part I