Sentiment Analysis an intro
Posted on August 15, 2016 in misc
Sentiment Analysis:¶
The whole idea of text mining is about gaining insights in textual data. Sentiment anaysis is one of the important applications in the area of text mining. It tries to identify weather the opinoin expressed in a text is positive, negitive or netural towards a given topic.
For example,
- I am happy about my promotion
- I feel sad evers since I heard the news
In the above two sentences, both express an opinion about something. In the first sentence the writer is "happy" about the promotion. In the second sentence the "news" made the writer "sad". Happy is a positive reaction and sad is a negitive reaction. This is how we observe opinions in text. On the other hand neutral sentences do not express either positive or negitive opinion. The below sentences are examples of neutral sentences:
- Terry is back from work
- I saw captain america part 1 and 2 .
A simple Use Case:¶
Suppose a new movie has been released and a firm wants to analyse the viewers opinion about the movie. One obvious place where people express their opinons is the web. There are numerous web sites where poeple express their sentiments for movies such as IMDB.com, rottenTomatoes etc. However the number of comments or reviews that these sites receive is huge. And it is extremely hard to go by each review manualy to analyse how reviewers feel about a movie.
NOTE: Of course some sites along with comments also offer ratings on a scale of either five or ten. However a good data analysis would often consider both the the reviewers comments as well as their ratings to analyse the opinions of the reviewers.
Opinion mining is not just limited to reviews alone. It can be used on any corpus such as blogs, tweets, books, comments etc. There are numerous methods to extract opinion from text. We are going to discuss a simplest approach for sentiment analysis.
LEXICON Based approach:¶
A lexicon is nothing but a dictionary. The Lexicon based approach is the simplest baseline approach that is used for sentiment analysis of a corpus (A courps is thing that has text such as blogs, documents, books, tweets etc). In this approach, as the name implies, we have a dictionary of words and each word has a predefined score which we call the polarity of the word. We use the polarity of induvidual words in a document or sentence to determine weather the it is expressing positive or negitive or netural opinion.
Lucky us there are some publicly available lexicons at our disposal. "SentiWordNet" is one such lexicon specifically meant for opinion mining. And to make things even more easy python's natural language tool kit (nltk) offers functionality to use the SentiWordNet lexicon.
You can instsall nltk using pip:
pip install nltk
You can import the nltk package in python as shown below:
import nltk
Python's NLTK is a very big library, apart from its own functionality, it has package installer that gives you access to many corpuses, datasets etc. We will use the sentiWordNet lexicon which is a dictionary of words with their polarity values. You can download the lexicon using the nltk's downloader:
# download sentiwordnet lexicon and import it
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
how to obtain polarities of a word:¶
print "pos score: ", swn.senti_synsests("happy", 'a')[0].pos_score()
print "neg score: ", swn.senti_synsets("happy", 'a')[0].neg_score()
Neutral score:¶
print "neutral score: ", swn.senti_synsets("happy", 'a')[0].obj_score()
Sentence Level Polarities:¶
Now that we know how to obtain a polarity of a word, we can apply the same to the Sentence level. A sentence is usallaly group of words that convey some meaning and has a subject and a verb associated with the subject.
The Idea is to calculate the polarities of induvidual words in a sentence and then compound them to determine the polarity of the entire sentence.
For example consider the below positive sentence:
- "I am happy." You can simply tell by looking at the above sentence that it is positive in nature. But the same can be derived computationly using senti wordnet lexicon. Lets see the polarity values for each word in the above sentence in the senti lexicon:
print swn.senti_synsets("i", 'n')[0]
print swn.senti_synsets("am", 'v')[0]
print swn.senti_synsets("happy","a")[0]
If you observe the combined positive score is:
0.0 + 0.25 + 0.875 = 1.125
The combined negitive score is:
0.0 + 0.125 + 0.0 = 0.125
Since positive score is greater than negitive score, we conclude the sentence as postive opinion. If negitive score is greater than postive score than the conclusion will be the polarity of the sentence is negitive.
tokenization:¶
Tokenization is process of generating list of words that are present in a sentence i.e the sentence "i am happy" is converted into list as ["i", "am", "happy"]. You can use the nltk's word_tokenization fucntion:
tokens = nltk.tokenize.word_tokenize("i am happy") # returns a list of the words in the sentence.
print "tokens: ", tokens
POS Tagging:¶
We have to determine now the parts of speech for each word in the sentence. We need Parts of speech of the word to obtain it's polarity.
In the previous example sentence "I am happy", each word is associated with a parts of speech. "I" is a noun. "am" is a verb "happy" is a adjective.
We need to specifiy the POS ( parts of speech ) of the word in the function swn.senti_synsets to obtain the polarity scores for that word. Lets see how we can achieve the parts of speech using the nltk tool:
from nltk.tag import pos_tag
pos_tag(tokens)
NN - is noun
VBP - verb
JJ - Adjective
In sentiwordnet:
- Nouns are tagged as 'n'.
- Verbs are tagged as 'v'.
- adjectives are tagged as 'a'
This exactly what we have done before to obtain the polarity of each word in a sentence:
print swn.senti_synsets("i", 'n')[0]
print swn.senti_synsets("am", 'v')[0]
print swn.senti_synsets("happy","a")[0]
NOTE: we are indexing with [0] because senti_synsets usually returns a list of words (that are of similar nature). They all have the same scores.
Stopwords:¶
When calculating polarities for words in sentences, we need not consider all the words. Some of the words do not carry any weight and can be ignored. These words do not have any affect on the polarity of a sentence. Such words are called stopwords. We can simply ignore the stopwords in a sentence and calculate the polartiy of the remaining words. NLTK comes with a list of stopwords.
from nltk.corpus import stopwords # import stopwords
stop = stopwords.words('english') # initialize english stopwords
new_sentence = []
for word in tokens:
if word not in stop:
new_sentence.append(word)
print "The sentence has been reduced from :", sentence, " : to : ", new_sentence
If we remove the stopwords we just have just have to cacluate the polarity for the word happy alone in our example. And for bigger and complex sentences you would be calculating the polarity for a fewer words.
Things to do:¶
In this chapter you have seen how to calculate the polarity of a sentence. Now write a progam to calculatet the polarity of all the tweets that you have extracted and preprocessed in the previous chapters. You progam should also include the below features:
a. Tweets have hashtags. Remove the hashtags and then find the polarity of each tweet. b. There might be words that are not present in the sentiwordnet lexicon. The program should handle these cases, by giving a zero score for such words.