Using NLTK’s built-in classifiers for Sentiment Analyzer¶

NLTK also offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of our dataset are useful in classifying each piece of data into our desired categories. Let's start it as a warm up.

Loading Libraries and Data¶

In [1]:
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from random import shuffle
import random
from nltk.tokenize import word_tokenize 
import numpy as np

nltk.download(["stopwords", "vader_lexicon", "punkt"])
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yanyuchen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yanyuchen/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/yanyuchen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[1]:
True
In [2]:
url = 'https://raw.githubusercontent.com/yanyuchen/sentiment-analysis/main/data/all_reviews.csv'
df = pd.read_csv(url)[['review', 'rating']].dropna()
df = df.reset_index(drop=True)
df.head()
Out[2]:
review rating
0 My family took the tour ( BUY TICKETS IN ADVAN... 5.0
1 This is a must stop if you are in San Fran!!! ... 5.0
2 I did not expect to enjoy the tour as much as ... 5.0
3 San Francisco is completely unsafe. We bought ... 1.0
4 I had a 13-hour layover in San Francisco And I... 4.0

Training and Using a Classifier¶

The common words in NLP, called the stop words, may have a negative effect on our analysis because they occur so often in the text. We drop them when we fit our models.

In [3]:
stopwords = set(nltk.corpus.stopwords.words("english"))
stopwords.update(nltk.corpus.stopwords.words("spanish")) 
In [4]:
act = df.rating > 3
features = [(df.review[i], act[i]) for i in range(len(act))]

Training the classifier involves random splitting the feature set so that one portion can be used for training and the other for evaluation. For the purpose of reproducibility, we fix our random seed.

In [5]:
random.seed(220)

We split the string of review into several peices of words and use the words that are not in the stop words list to fit our model.

In [6]:
train_count = round(len(features) * 0.8)
shuffle(features)
train = features[:train_count]
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]) if word.lower() not in stopwords)
#t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
# speed up:
t = [({word: False for word in all_words} | {word: True for word in word_tokenize(passage[0]) if word.lower() not in stopwords},
      label) for passage, label in train]
In [10]:
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features(15)
Most Informative Features
                       9 = True            False : True   =     11.0 : 1.0
                       $ = True            False : True   =      6.6 : 1.0
                       4 = True            False : True   =      6.6 : 1.0
                       8 = True            False : True   =      6.6 : 1.0
                      `` = True            False : True   =      6.6 : 1.0
                       Н = True            False : True   =      6.6 : 1.0
                       ア = True            False : True   =      6.6 : 1.0
                       子 = True            False : True   =      6.6 : 1.0
                       전 = True            False : True   =      6.6 : 1.0
                       È = True            False : True   =      4.7 : 1.0
                       c = True            False : True   =      4.0 : 1.0
                       Е = True            False : True   =      4.0 : 1.0
                       П = True            False : True   =      4.0 : 1.0
                       샌 = True            False : True   =      3.7 : 1.0
                       v = True            False : True   =      3.0 : 1.0
In [11]:
nltk.classify.accuracy(classifier, t)
Out[11]:
0.8696296296296296

The result shows that our fitted model performs better than the previous naive approach on the training set. However, it is unclear why we obtain those words to be our most informative features. Next, let's see its performance on the testing set.

Evaluating on the Testing Set¶

In [20]:
test = features[train_count:]
s = [{word: False for word in all_words} | {word: True for word in word_tokenize(passage[0]) if word.lower() not in stopwords}
     for passage, _ in test]
act = [ts[1] for ts in test]
In [13]:
pred = classifier.classify_many(s)
In [8]:
def accuracy(pred, act):
    return sum([pred[i] == act[i] for i in range(len(act))])/len(act)
In [22]:
accuracy(pred, act)
Out[22]:
0.875

The performance on the testing set is similar to that on the training set. It indicates no overfitting happened and suggests its ability of the generalization also outperforms the naive approach. Before we end this section, let's make a prediction on the whole dataset.

In [9]:
review = [review for review in df.review] # preserve the order
all_w = [{word: False for word in all_words} | {word: True for word in word_tokenize(passage) if word.lower() not in stopwords}
     for passage in review]
In [62]:
pred_all = classifier.classify_many(all_w)
pd.DataFrame(pred_all).to_csv('nltk.NaiveBayesClassifier_pred.csv', index=False)

Comparing Additional Classifiers with Scikit-learn¶

NLTK provides a class that can use most classifiers from the popular machine learning framework, scikit-learn. Many of the classifiers that scikit-learn provides can be instantiated quickly since they have defaults that often work well. Since NLTK allows us to integrate scikit-learn classifiers directly into its own classifier class, the training and classification processes will use the same methods we’ve already seen.

In [24]:
from sklearn.naive_bayes import (BernoulliNB, ComplementNB, MultinomialNB,)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
In [25]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
}
In [27]:
accuracy_list = []
pred_list = []

for name, sklearn_classifier in classifiers.items():
    classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
    classifier.train(t)
    
    # training set
    accur = nltk.classify.accuracy(classifier, t)
    accuracy_list.append(accur)
    
    # testing set
    pred = classifier.classify_many(s)
    pred_list.append(accuracy(pred, act))
    
    # whole dataset
    pred_all = classifier.classify_many(all_w)
    pd.DataFrame(pred_all).to_csv(F'{name}_pred.csv', index=False)
    
    print(F"{name} finish")
BernoulliNB finish
ComplementNB finish
MultinomialNB finish
LogisticRegression finish
DecisionTreeClassifier finish
RandomForestClassifier finish
In [36]:
pd.DataFrame(accuracy_list, columns = ['accuracy'], index = list(classifiers.keys())).T
Out[36]:
BernoulliNB ComplementNB MultinomialNB LogisticRegression DecisionTreeClassifier RandomForestClassifier
accuracy 0.868418 0.468687 0.868418 0.868418 0.86963 0.86963

The result shows that all scikit-learn classifiers perform similar results except for ComplementNB. And their accuracy is comparable to NLTK's NaiveBayesClassifier.

Evaluating on the Testing Set¶

In [37]:
pd.DataFrame(pred_list, columns = ['accuracy'], index = list(classifiers.keys())).T
Out[37]:
BernoulliNB ComplementNB MultinomialNB LogisticRegression DecisionTreeClassifier RandomForestClassifier
accuracy 0.875 0.473599 0.875 0.875 0.873922 0.875539

The performance on the testing set is also similar to that on the training set. It indicates no overfitting happened and suggests their ability of the generalization is as good as NLTK's NaiveBayesClassifier except for ComplementNB.