NLTK also offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of our dataset are useful in classifying each piece of data into our desired categories. Let's start it as a warm up.
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from random import shuffle
import random
from nltk.tokenize import word_tokenize
import numpy as np
nltk.download(["stopwords", "vader_lexicon", "punkt"])
[nltk_data] Downloading package stopwords to [nltk_data] /Users/yanyuchen/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package vader_lexicon to [nltk_data] /Users/yanyuchen/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date! [nltk_data] Downloading package punkt to /Users/yanyuchen/nltk_data... [nltk_data] Package punkt is already up-to-date!
True
url = 'https://raw.githubusercontent.com/yanyuchen/sentiment-analysis/main/data/all_reviews.csv'
df = pd.read_csv(url)[['review', 'rating']].dropna()
df = df.reset_index(drop=True)
df.head()
review | rating | |
---|---|---|
0 | My family took the tour ( BUY TICKETS IN ADVAN... | 5.0 |
1 | This is a must stop if you are in San Fran!!! ... | 5.0 |
2 | I did not expect to enjoy the tour as much as ... | 5.0 |
3 | San Francisco is completely unsafe. We bought ... | 1.0 |
4 | I had a 13-hour layover in San Francisco And I... | 4.0 |
The common words in NLP, called the stop words, may have a negative effect on our analysis because they occur so often in the text. We drop them when we fit our models.
stopwords = set(nltk.corpus.stopwords.words("english"))
stopwords.update(nltk.corpus.stopwords.words("spanish"))
act = df.rating > 3
features = [(df.review[i], act[i]) for i in range(len(act))]
Training the classifier involves random splitting the feature set so that one portion can be used for training and the other for evaluation. For the purpose of reproducibility, we fix our random seed.
random.seed(220)
We split the string of review into several peices of words and use the words that are not in the stop words list to fit our model.
train_count = round(len(features) * 0.8)
shuffle(features)
train = features[:train_count]
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]) if word.lower() not in stopwords)
#t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
# speed up:
t = [({word: False for word in all_words} | {word: True for word in word_tokenize(passage[0]) if word.lower() not in stopwords},
label) for passage, label in train]
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features(15)
Most Informative Features 9 = True False : True = 11.0 : 1.0 $ = True False : True = 6.6 : 1.0 4 = True False : True = 6.6 : 1.0 8 = True False : True = 6.6 : 1.0 `` = True False : True = 6.6 : 1.0 Н = True False : True = 6.6 : 1.0 ア = True False : True = 6.6 : 1.0 子 = True False : True = 6.6 : 1.0 전 = True False : True = 6.6 : 1.0 È = True False : True = 4.7 : 1.0 c = True False : True = 4.0 : 1.0 Е = True False : True = 4.0 : 1.0 П = True False : True = 4.0 : 1.0 샌 = True False : True = 3.7 : 1.0 v = True False : True = 3.0 : 1.0
nltk.classify.accuracy(classifier, t)
0.8696296296296296
The result shows that our fitted model performs better than the previous naive approach on the training set. However, it is unclear why we obtain those words to be our most informative features. Next, let's see its performance on the testing set.
test = features[train_count:]
s = [{word: False for word in all_words} | {word: True for word in word_tokenize(passage[0]) if word.lower() not in stopwords}
for passage, _ in test]
act = [ts[1] for ts in test]
pred = classifier.classify_many(s)
def accuracy(pred, act):
return sum([pred[i] == act[i] for i in range(len(act))])/len(act)
accuracy(pred, act)
0.875
The performance on the testing set is similar to that on the training set. It indicates no overfitting happened and suggests its ability of the generalization also outperforms the naive approach. Before we end this section, let's make a prediction on the whole dataset.
review = [review for review in df.review] # preserve the order
all_w = [{word: False for word in all_words} | {word: True for word in word_tokenize(passage) if word.lower() not in stopwords}
for passage in review]
pred_all = classifier.classify_many(all_w)
pd.DataFrame(pred_all).to_csv('nltk.NaiveBayesClassifier_pred.csv', index=False)
NLTK provides a class that can use most classifiers from the popular machine learning framework, scikit-learn. Many of the classifiers that scikit-learn provides can be instantiated quickly since they have defaults that often work well. Since NLTK allows us to integrate scikit-learn classifiers directly into its own classifier class, the training and classification processes will use the same methods we’ve already seen.
from sklearn.naive_bayes import (BernoulliNB, ComplementNB, MultinomialNB,)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
classifiers = {
"BernoulliNB": BernoulliNB(),
"ComplementNB": ComplementNB(),
"MultinomialNB": MultinomialNB(),
"LogisticRegression": LogisticRegression(),
"DecisionTreeClassifier": DecisionTreeClassifier(),
"RandomForestClassifier": RandomForestClassifier(),
}
accuracy_list = []
pred_list = []
for name, sklearn_classifier in classifiers.items():
classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
classifier.train(t)
# training set
accur = nltk.classify.accuracy(classifier, t)
accuracy_list.append(accur)
# testing set
pred = classifier.classify_many(s)
pred_list.append(accuracy(pred, act))
# whole dataset
pred_all = classifier.classify_many(all_w)
pd.DataFrame(pred_all).to_csv(F'{name}_pred.csv', index=False)
print(F"{name} finish")
BernoulliNB finish ComplementNB finish MultinomialNB finish LogisticRegression finish DecisionTreeClassifier finish RandomForestClassifier finish
pd.DataFrame(accuracy_list, columns = ['accuracy'], index = list(classifiers.keys())).T
BernoulliNB | ComplementNB | MultinomialNB | LogisticRegression | DecisionTreeClassifier | RandomForestClassifier | |
---|---|---|---|---|---|---|
accuracy | 0.868418 | 0.468687 | 0.868418 | 0.868418 | 0.86963 | 0.86963 |
The result shows that all scikit-learn classifiers perform similar results except for ComplementNB
. And their accuracy is comparable to NLTK's NaiveBayesClassifier
.
pd.DataFrame(pred_list, columns = ['accuracy'], index = list(classifiers.keys())).T
BernoulliNB | ComplementNB | MultinomialNB | LogisticRegression | DecisionTreeClassifier | RandomForestClassifier | |
---|---|---|---|---|---|---|
accuracy | 0.875 | 0.473599 | 0.875 | 0.875 | 0.873922 | 0.875539 |
The performance on the testing set is also similar to that on the training set. It indicates no overfitting happened and suggests their ability of the generalization is as good as NLTK's NaiveBayesClassifier
except for ComplementNB
.