Our study aimed to explore the sentiment of users towards various locations by analyzing their reviews on three popular review sites, namely TripAdvisor, Google Maps, and Yelp. To achieve this, we employed a combination of web scraping and API methods to collect the data, and then utilized natural language processing (NLP) techniques to standardize the reviews and find their frequency.
We began by collecting reviews of different locations from the three review sites mentioned above. This allowed us to obtain a diverse dataset containing a wide range of reviews, including positive and negative ones. We then used NLP techniques to preprocess the data and extract useful information from the reviews.
One of the first things we observed during our analysis was that certain words were frequently used across all rating levels. These included words like "place," "great," "museum," and "San Francisco." This indicates that users tend to describe the general characteristics of a location in their reviews, regardless of whether they had a positive or negative experience.
However, we also noticed that reviews with lower ratings tended to mention negative aspects related to visiting with children. This suggests that some locations may not be suitable for families with kids or may not offer sufficient facilities for them. This information can be valuable for businesses in the tourism industry to improve their services and cater to the needs of families.
To perform sentiment analysis on our dataset, we compared the performance of various classifiers, including NLTK's VADER and NaiveBayesClassifier
, as well as scikit-learn's BernoulliNB
, ComplementNB
, MultinomialNB
, DecisionTreeClassifier
, LogisticRegression
, and RandomForestClassifier
. We evaluated the accuracy, compatibility with other classifiers, and time efficiency of each classifier to determine the best option for our dataset.
Our results showed that all scikit-learn classifiers performed similarly in terms of accuracy. However, DecisionTreeClassifier
was the best option overall due to its compatibility with other classifiers and time efficiency. This classifier is easy to interpret and can be useful for predicting the sentiment of reviews in real-time applications.
Overall, our study highlights the importance of analyzing reviews from multiple sources and using NLP techniques for sentiment analysis. The information in the text of reviews reflects certain tendency of the rating. This can provide valuable insights to businesses in the tourism industry, helping them to gain a better understanding of their customers and make more informed decisions. In conclusion, our findings can be used to improve services and cater to the needs of customers in the tourism industry.