Since we use three data sources, there is a need to do data wrangling to combine the information of three review data together for our analysis. In this section, we are going to obtain a merged dataset that contains all the information of the reviews and rating across three different platforms.
import pandas as pd
import numpy as np
def load_data(filename, platform):
data = pd.read_csv(f'/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/{filename}')
data['platform'] = [platform]*(data.shape[0])
return data
# load in location information data
tplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/SF_places.csv')
gplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/new_google_places.csv')
yplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/yelp_business.csv')
# load in reviews data
treviews = load_data('tripadvisor_reviews.csv', platform='TripAdvisor')
greviews = load_data('new_google_reviews.csv', platform='Google')
yreviews = load_data('yelp_reviews.csv', platform='Yelp')
print(treviews.info())
print()
print(greviews.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7500 entries, 0 to 7499 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 attraction 7500 non-null object 1 username 7500 non-null object 2 city 7500 non-null object 3 country 7500 non-null object 4 contribution 7500 non-null object 5 title 7500 non-null object 6 month 7500 non-null object 7 year 7500 non-null object 8 review 7416 non-null object 9 rating 7500 non-null float64 10 platform 7500 non-null object dtypes: float64(1), object(10) memory usage: 644.7+ KB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 2361 entries, 0 to 2360 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 attraction 2361 non-null object 1 username 2357 non-null object 2 contributions 2361 non-null object 3 time 2360 non-null object 4 review 2361 non-null object 5 rating 1715 non-null float64 6 platform 2361 non-null object dtypes: float64(1), object(6) memory usage: 129.2+ KB None
We merge the reviews and bussiness together by their business ID.
yreviews.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 150 non-null object 1 url 150 non-null object 2 text 150 non-null object 3 rating 150 non-null int64 4 time_created 150 non-null object 5 user 150 non-null object 6 business_id 150 non-null object 7 platform 150 non-null object dtypes: int64(1), object(7) memory usage: 9.5+ KB
yplaces.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 50 non-null object 1 alias 50 non-null object 2 name 50 non-null object 3 image_url 50 non-null object 4 is_closed 50 non-null bool 5 url 50 non-null object 6 review_count 50 non-null int64 7 categories 50 non-null object 8 rating 50 non-null float64 9 coordinates 50 non-null object 10 transactions 50 non-null object 11 price 46 non-null object 12 location 50 non-null object 13 phone 49 non-null float64 14 display_phone 49 non-null object 15 distance 50 non-null float64 dtypes: bool(1), float64(3), int64(1), object(11) memory usage: 6.0+ KB
ymerge = pd.merge(yreviews, yplaces, left_on='business_id', right_on='id')
ymerge.head(5)
id_x | url_x | text | rating_x | time_created | user | business_id | platform | id_y | alias | ... | review_count | categories | rating_y | coordinates | transactions | price | location | phone | display_phone | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | XeBZmAzdhswpWiQYe6sTcw | https://www.yelp.com/biz/bi-rite-creamery-san-... | Got the Earl Gray + Pina colada and it was SO ... | 5 | 2023-03-12 11:26:29 | {'id': 'J_iJFYQp5y8fwcz-BdUEDQ', 'profile_url'... | wGl_DyNxSv8KUtYgiuLhmA | Yelp | wGl_DyNxSv8KUtYgiuLhmA | bi-rite-creamery-san-francisco | ... | 10031 | [{'alias': 'icecream', 'title': 'Ice Cream & F... | 4.5 | {'latitude': 37.761591, 'longitude': -122.425717} | ['delivery'] | $$ | {'address1': '3692 18th St', 'address2': None,... | 1.415627e+10 | (415) 626-5600 | 946.386739 |
1 | FuF8jvLV0olpgWbNQ2jHZg | https://www.yelp.com/biz/bi-rite-creamery-san-... | My review is not on the ice cream. The ice cre... | 1 | 2023-03-19 16:50:01 | {'id': 'T_yEM-V-vmbODnGqUGi--g', 'profile_url'... | wGl_DyNxSv8KUtYgiuLhmA | Yelp | wGl_DyNxSv8KUtYgiuLhmA | bi-rite-creamery-san-francisco | ... | 10031 | [{'alias': 'icecream', 'title': 'Ice Cream & F... | 4.5 | {'latitude': 37.761591, 'longitude': -122.425717} | ['delivery'] | $$ | {'address1': '3692 18th St', 'address2': None,... | 1.415627e+10 | (415) 626-5600 | 946.386739 |
2 | ruuEFbw8S0wQKZ9UN0-8tg | https://www.yelp.com/biz/bi-rite-creamery-san-... | 2/25/2023 - A delicious end to our date night ... | 5 | 2023-02-26 01:34:44 | {'id': '84oPkNdCcisrtOmYK_ACwA', 'profile_url'... | wGl_DyNxSv8KUtYgiuLhmA | Yelp | wGl_DyNxSv8KUtYgiuLhmA | bi-rite-creamery-san-francisco | ... | 10031 | [{'alias': 'icecream', 'title': 'Ice Cream & F... | 4.5 | {'latitude': 37.761591, 'longitude': -122.425717} | ['delivery'] | $$ | {'address1': '3692 18th St', 'address2': None,... | 1.415627e+10 | (415) 626-5600 | 946.386739 |
3 | eBDHqdq65uP6JAMs24q3tQ | https://www.yelp.com/biz/brendas-french-soul-f... | Many French toast shops do not handle it prope... | 5 | 2023-03-03 22:55:42 | {'id': '5TpUy6HRIDhH3JvQlm8LBA', 'profile_url'... | lJAGnYzku5zSaLnQ_T6_GQ | Yelp | lJAGnYzku5zSaLnQ_T6_GQ | brendas-french-soul-food-san-francisco-6 | ... | 11992 | [{'alias': 'breakfast_brunch', 'title': 'Break... | 4.0 | {'latitude': 37.78291531984934, 'longitude': -... | ['delivery'] | $$ | {'address1': '652 Polk St', 'address2': '', 'a... | 1.415346e+10 | (415) 345-8100 | 2893.406622 |
4 | IXYJkV13UIjG21CMMMy8kQ | https://www.yelp.com/biz/brendas-french-soul-f... | First and foremost, let's talk about Brenda's ... | 5 | 2023-03-09 18:43:35 | {'id': 'qizLZcjOtLwk9v-pjMn2sg', 'profile_url'... | lJAGnYzku5zSaLnQ_T6_GQ | Yelp | lJAGnYzku5zSaLnQ_T6_GQ | brendas-french-soul-food-san-francisco-6 | ... | 11992 | [{'alias': 'breakfast_brunch', 'title': 'Break... | 4.0 | {'latitude': 37.78291531984934, 'longitude': -... | ['delivery'] | $$ | {'address1': '652 Polk St', 'address2': '', 'a... | 1.415346e+10 | (415) 345-8100 | 2893.406622 |
5 rows × 24 columns
y_keep_columns = ['platform','text','rating_x', 'time_created', 'name', 'coordinates']
yreviews_new = ymerge.loc[:, y_keep_columns]
yreviews_new = yreviews_new.rename(columns={'name': 'attraction',
'text': 'review',
'rating_x': 'rating',
'time_created': 'date'
})
yreviews_new.head(5)
platform | review | rating | date | attraction | coordinates | |
---|---|---|---|---|---|---|
0 | Yelp | Got the Earl Gray + Pina colada and it was SO ... | 5 | 2023-03-12 11:26:29 | Bi-Rite Creamery | {'latitude': 37.761591, 'longitude': -122.425717} |
1 | Yelp | My review is not on the ice cream. The ice cre... | 1 | 2023-03-19 16:50:01 | Bi-Rite Creamery | {'latitude': 37.761591, 'longitude': -122.425717} |
2 | Yelp | 2/25/2023 - A delicious end to our date night ... | 5 | 2023-02-26 01:34:44 | Bi-Rite Creamery | {'latitude': 37.761591, 'longitude': -122.425717} |
3 | Yelp | Many French toast shops do not handle it prope... | 5 | 2023-03-03 22:55:42 | Brenda's French Soul Food | {'latitude': 37.78291531984934, 'longitude': -... |
4 | Yelp | First and foremost, let's talk about Brenda's ... | 5 | 2023-03-09 18:43:35 | Brenda's French Soul Food | {'latitude': 37.78291531984934, 'longitude': -... |
merge1 = pd.merge(treviews,greviews, how='outer')
all_reviews = pd.merge(merge1,yreviews_new, how='outer')
all_reviews.head(3)
attraction | username | city | country | contribution | title | month | year | review | rating | platform | contributions | time | date | coordinates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alcatraz Island | Amber N W | None | None | 5 | FUN FOR ALL AGES | Mar | 2023 | My family took the tour ( BUY TICKETS IN ADVAN... | 5.0 | TripAdvisor | NaN | NaN | NaN | NaN |
1 | Alcatraz Island | Gord P | None | None | 16 | great tour | Feb | 2023 | This is a must stop if you are in San Fran!!! ... | 5.0 | TripAdvisor | NaN | NaN | NaN | NaN |
2 | Alcatraz Island | Jess | None | None | 4 | Must See | Mar | 2023 | I did not expect to enjoy the tour as much as ... | 5.0 | TripAdvisor | NaN | NaN | NaN | NaN |
all_reviews['time2'] = all_reviews['month'] + ' ' + all_reviews['year'].astype(str)
all_reviews['time2'].fillna('Jan 1900', inplace=True)
all_reviews.loc[pd.to_datetime(all_reviews['time2'], format='%b %Y', errors='coerce').isnull(), 'time2'] = 'Jan 1900'
all_reviews['time2'] = pd.to_datetime(all_reviews['time2'], format='%b %Y').dt.strftime('%b, %Y')
all_reviews['time_merged'] = all_reviews['time2'].combine_first(all_reviews['time'])
all_reviews['time3'] = pd.to_datetime(all_reviews['date']).dt.strftime('%b, %Y')
all_reviews['time_merged2'] = all_reviews['time_merged'].combine_first(all_reviews['time3'])
all_reviews = all_reviews.rename(columns={'time_merged2': 'datecolumn' })
all_keep_columns = ['platform','attraction','review', 'rating','datecolumn', 'coordinates']
all_reviews = all_reviews.loc[:, all_keep_columns]
all_reviews['datecolumn'] = all_reviews['datecolumn'].replace('Jan, 1900', np.nan)
all_reviews.head(5)
platform | attraction | review | rating | datecolumn | coordinates | |
---|---|---|---|---|---|---|
0 | TripAdvisor | Alcatraz Island | My family took the tour ( BUY TICKETS IN ADVAN... | 5.0 | Mar, 2023 | NaN |
1 | TripAdvisor | Alcatraz Island | This is a must stop if you are in San Fran!!! ... | 5.0 | Feb, 2023 | NaN |
2 | TripAdvisor | Alcatraz Island | I did not expect to enjoy the tour as much as ... | 5.0 | Mar, 2023 | NaN |
3 | TripAdvisor | Alcatraz Island | San Francisco is completely unsafe. We bought ... | 1.0 | Mar, 2023 | NaN |
4 | TripAdvisor | Alcatraz Island | I had a 13-hour layover in San Francisco And I... | 4.0 | Mar, 2023 | NaN |