Data Wrangling¶

Since we use three data sources, there is a need to do data wrangling to combine the information of three review data together for our analysis. In this section, we are going to obtain a merged dataset that contains all the information of the reviews and rating across three different platforms.

Loading Libraries and Three Datasets¶

In [4]:
import pandas as pd
import numpy as np
In [5]:
def load_data(filename, platform):
    data = pd.read_csv(f'/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/{filename}')
    data['platform'] = [platform]*(data.shape[0])
    return data

# load in location information data
tplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/SF_places.csv')
gplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/new_google_places.csv')
yplaces = pd.read_csv('/Users/alliewu/Desktop/DataScience_Projects/STA220_sentiment-analysis/data/yelp_business.csv')

# load in reviews data
treviews = load_data('tripadvisor_reviews.csv', platform='TripAdvisor')
greviews = load_data('new_google_reviews.csv', platform='Google')
yreviews = load_data('yelp_reviews.csv', platform='Yelp')

Examination on TripAdvisor and Google Maps¶

In [6]:
print(treviews.info())
print()
print(greviews.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   attraction    7500 non-null   object 
 1   username      7500 non-null   object 
 2   city          7500 non-null   object 
 3   country       7500 non-null   object 
 4   contribution  7500 non-null   object 
 5   title         7500 non-null   object 
 6   month         7500 non-null   object 
 7   year          7500 non-null   object 
 8   review        7416 non-null   object 
 9   rating        7500 non-null   float64
 10  platform      7500 non-null   object 
dtypes: float64(1), object(10)
memory usage: 644.7+ KB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2361 entries, 0 to 2360
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   attraction     2361 non-null   object 
 1   username       2357 non-null   object 
 2   contributions  2361 non-null   object 
 3   time           2360 non-null   object 
 4   review         2361 non-null   object 
 5   rating         1715 non-null   float64
 6   platform       2361 non-null   object 
dtypes: float64(1), object(6)
memory usage: 129.2+ KB
None

Examination on Yelp¶

We merge the reviews and bussiness together by their business ID.

In [7]:
yreviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            150 non-null    object
 1   url           150 non-null    object
 2   text          150 non-null    object
 3   rating        150 non-null    int64 
 4   time_created  150 non-null    object
 5   user          150 non-null    object
 6   business_id   150 non-null    object
 7   platform      150 non-null    object
dtypes: int64(1), object(7)
memory usage: 9.5+ KB
In [8]:
yplaces.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             50 non-null     object 
 1   alias          50 non-null     object 
 2   name           50 non-null     object 
 3   image_url      50 non-null     object 
 4   is_closed      50 non-null     bool   
 5   url            50 non-null     object 
 6   review_count   50 non-null     int64  
 7   categories     50 non-null     object 
 8   rating         50 non-null     float64
 9   coordinates    50 non-null     object 
 10  transactions   50 non-null     object 
 11  price          46 non-null     object 
 12  location       50 non-null     object 
 13  phone          49 non-null     float64
 14  display_phone  49 non-null     object 
 15  distance       50 non-null     float64
dtypes: bool(1), float64(3), int64(1), object(11)
memory usage: 6.0+ KB
In [9]:
ymerge = pd.merge(yreviews, yplaces, left_on='business_id', right_on='id')
ymerge.head(5)
Out[9]:
id_x url_x text rating_x time_created user business_id platform id_y alias ... review_count categories rating_y coordinates transactions price location phone display_phone distance
0 XeBZmAzdhswpWiQYe6sTcw https://www.yelp.com/biz/bi-rite-creamery-san-... Got the Earl Gray + Pina colada and it was SO ... 5 2023-03-12 11:26:29 {'id': 'J_iJFYQp5y8fwcz-BdUEDQ', 'profile_url'... wGl_DyNxSv8KUtYgiuLhmA Yelp wGl_DyNxSv8KUtYgiuLhmA bi-rite-creamery-san-francisco ... 10031 [{'alias': 'icecream', 'title': 'Ice Cream & F... 4.5 {'latitude': 37.761591, 'longitude': -122.425717} ['delivery'] $$ {'address1': '3692 18th St', 'address2': None,... 1.415627e+10 (415) 626-5600 946.386739
1 FuF8jvLV0olpgWbNQ2jHZg https://www.yelp.com/biz/bi-rite-creamery-san-... My review is not on the ice cream. The ice cre... 1 2023-03-19 16:50:01 {'id': 'T_yEM-V-vmbODnGqUGi--g', 'profile_url'... wGl_DyNxSv8KUtYgiuLhmA Yelp wGl_DyNxSv8KUtYgiuLhmA bi-rite-creamery-san-francisco ... 10031 [{'alias': 'icecream', 'title': 'Ice Cream & F... 4.5 {'latitude': 37.761591, 'longitude': -122.425717} ['delivery'] $$ {'address1': '3692 18th St', 'address2': None,... 1.415627e+10 (415) 626-5600 946.386739
2 ruuEFbw8S0wQKZ9UN0-8tg https://www.yelp.com/biz/bi-rite-creamery-san-... 2/25/2023 - A delicious end to our date night ... 5 2023-02-26 01:34:44 {'id': '84oPkNdCcisrtOmYK_ACwA', 'profile_url'... wGl_DyNxSv8KUtYgiuLhmA Yelp wGl_DyNxSv8KUtYgiuLhmA bi-rite-creamery-san-francisco ... 10031 [{'alias': 'icecream', 'title': 'Ice Cream & F... 4.5 {'latitude': 37.761591, 'longitude': -122.425717} ['delivery'] $$ {'address1': '3692 18th St', 'address2': None,... 1.415627e+10 (415) 626-5600 946.386739
3 eBDHqdq65uP6JAMs24q3tQ https://www.yelp.com/biz/brendas-french-soul-f... Many French toast shops do not handle it prope... 5 2023-03-03 22:55:42 {'id': '5TpUy6HRIDhH3JvQlm8LBA', 'profile_url'... lJAGnYzku5zSaLnQ_T6_GQ Yelp lJAGnYzku5zSaLnQ_T6_GQ brendas-french-soul-food-san-francisco-6 ... 11992 [{'alias': 'breakfast_brunch', 'title': 'Break... 4.0 {'latitude': 37.78291531984934, 'longitude': -... ['delivery'] $$ {'address1': '652 Polk St', 'address2': '', 'a... 1.415346e+10 (415) 345-8100 2893.406622
4 IXYJkV13UIjG21CMMMy8kQ https://www.yelp.com/biz/brendas-french-soul-f... First and foremost, let's talk about Brenda's ... 5 2023-03-09 18:43:35 {'id': 'qizLZcjOtLwk9v-pjMn2sg', 'profile_url'... lJAGnYzku5zSaLnQ_T6_GQ Yelp lJAGnYzku5zSaLnQ_T6_GQ brendas-french-soul-food-san-francisco-6 ... 11992 [{'alias': 'breakfast_brunch', 'title': 'Break... 4.0 {'latitude': 37.78291531984934, 'longitude': -... ['delivery'] $$ {'address1': '652 Polk St', 'address2': '', 'a... 1.415346e+10 (415) 345-8100 2893.406622

5 rows × 24 columns

In [10]:
y_keep_columns = ['platform','text','rating_x', 'time_created', 'name', 'coordinates']
yreviews_new = ymerge.loc[:, y_keep_columns]
yreviews_new = yreviews_new.rename(columns={'name': 'attraction', 
                            'text': 'review',
                            'rating_x': 'rating',
                            'time_created': 'date'
                            })
yreviews_new.head(5)
Out[10]:
platform review rating date attraction coordinates
0 Yelp Got the Earl Gray + Pina colada and it was SO ... 5 2023-03-12 11:26:29 Bi-Rite Creamery {'latitude': 37.761591, 'longitude': -122.425717}
1 Yelp My review is not on the ice cream. The ice cre... 1 2023-03-19 16:50:01 Bi-Rite Creamery {'latitude': 37.761591, 'longitude': -122.425717}
2 Yelp 2/25/2023 - A delicious end to our date night ... 5 2023-02-26 01:34:44 Bi-Rite Creamery {'latitude': 37.761591, 'longitude': -122.425717}
3 Yelp Many French toast shops do not handle it prope... 5 2023-03-03 22:55:42 Brenda's French Soul Food {'latitude': 37.78291531984934, 'longitude': -...
4 Yelp First and foremost, let's talk about Brenda's ... 5 2023-03-09 18:43:35 Brenda's French Soul Food {'latitude': 37.78291531984934, 'longitude': -...

Merge Them All Together¶

In [11]:
merge1 = pd.merge(treviews,greviews, how='outer')
all_reviews = pd.merge(merge1,yreviews_new, how='outer')
all_reviews.head(3)
Out[11]:
attraction username city country contribution title month year review rating platform contributions time date coordinates
0 Alcatraz Island Amber N W None None 5 FUN FOR ALL AGES Mar 2023 My family took the tour ( BUY TICKETS IN ADVAN... 5.0 TripAdvisor NaN NaN NaN NaN
1 Alcatraz Island Gord P None None 16 great tour Feb 2023 This is a must stop if you are in San Fran!!! ... 5.0 TripAdvisor NaN NaN NaN NaN
2 Alcatraz Island Jess None None 4 Must See Mar 2023 I did not expect to enjoy the tour as much as ... 5.0 TripAdvisor NaN NaN NaN NaN
In [12]:
all_reviews['time2'] = all_reviews['month'] + ' ' + all_reviews['year'].astype(str)
all_reviews['time2'].fillna('Jan 1900', inplace=True)
all_reviews.loc[pd.to_datetime(all_reviews['time2'], format='%b %Y', errors='coerce').isnull(), 'time2'] = 'Jan 1900'
all_reviews['time2'] = pd.to_datetime(all_reviews['time2'], format='%b %Y').dt.strftime('%b, %Y')
all_reviews['time_merged'] = all_reviews['time2'].combine_first(all_reviews['time'])

all_reviews['time3'] = pd.to_datetime(all_reviews['date']).dt.strftime('%b, %Y')
all_reviews['time_merged2'] = all_reviews['time_merged'].combine_first(all_reviews['time3'])
In [13]:
all_reviews = all_reviews.rename(columns={'time_merged2': 'datecolumn' })
all_keep_columns = ['platform','attraction','review', 'rating','datecolumn', 'coordinates']
all_reviews = all_reviews.loc[:, all_keep_columns]
all_reviews['datecolumn'] = all_reviews['datecolumn'].replace('Jan, 1900', np.nan)
all_reviews.head(5)
Out[13]:
platform attraction review rating datecolumn coordinates
0 TripAdvisor Alcatraz Island My family took the tour ( BUY TICKETS IN ADVAN... 5.0 Mar, 2023 NaN
1 TripAdvisor Alcatraz Island This is a must stop if you are in San Fran!!! ... 5.0 Feb, 2023 NaN
2 TripAdvisor Alcatraz Island I did not expect to enjoy the tour as much as ... 5.0 Mar, 2023 NaN
3 TripAdvisor Alcatraz Island San Francisco is completely unsafe. We bought ... 1.0 Mar, 2023 NaN
4 TripAdvisor Alcatraz Island I had a 13-hour layover in San Francisco And I... 4.0 Mar, 2023 NaN