Yelp Data Analysis

Erwan Delorme, Louis Kremer,Benjamin Ooi,Anna Tcirkina

Overview

The project will analyse data from the reviews and recommendations website Yelp. Usingvarious data science tools, the team will attempt to gain valuable insights into the restaurantindustry and the online review environment provided by Yelp.Data

Original data: ​https://www.kaggle.com/yelp-dataset/yelp-dataset

The dataset stems from Kaggle and has been published by Yelp itself. It has a total size of10 GB, and it is stored in five JSON files.The team will focus on thebusiness and the reviews files. The business file contains 27 variables, including location data, business attributes (price range, parking etc.) and openingtimes. The reviews file contains most importantly text reviews. Our analysis will focus on textreviews concerning Las Vegas, to cut down on storage size.

Objectives

1.Design a sorting algorithm that will rank restaurants from best to worst (per cuisine,location, type or others) (Ben)

2.Try to understand how data relates to one another (clustering) with a particularemphasis on location data (Erwan)

3.Text processing: try to understand how text ratings influence ratings (Anna)

4.Predict the likelihood of food poisoning (Louis)

Practical Information

GitHub repository

  • The project is structured using four files (JSON import, individual csv import &packages, cleaning, models & visualisations)-The team will meet at least biweekly and Louis will check in with individual teammembers on a weekly basis

  • Data storage: every team member will store the data in csv format on their localcomputer and create their own individual upload path (GitHub limits file size to 50MB and 1GB under LFS)

09/03/2021