Using Simple Data Science to Find Hit Locations in NYC

Phillip Kim
7 min readApr 2, 2021

Introduction

While travelling in New York, my friend and I noticed that there weren’t as many Korean BBQ restaurants as other ethnic restaurants such as Itallian or Japanese in Manhattan, New York.

Some of the Korean BBQ restaurants that we visited such as Jongro BBQ on 32nd Street seemed to be very busy and making a lot of profit with fully booked tables and people waiting in line for over an hour to be served.

However, we barely saw Korean BBQ restaurants, if not none, in other parts of Manhattan. Our aim here is to find the best possible location for a Korean BBQ restaurant in a location where population traffic is dense but little or no Korean restaurants are around.

In addition, we want to avoid other types of BBQ restaurants as they could become serious competitors.

Data Description

First, we need to find locations of all Korean BBQ restaurants and locations of restaurants that serve similar food. This can be done using the Foursquare API and using keywords such as BBQ, Restaurants, and etc. We will be showing the dataset in the Methodology section.

Second, we need to determine population traffic at these locations. We can determine this by looking at rating counts and tip counts. We will be manipulating location data as well as venue data to merge to get the final dataset.

Methodology

We will merge location and venue data and preprocess the merged dataset. Once preprocessing is done, we will apply 75th percentile condition on rating, rating counts, and tip counts columns to determine which areas are hot spots.

# Import librariesimport numpy as np
import pandas as pd
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, None)
import json
from geopy.geocoders import Nominatim
import requests
from pandas import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
CLIENT_ID = ‘your Foursquare ID’
CLIENT_SECRET = ‘your Foursquare Secret’
ACCESS_TOKEN = ‘your FourSquare Access Token’
VERSION = ‘20180604’
LIMIT = 1000

We want to first find where KBBQ and other BBQ restaurants are located in Manhanttan.

address = ’22 W 32nd St, New York, NY’
geolocator = Nominatim(user_agent=”foursquare_agent”)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
search_query = ‘BBQ’
radius = 10000
url = ‘https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results[‘response’][‘venues’]
dataframe = json_normalize(venues)
filtered_columns = [‘name’, ‘categories’] + [col for col in dataframe.columns if col.startswith(‘location.’)] + [‘id’]
dataframe_filtered = dataframe.loc[:, filtered_columns]
def get_category_type(row):
try:
categories_list = row[‘categories’]
except:
categories_list = row[‘venue.categories’]

if len(categories_list) == 0:
return None
else:
return categories_list[0][‘name’]
dataframe_filtered[‘categories’] = dataframe_filtered.apply(get_category_type, axis=1)
dataframe_filtered.columns = [column.split(‘.’)[-1] for column in dataframe_filtered.columns]
dataframe_filtered.sort_values(by=’distance’,ascending=True, inplace=True)
dataframe_filtered = dataframe_filtered[dataframe_filtered[‘categories’].str.contains(‘Restaurant’,na=False,regex=False)]
dataframe_filtered
A dataset containing all data samples that have BBQ
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centered around the Jongro BBQ# add a red circle marker to represent the Jongro BBQ
folium.CircleMarker(
[latitude, longitude],
radius=10,
color=’red’,
popup=’Jongro BBQ’,
fill = True,
fill_color = ‘red’,
fill_opacity = 0.6
).add_to(venues_map)
# add the BBQ restaurants as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
folium.CircleMarker(
[lat, lng],
radius=5,
color=’blue’,
popup=label,
fill = True,
fill_color=’blue’,
fill_opacity=0.6
).add_to(venues_map)
# display map
venues_map
Locations of all BBQ restaurants

The map shows that BBQ restaurants are located around mid part of Manhattan. We should look for places either in lower or upper Manhattan.

Restaurants in Lower Manhattan

address = ‘138 Lafayette St, New York, NY 10013’
geolocator = Nominatim(user_agent=”foursquare_agent”)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
search_query = ‘Restaurant’
radius = 2000
url = ‘https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results[‘response’][‘venues’]
dataframe = json_normalize(venues)
filtered_columns = [‘name’, ‘categories’] + [col for col in dataframe.columns if col.startswith(‘location.’)] + [‘id’]
dataframe_filtered = dataframe.loc[:, filtered_columns]
def get_category_type(row):
try:
categories_list = row[‘categories’]
except:
categories_list = row[‘venue.categories’]

if len(categories_list) == 0:
return None
else:
return categories_list[0][‘name’]
dataframe_filtered[‘categories’] = dataframe_filtered.apply(get_category_type, axis=1)
dataframe_filtered.columns = [column.split(‘.’)[-1] for column in dataframe_filtered.columns]
dataframe_filtered.sort_values(by=’distance’,ascending=True, inplace=True)
dataframe_filtered = dataframe_filtered[dataframe_filtered[‘categories’].str.contains(‘Restaurant’,na=False,regex=False)]
total_data = dataframe_filtered.copy()

Restaurants in Upper Manhattan

address = ‘1544 Madison Ave, New York, NY 10029’
geolocator = Nominatim(user_agent=”foursquare_agent”)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
search_query = ‘Restaurant’
radius = 2000
url = ‘https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues = results[‘response’][‘venues’]
dataframe = json_normalize(venues)
filtered_columns = [‘name’, ‘categories’] + [col for col in dataframe.columns if col.startswith(‘location.’)] + [‘id’]
dataframe_filtered = dataframe.loc[:, filtered_columns]
def get_category_type(row):
try:
categories_list = row[‘categories’]
except:
categories_list = row[‘venue.categories’]

if len(categories_list) == 0:
return None
else:
return categories_list[0][‘name’]
dataframe_filtered[‘categories’] = dataframe_filtered.apply(get_category_type, axis=1)
dataframe_filtered.columns = [column.split(‘.’)[-1] for column in dataframe_filtered.columns]
dataframe_filtered.sort_values(by=’distance’,ascending=True, inplace=True)
dataframe_filtered = dataframe_filtered[dataframe_filtered[‘categories’].str.contains(‘Restaurant’,na=False,regex=False)]
total_data = total_data.append(dataframe_filtered)

We merge lower and upper Manhattan datasets.

total_data = pd.DataFrame(total_data, index=None)
total_data
A merged dataset containing lower and upper Manhattan data samples

We merge the location and venue data here.

def rating_count_extractor(venue_id):
url = ‘https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&oauth_token={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET,ACCESS_TOKEN, VERSION)
result = requests.get(url).json()
try:
num_rated = result[‘response’][‘venue’] [‘ratingSignals’]
rating = result[‘response’][‘venue’][‘rating’]
except:
num_rated = 0
rating = None
try:
tip_count = result[‘response’][‘venue’][‘tips’][‘count’]
except:
tip_count = 0
data_list = [venue_id, num_rated, rating, tip_count]
data_columns = [‘Venue_ID’, ‘Number_of_Rated’, ‘Rating’, ‘Tip_Count’]
df = pd.DataFrame([data_list], columns=data_columns)
return df
venue_ids = total_data[‘id’].tolist()
total_df = []
for venue_id in venue_ids:
df = rating_count_extractor(str(venue_id))
total_df.append(df)
total_df = pd.concat(total_df, ignore_index=True)
df_merge = total_data.merge(total_df, left_on =’id’, right_on = ‘Venue_ID’, how=’left’)
df_merge.dropna(subset = [‘Rating’], inplace=True)

As explained in the beginning of this section, we use 75th percentile on The Total Number of People Who Rated, Ratings, and Tip Comment Counts to determine which venue is hit or miss. The rationale for choosing these three features as indicators of the target variable is that we want to know which venues have been visited by a lot of people and rated well. If a venue was rated well, people tend to come back.

clean_list = [‘name’, ‘categories’, ‘lat’, ‘lng’,’Number_of_Rated’, ‘Rating’, ‘Tip_Count’]
df_merge_clean = df_merge[clean_list]
sfperc_num_rated = np.percentile(df_merge_clean[‘Number_of_Rated’], 75)
sfperc_rating = np.percentile(df_merge_clean[‘Rating’], 75)
sfperc_tip_count = np.percentile(df_merge_clean[‘Tip_Count’], 75)
conditions = [((df_merge_clean[‘Number_of_Rated’] >= sfperc_num_rated) & (df_merge_clean[‘Rating’] >= sfperc_rating) & (df_merge_clean[‘Tip_Count’] >= sfperc_tip_count)),
((df_merge_clean[‘Number_of_Rated’] < sfperc_num_rated) | (df_merge_clean[‘Rating’] < sfperc_rating))]
values = [‘Hit’, ‘Miss’]
df_merge_clean[‘Potential_Spot_Flag’] = np.select(conditions, values)
df_merge_clean.dropna(inplace=True)
df_merge_final = df_merge_clean[df_merge_clean[‘Potential_Spot_Flag’]==’Hit’]
df_merge_final
A merged and cleaned dataset showing all potential hit locations based on 75th percentile analysis

We can clearly see there are three potential hit areas in this map.

venues_map = folium.Map(location=[“40.7128”, “-74.0060”], zoom_start=11)# add the BBQ restaurants as blue circle markers
for lat, lng, label in zip(df_merge_final.lat, df_merge_final.lng, df_merge_final.categories):
folium.CircleMarker(
[lat, lng],
radius=5,
color=’blue’,
popup=label,
fill = True,
fill_color=’blue’,
fill_opacity=0.6
).add_to(venues_map)
# display map
venues_map
A map showing all three potential hit areas

Results

After running a data visualization, we were able to find three potential hit areas. Out of these three areas, lower Manhattan seems to be the best opportunity for an area to open a KBBQ restaurant as there is very high population traffic and no KBBQ or any other types of BBQ restaurants are present. Let us see the map below.

df_merge_final = df_merge_final.iloc[[0,1,2,3,4,5],:]venues_map = folium.Map(location=["40.7128", "-74.0060"], zoom_start=14)# add the BBQ restaurants as blue circle markers
for lat, lng, label in zip(df_merge_final.lat, df_merge_final.lng, df_merge_final.categories):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(venues_map)
coordinates = [[lat, lng] for lat, lng in zip(df_merge_final.lat, df_merge_final.lng)]avg_lat = np.mean(df_merge_final.lat)
std_lat = np.std(df_merge_final.lat)
avg_lng = np.mean(df_merge_final.lng)
std_lng = np.std(df_merge_final.lng)
folium.CircleMarker(
[avg_lat+0.2*std_lat, avg_lng+0.6*std_lng],
radius = 120,
color='green',
fill=True,
fill_color='green',
fill_opacity=0.2
).add_to(venues_map)
venues_map

Discussion

In this project, I refrained from using any of the existing machine learning algorithms for reasons. The first reason was the shear amount of data for this project. You can see from the methodology section that we do not have sufficient amount of data to train and test a model. In addition, unsupervised learning usually requires a lot more data than supervised learning, so for the sake of this project, I stayed with data manipulation and analysis. Choosing 75th percentile was my pure subjective choice, so this most certainly have introduced bias in my analysis. However the bias, based on the feature metrics of the surrounding restaurants, the area picked for the KBBQ will be populated by many people and will have a lot of patronage opportunities if the food is served right.

Conclusion

We set out to find what seemingly a very broad idea: “KBBQ seems very profitable and can we find the best location for a venue in Manhattan?” Coupled with location and venue data, this broad idea became a very specific goal. Even without sufficient amount of data and without using fancy machine learning algorithms, just employing simple data preprocessing, statistical methods, and data visualization, we were able to pinpoint an area where a potential investor in food industry can make profits running a KBBQ restaurant or restaurants.

Thank you for reading!

--

--