This is a learning project that I would like to share here. This project deals with a business problem and gives a solution that helps in making important decisions for the business. This post follows a sequence of steps so that it will be easy to read and understand the entire project.
As per the title, this project deals with Hyderabad City, India. Before going straight into the problem, it is a good idea to provide a formal introduction to the city. So let’s get to know about Hyderabad city.
Hyderabad is the capital city of the Indian state, of Telangana. It is located in the northern part of south India. The city comprises an estimated population of 9.7 million as of 2021 and is the sixth most populous metropolitan area in India. It is also the fifth-largest urban economy in India. The amalgamation of local and migrated individuals led to a distinctive culture and the city has emerged as the foremost centre of oriental culture. Many crafts such as painting, jewellery, literature, clothing… so on are still prominent in Hyderabad city.
The Telugu film industry in the city is the second-largest film production industry in the country. The city emerged as a pharmaceuticals and biotechnology hub in India. The formation of HITEC city, dedicated to information technology has encouraged multinational companies like Google, Amazon, Apple, Facebook, and Microsoft to set up their operations in Hyderabad city.
Just like India, Hyderabad city is one of the best representations of its great history, diverse culture, and food. The city has different cuisines and is listed as a UNESCO creative city of gastronomy. The city is famous for its popular food, “Biryani”. Many restaurants provide different cuisines of food. The city has a rich food culture dating back to the Nizam and Mughal empires. Some of the cuisines the city offers are Arabic, Turkish, Iranian, and native Telugu cuisines.
Now, let’s get straight to the business problem.
The Business Problem:
The business problem we’re dealing with here is that we have a business individual who wants to open a restaurant in Hyderabad. But he was not sure where to start with. Now our main goal is to solve this problem for him. So that he can decide on where to open his restaurant in the city.
Hyderabad city is one of the best locations in the country to open a restaurant. The city offers different cuisines and rich diverse food to the people. Setting up a restaurant in the city is profitable for doing business. The diversity of people and their food preferences provide a very good opportunity to compete in the food business. There are many neighbourhoods within the city. Choosing an optimal one to open a restaurant is a difficult decision to make. It is the job of a data scientist to gather information on all neighbourhoods in the city and present it to the stakeholder/stakeholders. So that they will make a business decision on the location to open a restaurant. For this data is required. Let’s see what the required data for this project is and how to get it.
Data sources and data cleaning:
For this project, Hyderabad city neighbourhood names, their respective latitude and longitude coordinates, and the data of nearby venues for each neighbourhood are required. The following are the data sources for our project.
Get neighbourhood names:
After a quick Google search, it was found that there was a Wikipedia page, that provides information on the neighbourhoods of Hyderabad city. Below is the link to the Wikipedia webpage.
To get data, let’s web-scrape this Wikipedia webpage and get the required data. For web-scraping, a beautiful soup python package was used. On inspecting the webpage, our required data is in the “<li>” tags.
Python requests library was used to fetch data from the URL. A beautiful soup object was used on the data from the webpage. The fetched data is then filtered out to obtain our required data. Finally, our required data is stored in a pandas data frame.
The web scraping was done as shown in the below figures.
Now that we have the neighbourhood data, we need the location coordinates for all neighbourhoods.
Get coordinates for neighbourhoods:
The latitude and longitude coordinates of neighbourhoods were not available on the Wikipedia page. To obtain the coordinates, the openstreetmap.org website’s nominatim API was used. An API request should be sent to the API to get data. The response we get contains the coordinates for a requested address in JSON format. By using this API, coordinates for all the neighbourhoods of Hyderabad city are obtained. This is shown below.
The coordinates are filtered out from the response and stored in a pandas data frame as shown below.
Data Cleaning: The challenge with coordinates data:
By using the nominatim API, coordinates for 14 neighborhoods are not obtained. This is mainly because,
1. For some neighbourhoods, the API doesn’t provide coordinates due to unknown reasons.
2. A few neighbourhood names on the Wikipedia webpage are misspelt. This was found, by manually searching for coordinates on Google.
This challenge has been overcome by searching coordinates for remaining neighbourhoods, using the below-mentioned website.
Latitude and Longitude Finder
Just like every actual house has its address (which includes the number, the name of the street, city, etc.), every…
This website provides latitude and longitude coordinates for a requested address or a place. This is a free website, it allows only a limited number of searches in a day.
Finally, after getting coordinates for all the neighbourhoods, the data is stored in a pandas data frame. Now our required data is in two data frames. One for neighbourhood names and the other for coordinates. These two data frames are merged into a single data frame which is required for our project.
After merging the two data frames, The final data frame obtained was as shown below.
Geopy library to get coordinates of Hyderabad city:
Geopy is a Python library that can be used to fetch the coordinates of an address. This library was used to get the coordinates of Hyderabad city itself. These coordinates will help plot the map of Hyderabad city using Python’s folium visualization library. Below is an example.
Neighborhood location data using Foursquare API:
To get the nearby venues of a neighbourhood, Foursquare API was used. Foursquare API provides the location data of an address. It provides diverse information about venues, users, photos, check-ins, geo-tagging…etc
This API was used in this project to get nearby venue details of a neighbourhood. To get data from the API, a search query is to be sent to the Foursquare API. The response from the API contains the requested data in JSON format.
The API requires credentials like CLIENT_ID, CLIENT_SECRET, and VERSION to get data. These can be obtained by creating an account on the Foursquare website. The response is filtered out to get our required data. Finally, the data is stored in a data frame.
We have the required data for our project. Let’s see the statistical analysis of the data. There are a total of 244 records in our final data frame named hyd_data. Therefore, we have 244 neighbourhoods and their coordinates in this data frame.
Now, we have another data frame that contains the data of neighbourhood venues. The second data frame was named hyderabad_venues. This data frame contains the venues of neighbourhoods. Below is a screenshot of the data frame.
There are 1154 records in the venue's data frame. On checking the data frame, there are 186 unique venue categories. We are interested in the neighbourhoods with restaurants. Let’s sort the data frame and check the top 10 most common venues in each neighbourhood. So that we get an idea about neighbourhoods that have restaurants. This is achieved by grouping the neighbourhoods, and by taking the mean of the frequency of occurrence of each venue category. Finally, we name this data frame as sorted_venues data frame which looks as shown below.
Machine Learning Algorithm:
Now that we have the sorted data frame. We will use the unsupervised machine learning algorithm “K Means Clustering” to cluster the neighbourhoods. This clustering algorithm forms clusters of neighbourhoods with similar most common venues. This helps us get an idea of the most common venue in a cluster.
K Means Algorithm:
K Means algorithm is a clustering algorithm. It is the most popular and widely used clustering algorithm. The algorithm takes a data set of items as input and categorizes those items into groups called clusters. The algorithm is explained below.
1. Randomly initialize the ‘n’ number of data points from the data set called cluster centroids. Here ‘n’ is the number of cluster centroids.
2. Now the distance between the remaining data points and cluster centroids is calculated. The data points with the least distance to cluster centroids are assigned to them to form clusters.
3. The mean distances between the data points and the centroid of that particular cluster are calculated. The centroid is moved to this calculated mean position.
The above process is repeated iteratively until the centroids can no longer move to a different mean position. At the end of this iterative process, we will have our clusters.
Choosing the number of clusters:
In some problems, we may have no idea how to choose, the number of clusters. So, to find the optimum value for choosing the number of clusters (k), we test the algorithm using the elbow method. In the elbow method, we select a predefined range for the number of clusters (k). We run the K Means algorithm on this range of ‘n’ and calculate the sum of squared distances of data points to their cluster centroid. If we plot the sum of squared distances vs the number of clusters (k), we get the plot as shown below figure.
The shape of the plot is just like the shape of an elbow, hence the name elbow plot. In the elbow plot, we choose k, where the plotline deviates at an angle. In the above plot, the line deviates at k = 3. Therefore, the optimum value for k is 3.
Plotting map of Hyderabad city with neighbourhoods superimposed on top:
Using the folium library, the Hyderabad city map is created with neighbourhoods superimposed on top of it. Which is shown below
Now the above map is plotted using the data of the hyd_data data frame. Here we can see all the neighborhoods of the city which are marked in blue circle markers.
Finally, we have the following required things to provide a solution to our business problem.
- The neighbourhood names and their coordinates are in the hyd_data data frame.
2. We also have the data of the top 10 venues of each neighbourhood in the sorted_venues data frame.
3. By using the K Means clustering machine learning algorithm, the data in the sorted_venues data frame is segmented into 3 cluster groups.
Using the above 3 data items, we form the final_data data frame. The final_data data frame contains the Neighborhood name, its latitude and longitude, cluster label (labels each neighbourhood to which cluster it belongs), and the top 10 most common venues of each neighbourhood. The final_data frame looks as shown in the below picture.
Plotting the Results using folium:
By visually plotting the results, we get a better idea of the clusters and the neighbourhoods. For plotting, we use the folium library. This library is very useful for plotting the location data. The folium library is used for creating beautiful map visualizations. It also has zoom functionality, that enables one to zoom in on the map and explore the areas within the map. The clusters in the plot are denoted with circular colour markers. The three clusters are marked in red, blue, and green circles. The final plot result is shown below.
In the map, the red circular mark denotes cluster 1, the blue circular mark denotes cluster 2 and the green color mark denotes cluster 3. These clusters are created based on the mean of, the frequency of occurrence of similar venue categories in neighbourhoods. We can verify the results of the map by taking a look at the clustered final data frame. The clustered final data frame is nothing but the final data frame split cluster-wise. As we have 3 clusters in our result, the final data frame is split into 3 data frames cluster-wise. Let’s take a look at them.
Cluster 1 data frame:
Cluster 2 data frame:
Cluster 3 data frame:
Discussion on results:
From the pictures of all three cluster data frames, we can make some analysis and discuss them. In the first cluster data frame, we can say that restaurants are not the top 1st most common venues. But they are in the 3rd and 9th most common venues. From this observation, we can say that either the competition or demand for restaurants is less in that particular cluster. If we further dive into this and find the answer to “why the restaurants are not the top most common venues in the first cluster?” We may get a clear idea of whether or not to open a restaurant in this cluster.
In the second cluster, restaurants are the 4th and 9th most common venues. But the most common venue is ATMs. Here, the situation is similar to the first cluster. Therefore, the same type of analysis can be applied here.
The third cluster is really interesting. Restaurants are the first and most common venues in this cluster. This means there is a lot of demand and competition for restaurants in Cluster 3. So, opening a restaurant in this cluster is good for business. But it also depends on other factors such as competition, the cost of opening a restaurant in the cluster, and so on. The final decision will be taken by the stakeholders.
In this project, our business problem is choosing a neighbourhood to open a restaurant in Hyderabad city, India. For this problem, we first need the neighbourhoods of Hyderabad and their location data. The required data was acquired from the Wikipedia site and by using nominatim API. The venue data of each neighbourhood was obtained using the Foursquare API. The K Means clustering algorithm was used on the data and grouped the data into 3 clusters. Each cluster was created by grouping the neighbourhoods on the mean of the frequency of occurrence of each venue category. The final result was labelled and segmented into 3 clusters. The data were plotted using the folium library. Analysis was made on the final data sets and observations were made. The observations and the comments made on the data should help the stockholders to decide on where to open a restaurant.
Finally, if you read my project, thanks for your time, and if you have any suggestions, advice, or ideas please do comment.