Comparing the neighborhoods of Hyderabad City, India

This is a learning project that I would like to share here. This project deals with a business problem and gives a solution that helps in taking important decisions for the business. This post follows a sequence of steps so that it will be easy to read and understand the entire project.

The Introduction:

Hyderabad is the capital city of the Indian state, Telangana. It is located in the northern part of south India. The city comprises an estimated population of 9.7 million as of 2021 and is the sixth most populous metropolitan area in India. It is also the fifth-largest urban economy in India. The amalgamation of local and migrated individuals led to a distinctive culture and the city has emerged as the foremost center of oriental culture. Many crafts such as painting, jewelry, literature, clothing… so on are still prominent in Hyderabad city.

The Telugu film industry in the city is the second-largest film production industry in the country. The city emerged as a pharmaceuticals and biotechnology hub in India. The formation of HITEC city, dedicated to information technology has encouraged multinational companies like Google, Amazon, Apple, Facebook, and Microsoft to set up their operations in Hyderabad city.

Just like India, Hyderabad city is one of the best representations of its great history, diverse culture, and food. The city has different cuisines and is listed as a UNESCO creative city of gastronomy. The city is famous for its popular food, “Biryani”. Many restaurants provide different cuisines of food. The city has rich food culture dating back to Nizams and Mughal empires. Some of the cuisines the city offers are Arabic, Turkish, Iranian, and native Telugu cuisines.

Now, let’s get straight to the business problem.

The Business Problem:

Hyderabad city is one of the best locations in the country to open a restaurant. The city offers different cuisines and rich diverse food to the people. Clearly, setting up a restaurant in the city is profitable for doing business. The diversity of people and their food preferences provide a very good opportunity to compete in the food business. There are many neighborhoods within the city. Choosing an optimal one to open a restaurant is a difficult decision to make. It is the job of a data scientist to gather information on all neighborhoods in the city and present it to the stakeholder/stakeholders. So that they will make a business decision on the location to open a restaurant. For this data is required. Let’s see what is the required data for this project and how to get it.

Data sources and data cleaning:

Get neighborhood names:

After a quick google search, it was found that there was a Wikipedia page, that provides information on the neighborhoods of Hyderabad city. Below is the link for the Wikipedia webpage.

https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Hyderabad

To get data, let’s web-scrape this Wikipedia webpage and get the required data. For web-scraping, a beautiful soup python package was used. On inspecting the webpage, our required data in the “<li>” tags.

Python requests library was used to fetch data from the URL. A beautiful soup object was used on the data from the webpage. The fetched data is then filtered out to obtain our required data. Finally, our required data is stored in a pandas data frame.

The web scraping was done as shown in the below figures.

Web-Scraping the Wikipedia page
Using Beautiful object on the data
scrapping the required data from the page
storing the data into a data frame

Now that we have the neighborhood data, we need the location coordinates for all neighborhoods.

Get coordinates for neighborhoods:

The latitude and longitude coordinates of neighborhoods were not available on the Wikipedia page. To obtain the coordinates, the openstreetmap.org website’s nominatim API was used. An API request should be sent to the API to get data. The response we get contains the coordinates for a requested address in JSON format. By using this API, coordinates for all the neighborhoods of Hyderabad city are obtained. This is shown below.

response from the nominatim API

The coordinates are filtered out from the response and stored in a pandas data frame as shown below.

Latitude and Longitude coordinates

Data Cleaning: The challenge with coordinates data:

1. For some neighborhoods, the API doesn’t provide coordinates due to unknown reasons.

2. Few neighborhood names in the Wikipedia webpage are misspelled. This was found, by manually searching for coordinates on google.

This challenge has been overcome by searching coordinates for remaining neighborhoods, using the below mentioned website.

This website provides latitude and longitude coordinates for a requested address or a place. This is a free website, it allows only a limited number of searches in a day.

Finally, after getting coordinates for all the neighborhoods, the data is stored in a pandas data frame. Now our required data is in two data frames. One for neighborhood names and the other for coordinates. These two data frames are merged into a single data frame which is required for our project.

After merging the two data frames, The final data frame obtained was as shown below.

Neighborhoods data frame with coordinates

Geopy library to get coordinates of Hyderabad city:

using geopy library

Neighborhood location data using Foursquare API:

This API was used in this project to get nearby venue details of a neighborhood. To get data from the API, a search query is to be sent to the Foursquare API. The response from the API contains the requested data in JSON format.

response from Foursquare API

The API requires credentials like CLIENT_ID, CLIENT_SECRET, and VERSION to get data. These can be obtained by creating an account on the Foursquare website. The response is filtered out to get our required data. Finally, the data is stored in a data frame.

venues data using Foursquare API

Methodology:

Now, we have another data frame that contains the data of neighborhood venues. The second data frame was named hyderabad_venues. This data frame contains the venues of neighborhoods. Below is a screenshot of the data frame.

venues data frame

There are 1154 records in the venue's data frame. On checking the data frame, there are 186 unique venue categories. We are interested in the neighborhoods with restaurants. Let’s sort the data frame and check the top 10 most common venues in each neighborhood. So that we get an idea about neighborhoods that have restaurants. This is achieved by grouping the neighborhoods, and by taking the mean of the frequency of occurrence of each venue category. Finally, we name this data frame as sorted_venues data frame which looks as shown below.

Top 10 venues in each neighborhood

Machine Learning Algorithm:

K Means Algorithm:

K Means algorithm is a clustering algorithm. It is the most popular and widely used clustering algorithm. The algorithm takes a data-set of items as input and categorizes those items into groups called clusters. The algorithm is explained below.

1. Randomly initialize ‘n’ number of data points from the data-set called cluster centroids. Here ‘n’ is the number of cluster centroids.

2. Now the distance between the remaining data points and cluster centroids is calculated. The data points with the least distance to cluster centroids are assigned to them to form clusters.

3. The mean of distances between the data points and the centroid of that particular cluster is calculated. The centroid is moved to this calculated mean position.

The above process is repeated iteratively until the centroids can no longer move to a different mean position. At the end of this iterative process, we will have our clusters.

Choosing the number of clusters:

In some problems, we may have no idea on choosing ‘k’, the number of clusters. So, to find the optimum value for choosing the number of clusters (k), we test the algorithm using the elbow method. In the elbow method, we select a predefined range for the number of clusters (k). We run the K Means algorithm on this range of ‘n’ and calculate the sum of squared distances of data points to their cluster centroid. If we plot the sum of squared distances Vs the number of clusters (k), we got the plot as shown below figure.

Elbow Plot

The shape of the plot is just like the shape of an elbow, hence the name elbow plot. In the elbow plot, we choose k, where the plotline deviates at an angle. In the above plot, the line deviates at k = 3. Therefore, the optimum value for k is 3.

Plotting map of Hyderabad city with neighborhoods superimposed on top:

Using folium library, Hyderabad city map is created with neighborhoods superimposed on top of it. Which is shown below

Hyderabad Map with neighborhoods

Now the above map is plotted using the data of hyd_data data frame. Here we can see all the neighborhoods of the city which are marked in blue circle markers.

Results:

  1. The neighborhood names and their coordinates in the hyd_data data frame.

2. We also have the data of the top 10 venues of each neighborhood in the sorted_venues data frame.

3. By using the K Means clustering machine learning algorithm, the data in the sorted_venues data frame is segmented into 3 cluster groups.

Using the above 3 data items, we form the final_data data frame. The final_data data frame contains the Neighborhood name, its latitude and longitude, cluster label (labels each neighborhood to which cluster it belongs), and the top 10 most common venues of each neighborhood. The final_data frame looks as shown in the below picture.

The Final Data Frame

Plotting the Results using folium:

Cluster Map

In the map, the red circular mark denotes cluster 1, the blue circular mark denotes cluster 2 and the green color mark denotes cluster 3. These clusters are created based on the mean of, the frequency of occurrence of similar venue categories in neighborhoods. We can verify the results of the map by taking a look at the clustered final data frame. The clustered final data frame is nothing but the final data frame split cluster-wise. As we have 3 clusters in our result, the final data frame is split into 3 data frames cluster-wise. Let’s take a look at them.

Cluster 1 data frame:

cluster 1 data frame

Cluster 2 data frame:

cluster 2 data frame

Cluster 3 data frame:

cluster 3 data frame

Discussion on results:

In the second cluster, restaurants are the 4th and 9th most common venues. But the top most common venue is ATMs. Here, the situation is similar to the first cluster. Therefore, the same type of analysis can be applied here.

The third cluster is really interesting. Restaurants are the top 1st most common venues in this cluster. This means there is a lot of demand and competition for restaurants in cluster 3. So, opening a restaurant in this cluster is good for business. But it also depends on other factors such as competition, cost of opening a restaurant in the cluster, and so on. The final decision will be taken by the stakeholders.

Conclusion:

Finally, if you read my project, thanks for your time, and if you have any suggestions, advice, or ideas please do comment.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Veda Swaroop

I am Post Grad. Electrical Engineering graduate. I love Science, Technology, Linux and Computers in general. Machine Learning, Deep Learning Enthusiast.