# Population Distribution In World Cities

# 1. Introduction

The World cities database is a small data set from Kaggle. It contains an up-to-date database of the world’s cities and towns. I use this data set for the Udacity Data Scientist course because it is simple but comprehensive enough for interesting information retrieval.

With this data base, I am interested in the following questions:

- Is population evenly distributed in world cities?
- If population is not evenly distributed in world cities, what kind of cities are more attractive for people who live in the cities?
- Cities in different countries may show different patterns when it comes to popular distribution. For example, in some countries people may prefer to live a very few cities. Other countries, however, seen their population more evenly distributed among its cities. Then, how unequally distributed people are in different countries?
- Can we predict the city’s population based on the variables in the data base?

Answering these questions well can have some direct business values. For example, when multi-international companies plan to set up a new office in a country. One of the many factors that determine where the office is located is determined by city’s population. If we know that in a certain country people are evenly distributed in different cities, then we can have many cities as candidates. It may also happen the city’s population is unknown due to some reasons. In that case, we need to find a simple solution to estimate the city’s population based on simple information related to the city such as its administrate level, longitude and latitude and so on.

Before going to answer these four questions, we first perform a data exploration analysis to have a general idea what exists in this data base. After that, for each question we perform an insight analysis either using descriptive statistics or using machine learning techniques. In the end, we draw our conclusions.

# 2. Data Exploration

Let’s first explore this data set. The date base is composed of 11 columns, containing city information such as its name, its latitude and longitude, its population, the country it belongs to, the name of the highest level administration region, whether or not it is a capital of a certain level and so on. We take one record for example to understand its contents:

`{'admin_name': {4568: 'Brussels-Capital Region'}, 'capital': {4568: 'primary'}, 'city': {4568: 'Brussels'}, 'city_ascii': {4568: 'Brussels'}, 'country': {4568: 'Belgium'}, 'id': {4568: 1056469830}, 'iso2': {4568: 'BE'}, 'iso3': {4568: 'BEL'}, 'lat': {4568: 50.8333}, 'lng': {4568: 4.3333}, 'population': {4568: 1743000.0}}`

Then, let’s how an overview of the whole database using Pandas’ `info()`

function:

`<class 'pandas.core.frame.DataFrame'> `

RangeIndex: 12959 entries, 0 to 12958 Data columns (total 11 columns):

city 12959 non-null object

city_ascii 12959 non-null object

lat 12959 non-null float64

lng 12959 non-null float64

country 12959 non-null object

iso2 12928 non-null object

iso3 12959 non-null object

admin_name 12750 non-null object

capital 5180 non-null object

population 11292 non-null float64

id 12959 non-null int64

dtypes: float64(3), int64(1), object(7) memory usage: 1.1+ MB

# 3. Population distribution

# 3.1 Should population evenly distributed in world cities?

The database provides precise information related to population in the cities. My personal observation is population should not be evenly distributed among all the cities. From the news as well as my personal experience, people are eager to live in big cities as big cities provide more opportunities. Does the data support it?

Let’s first visualize it.

The visualization seems to support our assumption. In the visualization map, the dot will become large if the population is large as well. It is very clear that cities of huge amount of people exists all over the world. These super cities are surrounded by cities of less population. This can be further verified with the city population histogram:

As we can see from the histogram, cities of small population are the majorities.

# 3.2 Do people in all the countries follow the same pattern: population is unbalanced among cities?

I ask this question because I do not like the idea of living in mass-population cities. It is true that mass-population cities may provide more opportunities, but it also leads to a lot of problems such as traffic jam, pollution, health and so on. Can we find some countries where most people live in moderate-population cities?

In order to answer this question, first we should define a measurement that can be used to indicate city population distribution. The measurement I select is Gini coefficient, which is inferred from population histogram. If people is evenly distributed in all the cities of the country, then its Gini coefficient is 0. The more unevenly distributed, the bigger the Gini coefficient is.

When we do calculate the country’s Gini coefficients for its population distribution, on purpose we ignore countries where the number of cities in this country is not more than 1. It does not make sense to do the comparison if the country has only one city. However, by doing so, around 20% countries will not be involved in our discussion.

Unfortunately, based on the Gini statistics histogram, we can observe that the trend where population distribution is unbalanced prevails in most countries.

We then rank all the countries based on their Gini numbers, and it seems that Canada, Australia, Angola, Bolivia and Paraguay are among the top-five countries where population distribution is very unbalanced while Paau, Cabo Verde, Trinidad And Tobago, Mauritius and Slovenia are the top-five countries where population distribution is balanced.

# 3.3 Which elements determines the city’s population?

When we look at the items in the database, we may ask ourselves a question: which columns in the database plays critical role in determining the city’s population? Let’s re-check the columns we have in the database.

`Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3', 'admin_name', 'capital', 'population', 'id'], dtype='object')`

Among all the columns, our intuition tells us the `capital`

, `lat`

and `lng`

may determine the city's population:

`capital`

: the administration level of the city also plays a role in determining its population. Our intuition is that the higher the city's administration level is, the larger population we can expect.`lat`

and`lng`

: they represent latitude and longitude respectively. They may play a role in determine the city's population. A simple example is China, where cities in the east have large population while cities in the west have small population.

The above analysis is from our intuition, but will data in the database support it? In order to answer this question, we decide to set up a multi-variable linear regression model. The steps of setting up the model include:

**(1) Data cleaning**

The purpose of data cleaning is to identify incorrect variables and then fix them. In our case, we use a very simple data cleaning method: remove the rows that contains at least one NaN values.

**(2)Transform categorical variable into numerical variable**

If we check the datatype of the cleaned data frame `clean_df`

, we will notice that the `capital`

column is not numerical types. It belongs to categorical variable in essence and should be transformed into numerical types so that we can use them in the linear regression model.

We use dummy coding to do the transformation, and it can duplicate the variable into several variables depending on the levels within the variable. We think it is reasonable method in the population estimation context because after the transformation, the `captial`

variable becomes three variables: `capital_admin`

,`capital_minor`

and `capital_primary`

. These variables provide important information regarding to the determining factors related to population estimation.

**(3) Independent variable normalization**

Normalization is needed for variable `lat`

and `lng`

. Since the range of the latitude is between -90 and 90 while the range of longitude is between -180 and 180, we use the following formula to calculate the normalized `lat`

and `lng`

variable:

`lat_normlized = (lat+90.0)/180`

`lng_normlized = (lng+180)/360`

In the literature, this normalization method is often called min-max normalization.

**(4) Linear regression model setup**

We decide to use `LinearRegression`

model provided by `sklearn`

library for building the linear regression model.

**(5) Linear regression analysis**

Now let’s analyze the linear regression result. First let’s check the model’s root mean square error (RMSR) and R2 statistics.

From the statistics, it seems that using `lat`

, `lng`

, `capital_admin`

, `capital_minor`

and `capital_primary`

is not enough to predict the city's population as R2 value is 0.0947, which is low for a linear regression model. Even in this case, we can still observe that the city's administration level plays a great role in determining the population. The coefficient for `capital_primary`

is a large positive number while both `capital_admin`

and `capital_minor`

are negative. Moreover, `capital_minor`

has a larger absolute value than `capital_admin`

. Then, we can safely say that the higher the city's administration level is, the more population the city will possess.

# 4. Conclusion

Take-away messages from the analysis are as follows:

- Population distribution is unbalanced happens in general all over the world, this trend can be also observed in individual countries.
- In general people who choose to live in cities prefer to live in a very few mass-population ones.
- It is very difficult to predict a city’s population just based on the information provided in the data base. We have tried to use the city’s latitude, longitude and city’s administration level to predict the population using linear least square model. The model’s prediction is not convincing but still we found that the population has a strong correlation with the city’s administration level.