Counting people is hard

+3

My most-played quizzes all involve the largest cities of Europe. Until recently, they all had the same problem: lack of accurate data. In fact, a lot of the datapoints were arbitrary. I pulled data together from different sources, removed any outliers and continued until I had a satisfactory distribution of cities in all countries in Europe. It makes for a great quiz, but it doesn't feel right to do this. Quizzes here are supposed to be backed by facts, not by opinions.

Solving that problem proves quite challenging. I admire the folks at citypopulation.de, which is the go-to source for all city-related data on jetpunk, for putting it all together. There are wildy diverging sources all over the place that all have different figures. Finding a consistent method that gives you reliable data is not an easy task to do. Even they have to constantly revise their method and redefine boundaries to include more or less people

The problem: illustrated by eurostat

The source I used previously was eurostat, which is compiled by the European union based on data from its different member states. It uses the concept of Large Urban Zone, which is different from the administrative area typically used in census data. That's useful, because it's typically a better reflection of how many people live in and around an urban area

Unfortunately, while the data is reliable, it is not accurate. The figures vary wildy between member states. It would typically overestimate how large an urban area is, sometimes by a huge amount. For example, Rzeszów, Poland is listed as having 506,965 people in its urban area, but in order to get to that number, you need to include all administrative areas in a radius of at least 40km. That's way too much to be a functional definiton of an Urban area

That's the problem with Europe: under the surface, there are 50 countries each doing their own thing

Creating a method: The good

For me, the most important thing a method should have is that it lines up with the major city data on citypopulation.de. They have a very useful list of all cities with more than 1 million people. Any method I could use should align with their method. That's necessary because there are only 78 cities in Europe with more than 1 million people on their list. I still need another 122 to complete my quiz

Now the fortunate thing is that citypopulation.de is German, which is in Europe. And they do extend their method to German cities below 1 million. The best part is that they also list how they established that data: The agglomerations consist of urban areas with at least 1,000 inhabitants [...] and a general maximum distance of 1 km between them (in exceptions a distance of 2 km is allowed).

Similar data is also available for the United Kingdom and Switzerland and it seems to align perfectly with the major city data

Extending a method: The bad

Unforunately that's where the easy methods end. When I get to the rest of Europe, there is no clean urban agglomeration data available

France has some nice data on urban agglomerations, but when you look at the layout on the map, you start to notice some problems. For example, the urban agglomeration of Le Mans has a hole in it around the village of Mulsanne. Urban agglomerations should not have holes, those holes should be part of the agglomeration. So I've got some work to do. The work is simple enough: just add it to the total

I can apply this same method to any dataset with urban area data. If there are urban areas that are close enough to the city, I should add them to the total. After all, that's part of the method citypopulation.de uses. Urban area data is a lot more common than urban agglomeration data, so this method now works for most countries and it's consistent with the data I already have

Sometimes the urban area data is a little out of date. The data for Albania happens to be 9 years old, so I had to correct for that by multiplying with the expected growth rate. Not too difficult, but extra work nonetheless. I can infer that citypopulation.de does this as well, because their data on major cities updates every year even when no new census data becomes available

Making guesses: The ugly

Of course there are always countries that don't play nice. Italy and the Netherlands caused me some headaches. Italy only has data on administrative areas and the Netherlands has such a narrow definition for urban area that it was unusable. There was only one option left to use: grab the latest census data. Visualize the area on google maps and start counting. This is further complicated by my inability to read Italian (Bulgaria was an even bigger challenge for this reason). Nevertheless, after sorting through some excel sheets and doing some number-crunching I got results that I liked and that had at least some basis in facts

I sincerely hope that at some point a dataset becomes available that is easier to use

+2
Level 57
Oct 1, 2020
I COMPLETELY agree with you. There is really no reliable data source hence it's also really difficult to make an accurate dataset for a quiz. No matter what site you use, there are always flaws. There is however a website that contains many datasets in many forms that are freely available to use, the United Nations. Unfortunately, the UN doesn't have a sufficient amount of datasets, however, they do have quite a lot of historical data which is good when making quizzes relating to past years. This is why it's also hard to define a largest city in the world or a separate country.
+2
Level 57
Oct 13, 2020
The un also has some data other than pops though