Geography, clusters, and clustering

What is a city, anyway?” asked CityMetric in 2014. It’s a very important question to us. Wards, local authorities, combined authorities, primary urban areas, local enterprise partnerships, statistical regions, nations. They’re just some of the geographies that we deal with when using official statistics within the UK. The history and emotion captured in each of their definitions makes comparisons difficult.Things gets even more complicated when we talk with governments and communities. The definition of a place changes enormously depending on who you speak with and the context of the discussion. It’s great that people are as passionate about places as we are, but it makes things complicated.class=”blogPost”>We deal with this complexity in two ways; simplification and starting from the beginning.

Simplification: The French way.

Geography is simpler than in the UK in many countries. For example, France has at various points in recent centuries thrown away old geographical definitions and created new simpler ones. The organisation of the country into communes, agglomerations, metropoles, départements, and régions makes most data analysis, and the discussions that arise from it, simpler than in the UK.

Starting from the beginning: clustering

In the absence of similarly good geographies in the UK we’ve developed an alternative: ignore existing geographies completely and use clustering algorithms to recreate them. The video below shows what that looks like in practice, with an example from our UK Tech Innovation index 2.

With this approach, we don’t care which city an event happens in, which local enterprise partnership a businesses is in, or which statistical region a university publishes papers in. In fact, we throw away all of that data and for each industry area that we’re investigating we just consider the precise location of each event, business, or university and the links between them.

The strength of a link between two entities can be as simple as how close they are on a map. It might be the proportion of common attendees across two events, or a marker that one business supplies another, or has won a common innovation grant, or has staff who’ve worked at another business.

These links are why there is a graph database at the core of The Data City. The importance of the links is clear in the fact that the graph database has many times more links than entities.

The importance of clusters and the problem with naming them

Defining clusters in this way is valuable to our users who are increasingly looking at data on clusters as an alternative to data about poorly and variably-defined cities. This is especially true in the UK where clusters like The West Midlands automotive industry are spready across many cities and towns. The UK Government’s Industrial Strategy

We must promote growth through fostering clusters and connectivity across cities, towns, and surrounding areas. — Industrial Strategy White Paper, UK Government, 2017

There is a tricky problem with clusters though. What do we call them?

Where clusters are fully-contained within a city, or the city limits are expanded to contain the whole cluster, it’s easy. Name the cluster after the city.

But many clusters extend well beyond a city, or spread across many cities. Some don’t include a city at all. In some cases clusters name themselves. Heavy industry in the Black Country in the UK and IT in Silicon Valley in California are good examples. But most of the time industrial clusters are unnamed.

Since we define clusters automatically, we also need an automatic way of naming clusters. In UK Tech Innovation Index 1 we just used the name of the largest city in the cluster. In UK Tech Innovation Index 2 we’re combining the names of large settlements within the cluster boundary. What do you think?

Should we call it Newcastle? Newcastle, Sunderland, Middlesbrough? Or split the cluster and push each place down our rankings?

Leave a Reply

Your email address will not be published. Required fields are marked *