Geography, clusters and clustering.

"What is a city, anyway?" asked CityMetric in 2014. It’s a very important question to us.

Wards, local authorities, combined authorities, primary urban areas, local enterprise partnerships, statistical regions, nations. They’re just some of the geographies that we deal with when using official statistics within the UK. The history and emotion captured in each of their definitions makes comparisons difficult.

Things gets even more complicated when we talk with governments and communities. The definition of a place changes enormously depending on who you speak with and the context of the discussion. It’s great that people are as passionate about places as we are, but it makes things complicated.

We deal with this complexity in two ways; simplification and starting from the beginning.

Simplification: The French way.

Geography is simpler than in the UK in many countries. For example, France has at various points in recent centuries thrown away old geographical definitions and created new simpler ones. The organisation of the country into communes, agglomerations, metropoles, départements, and régions makes most data analysis, and the discussions that arise from it, simpler than in the UK.

Starting from the beginning: clustering

In the absence of similarly good geographies in the UK we’ve developed an alternative: ignore existing geographies completely and use clustering algorithms to recreate them. The video below shows what that looks like in practice, with an example from our UK Tech Innovation index 2.

With this approach, we don’t care which city an event happens in, which local enterprise partnership a businesses is in, or which statistical region a university publishes papers in. In fact, we throw away all of that data and for each industry area that we’re investigating we just consider the precise location of each event, business, or university and the links between them.

The strength of a link between two entities can be as simple as how close they are on a map. It might be the proportion of common attendees across two events, or a marker that one business supplies another, or has won a common innovation grant, or has staff who’ve worked at another business.

These links are why there is a graph database at the core of The Data City. The importance of the links is clear in the fact that the graph database has many times more links than entities.

The importance of clusters and the problem with naming them

Defining clusters in this way is valuable to our users who are increasingly looking at data on clusters as an alternative to data about poorly and variably-defined cities. This is especially true in the UK where clusters like The West Midlands automotive industry are spready across many cities and towns. The UK Government’s Industrial Strategy

We must promote growth through fostering clusters and connectivity across cities, towns, and surrounding areas. Industrial Strategy White Paper, UK Government, 2017

There is a tricky problem with clusters though. What do we call them?

Where clusters are fully-contained within a city, or the city limits are expanded to contain the whole cluster, it’s easy. Name the cluster after the city.

But many clusters extend well beyond a city, or spread across many cities. Some don’t include a city at all. In some cases clusters name themselves. Heavy industry in the Black Country in the UK and IT in Silicon Valley in California are good examples. But most of the time industrial clusters are unnamed.

Since we define clusters automatically, we also need an automatic way of naming clusters. In UK Tech Innovation Index 1 we just used the name of the largest city in the cluster. In UK Tech Innovation Index 2 we’re combining the names of large settlements within the cluster boundary. What do you think?

Should we call it Newcastle? Newcastle, Sunderland, Middlesbrough? Or split the cluster and push each place down our rankings?

Industrial sector classification using machine learning.

One of the most important features of The Data City workflow is the classification of businesses. This is important because of the limitations of SIC codes. SIC codes define what activities every business in the UK performs, but they are poorly-suited to large companies that span many industrial sectors and technology companies whose small niches of operation change frequently.

To better classify businesses in the UK we use machine-learning.

In a previous post we showed where machine-learning fits into our data processing pipeline. In this post I'll explain more about how that works. It's simpler than it sounds.


We start with the list of UK companies available as open data from Companies House. Searching for each company name on the internet usually finds a website for the business and we can collect every website for analysis. The difficult part is deciding what a company does.

We could try and classify each business manually; it is usually quite easy to tell what a business does just from its website. We call this manual classification.

It is usually quite easy for a human expert to tell what a business does by looking at its website.

The obvious problem with manual classification is that with over a million UK businesses, it takes far too long. By the time industries are classified, new industrial sectors have sprung up.

The solution is to get a computer to classify industrial sectors.

Advances in machine-learning in the past decade mean that this is much easier than it once was. Unsupervised machine-learning algorithms can cluster websites into groups of similarity quite quickly.

An unsupervised machine-learning algorithm can identify similarities between websites and group them. It cannot usually name the groups, and will require guidance as to how many groupings to create.

The problem with unsupervised classification is that the results lack context about what is being grouped. Techniques might group companies by how optimistic their webpage is, or whether they use Wordpress or Squarespace for hosting. Classification by industrial sector can easily be hidden. Even if visible, without human-understandable names it is nearly useless.

The solution to this problem is to manually classify a small number of websites and businesses, and then use this to train the machine-learning algorithm so that it can spot similar businesses. Since the groupings are named manually, they are meaningful to experts.

A supervised machine-learning algorithm uses a collection of websites that have been manually classified to learn how to classify a much larger set of websites. "Learning" can be as simple as setting parameters in an otherwise unsupervised algorithm. It can be much more complicated and iterative in cases where deep learning techniques are used.

The power of this classification technique is increased enormously because of our approach to open data. When we released the first IoT UK Nation dataset as open data we received suggestions that some companies were missing. We received other feedback that some companies on the list were not in fact involved in IoT at all.

This was not a surprise to us — our machine-learning based approach isn't perfect and never will be. By incorporating this feedback, our classification algorithm recalculated its predictions overnight. The new classification model added those companies that were missing. It also added many similar companies that were previously excluded from the list of IoT businesses. The new classification model removed those companies that were wrongly classified as involved in IoT. It also removed some similar companies that were wrongly classified as involved in IoT even though no-one had explicitly alerted us to them.

In this way, by sharing our outputs openly and by continually checking our classifications, we improve the classification model and thus the quality of our classifications over time.

It's a bit more complicated than that.

The diagrams and the explanation we've given above are simple, but they manage to cover the important parts about how our industrial classification system works. Additional complexities are added to support more languages, more types of industry, and to try and keep the industrial classifications reasonably stable over time.

The classification system is continually learning as more websites are scraped, and more manual classifications are added.

The diagrams above suggest that only the contents of the websites are used by our classification system, but we actually much more information than this. We use company tweets, links from the company website, and links to the company website from elsewhere on the web including LinkedIn and open lists of companies and public grant winners. These are all pieces of information that a human expert might miss, but that a computer can include when deciding how to classify a business.

The final additional complexity for this blog post is how we use our classification model to classify more than just businesses. We use the same system to classify events from services like Open Tech Calendar and Eventbrite, and from patents and scientific paper abstracts. In this way we have a single learned ontology of industrial classification that we can use to combine data from many data sources, each with errors and uncertainties, to understand the evolution of industrial clusters.

Our classifications are currently pretty good, but we know that they can get better. The most exciting part of The Data City is that the more data we collect, the more questions we answer, and the more manual corrections we make, the better our classifications get.

Simple, powerful, always on. Listening to users.

Since our first work in 2015 we've been talking with our users and listening to what they want. The three biggest asks we get are.

  • Simple. Three visuals with the biggest messages beats thirty visuals with every message, every time. Our users always asks us to focus on simplicity above volume.
  • Powerful. The potential to access raw data and ask new questions is valued. Most of the time it won't be used, but it should be possible.
  • Always on. Our users don't want reports. They're too long, and they're out of date by the time they read them.

This is why we build simple, small, powerful tools powered by data that is accessible to people who want to look more deeply. We answer specific questions that users ask, and we keep on giving them the answer in as simple a way as possible, powered by data that improves all the time.

Simplicity is hard. It takes attention to detail in design and a willingness to throw away good work that's not quite good enough. It's also very hard to define and judge, but we'll know we've failed if we ever produce a report with a figure like this in it.

The 138 page International comparative performance of the UK research base, 2016 is fantastic, but it's not the kind of thing we'll be emulating. Especially not this figure.

Personal data and The Data City

At The Data City we handle hundreds of gigabytes of data. The Information Commissioner’s Office (ICO)’s data protection principles are clear on this; if data contains identifiable names, it is personal data. Our data contains names.

Because we work with personal data, The Data City’s two data controlling partner organisations, Bloom and imactivate, are registered with The ICO. We have processes to disclose and correct any personal data that we hold.

We take handling personal data correctly very seriously. In part, because it is the law and a condition of us doing business. But mostly because we don’t want to, or need to, reduce the privacy of anyone through our work.

By considering data protection early, we protect peoples’ privacy without reducing the power of our tools or the quality of our work. This is a passion for us. All of our co-founders also work at or with The Open Data Institute Leeds, pushing the idea that you can be good at business and good with data at the same time.

Sensitive data, private data, and public availability

The most important protections in The Data Protection Act and the ICO’s guidance on The Data Protection Act concern two types of personal data; private data and sensitive data.

In summary, private data is data that is collected without the person it refers to expecting it to be made public. Sensitive data includes information such as race, criminal record, and health. It is well-defined in the ICO’s Key Definitions .

We are not subject to most of the most complex concerns around data protection because we don’t work with private or sensitive data. In part this is because the data we collect is always publicly available.

When we collect the names of authors on scientific papers and patents, the data was already publicly available. When we collect the names of directors in companies, the data was already publicly available. And when we scrape company websites to understand what they do, the content was publicly available to. Public availability of data makes its collection and storage less likely to breach privacy; if it is public knowledge that a person is the director of a publicly listed company, holding that personal data does not compromise privacy.

But just because all of the personal data that we use is publicly available doesn’t mean that we can ignore ICO guidance around private and sensitive personal data.

The ICO is clear that the nature of publicly available data can change when it is aggregated or linked with other data and that making data public does not imply unlimited consent for its re-use. While each piece of data that we hold meets the ICOs test of being publicly available and neither private nor sensitive, in combination it sometimes doesn’t.

As an example, if you attend a public event on Eventbrite you can see other attendees. That is largely uncontentious. But if we collect data on every public event that a person attended over several years and released that in a single dataset, it would no longer meet the ICO’s test of being non-sensitive and non-private.

There are many more areas where the fact that data is publicly available does not mean it is not also private or sensitive and deserve according treatment. The ICO’s early work on “personal information online” considers social media posts in more detail and is a good guide to thinking about these challenges.

GDPR, Big Data, and what we do

Big data, Machine-learning, and AI are all poorly-defined and mostly quite new. We probably do all three and both public understanding and legislation is still evolving, and evolving quickly.

The ICO’s guidance for working with big data is comprehensive (especially the full 114 page version) and informs much of our approach. Although the graph database at the centre of The Data City contains only publicly-available personal data we treat it as if it contained highly-sensitive and private data. It is stored in a secure location and has highly restricted access controls.

Our public tools are powered by highly-aggregated data releases, almost always available under an open license. Examples are The IoT UK Nation Database on Data Mill North and The UK Tech Innovation Index on GitHub.

Through aggregation, personal data is removed from our central graph database before. In the public datasets there is no personal data at all.

Licenses and privacy

In our next blog post we’ll explain how a similar approach to data management applies to managing licenses. Collecting and combining data with different licenses into a final dataset that can be released as opendata doesn’t just happen by accident.

The Data City on tour

In my last blog post I introduced The Data City and what we’re building at the moment. Here I introduce our international strategy and what we’re doing later this year and in early 2018.

Because The Data City doesn’t rely heavily on formal national statistics we are able to expand to new countries easily. Language and subtle differences in economic structure are the only large barriers.

We know this because we’ve tested it. We already have large parts of The Data City working in Ireland and Scotland where national statistics are different to those in England & Wales. Our method is mostly unaffected. We can today provide comparable assessments of industrial strengths and potential for innovation in small niches of technology in Dublin, Belfast, Glasgow, Cardiff, and Leeds.

In December we are expanding to France. We’ll be starting in Leeds’ twin city of Lille, producing a version of The Data City for the Lille City Region. At the same time we’ll be updating our version of The Data City for Leeds so that both tools can be presented to both cities at the same time.

We’ve chosen France for three big reasons,

  1. France’s cities and national government have embraced open data, so we can easily access everything we need to expand our tool. We can publish our results easily too, on great data portals run by our friends at OpenDataSoft.
  2. French Cities have strong regional governments and business groups, both of which have the power and money to invest in innovation.
  3. Paris is one of the world’s leading cities for artificial intelligence. Places like Station F host both start-ups and big companies like Microsoft.

It also helps that France is so close, that we speak French, and that Lille is so close to Belgium and The Netherlands.

We’ve already done a lot. Our early work has been featured by étalab , the French government’s digital services team. Our workflow for analysing scientific papers worked without change, since most papers today are published in English. And we’ve already shared a lot of the additional methods we’ve developed to use French datasets and compare them with English & Welsh ones.

Our work on IoT UK Nation provides a fantastic basis on which to compare the UK and France. Famously, France is strong in The of Internet of Things , with nearly a third of exhibitors at this year’s CES in Las Vegas hailing from the hexagon. Our initial work suggests that this excellence is widely spread and often deeply linked with local industries.

In Rennes, IoT businesses are linked with Orange and Télécom Bretagne, a leading university and research institute. In Toulouse, IoT businesses are linked to Airbus and Ariane, world-leading aviation and airspace companies.

What we see in France is similar to what we’ve found in the UK. One example is The West Midlands, where the automotive industry plays host to world-leading companies in IoT that slip beneath the radar of many policy experts and investors.

IoT businesses in Toulouse. We are building the same, but better, for Lille.

We still have lots of work to do in France. For a start all our machine-learning needs translating. Internet of Things is probably Objets Connectés but we need to teach a machine that, and we’ll need French tech people to help us.

If you’re in Lille, or if you can come to Lille, and you can help, please come and see us in December at EuraTechnologies and take us to a nice Flanders bar afterwards, the first drink’s on me.

Introducing The Data City

Our UK Tech Innovation Index used techniques to gather evidence of existing and potential for innovation in seven areas of technology. Our IoT UK Nation Database identified companies, organisations, and people in a specific area of technology: The Internet of Things.

The Data City combines these two techniques and adds more data. We’ve got new data on patents, data from more events services, and data on both demand and supply of skills.

Here’s a diagram.

How data is used within The Data City.

The Data City is always on, collecting new data every minute of every day, so that our answers are always improving. It improves in three further ways,

  1. We add new data-sources. Recently we added house price changes.
  2. Our machine-learning finds new identities (locations, companies, institutions, technologies) within the data that it can use to link data.
  3. We ask our model new questions.

This is where our customers come in, and it’s why our business model is unique.

We’re tired of writing reports that few people read and that are out of date by the time they’re published. So we build tools that answer questions and keep answering them into the future. And instead of selling the same intelligence over and over, we use all of our data to answer all of our questions. This means that the more customers we have, the better the answers to everyone’s questions. And we share many of our answers to individual questions as open data, so that as more people ask more questions, everyone gets more value.

We know that some customers are wary of our radically open approach and we understand that some will not want to share everything. That’s okay. Talk to us and we’ll figure something out.