One of the most important features of The Data City workflow is the classification of businesses. This is important because of the limitations of SIC codes. SIC codes define what activities every business in the UK performs, but they are poorly-suited to large companies that span many industrial sectors and technology companies whose small niches of operation change frequently.To better classify businesses in the UK we use machine-learning.In a previous post we showed where machine-learning fits into our data processing pipeline. In this post I’ll explain more about how that works. It’s simpler than it sounds.
We start with the list of UK companies available as open data from Companies House. Searching for each company name on the internet usually finds a website for the business and we can collect every website for analysis. The difficult part is deciding what a company does.
We could try and classify each business manually; it is usually quite easy to tell what a business does just from its website. We call this manual classification.
The obvious problem with manual classification is that with over a million UK businesses, it takes far too long. By the time industries are classified, new industrial sectors have sprung up.
The solution is to get a computer to classify industrial sectors.
Advances in machine-learning in the past decade mean that this is much easier than it once was. Unsupervised machine-learning algorithms can cluster websites into groups of similarity quite quickly.
The problem with unsupervised classification is that the results lack context about what is being grouped. Techniques might group companies by how optimistic their webpage is, or whether they use WordPress or Squarespace for hosting. Classification by industrial sector can easily be hidden. Even if visible, without human-understandable names it is nearly useless.
The solution to this problem is to manually classify a small number of websites and businesses, and then use this to train the machine-learning algorithm so that it can spot similar businesses. Since the groupings are named manually, they are meaningful to experts.
The power of this classification technique is increased enormously because of our approach to open data. When we released the first IoT UK Nation dataset as open data we received suggestions that some companies were missing. We received other feedback that some companies on the list were not in fact involved in IoT at all.
This was not a surprise to us — our machine-learning based approach isn’t perfect and never will be. By incorporating this feedback, our classification algorithm recalculated its predictions overnight. The new classification model added those companies that were missing. It also added many similar companies that were previously excluded from the list of IoT businesses. The new classification model removed those companies that were wrongly classified as involved in IoT. It also removed some similar companies that were wrongly classified as involved in IoT even though no-one had explicitly alerted us to them.
In this way, by sharing our outputs openly and by continually checking our classifications, we improve the classification model and thus the quality of our classifications over time.
It’s a bit more complicated than that
The diagrams and the explanation we’ve given above are simple, but they manage to cover the important parts about how our industrial classification system works. Additional complexities are added to support more languages, more types of industry, and to try and keep the industrial classifications reasonably stable over time.
The classification system is continually learning as more websites are scraped, and more manual classifications are added.
The diagrams above suggest that only the contents of the websites are used by our classification system, but we actually much more information than this. We use company tweets, links from the company website, and links to the company website from elsewhere on the web including LinkedIn and open lists of companies and public grant winners. These are all pieces of information that a human expert might miss, but that a computer can include when deciding how to classify a business.
The final additional complexity for this blog post is how we use our classification model to classify more than just businesses. We use the same system to classify events from services like Open Tech Calendar and Eventbrite, and from patents and scientific paper abstracts. In this way we have a single learned ontology of industrial classification that we can use to combine data from many data sources, each with errors and uncertainties, to understand the evolution of industrial clusters.
Our classifications are currently pretty good, but we know that they can get better. The most exciting part of The Data City is that the more data we collect, the more questions we answer, and the more manual corrections we make, the better our classifications get.