At The Data City we handle hundreds of gigabytes of data. The Information Commissioner’s Office (ICO)’s data protection principles are clear on this; if data contains identifiable names, it is personal data. Our data contains names.
Because we work with personal data, The Data City’s two data controlling partner organisations, Bloom and imactivate, are registered with The ICO. We have processes to disclose and correct any personal data that we hold.
We take handling personal data correctly very seriously. In part, because it is the law and a condition of us doing business. But mostly because we don’t want to, or need to, reduce the privacy of anyone through our work.
By considering data protection early, we protect peoples’ privacy without reducing the power of our tools or the quality of our work. This is a passion for us. All of our co-founders also work at or with The Open Data Institute Leeds, pushing the idea that you can be good at business and good with data at the same time.
Sensitive data, private data, and public availability
The most important protections in The Data Protection Act and the ICO’s guidance on The Data Protection Act concern two types of personal data; private data and sensitive data.
In summary, private data is data that is collected without the person it refers to expecting it to be made public. Sensitive data includes information such as race, criminal record, and health. It is well-defined in the ICO’s Key Definitions .
We are not subject to most of the most complex concerns around data protection because we don’t work with private or sensitive data. In part this is because the data we collect is always publicly available.
When we collect the names of authors on scientific papers and patents, the data was already publicly available. When we collect the names of directors in companies, the data was already publicly available. And when we scrape company websites to understand what they do, the content was publicly available to. Public availability of data makes its collection and storage less likely to breach privacy; if it is public knowledge that a person is the director of a publicly listed company, holding that personal data does not compromise privacy.
But just because all of the personal data that we use is publicly available doesn’t mean that we can ignore ICO guidance around private and sensitive personal data.
The ICO is clear that the nature of publicly available data can change when it is aggregated or linked with other data and that making data public does not imply unlimited consent for its re-use. While each piece of data that we hold meets the ICOs test of being publicly available and neither private nor sensitive, in combination it sometimes doesn’t.
As an example, if you attend a public event on Eventbrite you can see other attendees. That is largely uncontentious. But if we collect data on every public event that a person attended over several years and released that in a single dataset, it would no longer meet the ICO’s test of being non-sensitive and non-private.
There are many more areas where the fact that data is publicly available does not mean it is not also private or sensitive and deserve according treatment. The ICO’s early work on “personal information online” considers social media posts in more detail and is a good guide to thinking about these challenges.
GDPR, Big Data, and what we do
Big data, Machine-learning, and AI are all poorly-defined and mostly quite new. We probably do all three and both public understanding and legislation is still evolving, and evolving quickly.
The ICO’s guidance for working with big data is comprehensive (especially the full 114 page version) and informs much of our approach. Although the graph database at the centre of The Data City contains only publicly-available personal data we treat it as if it contained highly-sensitive and private data. It is stored in a secure location and has highly restricted access controls.
Our public tools are powered by highly-aggregated data releases, almost always available under an open license. Examples are The IoT UK Nation Database on Data Mill North and The UK Tech Innovation Index on GitHub.
Through aggregation, personal data is removed from our central graph database before. In the public datasets there is no personal data at all.
Licenses and privacy
In our next blog post we’ll explain how a similar approach to data management applies to managing licenses. Collecting and combining data with different licenses into a final dataset that can be released as opendata doesn’t just happen by accident.