Avoiding The Dragons of Open Data

Published 30th September 2020

In popular culture, “here be dragons” (hic sunt dracones in the original Latin) means dangerous or unexplored territory and is often thought to be an imitation of the medieval custom of putting pictures of dragons or other mythical beasts on uncharted or blank areas of maps where danger was thought to exist.

In reality, whilst such illustrations were relatively common in early maps, the phrase itself is only known to appear on one surviving map, the Hunt-Lenox Globe, which is dated between 1503 and 1507 and which today resides in Rare Books Division of the New York Public Library.

How does a Latin phrase on a medieval globe have anything to do with Kamma or to do with open data? To badly paraphrase George Orwell, all open data is open, but some is more open than others. So, to start with, we should probably define what open data is.

The Open Definition summarises open data as this :-

Open data is data that can be freely used, re-used and redistributed by anyone – subject only, at the most, to the requirement to attribute and share-alike.

The full definition goes into a lot more detail, but the three most important points are :-

Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
Re-use and Redistribution: the data must be provided under terms that permit re-use and redistribution including the intermixing with other datasets.
Universal Participation: everyone must be able to use, re-use and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.

At Kamma, we make use of a lot of open data and we’ve recently codified the way in which we look at open data to see whether this is the right data for us to use or whether there’s any hidden dragons, or other dangers that could mean that maybe this isn’t the right data for us to use. Our open data guidelines look like this …

Data Sources and Currency

Is this data produced by or on behalf of an official organisation or body?

On the plus side, “official” data can often be more authoritative but the flip side of this is that it may not be updated regularly.

Is this data produced by an open data and/or open source community?

Community data is usually regularly updated, often in near real time. It may not be the case that community data is any less accurate that officially produced data.

Is this data regularly updated?

Has the data been produced once and then left alone? It’s rare that data doesn’t need to be updated. The frequency will vary considerably according to what the data represents.

Data Formats

Is the data released in a format that allows us to read it and consume it easily?

As a general rule, data that is released in a text format, such as CSV or GeoJSON, is preferable to data that is released in a binary form. Data, regardless of licensing, that is only available via an API is less attractive as it means you can only query the API for specific items of data rather than downloading the entire data set, which is usually preferable.

Does the data require specialist or proprietary software to read it?

Some binary forms of data, require other software or software libraries to read. If the data is in an open standard then libraries should be available to read and write this data format, such as Esri’s ShapeFile format. Conversely, if the data is in a format which requires you to licence additional software applications or libraries, there are cost implications as well as the real risk that these libraries could be unsupported in the future.

Is the data available online for easy download?

Can you easily acquire the data and updates or do you have to order a specific time limited URL or physical download? The latter approach can make automated updating of your data sets more manual and time consuming.

Are updates to the data incremental or include everything?

When a data set is updated does the update wrap up all previous updates or does it just contain records which have been added, changed or deleted since the last release?

Data Contents

Does the data contain stable and consistent identifiers for each record?

Does each record in the data set have a unique identifier? Does this identifier remain the same across releases or is it only unique in the context of a single release. You should consider how to manage data sets that do not have stable identifiers or even do not have identifiers at all.

Is the data consistent and documented?

Even the simplest of data needs supporting documentation for each field and each field’s data type.

Can this data easily hold hands with our existing data?

A lot of data sets can be linked together if you know what one identifier in a data set is equivalent to another identifier in a different data set. This can aid you in linking data sets, allow more insight to be gleaned than from a single data set and also suggest other possible data sets which may be of value.

Data Licensing

Is the data formally licensed under an open data licence?

Not all open data licences are equal. For example, open data produced by the UK Government must now be licensed under the Open Government Licence which places very few limitations on use. But some open data licences are more onerous in their restrictions. It’s important to consider the business implications based on the requirements of an open licence and how you plan to make use of the data.

Does the licence allow commercial use?

Some open licences disallow commercial use under any conditions, which may preclude using the data. Some do allow commercial use but would need to be under a formal, paid for, licence. That shouldn’t mean you can’t use the data per se, but you should have a discussion about the balance between cost and licensing conditions, which can be more restrictive under a formal scheme and the value the data can add.

Does the licence have an attribution clause?

Attributing the data means that you need to credit your use of the data in some manner. At Kamma we list the data source and the licence as part of the About section of our website.

Does the licence have a share-alike clause?

A share-alike clause can be more problematic in an open licence than an attribution clause as it can mean that if you co-mingle the data with your own data, that resultant data set must be released under the same licence and made publicly available. If this isn’t a viable proposition for you, you can still keep the data sets apart in their own silos and cross reference them depending on your needs.

Does the licence permit a derived work to be produced?

In addition to other licensing terms, some data licenses, both open and proprietary forbid co-mingling and producing a derived data set. Generally, that makes their use challenging, though as mentioned above, you may be able to do so if you keep it separated from all other data sets.

In summary and by no means complete or comprehensive, the questions above have allowed Kamma to quickly and easily triage whether an open data set is right for us to use as well as realising the immense value and benefits that well produced open data has to offer. While there is no one size fits all approach to open data, asking the same questions consistently when looking at an open data set has allowed us to navigate and avoid the perils and hazards of the dragons of open data.

Data & Technology

Avoiding The Dragons of Open Data

Data Sources and Currency

Data Formats

Data Contents

Data Licensing

How EPC data impacts property valuation for mortgage lenders

New insights: how does EPC data impact affordability assessments?

Kamma’s Response to CVE-2024-0394 (XZ Utils Backdoor)

In partnership with

Avoiding The Dragons of Open Data

Data Sources and Currency

Data Formats

Data Contents

Data Licensing

How EPC data impacts property valuation for mortgage lenders

New insights: how does EPC data impact affordability assessments?

Kamma’s Response to CVE-2024-0394 (XZ Utils Backdoor)

Subscribe to the Kamma Newsletters

In partnership with

Sign in