Build Things Together

Bill Mills

Truly Open Data

July 06, 2015 | 4 Minute Read

There is a lot to be optimistic about in the conversation around open data. More and more journals and funders are demanding data be published as a condition of submission or granting, creating much-needed pressure from the top; infrastructure like git large file storage on GitHub, AWS’s open data project, Figshare, and Dryad are making it feasible to host data on the web; and projects like Dat, Synapse and the EcoData Retriever are breaking ground on what it means to version and share data. It is certainly not time to hang the ‘Mission Accomplished’ banner on the open data battleship quite yet, but really good progress is nonetheless being made. There is still one big challenge, however, that open data is lagging on: usability.

Lily Hay Newman said it well at Slate a couple weeks ago: “Open-source scientific data is grossly underutilized and kind of a mess.” Which is not even a little surprising; I recently wrote here about the challenges inherent in using other people’s code - other people’s data, however, is much much worse. With great stamina, you can pick through my code and look up what every piece of syntax does; but if I hand you a CSV with no labels, no units and no context, forget it - that data will never be usable, and a little piece of the promise of the open data movement dies on the headerless table.

While we race to slap as much data up on the web as possible, we need to think about how to make that data legible, and not simply accessible. Access is only step one; legibility also demands that a third party can understand that data well enough to reuse it. I have two recommendations.

Data Decoding

All datasets should either come with or point to a piece of open-source software that can be used to ingest that data, delivering it to a usable in-memory form in at least one programming language. For example, at the International Quality Controlled Ocean Database, we include a Python class that will unpack the highly inscrutable WOD data format into an object that has well documented methods to return all the information we want about the deep blue sea. At high level, requirements for a decoder should include:

  • Self containment. The decoder should be encapsulated as a class, function, module or some other appropriate structure that does not require a user to massage or understand the underlying code.
  • Thorough documentation. Decoder docs should make it clear, as succinctly as possible, how to extract data from the resulting object, and where the formal definitions of those data reside (vis: a pointer to the schema.org-like-index described below).

Data Peer Review

The single most valuable thing that an organized scientific publishing apparatus provides for papers is a system of peer review. As we take it upon ourselves to publish data, we need to think about how to set the same standards of review in this new arena. I see the data reviewer as having at least two responsibilities:

  • Curating standard data ontologies. If we are going to distribute data, we need a standard language to describe it, or else we have no chance of combining or reusing that data. Each discipline will of course need to develop its own descriptions of data; part of the data reviewer’s job must be to examine the proposed data set’s usage of those standard descriptions, and ensure that they correctly and appropriately use previously defined descriptors wherever possible, and do not create new descriptors unnecessarily. Furthermore, these descriptors should be registered centrally in a machine-readable format, like schema.org does for JSON-LD contexts.
  • Ensuring encoding format compliance. There are many, many different encoding formats out there; an open dataset needs to adhere to one of them, and not invent new ones without good reason. In addition to ensuring data passes a linter at the structural level, encoding compliance must check that each piece of data matches the format or data type described in the schema recommended above (no starting numbers with apostrophes to trick GUI autoformatting, please).

There are probably tons of other strategies we can tack on here to ensure the legibility and usability of open data; please, let me know your ideas in the comments. One of the biggest challenges will be to convene the conversations necessary to sketch out those first ontologies - this is a conversation we need to be having at our conferences and our scientific societies, and it can be done. Open data deserves every ounce of the enthusiasm, effort and funding it is getting - but as we figure out this new paradigm, we’ve got to remember that nothing is truly open unless it is usable, too.