Build Things Together

Bill Mills

Consuming Open Data

New Parser for the World Ocean Database

July 27, 2015 | 3 Minute Read

Last week had some great announcements in open data: progress was reported in the OpenfMRI project, seeking in part to establish a standard data format for functional MRI brain scans, and K. Ramirez et al’s paper was accepted for publication that describes a path to standardizing and distributing soil biodiversity data. Both these projects emphasize the distribution infrastructure for their data, but both also advance the conversation around standardizing data formats in their fields - one of the key elements of creating truly open data that I discussed here earlier this month.

Standardized data formats are worth the trouble because they are easy to read; my other recommendation in my last post was to distribute parsers for these formats, in order to take advantage of this ease of consumption. My colleagues working on the International Quality Controlled Ocean Database and I have done just that for the World Ocean Database format in our new Python package on PyPI, wodpy.


Here’s some World Ocean Database data:

C41303567064US5112031934 8 744210374426193562-17227140 6110101201013011182205814
01118220291601118220291901024721 8STOCS85A3 41032151032165-500632175-50023218273

Like I said, easy to read, right? If this is a bit opaque, there’s a 170 page pdf that explains how to unpack this. This is the sort of slog that can really ruin the open data party; to avoid all this pain, grab our new parser off of PyPI:

sudo pip install wodpy

throw that block of data into a file data.dat, and give this Python script a try:

from wodpy import wod

file = open("data.dat")
profile = wod.WodProfile(file)

print profile.latitude()

et voila: the latitude this measurement was taken at is returned. This data and toy demo are in this repo; the README in the main wodpy repo describes all the methods the WodProfile object provides for extracting usable information from the terse format it came in. Big thanks to Simon Good out of the UK Met Office for hammering out the first version of this class in our AutoQC project.

wodpy is very alpha right now, and this is my first attempt at a serious package on PyPI. I’d appreciate feedback from everyone on basic operation - does the toy demo linked above work for you? Does the package actually check out and install? I pulled this together in cafes and airports while on vacation, so I haven’t had access to a million different OS’es, python versions, etc; whether you’re interested in WOD data or not, just letting me know when the demo falls over is a big help.

For those who actually use WOD-format data, what do you think? How can we make this as useful and convenient as possible for your research? There’s information encoded in WOD data that isn’t currently returned by a method on this class; it’ll all get there eventually, but if someone were to call out their favorites, I’ll send them to the top of the list.

By wrapping up our decoding in a something like the WodProfile class, that has easy-to-understand methods for pulling the information we actually want out of open data, I think we can lower the barriers to consumption and get closer to what the open data movement aspires to; beyond the World Ocean Database, I hope we can make wodpy a strong example of a modular parser that gets the boring munging out of the way, so we can get down to science faster.