Build Things Together

Bill Mills

Consuming Open Data

New Parser for the World Ocean Database

July 27, 2015 | 3 Minute Read

Last week had some great announcements in open data: progress was reported in the OpenfMRI project, seeking in part to establish a standard data format for functional MRI brain scans, and K. Ramirez et al’s paper was accepted for publication that describes a path to standardizing and distributing soil biodiversity data. Both these projects emphasize the distribution infrastructure for their data, but both also advance the conversation around standardizing data formats in their fields - one of the key elements of creating truly open data that I discussed here earlier this month.

Standardized data formats are worth the trouble because they are easy to read; my other recommendation in my last post was to distribute parsers for these formats, in order to take advantage of this ease of consumption. My colleagues working on the International Quality Controlled Ocean Database and I have done just that for the World Ocean Database format in our new Python package on PyPI, wodpy.

wodpy

Here’s some World Ocean Database data:

C41303567064US5112031934 8 744210374426193562-17227140 6110101201013011182205814
01118220291601118220291901024721 8STOCS85A3 41032151032165-500632175-50023218273
18117709500110134401427143303931722076210220602291107291110329977020133023846181
24421800132207614110217330103192220521322011216442103723077095001101818115508527
20012110000133312500021011060022022068002272214830228442684000230770421200000191
15507911800121100001333125000151105002103302270022022068002274411816302284426840
00230770426500000191155069459001211000013331250001511050021033011300220220680022
73319043022844268400023077042620000019116601596680012110000133312500021022016002
17110100220220680022733112830228442684000230770435700000181155088803001211000013
33125000210220160022022068002273311283022844268400023077042120000019115508880300
12110000133312500015110200210330535002202206800227441428030228442684000230770421
20000019115508880300121100001333125000152204300210220320022022068002273312563022
84426840002307704212000001911550853710012110000133312500015110200210220160022022
06800227331128302284426840002307704212000001100003328960044230900033267500222650
03312050033281000220100033289500442309000332670002227100331123003328100022025002
22900044231910033286200222900033115400332810002205000342-12300442324100332728003
32117003312560033280500 

Like I said, easy to read, right? If this is a bit opaque, there’s a 170 page pdf that explains how to unpack this. This is the sort of slog that can really ruin the open data party; to avoid all this pain, grab our new parser off of PyPI:

sudo pip install wodpy

throw that block of data into a file data.dat, and give this Python script a try:

from wodpy import wod

file = open("data.dat")
profile = wod.WodProfile(file)

print profile.latitude()

et voila: the latitude this measurement was taken at is returned. This data and toy demo are in this repo; the README in the main wodpy repo describes all the methods the WodProfile object provides for extracting usable information from the terse format it came in. Big thanks to Simon Good out of the UK Met Office for hammering out the first version of this class in our AutoQC project.

wodpy is very alpha right now, and this is my first attempt at a serious package on PyPI. I’d appreciate feedback from everyone on basic operation - does the toy demo linked above work for you? Does the package actually check out and install? I pulled this together in cafes and airports while on vacation, so I haven’t had access to a million different OS’es, python versions, etc; whether you’re interested in WOD data or not, just letting me know when the demo falls over is a big help.

For those who actually use WOD-format data, what do you think? How can we make this as useful and convenient as possible for your research? There’s information encoded in WOD data that isn’t currently returned by a method on this class; it’ll all get there eventually, but if someone were to call out their favorites, I’ll send them to the top of the list.

By wrapping up our decoding in a something like the WodProfile class, that has easy-to-understand methods for pulling the information we actually want out of open data, I think we can lower the barriers to consumption and get closer to what the open data movement aspires to; beyond the World Ocean Database, I hope we can make wodpy a strong example of a modular parser that gets the boring munging out of the way, so we can get down to science faster.