Build Things Together

Bill Mills

Next Steps

August 17, 2015 | 6 Minute Read

Today was my last day at the Mozilla Science Lab, after a year as their first Community Manager. It was a privilege to be part of Kaitlin Thaney’s team during this time; the Science Lab started a long list of global-scale conversations in open science that are organizing thought and catalyzing action among a panoply of scientific stakeholders. These conversations, and the things I learned from Mozilla and the Science Lab team, are of huge value to the future of science, and I encourage you to find a way to learn from them, too. I can articulate now what I could only sense before, and that clarity is so liberating.

After much reflection, I decided to take that clarity and return to active research in particle and nuclear physics. Folding these ideas back into the day-to-day practice of doing science requires involvement, access and presence to dig into the details and follow the threads of implementation that differ from field to field and lab to lab. Following these threads to their esoteric extremes is going to be necessary to get open science over the finish line. But it’s not the Science Lab’s role to dive down every rabbit hole; that task is left to us.

Open Nuclear Physics

In the short term, I’m excited to head back to TRIUMF as a research software developer, to look into extending some work I did there as a postdoc. I have a lot more to say about the emerging field of research software development, but I’ll save that for its own post later this week. Right now, I want to dig into what a first pass at open nuclear physics can look like, at TRIUMF and more broadly.

There are a ton of skills and ideas that surround open science, but I’d like to start by focusing on a few cardinal features: open access, open source, and open data.

Open Access

My favorite particle physics paper ever wasn’t about the Higgs, or even dark matter - it’s about the tremendous citation advantages that high energy physics, nuclear’s next door neighbor, has enjoyed thanks to publishing on the arxiv for the last twenty years. Combine this with the fact that there are thousands of Physical Review C papers with preprints on arxiv, setting a pretty clear precedent, and that Journal of Physics G actively encourages the upload of reviewed nuclear physics manuscripts to the arxiv, and the practical barriers to open access publishing in nuclear physics are pretty much zero. Furthermore, essentially all experimental facilities in nuclear physics are run by governing bodies that can impose values and criteria on the experiments proposed to them. As a colloquial practice, we’re most of the way there already; I’d like to see nuclear labs go the last mile, and mandate preprint submission to the arxiv for all papers written on experiments conducted using their hardware, as part of the experimental proposal and review process.

Open Source

I have too many things to say about scientific open source to cram them all into this blog post - I’ll be discussing some more in depth strategies later this week. Something to address early as code takes a leading role in the doing of science, is software citation. In order to start creating a knowable landscape of what’s out there in terms of open source code, we need to start reporting that code in context with the science it is supporting; that way, we can begin to defeat the software discoverability problem by staying abreast of the scientific literature, and thereby take better advantage of the improvements to productivity and code quality that open source collaboration brings. Furthermore, a standard of software citation is necessary for the growing field of scientific software development to define and integrate itself within the existing academic and scientific establishment.

A lively working group on software citation is amassing observations and thoughts on the practice over at FORCE11; our colleagues in astronomy are actively investigating software citation standards in a conversation that could serve as a useful precedent in nuclear. Furthermore, the last time I was at TRIUMF, I introduced the GRIFFIN Collaboration to open source software development on GitHub, a practice which they have retained to this day, and which paves the way for stamping our software with DOIs, thanks to the wonderful service provided by Zenodo and GitHub. Therefore, I would like to see DOI minting and citation adopted as a standard practice on GRIFFIN, following an investigation of a sensible format. All the pieces are on the table, and demand for this recognition is mounting; all that’s needed is a largely cost-free willingness to participate.

Open Data

Open data presents perhaps the most challenges for researchers: fearing the scoop, struggling with privacy concerns, making data understandable to third parties. And yet, like open access, there is a well documented citation advantage to publishing data. GRIFFIN is in a good position to address all of the concerns around data raised above:

  • GRIFFIN data is, as a matter of policy, deleted after 5 years - it’s just too big to keep around. If you’re deleting your data, you have certainly forgone all possibility of publishing anything further on it.
  • Like all data in nuclear and particle physics, GRIFFIN data obeys a strict, machine generated standardized format. If this data is released under an open license, I will personally write and distribute an appropriate parser to make sure the data is easily readable by third parties.
  • Metadata needed to make sense of GRIFFIN data, particularly post-sorting, is relatively simple to communicate; beam and target composition, detector geometry and efficiency and trigger state and performance should be enough to understand the sorted data.

What’s more, our colleagues over at the LHC have recently adopted an open data program, releasing data after several years, accompanied by enough software infrastructure and context to make sense of the data - this program can serve as a precedent and a model for GRIFFIN. The only hurdle left then, is the scale of GRIFFIN data, which can be produced at up to 200 TB a week. While it’s not immediately clear how to distribute data at this scale without an unrealistic burden on the lab, one thing we can do today is assign an open license to GRIFFIN data after five years. As it stands, this policy is completely academic; as noted above, the other thing that happens after five years, is GRIFFIN data is deleted to make space on our servers. Nevertheless, baking in a CC-0 (or otherwise) license at this point opens the door to distribution of data by third parties or means not currently available to us, without costing the lab anything in terms of resources or opportunity, and helps position TRIUMF as a leader in open data policy.

Sounds easy, right?

Wrong. I’m sure there will be nuances and speedbumps, and digging into those details is exactly what I was talking about in the introduction to this post; please, let me know your criticisms and ideas on how we can meet the challenges I laid out above in the comments. I’m looking forward to seeing what progress we can make on these fronts at TRIUMF, and to digging deeper into the role of scientific software development - more on that quest later this week, and these initiatives as they evolve.