Science and Source Code

Last year I came across the issue of reproducible science: the question of how best to ensure that published science can be reproduced by other people, whether because you want to fight fraud or simply be sure that there's something really real happening. I'm particularly interested in the segment of this debate that deals with computer-driven science, its data and its source code. Partly that's because I very much believe in Free as in Freedom, and partly it's because of where I work: a lot of the people I work with are doing computer-driven scientific research, and dealing with source code (theirs and others) is very much part of my job.

So it was interesting to read a blog post from Iddo Friedberg called "Can we make accountable research software?" Friedberg is an actual scientist and everything (and an acquaintance of my boss, which was a surprise to me). He wrote about two aspects of scientific programming that I hadn't realized before:

As a result, code written by scientists (much of it, in his words, "pipeline hacks") is almost never meant to be robust, or used by others, or broadly applicable. Preparing that code for a paper is even more work than writing the paper itself. But just dumping all the code is not an option, either: "No one has the time or energy to wade through a lab's paper- and magnetic- history trail. Plus, few labs will allow it: there is always the next project in the lab's notebooks and meetings, and no one likes to be scooped." And even if you get past all that, you're still facing trouble. What if:

Friedberg's solution: add an incentive for researchers to provide tested, portable code. The Bionformatics Testing Consortium, of which he's a member, will affix gold stars to papers with code that volunteer reviewers will smoke-test, file bugs against and verify against a sample dataset. Eventually, the gold star will signify a paper that's particularly cool, everyone will want one, and we'll have all the code we need.

However, even then he's not sure that all code needs to be released. He writes in a follow-up post:

If the Methods section of the paper contain the description and equations necessary for replication of research, that should be enough in many cases, perhaps accompanied by code release post-acceptance. Exceptions do apply. One notable exception would be if the paper is mostly a methods paper, where the software -- not just the algorithm -- is key.

[snip]

Another exception would be the paper Titus Brown and Jonathan Eisen wrote about: where the software is so central and novel, that not peer-reviewing it along with he paper makes the assessment of the paper's findings impossible.

(More on Titus Brown and the paper he references ahead.)

There were a lot of replies, some of which where in the Twitter conversation that prompted the post in the first place (yes, these replies TRAVELED THROUGH TIME): things like, "If it's not good enough to make public, why is it good enough to base publications on?" and "how many of those [pipeline] hacks have bugs that change results?"

Then there was this comment from someone who goes by "foo":

I'm a vanilla computer scientist by training, and have developed a passion for bioinformatics and computational biology after I've already spent over a decade working as a software developer and -- to make things even worse -- an IT security auditor. Since security and reliability are two sides of the same coin, I've spent years learning about all the subtle ways software can fail.

[snip]

During my time working in computational biology/bioinformatics groups, I've had a chance to look at some of the code in use there, and boy, can I confirm what you said about being horrified. Poor documentation, software behaving erratically (and silently so!) unless you supply it with exactly the right input, which is of course also poorly documented, memory corruption bugs that will crash the program (sucks if the read mapper you're using crashes after three days of running, so you have to spend time to somehow identify the bug and do the alignment over, or switch to a different read mapper in the hope of being luckier with that), or a Perl/Python-based toolchain that will crash on this one piece of oddly formatted input, and on and on. Worst of all, I've seen bugs that are silent, but corrupt parts of the output data, or lead to invalid results in a non-obvious way.

I was horrified then because I kept thinking "How on earth do people get reliable and reproducible results working like this?" And now I'm not sure whether things somehow work out fine (strength in numbers?) or whether they actually don't, and nobody really notices.

The commenter goes on to explain how one lab he worked at hired a scientific programmer to take care of this. It might seem extravagant, but it lets the biologists do biology again. (I'm reminded of my first sysadmin job, when I was hired by a programmer who wanted to get back to programming instead of babysitting machines.) foo writes: "It's also noteworthy that having technical assistants in a biology lab is fairly common -- which seems to be a matter of the perception of "best practice" in a certain discipline." Touche!

Deepak Singh had two points:

Meanwhile, Greg Wilson got some great snark in:

I might agree that careful specification isn't needed for research programming, but error checking and testing definitely are. In fact, if we've learned anything from the agile movement in the last 15 years, it's that the more improvisatory your development process is, the more important careful craftsmanship is as well -- unless, of course, you don't care whether your programs are producing correct answers or not.

[snip]

[Rapid prototyping rather than careful, deliberate development] is equally true of software developed by agile teams. What saves them from [code that is difficult to distribute or maintain] is developers' willingness to refactor relentlessly, which depends in turn on management's willingness to allow time for that. Developers also have to have some idea of what good software looks like, i.e., of what they ought to be refactoring to. Given those things, I think reusability and reproducibility would be a lot more tractable.

Kevin Karplus doubted that the Bioinformatics Testing Consortium would do much:

(He also writes that the volunteers who are careful software developers are not the main problem -- which I think misses the point, since the job of reviewer is not meant to be punishment for causing a segfault.)

He worries that providing the code makes it easy to forget that proper verification of computational methods comes from an independent re-implementation of the method:

I fear that the push to have highly polished distributable code for all publications will result in a lot less scientific validation of methods by reimplementation, and more "ritual magic" invocation of code that no one understands. I've seen this already with code like DSSP, which almost all protein structure people use for identifying protein secondary structure with almost no understanding of what DSSP really does nor exactly how it defines H-bonds. It does a good enough job of identifying secondary structure, so no one thinks about the problems.

C. Titus Brown jumped in at that point. Using the example of a software paper published in Science without the code being released, he pointed out that saying "just re-implement it independently" glosses over a lot of hard work with little reward:

[...] we'd love to use their approach. But, at least at the moment, we'd have to reimplement the interesting part of it from scratch, which will take a both solid reimplementation effort as well as guesswork, to figure out parameters and resolve unclear algorithmic choices. If we do reimplement it from scratch, we'll probably find that it works really well (in which case Iverson et al. get to claim that they invented the technique and we're derivative) or we'll find that it works badly (in which case Iverson et al. can claim that we implemented it badly). It's hard to see this working out well for us, and it's hard to see it working out poorly for Iverson et al.

But he also insisted that the code matters to science. To quote at length:

All too often, biologists and bioinformaticians spend time hunting for the magic combination of parameters that gives them a good result, where "good result" is defined as "a result that matches expectations, but with unknown robustness to changes in parameters and data." (I blame the hypothesis-driven fascista for the attitude that a result matching expectations is a good thing.) I hardly need to explain why parameter search is a problem, I hope; read this fascinating @simplystats blog post for some interesting ideas on how to deal with the search for parameters that lead to a "good result". But often the result you achieve are only a small part of the content of a paper -- methods, computational and otherwise, are also important. This is in part because people need to be able to (in theory) reproduce your paper, and also because in larger part progress in biology is driven by new techniques and technology. If the methods aren't published in detail, you're short-changing the future. As noted above, this may be an excellent strategy for any given lab, but it's hardly conducive to advancing science. After all, if the methods and technology are both robust and applicable to more than your system, other people will use them -- often in ways you never thought of.

[snip]

What's the bottom line? Publish your methods, which include your source code and your parameters, and discuss your controls and evaluation in detail. Otherwise, you're doing anecdotal science.


I told you that story so I could tell you this one.


I want to point something out: Friedberg et al. are talking past each other because they're conflating a number of separate questions:

  1. When do I need to provide code? Should I have to provide code for a paper as part of the review process, or is it enough to make it freely available after publication, or is it even needed in the first place?

  2. If I provide it for review, how will I ensure that the reviewers (pressed for time, unknown expertise, running code on unknown platforms) will be able to even compile/satisfy dependencies for this code, let alone actually see the results I saw?

  3. If I make the code available to the public afterward, what obligations do I have to clean it up, or to provide support? And how will I pay for it?

Let's take those in order, keeping in mind that I'm just a simple country sysadmin and not a scientist.

When do I need to provide code? At the very least, when the paper's published. Better yet, for review, possibly because it gets you a badge. There are too many examples of code being important to picking out errors or fraud; let's not start thinking about how to carve up exceptions to this rule.

I should point out here that my boss (another real actual scientist and all), when I mentioned this whole discussion in a lab meeting, took issue with the idea that this was a job for reviewers. He says the important thing is to have the code available when published, so that other people can replicate it. He's a lot more likely to know than I am what the proper role of a reviewer is, so I'll trust him on that one. But I still think the earlier you provide it, the better.

(Another take entirely: Michael Eisen, one of the co-founders of the Public Library of Science, says the order is all wrong, and we should review after publication, not before. He's written this before, in the wonderfully-titled post "Peer review is f***ed up, let's fix it".)

How do I make sure the code works for reviewers? Good question, and a hard one -- but it's one we have answers for.

First, this is the same damn problem that autotools, CPAN, pip and all the rest have been trying to fix. Yes, there are lots of shortcomings in these tools and these approaches, but these are known problems with at least half-working solutions. This is not new!

Second, this is what VMs and appliances are good at. The Encode project used exactly this approach and provided a VM with all the tools (!) to run the analysis. Yes, it's another layer of complexity (which platform? which player? how to easily encapsulate a working pipeline?); no, you don't get a working 5000-node Hadoop cluster. But it works, and it works on anyone's machine.

What obligation do I have to maintain or improve the code? No more than you can, or want to, provide.

Look: inherent in the question is the assumption that the authors will get hordes of people banging on the doors, asking why your software doesn't compile on Ubuntu Scratchy Coelacanth, or why it crashes when you send it input from /dev/null, or how come the man page is out of date. But for most Free Software projects of any description, that day never comes. They're used by a small handful of people, and a smaller handful than that actually work on it...until they no longer want to, and the software dies, is broken down by soil bacteria, returns to humus and is recycled into water snails. (That's where new Linux distros come from, btw.)

Some do become widely used. And the people who want, and are able, to fix these things do so because they need it to be done. (Or the project gets taken over by The Apache Foundation, which is an excellent fate.) But these are the exception. To worry about becoming one of them is like a teenage band being reluctant to play their first gig because they're worried about losing their privacy when they become celebrities.

In conclusion:

...Fuck it, I hate conclusions (another rant). Just publish the damned code, and the earlier, the better.