Packaging science

15 Apr 2014

So the other day I was asked to help get a bioinformatics tool working. Tarball was up on Sourceforge, so it shouldn't be a problem, right? Right. Download, skim the instructions, run "make" and we're done. Case closed!

Only I had to look. Which was a mistake. Because inside the tarball was another tarball. It was GNU coreutils, version 8.22. Which was dutifully compiled and built as part of the toolchain. It was committed about 18 months ago because:

this will create a new sort that is used by chrysalis to run sort in parallel speedup on hour system running a 13g dataset was from 46min to 6min runtime

That is a significant speedup. Yes. And sure, it's newer than the version in the last Ubuntu LTS (8.13), and 'way newer than the version in CentOS 5 (5.97). But that is a tarball, even if it is only 8 MB, in the subversion repo for a project that was published in Nature Protocols. Why in hell wasn't it written up as a dependency in the README? So yeah, I got angry: "I think I'm gonna submit a patch with an Ubuntu ISO in it, see if they accept it."

I'm struggling with what to write here. This is bad practice, yes, but what constructive, helpful alternative do I have to offer? The scientists I work with are brilliant, smart people who do amazing research, but their knowledge of proper (add scare quotes if you like) development practice is sorely lacking. It's not their fault, and folks like Software Carpentry are doing the angel's work to get them up to speed. But riddle me this: if you're trying to get a tool into the hands of a pretty new Linux user -- one who's going to base the next 18 months of their work on how well your tool works -- how do you handle this sort of thing?

Mark it in the README? That's great if they've got a sysadmin, and Lord knows they should...but there are many that don't, or it's the grad student in the corner doing the work and they're more focussed on their thesis. (That's not a criticism.)
Throw an error? Maybe, or maybe a warning if it's less than version umptysquat. That gets into all sorts of fun version parsing problems.
Distribute a VM? Maybe -- but read C. Titus Brown's comments on this. Plus, if we wince at the idea of telling a newbie "Just go get it installed", imagine our faces when we tell them "just go get the VM and run it." Ditto Docker, Vagrant or whatever new hotness we cool kids are using these days.
Ports tree? Now we're getting somewhere. All we need to do is have a portable, customizable, easily-extended ports tree that works for lots of different Linux distros and maybe Unices. Hear that sound? That's the NetBSD ports tree committer berzerkers coming for your brains. Because that work is HARD, and they are damned unappreciated.

We have no good alternative to offer. I can be snotty all I want (confession: IT'S SO MUCH FUN) but the truth is this is a hard problem, and people who just want to get shit done are doing it the best they can because they just want to get shit done. We have -- are -- failing them. And I don't know what to do.

2 Comments

From: Wout Mertens
16 April 2014 20:34:47

You don't want ports, you want Nix :-) http://nixos.org/nix/

From: Victor
19 April 2014 00:23:23

There are some researchers working on improving reproducibility of results, the general idea is to have a way to describe every element of a scientific workflow (software, dependencies), and to keep the raw data AND potentially intermediate artifacts as well as the results. In addition to someone else being able to reproduce your results, if you used 3 tools that transformed your data and fed intermediate results into each other: X ->Y->Z, and you realized that you'd got some parameters in Y wrong, you could re-use the original output from X when you ran Y again instead of starting from scratch. Rolling this from scratch would take a LOT of discipline, tool knowledge and disk space. Making this approach generally useful it will probably require agreed up on standards for how to describe scientific workflows, and some tools that implement those standards. Looking at the current state of software tools in, say, bioinformatics, it kind of feels like a pipe dream. But, people are still trying. :) Example of an attempt in this general area: http://pegasus.isi.edu/.

Add a comment:

Name and email required; email is not displayed.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

Packaging science

2 Comments

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018