Hadoop and samtools

I've been asked to revisit Hadoop at $WORK. About a year ago I got a small cluster (3 nodes) working and was plugging away at Myrna...but then our need for Myrna disappeared, and Hadoop was left fallow. The need this time around seems more permanent.

So far I'm trying to get a simple streaming job working. The initial script is pretty simple:

samtools view input.bam| cut -f 3 | uniq -c | sed 's/^[\t]*//' | sort -k1,1nr > output.txt

This breaks down to:

which, invoked Hadoop-style, should be: ``` hstream -input input.bam \ -file mapper.sh -mapper "mapper.sh" \ -file reducer.sh -reducer "reducer.sh" \ -output output.txt


Running the mapper.sh/reducer.sh files works fine; the problem is that
under Hadoop, it fails:

2012-11-06 12:07:30,106 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s] 2012-11-06 12:07:30,110 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done 2012-11-06 12:07:30,111 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)


I'm unsure right now if that's [this error][3] or something else I've
done wrong.  Oh well, it'll be fun to turn on debugging and see what's
going on under the hood...

...unless, of course, unless I'm wasting my time.  A quick search
turned up a number of Hadoop-based bioinformatics tools
([Biodoop][4], [Seqpiq][5] and [Hadoop-Bam][6]), and I'm sure there
are a crapton more.

Other chores:

* Duplicating pythonbrew/modules work on another server since our
  cluster is busy
* Migrating our mail server to a VM
* Setting up printing accounting with Pykota (latest challenge:
  dealing wth usernames that aren't in our LDAP tree)
* Accumulated paperwork
* Renewing lapsed support on a Very Important Server

Oh well, at least I'm registered for [LISA][7].  Woohoo!