Hadoop and samtools
06 Nov 2012I've been asked to revisit Hadoop at $WORK. About a year ago I got a small cluster (3 nodes) working and was plugging away at Myrna...but then our need for Myrna disappeared, and Hadoop was left fallow. The need this time around seems more permanent.
So far I'm trying to get a simple streaming job working. The initial script is pretty simple:
samtools view input.bam| cut -f 3 | uniq -c | sed 's/^[\t]*//' | sort -k1,1nr > output.txt
This breaks down to:
- mapper.sh: samtools view
- reducer.sh: "cut -f 3 | uniq -c | sed ... | sort ..."
which, invoked Hadoop-style, should be: ``` hstream -input input.bam \ -file mapper.sh -mapper "mapper.sh" \ -file reducer.sh -reducer "reducer.sh" \ -output output.txt
Running the mapper.sh/reducer.sh files works fine; the problem is that
under Hadoop, it fails:
2012-11-06 12:07:30,106 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s] 2012-11-06 12:07:30,110 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done 2012-11-06 12:07:30,111 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
I'm unsure right now if that's [this error][3] or something else I've
done wrong. Oh well, it'll be fun to turn on debugging and see what's
going on under the hood...
...unless, of course, unless I'm wasting my time. A quick search
turned up a number of Hadoop-based bioinformatics tools
([Biodoop][4], [Seqpiq][5] and [Hadoop-Bam][6]), and I'm sure there
are a crapton more.
Other chores:
* Duplicating pythonbrew/modules work on another server since our
cluster is busy
* Migrating our mail server to a VM
* Setting up printing accounting with Pykota (latest challenge:
dealing wth usernames that aren't in our LDAP tree)
* Accumulated paperwork
* Renewing lapsed support on a Very Important Server
Oh well, at least I'm registered for [LISA][7]. Woohoo!
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018