RMS – Run My Samples

RMS is a “cluster scripting language” and execution engine, making the creation of computational pipelines, and running them across a compute cluster, easy to do.  The software takes a templated RMS script plus spreadsheet data files, generates commands by using the spreadsheet data to fill in the templates (i.e., for each file, each sample, each trio), and runs them on the current computer or across a cluster.

And, calling it a language overstates the case just a bit.  It really consists of a little extra syntax to organize the Bash, Perl, Python or R scripts of a pipeline, along with fill-in-the-blank “template elements” that are replaced when creating the executed commands.  You write the pipeline steps in any combination of the four languages (whatever best fits each step’s implementation), and the syntax is designed so that language-specific editors will ignore the RMS elements of the scripts, allowing you to use your favorite editor tools to do your pipeline development.

The following links provide initial examples and documentation on creating and running RMS scripts.  More links will be developed in the near future.

Using RMS on a Cluster

Executing RMS scripts across a cluster works differently than the typical execution of cluster jobs.  Traditionally, you create a job, perform the batch submission using qsub (or equivalent), and then keep checking back to see if it has completed (or get an email when it is completed).  RMS changes this, in that you execute the RMS script on the cluster login nodes, and have the main controlling program execute there until it is complete.  RMS handles all of the details of working with the cluster, so you are able to interact with it much like any shell command.

Note to Sysadmins:  The main RMS process is just coordinates the execution across the cluster.  Scripts with over 10,000 commands have run using this process, and the main RMS program used 50-100 CPU seconds of processing during that time (which occurred over the 4-5 days the 10,000 commands took to execute).

By default, executing an RMS script will look something like this (this is an example GATK whole genome run of 8 genomes):

[user@login-0-1]$ rms -n highcore:3 ~/pipelines/gatkGenome.rms T3565* T370*
Sheet Input:  433 rows, 4 columns.
Commands:  693 commands to be executed.
[Wed Jan 07, 10:28pm]:Pipeline execution starting.
[Thu Jan 08, 8:33am]:     markDuplicatesPerLane[8]: 0q,4r,0f,4c     realignerTargetCreatorPerLane[8]: 0q,4r,0f,0c

The first line is the shell command running RMS, telling it to use up to 3 nodes of the “highcore” queue, running the RMS script “~/pipelines/gatkGenomes.rms” and naming the sample directories containing the fastq files for 8 genomes.

The first two lines of the output are summaries of rows and columns of the spreadsheet input, and the number of commands to be executed.  The second two lines give information about the progress of the computation as it runs (each time a command starts or finishes, the last progress line will be updated).  The timestamp lets you know if the overall computation is making any progress, and the rest of that last line tell which steps of the pipeline are being executed (in this case the “markDuplicatesPerLane” and “realignerTargetCreatorPerLane” steps).  In that text, the number in brackets is the number of commands of that step that need to be executed, and the other numbers are the number of commands that are ‘q’ueued, ‘r’unning, ‘f’ailed and ‘c’ompleted for that step.

TIP:  You are strongly encouraged to use the GNU “screen” commands in conjunction with using this software.  To make the software more like a normal shell command, if you interrupt the main RMS process (such as hitting Ctrl-C at the terminal), it will automatically kill all of the actively running jobs on the cluster (which, in conjunction with RMS showing you the last stderr lines of any command that fails, makes it MUCH easier to  develop and debug cluster computations).  But, that also means that, unlike cluster batch submissions, if you log out before the computation is complete, it will die at that point.  Using the screen command will allow you to run the RMS scripts, and log out of the cluster while the computation is still running.

Installation

The software will be downloadable from github.  It is written in Python and has no dependencies, so simply add the directory to your PATH environment variable and the software should be ready to go.  The hello1 and hello2 commands from the Hello World example can be used to test your access to the software and the software’s access to the cluster.

Cluster Configuration

While the software can be used just by setting the PATH, a configuration file describing your cluster will likely be necessary before RMS can use it.  By default, RMS assumes your cluster uses the PBS/Torque manager, and has a single queue called “default”.

Leave a Reply

Your email address will not be published. Required fields are marked *