Yale has a number of high-performance compute clusters, and ruddle is the new cluster that will soon be available for the analysis of YCGA data. A description of the cluster and more information can be found on the Yale Center for Computing Resources pages. This page contains tips, advice and software tool descriptions for people who want to use the cluster for their analyses, as well as our current best practice procedures for WES and WGS sequencing (with RNA-seq, ChIP-seq and others coming soon).
Whole-Exome and Whole-Genome Sequencing Analysis
The bulk of the sequencing generated by the YCGA is whole exome sequencing, with whole genome sequencing as a small but growing portion. Our current best practice for the analysis of this data is using the GATK best practice guidelines, with BWA MEM and picard MarkDuplicates for the initial processing of the sequencing reads and Annovar (plus additional annotations) for annotating the called variants. We are at work completing downstream analysis scripts (de novo calling, burden analysis, somatic variant calling), and can prioritize those by request.
- gatkExome.rms – Exome alignment and variant calling
- gatkGenome.rms – Whole genome alignment and variant calling
- vcfAnnotate – Annotating WES or WGS called variants
- denovoFilter – Filtering trios for de novo mutations
- somaticExome.rms – Somatic variant calling of tumor-normal exomes
- somaticGenome.rms – Somatic variant calling of tumor-normal whole genomes [requested]
- geneBurden – Gene burden analysis [in development]
- cnvExome.rms – CNV analysis of exome data [requested]
There are several utility tools that can be useful in completing WES and WGS analysis, which are available on bulldogn. The origin of several of the tools are from development done at Yale by Murim Choi, with updates to integrate the tools within the rest of the current tools.
- plotReads – Revised p08PlotReads tool for visualizing BAM file alignments
- pcaHapmap – Revised pEigenstrat tool for PCA analysis of ancestry
- kinship – Kinship analysis
- bamstrip – Stripping extra tags from BAM files to reduce BAM size
- bam2cram – Converting BAM files into CRAM files
- bam2splitfastq – Convert WGS BAM files into split fastq files for reanalysis by gatkGenome.rms
- fastq2splitfastq – Convert reads from single-file WGS fastq files into split fastq files for reanalysis by gatkGenome.rms
On any system, the best way to get oriented is to find out (1) where all of the files are (and where should I do my work), and (2) where all of the existing software is. With this system, it helps to also know (3) how do I use the cluster properly and (4) how do I access and use the YCGA sequencing run data. These pages assume you are familiar with the Linux command-line, and know about environment variables like PATH, and the .bashrc and .bash_profile configuration scripts. If you are not, read through some tutorials first (because you will need that knowledge to use the cluster).
- Files and Disk Areas
- General and Bioinformatic Software
- Using and Monitoring the Cluster
- ycgaFastq – Accessing YCGA Sequencing Run Data
Also, the following links provide more tips and advice on how to effectively work on the cluster when performing bioinformatic analyses.
- Using Disk Space Efficiently
- CRAM and Quip – Compressing BAM and FASTQ files
- Running Larger Compute Jobs Efficiently
Accessing and Using the Knightlab Software
There are a number of tools and pipelines that are provided and supported on ruddle for bioinformatic analysis, and they are described below. In order to use the tools, a little bit of configuration is required. First, the following directory needs to be added to your PATH environment variable (it contains all of the executables described below):
Second, run the following command to configure the RMS software for the cluster queues on ruddle.
cp /home/bioinfo/software/knightlab/rmsrc.ruddle ~/.rmsrc
Many of the knightlab pipelines use the RMS software for its execution. RMS is bioinformatics pipeline software and “scripting language,” for quickly creating pipelines that can run across the cluster. You don’t have to know how RMS works in order to use the pipeline tools, but it is helpful to know what the output of those pipelines are going to look like when they are running. If you do want more details, it is described at http://rms.readthedocs.org.
Unlike other cluster workflow tools (like simplequeue), RMS pipelines are run directly from the command-line on the login nodes, just as would normally run any other command. It automatically spawns off all of the computation jobs onto the cluster, and reports progress information about the currently running commands, as shown here:
[ruddle@ruddle2 testruddle]$ ~/knightlab/bin_Mar2016.ruddle/gatkExome.rms -n 5 -19 NA12878/ Input: 10 rows, 3 columns Input: 1 row, 2 columns Commands: 46 commands commands to be executed. [Mon Feb 29, 3:14pm]:Pipeline execution starting. [Mon Feb 29, 3:19pm]: markDupsAndIndelRealign: 0q,1r,0f,0c
When it is running, it shows each pipeline step that has actively running commands (in this case, the markDupsAndIndelRealign step), with counts of how many are ‘q’ueued, ‘r’unning, ‘f’ailed and ‘c’ompleted.
My recommendation is to use the screen command if your computation will take a while (this will let it run uninterrupted for as long as it needs to).
By default, RMS will use the “default” queue to submit its jobs, and will expand and contract the number of nodes it uses as the computation requires. But you can change or limit that using the “-n” option of either RMS or the pipelines themselves. Here, I gave the “-n 5” option to have it use no more than 5 nodes (but, if you look right now, it currently is only using 1 node, because that is all that is required at this point in the computation).