gatkExome.rms – GATK Exome Pipeline

The gatkExome.rms pipeline runs the “best practices” GATK pipeline for aligning reads to the human genome and calling variants from exome data. It can handle the individual calling of single samples, or the joint calling of thousands of samples. It is able to align and call variants on either hg19 or the new hg38 human reference. And, it can run all of your samples in one batch, or incrementally as the data for samples is generated.

Basic Command-Line Format

The basic command-line for the pipeline is the following:

gatkExome.rms { -19 | -38 } sampleDir...

where either the “-19” or “-38 option is required, to tell the pipeline which human reference to use.  The “sampleDir…” arguments are one or more sample directories, as created by the ycgaFastq software.

The pipeline will align the reads using BWA MEM, mark duplicates using Picard and then run the GATK pipeline for indel realignment, quality score recalibration and variant calling. For each sample, it will create a final BAM file and GVCF file in the sample directory, then perform a joint calling of the samples (by default), creating an “exome_calls.vcf” file in the current working directory. The pipeline also creates an “exome_metrics.txt” file containing basic per-sample metrics for the samples (similar to, but not the same as, the Picard CalculateHsMetrics output).

The pipeline can used incrementally over time, running with some samples, then adding more samples and rerunning it). It will only compute on the new sequence data, and will not recompute existing bam or GVCF files. However, the joint variant calling and metric generation will only be computed if the “exome_calls.vcf” and “exome_metrics.txt” files do not exist, so you will need to delete or move those two files if you are rerunning the pipeline.

[Note: The ycgaFastq program acts in a similar incremental fashion, only adding new sequence data which does not exist in the sample directories, so for a larger project, you can maintain the single directory of all of the samples, adding new samples and sequencing data to it, and then using the gatkExome.rms pipeline to incrementally compute the results as more samples are added.]

Complete Command-Line Format

The pipeline has additional options to configure the execution of the pipeline, as well as two additional commands to handle larger datasets. The full command-line format is the following:

gatkExome.rms { -19 | -38 } [-v2] [-nj] [-hf] [-n #] sampleDir...
gatkExomeSample.rms { -19 | -38 } [-v2] [-n #] sampleDir...
gatkExomeCall.rms { -19 | -38 } [-nj] [-hf] [-n #] sampleDir...

The two additional commands, gatkExomeSample.rms and gatkExomeCall.rms, run the first and second halves of the pipeline, respectively.  gatkExomeSample.rms does all of the per-sample processing, starting with FASTQ files and generating the BAM and GVCF files for each sample.  This is useful to perform the bulk of the computation for partial datasets, without generating any variant calls on incomplete data.

gatkExomeCall.rms performs the variant calling for the samples, starting from the GVCF files.  Unlike the other two commands, it does not require the presence of the initial FASTQ files, and only requires the GVCF files to exist in each sample directory.  Also, it implements the CombineGVCFs batching strategy for very large datasets.  The gatkExome.rms cannot joint call more than ~1000 samples, while the gatkExomeCall.rms combines batches of 200 GVCFs together, and then joint calls on the combined GVCF’s, allowing it to joint call thousands of samples.

The full set of options for these commands are the following:

  • -19 – Use hg19 as the reference.  One of -19 and -38 must be given.  [Note:  Technically, this is the hs37d5 reference with decoy sequences, as recommended by GATK best practices.]
  • -38 – Use hg38 as the reference.  One of -19 and -38 must be given.
  • -v2 – Use the original V2 bed file as the target regions for calling and for metrics generation.  See below for details on the bed files used.
  • -nj – Do not perform joint calling, instead do variant calling of each sample separately.  If given, the VCF file for a sample will be generated in the sample directory, and will be called “sampleName_calls.vcf” where “sampleName” is the name of the sample.
  • -hf – Force the use of GATK’s hard filtering for the variant calling.  By default, if more than 3 samples are joint called, variant quality score recalibration is used for filtering the variant calls.  This option forces hard filtering for those joint calls.
  • -n # – Limit the number of compute nodes used in the computation to this number.  This is passed to the RMS software (it is the -n option of RMS), and so the format of “#” can be that of the RMS software (see the RMS documentation for more details).

Target/Metrics Bed Files

TBD

Methods

 

Using External Data

The gatkExome.rms pipeline is setup to use the directories generated by the ycgaFastq software, but the pipeline can be used with data that was not generated by the YCGA.  In this case, the sample directory structure must be created manually in order to match what the gatkExome.rms pipeline expects.

The structure that is generated by ycgaFastq is a directory for each sample, with the directory name equal to the sample name, containing a sub-directory named “Unaligned” which holds the FASTQ files.  The filenames given to the FASTQ files must be in a particular format, so that the pipeline can match the paired-end files together when aligning to the reference.  The format that gatkExome.rms expects is the following:

prefix_R#_###.fastq.(gz|qp)

where

  • “prefix” can be any text, and is typically the flowcell/lane information that uniquifies the fastq files from files from other runs (so that the FASTQ files from multiple runs can be written into the same directory and used together when running the pipeline).
  • “R#”  must be either “R1” or “R2” to identify the first and second halves of the paired-end read files. Single-end reads are permitted, but must be named as “R1” (with no matching file named “R2”).
  • “###” is a three digit number that uniquifies the files when multiple FASTQ files are generated for the same flowcell/lane/sample (as occurs in the Illumina sequencer output, where new files are generated for every 8 million reads).
  • “(gz|qp)” is the suffix of the filename, and can be either “gz” or “qp” for gzip or quip compression.  The pipeline can handle either gzip or quip compressed files, in any combination.  (Quip is a fastq-format aware compression tool that generates files 40% smaller than gzip’ed fastq files.)

An example of a directory for the NA12878 sample would be the following:

NA12878/
    Unaligned/
         NA12878_AH14UGADXX_L002_R1_001.fastq.gz
         NA12878_AH14UGADXX_L002_R1_002.fastq.gz
         NA12878_AH14UGADXX_L002_R1_003.fastq.gz
         NA12878_AH14UGADXX_L002_R2_001.fastq.gz
         NA12878_AH14UGADXX_L002_R2_002.fastq.gz
         NA12878_AH14UGADXX_L002_R2_003.fastq.gz

You can also use external GVCF files with the gatkExomeCall.rms pipeline. In that case, the only structure required is separate sample directories, with the directory names equal to the sample names, and each sample directory should contain a “sampleName.g.vcf” or “sampleName.g.vcf.gz” file (where “sampleName” is the name of the sample). No other files are required to run this pipeline.

Comments are closed.