Graphs, Alignments, Variants and Annotations, pt. 2

Instead of heading towards the more theoretic graph design, the day after writing part 1 of what is turning out to be a series, I focused on concrete software changes that might answer the first question I posed in the previous post (“If you recast the problem as that of calling ‘annotated variants’, can you speed up the current pipelines?”), and that is what this post will be about.  The more theoretic graph design will be coming, but I’m still thinking through Heng Li’s update describing previous work on graph algorithms, as well as a recent slide deck he presented (the picture of where he is coming from is becoming clearer…).

I also decided to do something “foolish” for this part of the work.  On a regular basis, people lament about wanting better bioinformatics tools, and about how all they get are new tools (most of which may or may not be better, because they have only been run on the datasets needed to publish the papers).  Which leads others to reiterate all of the reasons why academic bioinformatics software is written that way (papers & citations all important, difficulty publishing updated software, no funding for maintenance or improvement).

All of which is true, and is my situation right now, but since I’m still in my honeymoon period here at Yale, let’s go make some “better” bioinformatics tools and see what happens.

The Results

The tool I chose is (no surprise) BWA MEM, and to pique your interest (or really, just to have you not stop reading when I talk about all that boring algorithm/software stuff), I’ll start with the results.  I took the current development head of BWA, made the changes described below, and then ran it on the NA12878 exome and genome datasets also used by Blue Collar Bioinformatics in comparing variant calling pipelines.


The first two columns are the “run times” of the standard bwa mem on the exome and genome data.  The middle two columns are the “run times” of the updated BWA, given the exome’s BED file of target regions.  The right two columns are the “run times” of the updated BWA, given a BED file of all of the “flagged” SNPs in dbSNP138 (these are the clinically actionable variants).

And, by “run time”, what I’m reporting here are the interval times that BWA MEM reports as it processes each batch of reads, which is very consistent across the batches.  The total running time is linear in this interval time and the size of the datasets (since I’m still in development mode, as you’ll see below, quick but representative times work better…when we get to done, I’ll do timings and results for the whole process).

So, the total running times are not quite as significant as it appears in that chart for WGS datasets, since you have to multiply by the difference in the data size (the recent 30x HiSeq X Ten dataset for NA12878 dataset is approx. 10 times larger than this exome dataset):


An improvement for both exomes and genomes, yes, but can we do better?  The answer to that is also yes, I believe, but it will take answering the questions at the end of the post and the completion of the software development.

The Methods

The set of changes to bwa were on the simpler side, without any substantial algorithmic changes.  An option was added to bwa mem to pass a bed file of target intervals to the algorithms, and then checks were added to stop the alignment process when a read will not align in one of those intervals.

The given intervals are expanded by 300 bp on each side, to ensure that there are few to no boundary effects for the paired-end reads mapping to the actual intervals.  The data structure used for doing the interval checks is discussed in the next section, because it may also be used to speed up the variant calling process later in the pipeline.

In the main algorithm, checks were added at the following points (if you don’t know what SMEMs, MEMs or chains are, either read the BWA and BWA MEM papers…or just think of SMEMs and MEMs as the initial exact matches and chains as the larger, potentially gapped, matches):

  • At the first SMEM, if it covers the entire trimmed region (which means that there is an exact match to the “whole” read)
  • After the generation of MEMs
  • After the generation of chains
  • After the generation of the final alignment

Note:  For the first check to be effective, bases at the 5’ and 3’ ends with quality score less than 5 are “trimmed” for the identification of SMEMs and MEMs (the full read is used for the rest of the algorithm and in the output).  Without this trimming step, the test to see if the whole read exactly matches the reference rarely succeeds, and the overall run time doubles.

A read fails the check if all of the SMEMs, MEMs, chains or alignments are outside the target intervals, the computation stops for that read at that point, and no output is generated for that read.  Also, reads are dropped from the output if either they are unaligned or the pair has a mapping quality score of 0.

Note: This last condition is a long-standing belief on my part that true repeat reads have no place in a variant calling application, because either the read exactly matches the multiple genome locations (in which case, there are no variants), or any sequence difference that might lead to a variant call cannot be attributed to a specific genome location, and so the resulting variant is likely to be an error.  And, the Erik Garrison (lead author of freebayes) notes near the bottom of a response about trimming recommendations, he is currently filtering out these pairs in the pipeline he runs.  But, I would like to hear from someone who does use these reads in variant calling.

Next Steps

Okay, so we’ve gotten a decent speedup of the algorithm for variant calling of exome regions, disease gene panels or local regions like the clinically actionable variants.  But, can it be better?

The reason why I think it could is that there is now a hotspot in the computation, specifically at the calculation of the initial SMEM for each read.  The “run time” when you just short circuit the computation just before that step is ~1 (instead of 7-10 when given BED files).  So, it looks like up to 50-70% of the current computation could be eliminated for WGS datasets, if we can make a very quick check for off-target reads.

My thought is whether you can flag the sections of the BWT index to signal that a read is aligning somewhere else in the genome, and short-circuit the SMEM computation at that point.  Or, can you construct another data structure that can quickly determine that a read does not align to any of the target intervals, without computing the full SMEM?  I believe the answer to that is yes, but it is not clear to me how to do that now.  Any ideas from the BWT algorithm experts?  If you haven’t seen it yet, the methods described above, and the answer to this question, are applicable to any of the current mapping tools (Bowtie, MOSAIK, Novoalign, …) and should provide a similar speedup.

Moving over to the other major computation, variant calling.  (With this solution, the sorting, duplicate removal and indexing steps for WGS datasets will be on par with WES datasets, since only the reads aligning in the target intervals make it into the BAM file.)  My thought here is that, for exomes, the caller computes over 40-60 million bases of the reference, but only ends up with 40-50 thousand variants.  Can we, prior to variant calling, subset the target intervals to only those that have enough reads that differ from the reference, and just give the variant caller that subset?

I’m of two minds about how to do that.  Initially, my thought was to build it into the aligner, because we are seeing all of the reads and their alignments, so without any additional I/O, we can piggyback this computation to it.  Hence, the data structure I’m using right now to mark the target intervals in the modified bwa is an array of uint8_t’s, for each base of the reference, and the software parses the cigar strings of the output read alignments, and counts any differences at each reference position.  This way, you can then scan the counts to determine which subset of regions have enough differences to make it worthwhile to use the variant caller on it (and more importantly, skip the regions without enough differences).

Note:  Yes, I know that a 1-byte counter for each reference position is 3 gigabytes of memory for human references, and is overkill.  The real data structure will be much more efficient, once we figure out what the actual algorithm is going to be.  (Initally, I partitioned the reference into 10 base windows, and kept counts for each window.  But, when I started thinking about how to compute the difference regions, the window-based counts didn’t fit well.)

My current thought is to run a 10bp sliding window to identify those windows with either > 1 difference at a reference position or more than 6 columns with 1 difference.  But, these are preliminary thoughts right now, since the focus was on the alignment step.

But, as I started writing this last section, I couldn’t come up with any reason why this target interval recalculation can’t be a separate step of the pipeline as a whole (possibly between the duplicate removal and the variant calling), creating a trimmed interval list to give to the variant caller.  Scanning a BAM file once is not that computationally intensive.  Has anyone done one of those, or does the variant calling algorithms already do that kind of filtering right now?  This is the next speedup step after the alignment computation has been optimized.

And then, finally, after the above implementations are complete, we need to check the alignments in the target intervals and the final variant calls to make sure we’re not dropping any relevant reads or calls because of the speedups.  The next post on the software improvements will report on all of this.

But, I think my next post in the series will turn back to the graph design question, and think through in detail what the graph format needs to supply to the downstream tools (and so what needs to be in there).  Until then..

Posted in Bioinformatic Scribblings | 4 Comments

Graphs, Alignments, Variants and Annotations, pt. 1

Note:  This post, and following posts, were triggered by the recent post by Heng Li, describing his proposed GFA format for assembly graph data.  After a twitter request for comments, my first tweet was that this was the right person to do the format.  My second tweet, after reading the post, was that the format “was a miss…”  Heng rightly asked me to explain, and this begins that explanation.

In recent years, there has been a growing trend toward developing a standard file format for assemblies, specifically a graph-based format (since all of the assemblers’ internal data structures hold a graph of one sort or another). A number of years ago, it was the AMOS project that made the attempt, two to three years ago, it was FASTG, and now this GFA format proposal.

When the FASTG format was really being talked about at the several conferences two years ago, where by luck or by happenstance most of the lead authors of the major NGS assemblers were present, I came really close to commenting on the format, but refrained. The main reason is that I didn’t see anything wrong with the format itself…it does a very good job capturing the structure that nearly all of the assemblers build, and is a usable format for organizing the assemblies’ data. The thoughts in my mind were all about who was doing the design, and, in a related way, why the FASTA format that we all know and love (or don’t love) is called the FASTA format.

The FASTG design was being done by the assembler writers, as a way of standardizing their output and saying to the rest of the community “here is how our data is really structured, and it’s more than just a set of contig sequences or scaffold sequences”. A perfectly legitimate thing to do, from people I respect and admire, and the result could easily be used as the standard format for assembly results. But, what I was thinking at the time was “Who is going to use it?”.

A Sidebar about FASTA

My actual thought was an attempt to be pithy. “The FASTA format isn’t called the FASTA format because Bill Pearson wrote a basecaller and this was his output format. He wrote the downstream tool, and this format was his way of trying to deal with all of the basecallers out there (each with its own format).” As with most pithy statements, it contains a core of truth, but isn’t quite accurate.

There are two main reasons you still know about the FASTA format today. The first is that this “query sequence” format provided a simple way to describe sets of sequences. But, my guess (and I don’t know this for sure) is that the real reason they needed it was not various Sanger read chromatogram formats, but all of the protein sequence formats that existed at the time (formats using 1-letter alphabets, 3-letter alphabets, or describing proteins by their 3-D structure). Since the FASTP/FASTA suite handled both protein and DNA comparisons, this was a way to simplify the description of protein and DNA sequences into what was needed for sequence comparison.

The second reason you know the name is that the FASTP/FASTA software was the standard bioinformatic alignment tool for sequence database searching (pre-BLAST). The software (from Lipman and Pearson and Pearson and Lipman) implemented the “k-tuple” matching concept of Wilbur and Lipman (i.e., the fundamental k-mer matching that we know today), along with a dynamically growing banded alignment algorithm, where the region of the alignment is progressively extended until the most significant alignment is found (i.e., computing the alignment of a larger region of the dynamic programming graph would likely not find a better alignment). That combination provided more sensitive results, at a much faster speed, than any other software out there. And, all in a package that was easy to install and use.

My Reaction to the GFA Proposal

But, back on point… My main reaction to the FASTG format was that the format really needs to come from the downstream tool makers, not from the assembler writers themselves. So, when I saw that Heng Li (for those of you know don’t know, lead author of MAQ, SAMtools and BWA) had posted a proposed format for graph assembly results, my first reaction was that, finally, a downstream tool author was proposing a format, and I was hoping for something along the lines of “Given the diploid nature of large genomes like human, and the clear need for a graph-based organization of the alternative structural variations of any eukaryotic genome “reference”, here is the file format I need to produce the fastest and most accurate sequence alignments in BWA”.

But, that was not the objective of the GFA format, and hence my second tweet about it. My thoughts over the last several days have turned to the questions of what file format should be developed, and can the lessons of what made FASTA (the format and the software) successful be applied here?

My Thought

The answer that I came up with is that of finding “annotated variants” for large gene panel, WES or WGS datasets. The world of variant calling has transitioned from the 1000 Genome days of discovering and cataloging human variants to the “clinical” sequencing days of identifying functionally relevant variations that can explain (or are associated with) specific diseases or phenotypes.

But, the variant calling pipelines have largely remained the same. You align all of your reads against the whole genome, remove duplicates, possibly do quality score recalibration, then call variants, annotate those variants and filter/tier the annotated calls. The computation time is significant, and as Heng Li rightly notes in his answer to a SeqAnswers thread, “Long running time is just the price of getting a lot of data. There are faster ways, but they are less accurate in general. Learn to parallelize. You will need that all the time in the NGS era.”

So, here’s my thought. If we recast the problem as one of aligning and variant finding of only specific locations/regions of the genome (i.e., the clinically actionable variants, the disease’s related genes’, the exome, the evolutionarily conserved locations in the genome), could we compute those more quickly than the current method? What would the reference file format need to be to support this kind of computation? It this does work, would that lead to the ability to tier the regions of the genome, so that an answer to a tier permits you to stop the computation (i.e., where computing wider tiers of the genome would not lead to an improved set of annotated variants)?

I believe the answer to these questions (except possibly the last one) is yes, and the followup parts of this thread will explore these questions in more detail.

But What About…?

What does this have to do with an assembly graph file format like FASTG or GFA? My answer to that is threefold.

First, don’t think that the real “reference” for an annotated variant caller is just going to be a FASTA file plus a VCF or BED file.  Above and beyond SNPs and local indels, individual human genomes contain hundreds of different structural variants compared to any reference sequence you can create.  To represent the full complexity of our DNA sequences across the population (in order to map them as well as possible), you are going to need a full graph-like structure to describe how certain exons and genes relate to other regions of the genome.  For what that is, see part three of my answer below.

Second, there is a fallacy that has grown around variant calling pipelines (and is currently gaining traction about de novo assemblers) that their computational components can be treated like Unix commands, like linkable boxes that can be organized and pipelined in pretty much any combination. This is a fallacy because the algorithmic problems are fundamentally hard. The algorithmic components of these pipelines make assumptions and shortcuts in order to compute their results in a reasonable amount of time. When those assumptions and shortcuts are not complementary with each other (work together to address each other’s shortcomings), those pipelines will not work well. And, simply trying all combinations to see “what works best” will create winners, but they won’t be robust across a wide range of datasets, and you’ll always be wondering why it works well for some databases but falls down for others. There is no shortcut to creating a cohesive set of algorithms developed within the context of the properties of the other algorithms in the pipeline. So, while a FASTG or GFA format is a nice standard to have, it won’t significantly improve de novo assemblies, and you will just have to wait for a new assembler to “dazzle” you.

And finally, third, to quote Indiana Jones in Raiders of the Lost Ark, “I don’t know, I’m making this up as I go along…” Part two is coming, let’s see where it goes…

Posted in Bioinformatic Scribblings | 3 Comments

The Site is Live

The website for the Knight Lab is active and live today.  Now, all we need is some content…here, content, come here…

Posted in News and Updates | Leave a comment