Note: This post, and following posts, were triggered by the recent post by Heng Li, describing his proposed GFA format for assembly graph data. After a twitter request for comments, my first tweet was that this was the right person to do the format. My second tweet, after reading the post, was that the format “was a miss…” Heng rightly asked me to explain, and this begins that explanation.
In recent years, there has been a growing trend toward developing a standard file format for assemblies, specifically a graph-based format (since all of the assemblers’ internal data structures hold a graph of one sort or another). A number of years ago, it was the AMOS project that made the attempt, two to three years ago, it was FASTG, and now this GFA format proposal.
When the FASTG format was really being talked about at the several conferences two years ago, where by luck or by happenstance most of the lead authors of the major NGS assemblers were present, I came really close to commenting on the format, but refrained. The main reason is that I didn’t see anything wrong with the format itself…it does a very good job capturing the structure that nearly all of the assemblers build, and is a usable format for organizing the assemblies’ data. The thoughts in my mind were all about who was doing the design, and, in a related way, why the FASTA format that we all know and love (or don’t love) is called the FASTA format.
The FASTG design was being done by the assembler writers, as a way of standardizing their output and saying to the rest of the community “here is how our data is really structured, and it’s more than just a set of contig sequences or scaffold sequences”. A perfectly legitimate thing to do, from people I respect and admire, and the result could easily be used as the standard format for assembly results. But, what I was thinking at the time was “Who is going to use it?”.
A Sidebar about FASTA
My actual thought was an attempt to be pithy. “The FASTA format isn’t called the FASTA format because Bill Pearson wrote a basecaller and this was his output format. He wrote the downstream tool, and this format was his way of trying to deal with all of the basecallers out there (each with its own format).” As with most pithy statements, it contains a core of truth, but isn’t quite accurate.
There are two main reasons you still know about the FASTA format today. The first is that this “query sequence” format provided a simple way to describe sets of sequences. But, my guess (and I don’t know this for sure) is that the real reason they needed it was not various Sanger read chromatogram formats, but all of the protein sequence formats that existed at the time (formats using 1-letter alphabets, 3-letter alphabets, or describing proteins by their 3-D structure). Since the FASTP/FASTA suite handled both protein and DNA comparisons, this was a way to simplify the description of protein and DNA sequences into what was needed for sequence comparison.
The second reason you know the name is that the FASTP/FASTA software was the standard bioinformatic alignment tool for sequence database searching (pre-BLAST). The software (from Lipman and Pearson and Pearson and Lipman) implemented the “k-tuple” matching concept of Wilbur and Lipman (i.e., the fundamental k-mer matching that we know today), along with a dynamically growing banded alignment algorithm, where the region of the alignment is progressively extended until the most significant alignment is found (i.e., computing the alignment of a larger region of the dynamic programming graph would likely not find a better alignment). That combination provided more sensitive results, at a much faster speed, than any other software out there. And, all in a package that was easy to install and use.
My Reaction to the GFA Proposal
But, back on point… My main reaction to the FASTG format was that the format really needs to come from the downstream tool makers, not from the assembler writers themselves. So, when I saw that Heng Li (for those of you know don’t know, lead author of MAQ, SAMtools and BWA) had posted a proposed format for graph assembly results, my first reaction was that, finally, a downstream tool author was proposing a format, and I was hoping for something along the lines of “Given the diploid nature of large genomes like human, and the clear need for a graph-based organization of the alternative structural variations of any eukaryotic genome “reference”, here is the file format I need to produce the fastest and most accurate sequence alignments in BWA”.
But, that was not the objective of the GFA format, and hence my second tweet about it. My thoughts over the last several days have turned to the questions of what file format should be developed, and can the lessons of what made FASTA (the format and the software) successful be applied here?
The answer that I came up with is that of finding “annotated variants” for large gene panel, WES or WGS datasets. The world of variant calling has transitioned from the 1000 Genome days of discovering and cataloging human variants to the “clinical” sequencing days of identifying functionally relevant variations that can explain (or are associated with) specific diseases or phenotypes.
But, the variant calling pipelines have largely remained the same. You align all of your reads against the whole genome, remove duplicates, possibly do quality score recalibration, then call variants, annotate those variants and filter/tier the annotated calls. The computation time is significant, and as Heng Li rightly notes in his answer to a SeqAnswers thread, “Long running time is just the price of getting a lot of data. There are faster ways, but they are less accurate in general. Learn to parallelize. You will need that all the time in the NGS era.”
So, here’s my thought. If we recast the problem as one of aligning and variant finding of only specific locations/regions of the genome (i.e., the clinically actionable variants, the disease’s related genes’, the exome, the evolutionarily conserved locations in the genome), could we compute those more quickly than the current method? What would the reference file format need to be to support this kind of computation? It this does work, would that lead to the ability to tier the regions of the genome, so that an answer to a tier permits you to stop the computation (i.e., where computing wider tiers of the genome would not lead to an improved set of annotated variants)?
I believe the answer to these questions (except possibly the last one) is yes, and the followup parts of this thread will explore these questions in more detail.
But What About…?
What does this have to do with an assembly graph file format like FASTG or GFA? My answer to that is threefold.
First, don’t think that the real “reference” for an annotated variant caller is just going to be a FASTA file plus a VCF or BED file. Above and beyond SNPs and local indels, individual human genomes contain hundreds of different structural variants compared to any reference sequence you can create. To represent the full complexity of our DNA sequences across the population (in order to map them as well as possible), you are going to need a full graph-like structure to describe how certain exons and genes relate to other regions of the genome. For what that is, see part three of my answer below.
Second, there is a fallacy that has grown around variant calling pipelines (and is currently gaining traction about de novo assemblers) that their computational components can be treated like Unix commands, like linkable boxes that can be organized and pipelined in pretty much any combination. This is a fallacy because the algorithmic problems are fundamentally hard. The algorithmic components of these pipelines make assumptions and shortcuts in order to compute their results in a reasonable amount of time. When those assumptions and shortcuts are not complementary with each other (work together to address each other’s shortcomings), those pipelines will not work well. And, simply trying all combinations to see “what works best” will create winners, but they won’t be robust across a wide range of datasets, and you’ll always be wondering why it works well for some databases but falls down for others. There is no shortcut to creating a cohesive set of algorithms developed within the context of the properties of the other algorithms in the pipeline. So, while a FASTG or GFA format is a nice standard to have, it won’t significantly improve de novo assemblies, and you will just have to wait for a new assembler to “dazzle” you.
And finally, third, to quote Indiana Jones in Raiders of the Lost Ark, “I don’t know, I’m making this up as I go along…” Part two is coming, let’s see where it goes…