RNA-seq expression estimates need not take longer than a cup of coffee
The quantification of gene or isoform abundance is a fundamental step in many transcriptome analysis tasks, such as determining differential expression between biological samples. Yet, estimating isoform abundance from a large set of RNA-seq reads remains a computationally intensive task, owing in large part to the necessity of read mapping. To address this problem directly, we developed Sailfish, a software tool that implements a novel, alignment-free algorithm for the estimation of isoform abundances directly from a set of reference sequences and RNA-seq reads. Rather than working at the read level, the fundamental unit of transcript coverage in Sailfish is the k-mer. Implementing this alternative, lightweight, approach allows Sailfish to dispense with many of the complexities of read mapping while remaining robust to sequencing errors. By replacing read mapping with intelligent k-mer indexing and counting, Sailfish is able to quantify isoform abundance orders of magnitude faster than existing tools. For example, it takes about 15 minutes for a set of 150 million reads where existing tools take over 6 hours.
This increase in speed is obtained without sacrificing accuracy. Sailfish implements an efficient, accelerated expectation-maximization algorithm for quantifying isoform abundance that produces high-quality results, and is capable of correcting numerous types of systematic bias that are known to occur in RNA-seq experiments. In the paper, we demonstrate that, on both real and synthetic data, Sailfish is as accurate as existing read mapping-based tools such as eXpress and Cufflinks.Contact The Sailfish software is developed by Rob Patro, and Carl Kingsford at the Lane Center for Computational Biology at Carnegie Mellon University in collaboration with Steve Mount at the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park.
We are interested in hearing about other's experience using Sailfish on different data types, the primary variables being the quality of the transcriptome annotation, and the extent (and range) of divergence between the reference transcriptome and the sample. To this end we have set up a user's group - sailfish-users on googlegroups.com (short URL http://ongen.us/SForum).
Please submit any bug reports through our GitHub Issue Tracker
CitationIf you make use of Sailfish, please cite:
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms
Rob Patro, Stephen M. Mount, and Carl Kingsford
manuscript submitted (2013)
- CCF-1256087, CCF-1053918, EF-0849899
- 1R21HG006913, 1R21AI085376
- Alfred P. Sloan Foundation
- Sloan Research Fellowship to Carl Kingsford
This material is based upon work supported by the National Science Foundation under Grant Numbers EF-0849899, IIS-0812111, CCF-1053918. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.