Querying a short read database for a transcript of interest is a fundamental problem in biology. Yet such queries are computationally intensive and scale linearly with the size of the data being searched. This leads to a computational bottleneck in which large databases of sequencing reads are compiled but never investigated systematically. To address this problem, we developed the Sequence Bloom Tree (SBT) data structure to facilitate searching short-read expression experiments for transcripts of interest. Rather then naively explore every file in a database, the SBT prunes files which do not contain the query with high probability and thus scales linearly with the number of experiments containing the query rather then the total size of the experiment set.
The SBT is built upon the Jellyfish library bloom filter implementation and the default settings are designed as a reasonable compromise between speed, storage cost, and accuracy. In the paper, we demonstrate that the SBT can search multi-terabyte databases substantially faster than any existing tool with reasonable accuracy and negligable storage costs in both memory and RAM.
Download SBT Source on Github
To install using the binary:
Download SBT Linux Binary [beta v0.3.5]
Download SBT User Manual [beta v0.3.5]
SBT Example Files
Download Example Compressed SBT Index. - All the necessary files to load and query a 2652 experiment compressed SBT. [176 GB Download, 200 GB unpacked].
Download Example SBT Leaves. - The compressed leaf bloom filters for 2652 SRR experiments. [50 GB Download, 63 GB unpacked].
Download Example SBT Uncompressed Leaves. - The uncompressed leaf bloom filters for 2652 SRR experiments. [68 GB Download, 618 GB unpacked].
Download the latest version of SBT
Decompress the tarball: tar xzf sbt-binary-0.3.1tar.gz
Install gcc (Version 4.9.1 or later)
Install Jellyfish (Version 2.2.0 or later)
Install SDSL (SDSL-lite)
Download the latest version of SBT using Github
Solomon, Brad and Carl Kingsford.
Fast search of thousands of short-read sequencing experiments. Nature biotechnology. 2016 doi: 10.1038/nbt.3442
A list of the experiments included in that paper is here: srr-list.txt