SBT: Large-Scale Search of Short Read Databases


Overview

Querying a short read database for a transcript of interest is a fundamental problem in biology. Yet such queries are computationally intensive and scale linearly with the size of the data being searched. This leads to a computational bottleneck in which large databases of sequencing reads are compiled but never investigated systematically. To address this problem, we developed the Sequence Bloom Tree (SBT) data structure to facilitate searching short-read expression experiments for transcripts of interest. Rather then naively explore every file in a database, the SBT prunes files which do not contain the query with high probability and thus scales linearly with the number of experiments containing the query rather then the total size of the experiment set.

The SBT is built upon the Jellyfish library bloom filter implementation and the default settings are designed as a reasonable compromise between speed, storage cost, and accuracy. In the paper, we demonstrate that the SBT can search multi-terabyte databases substantially faster than any existing tool with reasonable accuracy and negligable storage costs in both memory and RAM.

Downloads

Download SBT Source on Github

Download SBT Linux Binary [beta v0.3.5]

Download SBT User Manual [beta v0.3.5]

The pre-release version of our latest tool - the Split Sequence Bloom Tree - can be found below. Manuscript coming soon!
Download SSBT Source on Github

SBT Example Files

Download Example Compressed SBT Index. - All the necessary files to load and query a 2652 experiment compressed SBT. [176 GB Download, 200 GB unpacked].
Download Example SBT Leaves. - The compressed leaf bloom filters for 2652 SRR experiments. [50 GB Download, 63 GB unpacked].
Download Example SBT Uncompressed Leaves. - The uncompressed leaf bloom filters for 2652 SRR experiments. [68 GB Download, 618 GB unpacked].

Installation Instructions

To install using the binary:
  1. Download the latest version of SBT

  2. Decompress the tarball: tar xzf sbt-binary-0.3.5.tar.gz

To install using the source:
  1. Install gcc (Version 4.9.1 or later)

  2. Install Jellyfish (Version 2.2.0 or later)

  3. Install SDSL (SDSL-lite)

  4. Download the latest version of SBT using Github

  5. Compile:
    cd bloomtree/src
    make

Contact

The software is developed by Brad Solomon and Carl Kingsford at the Computational Biology Department at Carnegie Mellon University.

Citation

If you use of SBT, please cite:

Solomon, Brad and Carl Kingsford.
Fast search of thousands of short-read sequencing experiments. Nature biotechnology. 2016 doi: 10.1038/nbt.3442

A list of the experiments included in that paper is here: srr-list.txt