Kpath ‐ statistical reference-based compression for short reads

About Path Encoding

Path encoding is a technique for compressing short-read sequence files. It uses a reference (any gzipped multi-FASTA file) to build a statistical model of the sequences, which is adaptively updated during compression.

The path encoding software is written in Go, and is open source.

If you use this software, please cite:

Carl Kingsford and Rob Patro. Reference-based compression of short-read sequences using path encoding. Bioinformatics (2015) 31 (12): 1920-1928

Here are the transcripts used as a reference in that paper (92Mb).

Installation & Requirements

Download Latest Version: Version 0.6.3

Binaries for Mac OS X and Linux are available in the above-linked .tar.gz file. If neither of them work on your system, you can easily build the software from the sources.

To install using the binary:

  1. Download the latest version of path encode from above.
  2. Decompress the tarball: tar xzf kpath-0.6.3.tar.gz
  3. Copy the kpath-0.6.3-XXX binary for your system to a location in your path (for easy access) and rename it to kpath if you want

To install using the source:

  1. Install Go, version 1.2, 1.3, or 1.4 (at least 1.3 recommended) [this is easy].
  2. Download the latest version of path encode from above.
  3. Decompress the tarball: tar xzf kpath-0.6.3.tar.gz
  4. Copy the "src" directory and its subdirectories into the "src" subdirectory of your GOPATH workspace.

    Alternatively, make the directory that tar created the root of your Go workspace: export GOPATH=/path/to/kpath-0.6.3/

  5. cd src/kingsford/kpath
  6. go build
  7. Copy the kpath executable that is created to a location in your path (for easy access)

Usage

To compress:

kpath encode -ref=REF -reads=IN.fastq -out=OUT
where REF is the path to a gzipped multi-fasta file containing your reference sequences (i.e. a set of transcripts, or genomes, or chromosomes); IN.fastq is the fastq file you want to compress; OUT is the prefix of the output files where compressed version are stored.

kpath will create OUT.enc, OUT.bittree, OUT.counts, OUT.flipped, and OUT.ns. The first three files (.enc, .bittree, .counts) are needed to decompress the sequences if you don't care about Ns the orientation of the reads. You can delete one or both of .flipped and .ns.

To decompress:

kpath decode -ref=REF -reads=OUT -out=RECOVERED.fasta
where OUT is the basename for the file to decompress (same as the OUT used when encoding). REF is the same reference used when encoding. This will write the sequences in FASTA format to RECOVERED.fasta. If the OUT.ns file is present the Ns in the original reads will be recovered. If the OUT.flipped file is present, the reads will be put in their original orientation. If either of these files is missing, the corresponding step will be skipped.

The reads will NOT be in the same order as in the original file.

Other Options

-k=16: length of k Change the value of the context length used. Smaller k and larger k generally result in worse compression, but smaller k can use less resources.

-fasta=true: If false, output seqs, one per line. Use "-fasta=false" to write out the reads without fasta headers.

-p=10: The maximum number of threads to use. Allow kpath to use more or fewer threads.

-flip=true: if true, reverse complement reads as needed. Use -flip=false to skip writing out the file that records which reads were reverse complemented.