Studies in computational genomics increasingly rely on analyzing huge quantity of sequences from massive sequencing experiments. Many of the most resource intensive steps for these analyses, like read alignment, sequence assembly and sequence database search, can be accelerated by a sequence sketch for faster identification of sequence similarities and overlaps. A better design of sequence sketches in turn improves runtime and storage requirement of these analyses.
We focus on design and analysis of sequence sketching methods, with an emphasis on minimizers, a family of sequence sketches that are easy to use and enjoy strong guarantees. We establish rigorous analysis for existing minimizer-family sketches, and propose improved design of minimizers under two setups, both with knowledge of a reference sequence and without. Together, these works refine and improve existing approaches to sequence sketching, and open up new lines of research that potentially would further help management and analysis of massive sequencing experiments.
Carl Kingsford (Chair / CMU)
Hosein Mohimami (CMU)
Guillaume Marcais (CMU)
Maria Chikina (PITT)
Ron Shamir (Tel Aviv University)