Associate Professor of Computer Science and Engineering,
Associate Professor of Biochemistry and Molecular Biology; Director, Center for Computational Biology and Bioinformatics
Pennsylvania State University
Data structures to represent sets of k-mers
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying k-mer sets has emerged as a shared underlying component and there have been many specialized data structures for their representation.
In this talk, I will describe the applications of k-mer sets in bioinformatics and motivate the need for specialized data structures. I will give an overview of known approaches and lower bounds. Then, I will present a de Bruijn graph unitig-based representation, based on the bcalm2 algorithm. Finally, I will describe a data structure for representing sets of k-mer sets, called the HowDe Sequence Bloom Tree.