Joint CMU-Pitt Ph.D. Program in Computational Biology Seminar

  • Newell-Simon Hall
  • Mauldin Auditorium 1305
  • PAUL MEDVEDEV
  • Associate Professor of Computer Science and Engineering,
  • Associate Professor of Biochemistry and Molecular Biology; Director, Center for Computational Biology and Bioinformatics
  • Pennsylvania State University
Seminars

Data structures to represent sets of k-mers

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying k-mer sets has emerged as a shared underlying component and there have been many specialized data structures for their representation.

In this talk, I will describe the applications of k-mer sets in bioinformatics and motivate the need for specialized data structures. I will give an overview of known approaches and lower bounds. Then, I will present a de Bruijn graph unitig-based representation, based on the bcalm2 algorithm. Finally, I will describe a data structure for representing sets of k-mer sets, called the HowDe Sequence Bloom Tree.

For More Information, Please Contact: 
Keywords: