15-853: Algorithms in the Real World (Guy Blelloch)

next up previous
Topic 8: Pattern Matching in Computational Biology

[ Topics | Scribe Notes | Readings | Text Books | Links ]


Topic Outline

  • Introduction
  • DNA, proteins, and sequencing
  • Global alignment
  • The longest common subsequence (LCS) and minimum edit distance problems
  • The general alignment problem
  • Cost matrices -- how derived in biology
  • Recursive solution with memoizing
  • Dynamic programming solution
  • O(n+m) space solution (Hirshberg)
  • O(n d) time solution (Ukkonen)
  • Other alignment problems
  • Local alignments
  • Alignment with gaps
  • Affine gap model and algorithm (Gotoh)
  • Multiple alignment
  • Sequence analysis tools and search engines
  • BLAST
  • FASTA
  • Applications of alignment
  • Protein sequencing
  • Evolutionary (phylogenic) trees
  • Scribe Notes

  • VLSI 3 + Pattern Matching 1 (draft) (Noah Treuhaft)
  • Pattern Matching 2 (draft) (Felix Wu)
  • Readings

  • Lecture notes 2, 3, 4, and 6 from the class Algorithms in Molecular Biology by Richard Karp, Larry Ruzzo, and Martin Tompa.
    We will not be covering the exact same material, but there will be significant overlap.
  • Recommended Text Books

    Suggested
  • Dan Gusfield. Algorithms on String, Trees, and Sequences. Cambridge University Press, 1997.
    This is an excellent book and the definitive source for combinatorial algorithms on strings. It does not go into much detail on the biology side and is somewhat narrow in the overall computational biology. For example it has little on Gene prediction, finding signals in DNA, on proteomics. In general it has very little on statistical based techniques.
  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
    This is a good book on probabilistic methods for sequence analysis, but a little weak on the computer science side. It is complamentary to Gusfield's book.
  • Pavel A. Pevzner. Computational Molecular Biology (An Algorithmic Approach) MIT Press, 2000.
    This book is is broader than either the Gusfield or Durbin book on their own, but not as deep in either discrete string matching algorithms (compared to Gusfield), or statistical methods (compared to Durbin). If you just want just one book, this might be the right one.
  • Other
  • M.S. Waterman. Introduction to Computational Biology: Maps, Sequences, Genomes. Chapman & Hall, 1995.
  • Joćo Meidanis & Joćo Carlos Setubal. Introduction to Computational Molecular Biology. PWS Publishing Company, Boston, 1996.
  • Arthur M. Lesk. Computational molecular biology: sources and methods for sequence analysis. Oxford University Press, 1988.
  • Graham A Stephen String searching algorithms World Scientific, 1994.
    The technical report this book is based on is available online.
  • Further Readings and Links

  • General information
  • Primer on Molecular Genetics from the U.S. Department of Energy. A nice overview in the general topic of molecular genetics.
  • Alignments, the basis for sequence comparison. A nice overview by Peter Rijk.
  • Protein Sequence Alignment and Database Scanning by Geoffere Barton. Yet another good overview paper on the subject (also available in postscript)
  • Combinatorial Methods for DNA Mapping and Sequencing. Forward from a special issue of the Journal of Computational Biology.
  • Exact String Matching Algorithms. An overview by Christian Charras and Thierry Lecroq. Has some nice applets.
  • The Biology Workbench. Access to a collection of biological sequence and structure analysis tools set up at NCSA. You need to register to use it.
  • A guide to sequence searching. Brief description of BLAST, FAST and similar techniques.
  • Yahoo: Home > Science > Biology > Molecular Biology > Computational Biology Actually a pretty worthless page.
  • Courses on computational biology
  • Algorithms for Molecular Biology
    (Dalit Naor and Ron Shamir at Tel-Aviv University).
  • Representations and Algorithms for Computational Molecular Biology
    (Russ Altman at Stanford). This has more of a biological bent---the instructors are biologists.
  • Algorithmic Fundamentals for Computational Molecular Biology
    (Gene Myers at U. Arizona).
  • Computational Biology
    (Larry Ruzzo, and Martin Tompa at U. Washington).
  • Introduction to Computational Molecular Biology
    (Bonnie Berger and Mona Singh at MIT)
  • 3-D Structure in Chemistry and Molecular Biology
    (Klara Kedem, Paul Chew, Jon Kleinberg, and Dan Huttenlocher at Cornell). This course mostly covers three dimensional matching of proteins, and other molecules.
  • 3-D Structure in Chemistry and Molecular Biology
    (Bruce Donald at Dartmouth). This is similar to the previous course (obviously, by name). Bruce Donald was at Cornell before Dartmouth.
  • Computational Biology and Chemistry
    (Sung-Hou Kim and John Canny at UC Berkeley).
  • Principles of Computational Biology
    (Steven Salzberg at Johns Hopkins)
  • Bibliographies
  • A bibliography on Computational Gene recognition.
  • Implementations
  • Sequence analysis software
  • An overview and rating of various sequence matching tools (and other biomolecular research tools).
  • A FPGA based implementation of sequence matching. Claims to be 80x faster than a 300MHz Alpha.
  • Overview of the FASTA algorithm.
  • Human Genome Project
  • Home page
  • A Primer on Molecular Genetics
  • Links to otheruseful information.
  • Evolutionary (phylogenic) trees
  • Lectures 17, 18, and 19 from the course Algorithms in Molecular Biology at U. Washington. Notes are quite short but this is a good introduction.
  • Chapter 17 from D. Gusfield, Algorithms on String, Trees, and Sequences. Computer theory viewpoint.
  • Chapter 11 from D. Hillis, C. Moritz, and B. Mable (Eds.), Molecular systematics. Biology viewpoint.
  • A glossary on evolution from UBC.
  • A collection of programs for constructin evolutionary trees from U. Washington.
  • Longest Common Subsequence
  • A nice description of the LCS problem and algorithms from a course by David Eppstein.

  • Back to the Algorithms in the Real World home page.
    Guy Blelloch, guyb@cs.cmu.edu.