Machine Learning Challenges in Location Proteomics

Robert Murphy (Carnegie Mellon University)

Professor, Biological Sciences and Biomedical Engineering
Faculty Member, Center for Automated Learning and Discovery
Director, Merck Computational Biology and Chemistry Program


  Efforts in the burgeoning field of proteomics seek to characterize all expressed proteins in several cell types. Methods for describing proteins in terms of their sequence and structure are well advanced, and systematic schemes for describing protein functions (e.g., E.C.) have been devised. However, there is no current systematic means of describing another important aspect of proteins, their subcellular location. As nucleotide and protein sequence databases have fueled a revolution in biological and clinical research, protein subcellular location also needs to be entered into databases in a way that lends itself to querying by pattern similarity. This will be critical to the new field of systems biology that attempts to describe and model all aspects of biological systems. Towards this end, we have created databases containing 2D and 3D images of the patterns of all major subcellular organelles and structures and have developed sets of intensity- and rotation-invariant features to describe the patterns. We have validated the ability of these features to adequately describe the protein patterns by training and testing classifiers that can recognize all of the patterns with over 92% accuracy on single images and over 99% accuracy on sets of images. Comparison with human classification reveals that the automated system is capable of resolving patterns not distinguishable by humans. We have coined the term ?location proteins? to refer to the automated, comprehensive and systematic analysis of subcellular location. This talk will briefly review this work and focus on computational issues raised by extending it to time series images, multiple cell types, intact tissues, and unknown proteins. I will also discuss the use of our subcellular analysis methods to improve information extraction from multimedia documents, such as articles in online journals, and the interfacing of location proteomics with systems modeling efforts.

Back to the Main Page

Charles Rosenberg
Last modified: Tue Nov 11 18:03:03 EST 2003