NSF Awards $9 Million to Researchers Using Language Technology Tools To Better Understand the Structure and Function of Proteins in Human Cells

Byron SpiceThursday, September 26, 2002

PITTSBURGH? The National Science Foundation has made a $9 million, five-year grant to a collaboration of Carnegie Mellon University computer scientists, University of Pittsburgh and Massachusetts Institute of Technology biological chemists, and others from Boston University and the National Canadian Research Council to advance a new field called computational biolinguistics.

Computational biolinguistics, which combines the use of computational tools, including statistical language modeling, machine learning methods and high-level language processing, will allow scientists to better understand how proteins work inside cells.As in languages, where there are sequences of letters that fall into patterns that make them understandable, there are sequences of amino acids in proteins that can be read to understand their structure, dynamics and function. Sequences of amino acids and their constituents can be thought of as syllables or words that have particular properties.

A deeper understanding of the relationship between protein structure, dynamics and function can help to extract information hidden in the gene sequences of genomes, which may, in turn, help develop drugs to fight disease. Today, there is great societal demand to understand and treat degenerative diseases, many of which are based on defective triggers for protein shape and interactions.

The project's principal investigators are Raj Reddy, Carnegie Mellon's Simon University professor of computer science and robotics, and Judith Klein-Seetharaman, assistant professor of pharmacology at the University of Pittsburgh Medical School, who also holds an appointment at Carnegie Mellon's Language Technologies Institute (LTI).

"The Human Genome Project and related genome sequencing efforts have provided a wealth of data, which has stirred great hopes for increasing our understanding and treating of disease or for mimicking nature's inventions in nanomachine design," said Klein-Seetharaman. "But the precise relationship between a primary sequence and the structure, dynamics and function of the encoded proteins is one of the most fundamental unanswered questions in biology.

"The computational biolinguistics project promises to provide novel views and approaches to solving these challenges that would not be obvious without thinking in terms of the analogy between language and biology."The team will use computer tools and methods developed for working statistically with human language to better understand the function of proteins in human cells and those of other organisms.

Carnegie Mellon will be the central site for the computational biolinguistics project. Its scientists will supply all of the necessary computational and language modeling technologies. Other partners will provide the bulk of biological and proteomic research and the laboratories where experimental work will take place.

There is also an industrial component to the project. Mathworks, Inc., of Natick, Mass., will work with Carnegie Mellon scientists to enhance its MatLab mathematical software to better support computational biolinguistics research. Medstory, Inc., Burlingame, Ca., which deals with drug innovation informatics, will focus on the clinical and drug development relevance of computational discoveries made under this program.Reddy and Klein-Seetharaman, together with Language Technologies Institute director, computer science Professor Jaime Carbonell, and LTI associate professors Ronald Rosenfeld and Yiming Yang, have been doing preliminary work in computational biolinguistics for nearly two years. By applying statistical language modeling technologies to genome sequences, they have been able to detect protein fragment signatures from pathogens.

The computational biolinguistics grant is one of more than 300 announced by the National Science Foundation as part of its Information Technology Research (ITR) program. This year, NSF awarded a total of $144 million in new grants under the program. For more information on the computational biolinguistics project, see: http://www-2.cs.cmu.edu/~blmt/For a searchable database of NSF's ITR awards, see http://www.itr.nsf.gov

For More Information

Byron Spice | 412-268-9068 | bspice@cs.cmu.edu