In the 1990s, it took $10 billion and 10 years to sequence the human genome. A few years later, it took only $50 million and a fraction of the time to sequence the comparable-size chimpanzee genome. In 2007, a similar-sized project cost $1 million and lasted less than a year. NIH has recently set a goal of $1,000 per human genome. As affordable, routine genome sequencing is quickly becoming a reality, the amount of sequenced DNA will continue to grow exponentially for at least another decade or two.

A similar trend took place in the 1980s and 1990s in speech and language technologies: from the 1 million words of the Brown Corpus in the 1970s, to the more than 1 trillion online English words on the web in 2007. This resulted in dramatically improved capabilities in virtually all language technologies, including Automatic Speech Recognition, Machine Translation and Information Retrieval. Many other areas of Artificial Intelligence saw similar advances from comparable increases in data availability.

The lessons we learned from this period were:

  1. Exponential growth in data enables qualitatively improved prediction capabilities.
  2. More data require the new methods and tools designed to handle the increased quantities.

The current state of sequencing technology is very reminiscent of the historical development of speech and language technologies. Existing tools for modeling bio-sequences were not designed for the 1010 nucleotides/Amino Acids available today, nor for the 1012 - 1015nts/AAs anticipated in the foreseeable future. The development of new algorithms, visualization tools, and predictive models will change not only medical practice, via personalized diagnostics and treatment, but also the nature and pace of biomedical investigation, drug discovery, and public health decision making.

Given lots of data, both medical practice and biomedical research can be revolutionized (i.e., accelerated and optimized).

Nowhere is this more evident than in vaccine and antiviral drug design, particularly for the RNA and retroviruses which are of great public health significance. Since these viruses mutate very fast, each isolate is virtually unique and every naturally occurring strain is likely to differ from it's parent and siblings by at least one nucleotide. When large numbers of isolates are sequenced via high throughput mechanisms such as the ones that have recently come on-line, this results in a wealth of data comparable to the human or chimp genome in terms of numbers of nucleotides, but markedly different in that they represent many copies of a short genome rather than one copy of a very long genome. Thus there is a need for an entirely different set of tools than those developed to handle traditional genomics datasets.

In Project GATTACA, we tap into this historical opportunity, and develop algorithms, visualization tools and predictive models, which are all geared towards very large amounts of parallel bio-sequence data (DNA, RNA, protein).

Currently, we focus on RNA viruses and other fast evolving pathogens. We build large scale computational models of important viral proteins (like HIV's Env and RT, or Influenza's hemagglutinin and neuraminidase). We then develop algorithms to infer molecular correlates of important viral properties such as drug-resistance, pathogenicity, antigenicity, immunogenicity, virulence, infectivity, neutralizability, etc. Our algorithms make concrete predictions that can be verified experimentally. We also build models of viral molecular evolution, attempting to anticipate the future behavior and properties of the pathogen. In parallel, we design and build visualization and interactive exploration tools for very large biosequence alignments.