PITTSBURGH—A team of researchers at Carnegie Mellon University has received a three year,$646,000 grant from the National Science Foundation to develop computational methods that willquickly identify key regions of the human genome that can be traced to prehistoric times. These regionscan then be used to reconstruct human genetic histories. Ultimately the new tools, which draw from thelatest techniques in population genetics, theoretical computer science and operations research, will helpresearchers address basic questions about human evolution and identify regions of the genome involvedwith diseases like cancer, diabetes and mental illness.
Humans are 99.9 percent identical at the genetic level, and the key to understanding thediversity of the human species is buried in the 0.1 percent that makes us genetically different from oneanother. But sorting through the genome to identify and analyze these variations is a computationalnightmare."Computer analysis of these genetic variations allows us to infer how human populations haveevolved over thousands of years. Given our current computational tools, though, we could not complete thistask in our lifetimes even if we had every computer in the world working on the problem," said RussellSchwartz, an assistant professor of biological sciences and principal investigator on the project. "We willinstead tackle those portions of it that can be solved with confidence given current limitations, whilesimultaneously pushing the limits of established tools as far as possible through novel algorithmdevelopment."
The most common genetic variations occur as single nucleotide polymorphisms (SNPs), singlemutations in one of the four chemical bases that make up DNA. Each human genome is made of morethan six billion of these bases. Researchers have identified many of the predicted 10 million SNPs in thehuman genome, but understanding how these variations have accumulated over the course of humanhistory and how they became distributed in human populations is a computational challenge.Schwartz and co-principal investigators Computer Science ProfessorGuy Blelloch and R. Ravi,professor of operations research and computer science at the Tepper School of Business, are creating newcomputational techniques to identify patterns of SNPs that are common in human populations — patternsthat indicate ancient relationships shared among humans today. According to the researchers, developingthese tools is critical to finding genes that cause disorders like diabetes or heart disease.
To help develop these tools, Carnegie Mellon researchers will analyze data gathered by theInternational HapMap Project. This research consortium is mapping variations in the human genome tofind genes that could help diagnose disease susceptibility and design targeted medicines in the future. The"Hap" is short for haplotypes, or sets of associated SNPs along a segment of the genome that have beenconserved throughout human genetic history. Researchers created an initial HapMap — a map of sharedblocks of SNPs — by analyzing DNA in blood samples collected from people in Nigeria, Japan, Chinaand the United States (with ancestry from northern and western Europe).
Sorting through millions of SNPs to identify haplotypes is even more computationallychallenging because of recombination, a shuffling of genetic material between chromosomes that occurswhen sperm and egg cells are produced. Because recombination events accumulate over the course ofmany generations, they complicate efforts to identify shared ancestry between different people ordifferent regions of the genome. Finding the haplotypes, which have undergone little or norecombination in the recent past, would help scientists identify and trace the ancestral lineages of specificgenes across populations.
Schwartz and his colleagues are attempting to find haplotypes with more precision than currenttechniques by using a new method for partitioning DNA into small segments they call "haplotypemotifs." These motifs frequently occur across human populations. Already, their approach has identifiedancient haplotype patterns consistent with current evidence about human evolution. For example, theteam used their algorithms to analyze data from the HapMap to confirm evidence of ancient haplotypepatterns predating the divergence of Chinese and Japanese populations, as well as some patterns predatingEuropean and Asian population divergence.
The team is also simultaneously developing novel algorithms to infer phylogenies (familytrees) of pieces of the human genome that have not been touched by recombination.
"We are applying new methods from theoretical computer science to create phylogenies thatare guaranteed to be the best possible, given the SNP data available to us and our understanding of howthe observed patterns of SNPs were created," Schwartz said.
At present, these phylogenies are generally inferred by approximate, or heuristic, methods thatdo not always make the best possible inferences from the available data, according to Schwartz. The teamis developing optimal methods for this task and a related extension where the genome pieces may havelimited mutation. These new methods draw from a variety of techniques ranging from graph theory tomathematical programming.
"Both new analyses will together provide us with a partial history of the human genome anddetailed information about specific genetic regions where such information can be inferred withconfidence," he said.
The grant will also allow the team to develop new course material in the areas of algorithmsand computational biology, and provide undergraduate and graduate student research opportunities atthe boundaries of quantitative and biological research.