August 2, 2006
Carnegie Mellon University Receives NSF Grant To Develop New
Computational Techniques for Unraveling Human Genetic History
PITTSBURGH—A team of researchers at Carnegie Mellon University has received a three year,
$646,000 grant from the National Science Foundation to develop computational methods that will
quickly identify key regions of the human genome that can be traced to prehistoric times. These regions
can then be used to reconstruct human genetic histories. Ultimately the new tools, which draw from the
latest techniques in population genetics, theoretical computer science and operations research, will help
researchers address basic questions about human evolution and identify regions of the genome involved
with diseases like cancer, diabetes and mental illness.
Humans are 99.9 percent identical at the genetic level, and the key to understanding the
diversity of the human species is buried in the 0.1 percent that makes us genetically different from one
another. But sorting through the genome to identify and analyze these variations is a computational
nightmare.
“Computer analysis of these genetic variations allows us to infer how human populations have
evolved over thousands of years. Given our current computational tools, though, we could not complete this
task in our lifetimes even if we had every computer in the world working on the problem,” said Russell
Schwartz, an assistant professor of biological sciences and principal investigator on the project. “We will
instead tackle those portions of it that can be solved with confidence given current limitations, while
simultaneously pushing the limits of established tools as far as possible through novel algorithm
development.”
The most common genetic variations occur as single nucleotide polymorphisms (SNPs), single
mutations in one of the four chemical bases that make up DNA. Each human genome is made of more
than six billion of these bases. Researchers have identified many of the predicted 10 million SNPs in the
human genome, but understanding how these variations have accumulated over the course of human
history and how they became distributed in human populations is a computational challenge.
Schwartz and co-principal investigators Computer Science Professor
Guy Blelloch and R. Ravi,
professor of operations research and computer science at the Tepper School of Business, are creating new
computational techniques to identify patterns of SNPs that are common in human populations — patterns
that indicate ancient relationships shared among humans today. According to the researchers, developing
these tools is critical to finding genes that cause disorders like diabetes or heart disease.
To help develop these tools, Carnegie Mellon researchers will analyze data gathered by the
International HapMap Project. This research consortium is mapping variations in the human genome to
find genes that could help diagnose disease susceptibility and design targeted medicines in the future. The
“Hap” is short for haplotypes, or sets of associated SNPs along a segment of the genome that have been
conserved throughout human genetic history. Researchers created an initial HapMap — a map of shared
blocks of SNPs — by analyzing DNA in blood samples collected from people in Nigeria, Japan, China
and the United States (with ancestry from northern and western Europe).
Sorting through millions of SNPs to identify haplotypes is even more computationally
challenging because of recombination, a shuffling of genetic material between chromosomes that occurs
when sperm and egg cells are produced. Because recombination events accumulate over the course of
many generations, they complicate efforts to identify shared ancestry between different people or
different regions of the genome. Finding the haplotypes, which have undergone little or no
recombination in the recent past, would help scientists identify and trace the ancestral lineages of specific
genes across populations.
Schwartz and his colleagues are attempting to find haplotypes with more precision than current
techniques by using a new method for partitioning DNA into small segments they call “haplotype
motifs.” These motifs frequently occur across human populations. Already, their approach has identified
ancient haplotype patterns consistent with current evidence about human evolution. For example, the
team used their algorithms to analyze data from the HapMap to confirm evidence of ancient haplotype
patterns predating the divergence of Chinese and Japanese populations, as well as some patterns predating
European and Asian population divergence.
The team is also simultaneously developing novel algorithms to infer phylogenies (family
trees) of pieces of the human genome that have not been touched by recombination.
“We are applying new methods from theoretical computer science to create phylogenies that
are guaranteed to be the best possible, given the SNP data available to us and our understanding of how
the observed patterns of SNPs were created,” Schwartz said.
At present, these phylogenies are generally inferred by approximate, or heuristic, methods that
do not always make the best possible inferences from the available data, according to Schwartz. The team
is developing optimal methods for this task and a related extension where the genome pieces may have
limited mutation. These new methods draw from a variety of techniques ranging from graph theory to
mathematical programming.
“Both new analyses will together provide us with a partial history of the human genome and
detailed information about specific genetic regions where such information can be inferred with
confidence,” he said.
The grant will also allow the team to develop new course material in the areas of algorithms
and computational biology, and provide undergraduate and graduate student research opportunities at
the boundaries of quantitative and biological research.
Contact:
Amy Pavlak
412.268.8619
Lauren Ward
412.268.7761