Director: Ross Hardison
The Center for Comparative Genomics and Bioinformatics (CCGB) was established within the Institute for Genomics, Proteomics and Bioinformatics of the Huck Institutes of the Life Sciences. Its mission is to bring together laboratories applying bioinformatic and experimental approaches to find functional sequences within genomic DNA and to assign function to proteins. These projects effectively harvest the physiologically important parts of the bountiful genomic sequences currently being determined.
Determination of the genomic DNA sequences of many organisms, ranging from humans to microbes, has revolutionized the life sciences. The scope of studies is expanded so that effects are studied globally, examining all genes and proteins. Knowledge of the complete set of genes andproteins should allow investigators to organize this information in a more useful and understandable way, much as the periodic table is an organizing structure for much of chemistry. These studies should lead to major new insights, not only into how organisms evolve and develop, but also how those developmental processes can be altered in pathological or beneficial ways. Such payoffs are much more likely if genomic sequence analysis is truly comprehensive in its identification of functional elements, also known as the functional annotation of genomes. As of April 2003, the human genome sequence has reached a critical milestone, a finished reference sequence. For the scope and resolution of most studies, the raw DNA sequence is now known for all human chromosomes. Importantly,fairly complete draft sequences are available for the mouse and rat genomes, and more vertebrate genomes will be completed soon. These are landmark accomplishments by the genome sequencing community.
However, critical issues such as the following must be resolved before this wealth of genomic DNA information can be comprehensively harvested and organized. (1) For none of these genomes are all the genes known at the present time, despite over a decade of insightful work from many investigators. (2) Predicted and verified proteins encoded by the genome catalyze cellular reactions and/or make critical structures within and between cells. However, these cellular functions are known for only a minority of the proteins. (3) Genes coding for proteins comprise at most about 2% of mammalian genomes. Other functional sequences, such as those that regulate the expression of genes, are dispersed among the non-coding DNA sequences.
Members of the CCGB are contributing to addressing all three of these issues. Comparative analyses of DNA and protein sequences are at the heart of many of these approaches. Dr. Anton Nekrutenko (Biochemistry and Molecular Biology) and Dr. Kateryna Makova (Biology) are two young investigators applying novel, informative methods for better gene identification. They use aligned sequences between human and mouse DNA sequences to provide key signatures of many protein-coding DNA sequences, such as predicted non-synonymous sites changing more slowly than predicted synonymous sites. Such studies, coupled with independent analyses from other investigators around the world, should in the near future provide a much more complete set of genes in humans and other mammals. In addition, Dr. Nekrutenko maintains a public web server (ETOPE) to facilitate this analysis by anyone for any genomic DNA sequences.
Effectively predicting the function of proteins is a major challenge, one that would have seemed impossible a few years ago. Indeed, predicting a protein structure from a given amino acid sequence is a long-standing problem in protein structure and enzymology. Dr. Arthur Lesk (Biochemistry and Molecular Biology) is one of the world leaders in these areas. His current research combines his expertise in modeling protein structure and molecular graphics with the rapidly expanding knowledge of three-dimensional protein structures to develop novel methods for predicting protein function. This cutting-edge research also utilizes comparative approaches, including three-dimensional structural comparisons instead of the primary structure comparisons that are the basis of other projects in the CCGB.
Just as aligned genome sequences provide fundamental information for predicting protein-coding genes, the alignments are central to the approaches used in the CCGB for finding functional non-coding genomic DNA sequences. Dr. Webb Miller (Computer Science and Engineering and Biology) has pioneered dynamic programming methods for aligning long genomic DNA sequences. He has collaborated with Dr. Ross Hardison (Biochemistry and Molecular Biology) in developing software for this problem and in designing experiments to test computer-generated predictions of sequences regulating gene expression for the past 14 years. For the past two years, Dr. Francesca Chiaromonte (Statistics) has joined this collaboration to bring state-of-the-art statistical methods to analysis of the alignments, always with the aim of identifying functional DNA sequences by hallmarks of purifying selection (slower rates of change than neutral DNA) or by statistically robust measures of similarity to patterns in known regulatory sequences. (See International Mouse Genome Sequencing Consortium, Nature 420:520-562, 2002.) Full utilization of these whole-genome interspecies alignments requires that they be recorded in a database along with current annotations of the human and mouse genomes (e.g. genes, expression patterns, transcription factor binding sites, repetitive DNA sequences). An initial version of such a database, called GALA, is now available.
Members of the Center are developing a number of computational tools and databases. These include PipMaker, PipTools, MultiPipMaker, GALA, dbERGE II, HbVar, ETOPE, and many others. These products of bioinformatics research at Penn State are used by thousands of users worldwide. One goal of the center is to integrate these tools into a single user-friendly web portal that would enable researchers from around the world to perform multiple genome analysis tasks without leaving the CCGB web-site. Creation of such a resource is critically important for establishing Penn State as a leading institution in genomic and bioinformatic research.
The bioinformatic predictions can only be verified by experimental tests, and Dr. Hardison's lab is now testing DNA sequences predicted to regulate erythroid gene expression in mammals. Additional genomic DNA sequencing is needed for some approaches, such as the sequencing done by Dr. Makova to measure the amount of intra-species polymorphism. The CCGB is designed to develop bioinformatic tools and predictions in concert with experimental tests to continually improve the efficacy of the bioinformatics.