Segmenting the human genome based on states of neutral genetic divergence


In addition to a significant contribution to our understanding of the intricacies of mutagenesis, this study provides a powerful platform for mining biomedical data extemdashwhich we make publicly available through the University of California Santa Cruz Genome Browser and the Galaxy portal. The divergence states we characterize serve as local background to benchmark signals used in computational algorithms for prediction of noncoding functional elements and in screening variants from cancer and other disease-affected genomes.Many studies have demonstrated that divergence levels generated by different mutation types vary and covary across the human genome. To improve our still-incomplete understanding of the mechanistic basis of this phenomenon, we analyze several mutation types simultaneously, anchoring their variation to specific regions of the genome. Using hidden Markov models on insertion, deletion, nucleotide substitution, and microsatellite divergence estimates inferred from human extendashorangutan alignments of neutrally evolving genomic sequences, we segment the human genome into regions corresponding to different divergence states extemdasheach uniquely characterized by specific combinations of divergence levels. We then parsed the mutagenic contributions of various biochemical processes associating divergence states with a broad range of genomic landscape features. We find that high divergence states inhabit guanine- and cytosine (GC)-rich, highly recombining subtelomeric regions; low divergence states cover inner parts of autosomes; chromosome X forms its own state with lowest divergence; and a state of elevated microsatellite mutability is interspersed across the genome. These general trends are mirrored in human diversity data from the 1000 Genomes Project, and departures from them highlight the evolutionary history of primate chromosomes. We also find that genes and noncoding functional marks [annotations from the Encyclopedia of DNA Elements (ENCODE)] are concentrated in high divergence states. Our results provide a powerful tool for biomedical data analysis: segmentations can be used to screen personal genome variants extemdashincluding those associated with cancer and other diseases extemdashand to improve computational predictions of noncoding functional elements.

Proceedings of the National Academy of Sciences