Description

This track plots the level of evolutionary conservation along the genome, as estimated from multiple alignments of the human (hg16), mouse (mm3), and rat (rn3) genomes. The conservation score shown here is based on a phylogenetic hidden Markov model (phylo-HMM).

Methods

A phylo-HMM is a probabilistic model that describes both the process of DNA substitution at each site in a genome, and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2003, Siepel and Haussler 2004). A phylo-HMM can be thought of as a machine that generates a multiple alignment, in the same way that an ordinary hidden Markov model (HMM) generates an individual sequence. While the states of an ordinary HMM are associated with simple multinomial probability distributions, however, the states of a phylo-HMM are associated with more complex distributions defined by probabilistic phylogenetic models. These distributions can capture differences in the rates and patterns of nucleotide substitution observed in different types of genomic regions (e.g., coding or noncoding regions, conserved or nonconserved regions).

To compute a conservation score, we use a k-state phylo-HMM, whose k associated phylogenetic models differ only in overall evolutionary rate (Felsenstein and Churchill 1996, Yang 1995). (In the picture at right, k = 3, but in practice, we use k = 10.) A phylogenetic model is estimated globally, using the discrete gamma model for rate variation (Yang 1994), then a scaled version of the estimated model is associated with each state in a phylo-HMM (see picture). (There is a separate "rate constant," r_i, for each state i, which is multiplied by all branch lengths in the globally estimated model.) The transition probabilities between states allow for autocorrelation of substitution rates, i.e., for adjacent sites to tend to exhibit similar overall substitution rates. A single parameter lambda describes the degree of autocorrelation and defines all transition probabilities (see picture). Here, we have estimated the rate constants from the data, similarly to Yang (1995) (see Siepel and Haussler 2003), but have allowed lambda to be treated as a tuning parameter. For the conservation score, we use the posterior probability that each site was "generated" by the state having the smallest rate constant. Because of the way the rate categories are defined, the plotted values can be thought of as approximately representing the posterior probability that each site is among the 10% most conserved sites in the data set (allowing for autocorrelation of substitution rates).

In this case, the general reversible (REV) substitution model was used in parameter estimation, and lambda was set to 0.9. Alignment gaps were treated as missing data, which sometimes has the effect of producing undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps.

Credits

This track was created with tree estimation and phylo-HMM software by Adam Siepel, and plotting software ("wiggle track") by Hiram Clawson.

References

J. Felsenstein and G. A. Churchill. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104, 1996.

A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. In Proc. 7th Annual Int'l Conf. on Research in Computational Molecular Biology (RECOMB 2003), pages 277-286, 2003.

A. Siepel and D. Haussler. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, Springer, (2005).

Z. Yang. A space-time process model for the evolution of DNA sequences. Genetics, 139:993-1005, 1995.

Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, J. Mol. Evol. 39:306-314, 1994.