Description

This track shows a measure of evolutionary conservation in $organism, chimp, mouse, rat, and chicken based on a phylogenetic hidden Markov model (phylo-HMM). The following multiz alignments were used to generate the annotation:

$organism $date ($db)
chimpanzee Nov. 2003 (panTro1)
mouse Feb. 2003 (mm3)
rat Jun. 2003 (rn3)
chicken Feb. 2004 (galGal2)

In "full" visibility mode, this track displays pairwise alignments of chimp, mouse, rat, and chicken, each aligned to the $organism genome. The pairwise alignments are displayed in the standard UCSC browser "dense" mode using a greyscale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display; however, this does not remove them from the conservation score display.

When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the $organism sequence at those alignment positions relative to the longest non-$organism sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment.

This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options.

Methods

Best-in-genome blastz pairwise alignments of $organism-mouse and $organism-rat were multiply aligned using a program called humor (HUman-MOuse-Rat), which is a special variant of the Multiz program. Multiz was used first to align the humor results with reciprocal best $organism-chimp alignments, and then to align the $organism-chimp-mouse-rat multiple alignment with best-in-genome blastz $organism-chicken alignments. The resulting $organism-chimp-mouse-rat-chicken multiple alignments were then assigned conservation scores by phylo-HMM.

A phylo-HMM is a probabilistic model that describes both the process of DNA substitution at each site in a genome, and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2003, Siepel and Haussler 2004). A phylo-HMM can be thought of as a machine that generates a multiple alignment, in the same way that an ordinary hidden Markov model (HMM) generates an individual sequence. While the states of an ordinary HMM are associated with simple multinomial probability distributions, the states of a phylo-HMM are associated with more complex distributions defined by probabilistic phylogenetic models. These distributions can capture differences in the rates and patterns of nucleotide substitution observed in different types of genomic regions (e.g., coding or noncoding regions, conserved or nonconserved regions).

To compute a conservation score, we use a k-state phylo-HMM, whose k associated phylogenetic models differ only in overall evolutionary rate (Felsenstein and Churchill 1996, Yang 1995). In the image at right, there are three k states, S₁, S₂, and S₃, but in practice we use k = 10. A phylogenetic model is estimated globally, using the discrete gamma model for rate variation (Yang 1994), then a scaled version of the estimated model is associated with each state in a phylo-HMM. There is a separate "rate constant", r_i, for each state i, which is multiplied by all branch lengths in the globally estimated model. The transition probabilities between states allow for autocorrelation of substitution rates, i.e., for adjacent sites to tend to exhibit similar overall substitution rates. A single parameter, lambda, describes the degree of autocorrelation and defines all transition probabilities. Here, we have estimated the rate constants from the data, similarly to Yang (1995) (Siepel and Haussler 2003), but have allowed lambda to be treated as a tuning parameter. For the conservation score, we use the posterior probability that each site was "generated" by the state having the smallest rate constant. Because of the way the rate categories are defined, the plotted values can be thought of as approximately representing the posterior probability that each site is among the 10% most conserved sites in the data set (allowing for autocorrelation of substitution rates).

In this case, the general reversible (REV) substitution model was used in parameter estimation, and lambda was set to 0.9. Alignment gaps were treated as missing data, which sometimes has the effect of producing undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps.

Credits

This track was created at UCSC using the following programs:

Blastz and multiz from Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.
AxtBest, axtChain, chainNet, netSyntenic, and netClass developed by Jim Kent at UCSC.
Tree estimation and phylo-HMM software by Adam Siepel at Cornell University.
"Wiggle track" plotting software by Hiram Clawson at UCSC.

The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community.

References

Phylo-HMMs and phastCons

Felsenstein, J. and Churchill, G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13, 93-104 (1996).

Siepel, A. and Haussler, D. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York (2005).

Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005).

Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics, 139, 993-1005 (1995).

Chain/Net:

Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003).

Multiz:

Blanchette, M., Kent, W.J., Riemer, C., Elnitski, .L, Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4), 708-15 (2004).

Blastz:

Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002).

Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003).

Phylogenetic Tree:

Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001).