Description

This track displays different measurements of conservation based on the Threaded Blockset Aligner (TBA) multiple sequence alignments of ENCODE regions shown in the TBA Alignment track. Three programs — binCons (binomial-based conservation method), phastCons (phylogenetic hidden-Markov model method), and GERP (Genomic Evolutionary Rate Profiling) — generated the conservation scoring used to create this track. A related track, TBA Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track.

For details on the conservation scores generated by each program, refer to the individual Methods subsections.

Display Conventions and Configuration

The subtracks within this composite annotation track, which show data from the binCons, phastCons and GERP programs, may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of the subtracks. A subtrack may be hidden from view by checking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link.

Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack.

Methods

The methods used to create the TBA alignments in the ENCODE regions are described in the TBA Alignment track description.

BinCons

The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores (this latter weighting scheme was found to closely match 4D weights).

The negative log of these P-values was then averaged across all human-referenced pairwise combinations, and the highest scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater (top 10% have a score of 0.9 or greater, and so on).

BinCons scores were normalized to represent a percentile to the power of 10. For example, scores representing the top 1 percent most conserved sequence, 99th percentile, have a score greater than or equal to 0.99^10 = 0.904. Transforming scores to the power of 10 was done for visual purposes only, in order to accentuate and distinguish the peaks of more highly conserved regions.

More details on binCons can be found in Margulies et. al. (2003) cited below.

PhastCons

The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1).

For determining the conservation for the ENCODE TBA alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions.

PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.05 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%.

The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation.

More details on phastCons can be found in Siepel et. al. (2005) cited below.

GERP

The GERP score is the expected substitution rate divided by the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA generated by MLAGAN. The scores range from 0 to 3; those greater than 3 are clipped to 3. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column.

Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns.

Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment.

Credits

BinCons was developed by Elliott Margulies of the Eric Green lab at NHGRI.

PhastCons was developed by Adam Siepel in the Haussler lab at UCSC.

GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford).

TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.

The data for this track were generated by Elliott Margulies, with assistance from Adam Siepel.

References

Blanchette, M., Kent, W.J., Reimer, C., Elnitski, L., Smit, A., Roskin, K., Baertsch, R., Rosenbloom, K.R., Clawson, H. et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res 14, 708-15 (2004).

Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S. and Sidow, A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901-13 (2005).

Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507-18 (2003).

Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005).