Description

This track displays multi-species conserved sequences (MCSs) derived from binCons, phastCons, and genomic evolutionary rate profiling (GERP) conservation scoring of Threaded Blockset Aligner (TBA) multiple sequence alignments in the ENCODE regions. The combined-methods subtracks show the union/intersection of conserved elements produced by the three conservation methods.

The multiple sequence alignments may be viewed in the TBA Alignment track. Another related track, TBA Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation.

Display Conventions and Configuration

The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the three methods listed above, as well as both unions and intersections of the sets of conserved and non-coding conserved elements. To show only selected subtracks, uncheck the boxes next to the tracks you wish to hide.

The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page.

Display characteristics specific to certain subtracks are described in the respective Methods sections below.

Methods

BinCons-based Elements

For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions.

PhastCons-based Elements

The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e., maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state.

GERP-based Elements

GERP elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name.

"Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS).

Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%.

PhastCons/BinCons/GERP Union/Intersection of Conserved Elements

These subtracks were produced by creating unions and intersections of the constrained element data detected by binCons, phastCons, and GERP on TBA alignments. In these annotations, "non-coding" is defined as those regions not overlapping with CDS regions in any of the following UCSC gene tables: refFlat, knownGene, mgcGenes, vegaGene, or ensGene.

Credits

BinCons and phastCons MCS data were contributed by Elliott Margulies in the Eric Green lab at NHGRI, with assistance from Adam Siepel of UCSC.

GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford).

TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.

References

See the TBA Alignment and TBA Cons tracks for references.