Description

This track displays human-centric multiple sequence alignments in the ENCODE regions for the 28 vertebrates included in the September 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project as well as whole-genome assemblies residing at UCSC, as listed:

The alignments in this track were generated using the LAGAN Alignment Toolkit. The Genome Browser companion tracks, MLAGAN Cons and MLAGAN Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods.

Display Conventions and Configuration

In full display mode, this track shows pairwise alignments of each species aligned to the human genome. In dense mode, the alignments are depicted using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display.

When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "Gaps" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment.

Methods

MLAGAN alignments were produced by a pipeline specifically designed for ENCODE. First, WU-BLAST was used to find local similarities (anchors) between the human sequence and the sequence of every other species. Then, Shuffle-LAGAN was used to calculate the highest-scoring human-monotonic chain of these local similarities (according to a scoring scheme that penalized evolutionary rearrangements), and — with the help of a utility called SuperMap — produce a map of orthologous segments, in increasing human coordinates. This map was used to undo the genomic rearrangements of the other sequence and convert it to a form that was directly alignable to the human sequence. The new humanized sequences, together with the human sequence, were then multiply aligned using MLAGAN. The resulting alignments were subsequently refined using MUSCLE, which processed small non-overlapping alignment windows and realigned them in an iterative fashion, keeping the refined alignment if it had a better sum-of-pairs score than the original. Finally, a pairwise refinement round was performed, during which the pieces that had very low identity (in the induced pairwise alignments between human and each species) were removed from the alignment.

Credits

The MLAGAN alignments were generated by George Asimenos from Stanford's ENCODE group.

Shuffle-LAGAN, SuperMap and MLAGAN were written by Mike Brudno.

MUSCLE was authored by Bob Edgar.

WU-BLAST was provided by the Gish lab at the School of Medicine, University of Washington in St. Louis.

The phylogenetic tree is based on Murphy et al. (2001).

References

Brudno M, Do C, Cooper G, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13(4):721-31.

Brudno M, Malde S, Poliakov A, Do C, Couronne O, Dubchak I, Batzoglou S. Global alignment: finding rearrangements during alignment. Bioinformatics. 2003;19(Suppl. 1):i54-i62.

Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004;32(5):1792-7.

Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348-51.