Description

This track displays human-centric multiple sequence alignments and conserved elements in the ENCODE regions for the 36 vertebrates included in the December 2007 ENCODE MSA freeze. The alignments in this track were generated using the Threaded Blockset Aligner (TBA). The conservation subtracks display conserved elements generated by two methods: BinCons, a binomial-based method that calculates a conservation score in sliding windows with normalization for phylogenetic bias, and Chai Cons, a DNA structure-informed constraint detection algorithm that uses hydroxyl radical cleavage patterns as a measure of DNA structure.

The multiple alignments are based on comparative sequence data generated for the ENCODE project from NIH Intramural Sequencing Center (NISC) as well as whole-genome assemblies residing at UCSC, as listed:

Organism Species Version

Human Homo sapiens UCSC hg18

Armadillo Dasypus novemcinctus NISC

Baboon Papio anubis NISC

Bat (rfbat) Rhinolophus ferrumequinum NISC

Bat (sbbat) Myotis lucifugus NISC

Cat Felis catus NISC

Chicken Gallus gallus UCSC galGal3

Chimpanzee Pan troglodytes UCSC panTro2

Colobus Monkey Colobus guereza NISC

Cow Bos taurus UCSC bosTau3

Dog Canis familiaris UCSC canFam2

Dusky titi Callicebus moloch NISC

Elephant Loxodonta africana NISC

Flying Fox Pteropus vampyrus NISC

Galago Otolemur garnettii NISC

Gibbon Nomascus leucogenys leucogenys NISC

Guinea pig Cavia porcellus NISC

Hedgehog Atelerix albiventris NISC

Horse Equus caballus NISC

Macaque Macaca mulatta UCSC rheMac2

Marmoset Callithrix jacchus NISC

Mouse Mus musculus UCSC mm9

Mouse Lemur Microcebus murinus NISC

Opossum Monodelphis domestica UCSC monDom4

Orangutan Pongo abelii UCSC ponAbe2

Owl Monkey Aotus nancymaae NISC

Platypus Ornithorhychus anatinus NISC

Rabbit Oryctolagus cuniculus NISC

Rat Rattus norvegicus UCSC rn4

Rock hyrax Procavia capensis NISC

Shrew Sorex araneus NISC

Squirrel monkey Saimiri boliviensis boliviensis NISC

Squirrel Spermophilus tridecemlineatus NISC

Tenrec Echinops telfairi NISC

Tree shrew Tupaia belangeri NISC

Vervet monkey Chlorocebus aethiops NISC

Organism	Species	Version
Human	Homo sapiens	UCSC hg18
Armadillo	Dasypus novemcinctus	NISC
Baboon	Papio anubis	NISC
Bat (rfbat)	Rhinolophus ferrumequinum	NISC
Bat (sbbat)	Myotis lucifugus	NISC
Cat	Felis catus	NISC
Chicken	Gallus gallus	UCSC galGal3
Chimpanzee	Pan troglodytes	UCSC panTro2
Colobus Monkey	Colobus guereza	NISC
Cow	Bos taurus	UCSC bosTau3
Dog	Canis familiaris	UCSC canFam2
Dusky titi	Callicebus moloch	NISC
Elephant	Loxodonta africana	NISC
Flying Fox	Pteropus vampyrus	NISC
Galago	Otolemur garnettii	NISC
Gibbon	Nomascus leucogenys leucogenys	NISC
Guinea pig	Cavia porcellus	NISC
Hedgehog	Atelerix albiventris	NISC
Horse	Equus caballus	NISC
Macaque	Macaca mulatta	UCSC rheMac2
Marmoset	Callithrix jacchus	NISC
Mouse	Mus musculus	UCSC mm9
Mouse Lemur	Microcebus murinus	NISC
Opossum	Monodelphis domestica	UCSC monDom4
Orangutan	Pongo abelii	UCSC ponAbe2
Owl Monkey	Aotus nancymaae	NISC
Platypus	Ornithorhychus anatinus	NISC
Rabbit	Oryctolagus cuniculus	NISC
Rat	Rattus norvegicus	UCSC rn4
Rock hyrax	Procavia capensis	NISC
Shrew	Sorex araneus	NISC
Squirrel monkey	Saimiri boliviensis boliviensis	NISC
Squirrel	Spermophilus tridecemlineatus	NISC
Tenrec	Echinops telfairi	NISC
Tree shrew	Tupaia belangeri	NISC
Vervet monkey	Chlorocebus aethiops	NISC

Display Conventions and Configuration

In full display mode, this track shows pairwise alignments of each species aligned to the human genome. In dense mode, the alignments are depicted using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment.

Gap Annotation

The Display chains between alignments configuration option enables display of gaps between alignment blocks in the pairwise alignments in a manner similar to the Chain track display. The following conventions are used:

Single line: no bases in the aligned species. Possibly due to a lineage-specific insertion between the aligned blocks in the $organism genome or a lineage-specific deletion between the aligned blocks in the aligning species.
Double line: aligning species has one or more unalignable bases in the gap region. Possibly due to excessive evolutionary distance between species or independent indels in the region between the aligned blocks in both species.
Pale yellow coloring: aligning species has Ns in the gap region. Reflects uncertainty in the relationship between the DNA of both species, due to lack of sequence in relevant portions of the aligning species.

Genomic Breaks

Discontinuities in the genomic context (chromosome, scaffold or region) of the aligned DNA in the aligning species are shown as follows:

Vertical blue bar: represents a discontinuity that persists indefinitely on either side, e.g. a large region of DNA on either side of the bar comes from a different chromosome in the aligned species due to a large scale rearrangement.
Green square brackets: enclose shorter alignments consisting of DNA from one genomic context in the aligned species nested inside a larger chain of alignments from a different genomic context. The alignment within the brackets may represent a short misalignment, a lineage-specific insertion of a transposon in the $organism genome that aligns to a paralogous copy somewhere else in the aligned species, or other similar occurrence.

Base Level

When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the $organism sequence at those alignment positions relative to the longest non-$organism sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+".

Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes:

No codon translation: the gene annotation is not used; the bases are displayed without translation.
Use default species reading frames for translation: the annotations from the genome displayed in the Default species for translation; pull-down menu are used to translate all the aligned species present in the alignment.
Use reading frames for species if available, otherwise no translation: codon translation is performed only for those species where the region is annotated as protein coding.
Use reading frames for species if available, otherwise use default species: codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation.

Codon translation uses the following gene tracks as the basis for translation, depending on the species chosen. Species listed in the row labeled "None" do not have species-specific reading frames for gene translation.

Gene Track Species

Gencode Genes human

UCSC Genes mouse

Known Genes rat

RefSeq Genes chimp

Ensembl Genes rhesus, opossum

None the remaining 30 species

Methods

TBA

TBA was used to align sequences in the December 2007 ENCODE sequence data freeze. Multiple alignments were seeded from a series of combinatorial pairwise blastz alignments (not referenced to any one species). The specific combinations were determined by the species guide tree. The resulting multiple alignments were projected onto the human reference sequence.

BinCons

The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores; this latter weighting scheme was found to closely match 4D weights. Clusters of bases that exceeded the given conservation score threshold were designated as conserved elements. The minimum length of a conserved element is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separates an element into two distinct regions. Regions reported here exceed a 5% False Discovery Rate threshold, using a window size of 7 bases. More details on binCons can be found in Margulies et. al. (2003) cited below.

Chai

Chai is a DNA structure-informed evolutionary conservation algorithm that works in a manner analogous to the primary sequence-based binCons. Instead of computing the binomial probability of observed base substitutions between species, Chai calculates the difference between DNA structural profiles as a measure of similarity. Single nucleotide resolution structure profiles for genomic DNA are predicted using the algorithm described in Greenbaum et. al (2007), below. Regions reported here exceed a 5% False Discovery Rate threshold.

Credits

The TBA multiple alignments were created by Gayle McEwen & Elliott Margulies of NHGRI.

BinCons was developed by Elliott Margulies (Margulies et al. 2003).

Chai was developed by Steve Parker & Tom Tullius (Boston University), Elliott Margulies(NHGRI) and Loren Hansen (NCBI).

The programs Blastz and TBA, which were used to generate the alignments, were provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.

The phylogenetic tree is based on Murphy et al. (2001).

References

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15.

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002;:115-26.

Greenbaum JA, Pang B, Tullius TD. Construction of a genome-scale structural map at single-nucleotide resolution. Genome Res. 2007 Jun;17(6):947-53.

Margulies EH, Blanchette, M, NISC Comparative Sequencing Program, Haussler, D and Green, ED. Identification and characterization of multi-species conserved sequences. Genome Res. 2003 Dec;13(12): 2507-18.

Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7.

Gene Track	Species
Gencode Genes	human
UCSC Genes	mouse
Known Genes	rat
RefSeq Genes	chimp
Ensembl Genes	rhesus, opossum
None	the remaining 30 species