Description

This track shows full sets of gene predictions covering all 44 ENCODE regions originally submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included:

The EGASP Partial companion track shows original gene prediction submissions for a partial set of the 44 ENCODE regions; the EGASP Update track shows updated versions of the submitted predictions. These annotations were originally produced using the hg17 assembly.

Display Conventions and Configuration

Data for each gene prediction method within this composite annotation track are displayed in a separate subtrack. See the top of the track description page for configuration options allowing display of selected subsets of gene predictions. To remove a subtrack from the display, uncheck the appropriate box.

The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature.

Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods.

Methods

AceView

These annotations were generated using AceView. All mRNAs and cDNAs available in GenBank, excluding NMs, were co-aligned on the Gencode sections. The results were then examined and filtered to resemble Havana. The very restrictive view of Havana on CDS was not reproduced, due to a lack of experimental data.

DOGFISH-C

Candidate splice sites and coding starts/stops were evaluated using DNA alignments between the human assembly and seven other vertebrate species (UCSC multiz alignments, adding the frog and removing the chimp). Genes (single transcripts only) were then predicted using dynamic programming.

Ensembl

The Ensembl annotation includes two types of predictions: protein-coding genes (the Ensembl Gene Predictions subtrack) and pseudogenes of protein-coding genes (the Ensembl Pseudogene Predictions subtrack). The Ensembl Pseudo track is not intended as a comprehensive annotation of pseudogenes, but rather an attempt to identify and label those gene predictions made by the Ensembl pipeline that have pseudogene characteristics. Exons that lie partially outside the ENCODE region are not included in the data set. The "Alternate Name" field on the subtrack details page shows the Ensembl ID for the selected gene or transcript.

ExonHunter

ExonHunter is a comprehensive gene-finder based on hidden Markov models (HMMs) allowing the use of a variety of additional sources of information (ESTs, proteins, genome-genome comparisons).

Exogean

Exogean annotates protein coding genes by combining mRNA and cross-species protein alignments in directed acyclic colored multigraphs where nodes and edges respectively represent biological objects and human expertise. Additional predictions and methods for this subtrack are available in the EGASP Updates track.

Fgenesh Pseudogenes

Fgenesh is an HMM gene structure prediction program. This data set shows predictions of potential pseudogenes.

Fgenesh++

These gene predictions were generated by Fgenesh++, a gene-finding program that uses both HMMs and protein similarity to find genes in a completely automated manner.

GeneID-U12

The GeneID-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both GT-AG and AT-AC subtypes) when present, employs a single-genome ab initio method. This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two GeneID-U12 subtracks are included: GeneID Gene Predictions and GeneID U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track.

GeneMark

The eukaryotic version of the GeneMark.hmm (release 2.2) gene prediction program utilizes the HMM statistical model with duration or hidden semi-Markov model (HSMM). The HMM includes hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes the "border" states, such as start site (initiation codon), stop site (termination codons), and donor and acceptor splice sites. Sequences of all protein-coding regions were modeled by three periodic inhomogeneous Markov chains; sequences of non-coding regions were modeled by homogeneous Markov chains. Nucleotide sequences corresponding to the site states were modeled by position-specific inhomogeneous Markov chains. Parameters of the gene models were derived from the set of genes obtained by cDNA mapping to genomic DNA. To reflect variations in G+C composition of the genome, the gene model parameters were estimated separately for the three G+C regions.

JIGSAW

JIGSAW uses the output from gene-finders, splice-site prediction programs and sequence alignments to predict gene models. Annotation data downloaded from the UCSC Genome Browser and TIGR gene-finder output was used as input for these predictions. JIGSAW predicts both partial and complete genes. Additional predictions and methods for this subtrack are available in the EGASP Updates track.

Pairagon/N-SCAN

The pairHMM-based alignment program, Pairagon, was used to align high-quality mRNA sequences to the ENCODE regions. These were supplemented with N-SCAN EST predictions which are displayed in the Pairgn/NSCAN-E subtrack, and extended further with additional transcripts from the Brent Lab to produce the predictions displayed as the Pairgn/NSCAN-E/+ subtrack. The NSCAN subtrack contains only predictions from the N-SCAN program.

SGP2-U12

The SGP2-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both AT-AC and GT-AG subtypes) when present, employs a dual-genome method (SGP2) that utilizes similarity (tblastx) to mouse genomic sequence syntenic to the ENCODE regions (Oct. 2004 MSA freeze). This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two SGP2-U12 subtracks are included: SGP2 Gene Predictions and SGP2 U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track.

SPIDA

This exon-only prediction set was produced using SPIDA (Substitution Periodicity Index and Domain Analysis). Exons derived by mapping ESTs to the genome were validated by seeking periodic substitution patterns in the aligned informant DNA sequences. First, all available ESTs were mapped to the genome using Exonerate. The resulting transcript structures were "flattened" to remove redundancy. Each exon of the flattened transcripts was subjected to SPI analysis, which involves identifying periodicity in the pattern of mutations occurring between the human and an informant species DNA sequence (the informant sequences and their TBA alignments were provided by Elliott Margulies). SPI was calculated for all available human-informant pairs for whole exons and in a sliding 48 bp window. SPI analysis requires that a threshold level of periodicity be identified in at least two of the informant species if the exon is to be accepted. If accepted, SPI provides the correct frame for translation of the exon. This exon was used as a starting point for extending the ORF coding region of the flattened transcript from which it came. This gave a full or partial CDS; different exons may give different CDSs. The CDSs were translated and searched for domains using hmmpfam and Pfam_fs. Only transcripts with a domain hit with e > 1.0 were retained. Heuristics were applied to the retained CDSs to identify problems with the transcript structure, particularly frame-shifts. Many transcripts may identify the same exon, but only a single instance of each exon has been retained.

Twinscan-MARS

This gene prediction set was produced by a version of Twinscan that employs multiple pairwise genome comparisons to identify protein-coding genes (including alternative splices) using nucleotide homology information. No expression or protein data were used.

Credits

The following individuals and institutions provided the data for the subtracks in this annotation: