Description

This track shows updated versions of gene predictions submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included:

The original EGASP submissions are displayed in the companion tracks, EGASP Full and EGASP Partial.

Display Conventions and Configuration

Data for each gene prediction method within this composite annotation track are displayed in separate subtracks. See the top of the track description page for a complete list of the subtracks available for this annotation. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide.

The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature.

Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods.

Methods

Augustus

Augustus uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, the translation start and end, and the lengths of exons and introns. This version has been trained on a set of 1284 human genes. The track contains four sets of predictions: ab initio, EST and protein-based, mouse homology-based, and those using EST/protein and mouse homology evidence as additional input to Augustus for the predictions.

The EST and protein evidence was generated by aligning sequences from the dbEST and nr databases to the ENCODE region using wublastn and wublastx. The resulting alignments were used to generate hints about putative splice sites, exons, coding regions, introns, translation start and translation stop.

The mouse homology evidence was generated by aligning pairs of human and mouse genomic sequences using the program DIALIGN. Regions conserved at the peptide level were used to generate hints about coding regions.

Exogean

Exogean produces alternative transcripts by combining mRNA and cross-species sequence alignments using heuristic rules. The program implements a generic framework based on directed acyclic colored multigraphs (DACMs). In Exogean, DACM nodes represent biological objects (mRNA or protein HSPs/transcripts) and multiple edges between nodes represent known relationships between these objects derived from human expertise. Exogean DACMs are succesively built and reduced, leading to increasingly complex objects. This process enables the production of alternative transcripts from initial HSPs.

FGenesh++

FGenesh++ predictions are based on hidden Markov models and protein similarity to the NR database. For more information, see the reference below.

GeneID-U12

The GeneID program predicts genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using position weight arrays (PWAs). Next, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites plus the the log-likelihood ratio of a Markov model for coding DNA. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The modified version of GeneID used to generate the predictions in this track incorporates models for U12-dependent splice signals in addition to U2 splice signals.

The GeneID subtrack shows all GeneID genes. Only U12 introns and their flanking exons are displayed in the GeneID U12 subtrack. Exons flanking predicted U12-dependent introns are assigned a type attribute reflecting their splice sites, displayed on the details page of the GeneID U12 subtrack as the "Alternate Name" of the item composed of the intron plus flanking exons.

Jigsaw

Jigsaw is a gene prediction program that determines genes based on target genomic sequence and output from a gene structure annotation database. Data downloaded from UCSC's annotation database is used as input and includes the following tracks of evidence: Known Genes, Ensembl, RefSeq, GeneID, Genscan, SGP, Twinscan, Human mRNAs, TIGR Gene Index, UniGene, Most Conserved Elements and Non-human RefSeq Genes. GlimmerHMM and GeneZilla, two open source ab initio gene-finding programs based on GHMMs, are also used.

SGP2-U12

To predict genes in a genomic query, SGP2 combines GeneID predictions with tblastx comparisons of the genomic query against other genomic sequences. This modified version of SGP2 uses models for U12-dependent splice signals in addition to U2 splice signals. The reference genomic sequence for this data set is the Oct. 2004 release of mouse sequence syntenic to ENCODE regions.

The SGP2 and SGP2 U12 tracks follow the same display conventions as the GeneID and GeneID U12 subtracks described above.

Yale Pseudogenes

For this analysis, pseudogenes were defined as genomic sequences similar to known human genes and with various disablements (premature stop codons or frameshifts) in their "putative" protein-coding regions.

The protein sequences of known human genes (as annotated by ENSEMBL) were used to search for similar nongenic sequences in ENCODE regions. The matching sequences were assessed as disabled copies of genes based on the occurrences of premature stop codons or frameshifts. The intron-exon structure of the functional gene was further used to infer whether a pseudogene was duplicated or processed (a duplicated pseudogene keeps the intron-exon structure of its parent functional gene). Small pseudogene sequences were labeled as fragments or other types.

All pseudogenes in this track were manually curated. In the browser, the track details page shows the pseudogene type.

Credits

Augustus was written by Mario Stanke at the Department of Bioinformatics of the University of Göttingen in Germany.

Exogean was developed by Sarah Djebali and Hugues Roest Crollius from the Dyogen Lab, Ecole Normale Supérieure (Paris, France) and Franck Delaplace from the Laboratoire de Méthodes Informatiques (LaMI), (Evry, France).

The FGenesh++ gene predictions were provided by Victor Solovyev of Softberry Inc.

The GeneID-U12 and SGP2-U12 programs were developed by the Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM) in Barcelona. The version of GeneID on which GeneID-U12 is based (geneid_v1.2) was written by Enrique Blanco and Roderic Guigó. The parameter files were constructed by Genis Parra and Francisco Camara. Additional contributions were made by Josep F. Abril, Moises Burset and Xavier Messeguer. Modifications to GeneID that allow for the prediction of U12-dependent splice sites and incorporation of U12 introns into gene models were made by Tyler Alioto.

Jigsaw was developed at The Institute for Genomic Research (TIGR) by Jonathan Allen and Steven Salzberg, with computational gene-finder contributions from Mihaela Pertea and William Majoros. Continued maintenance and development of Jigsaw will be provided by the Salzberg group at the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park.

The Yale Pseudogenes were generated by the pseudogene annotation group of Mark Gerstein at Yale University.

References

Augustus

Stanke, M. Gene prediction with a hidden Markov model. Ph.D. thesis, Universität Göttingen, Germany (2004).

Stanke, M. and Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(Suppl. 2), ii215-ii225 (2003).

Stanke, M., Steinkamp, R., Waack, S. and Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Res., 32, W309-W312 (2004).

FGenesh++

Solovyev V.V. "Statistical approaches in Eukaryotic gene prediction". In Handbook of Statistical Genetics (eds. Balding D. et al.) (John Wiley & Sons, Inc., 2001). p. 83-127.

GeneID

Blanco, E., Parra, G. and Guigó, R. "Using geneid to identify genes". In Current Protocols in Bioinformatics, Unit 4.3. (ed. Baxevanis, A.D.) (John Wiley & Sons, Inc., 2002).

Guigó, R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 5(4), 681-702 (1998).

Guigó, R., Knudsen, S., Drake, N. and Smith, T. Prediction of gene structure. J Mol Biol. 226(1), 141-57 (1992).

Parra, G., Blanco, E. and Guigó, R. GeneID in Drosophila. Genome Research 10(4), 511-515 (2000).

Jigsaw

Allen, J.E., Pertea, M. and Salzberg, S.L. Computational gene prediction using multiple sources of evidence. Genome Res., 14(1), 142-8 (2004).

Allen, J.E. and Salzberg, S.L. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18), 3596-3603 (2005).

SGP2

Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003).

Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W. and Guigó, R. Comparative gene prediction in human and mouse. Genome Res. 13(1), 108-17 (2003).