Description

The GENCODE Genes track (version 10, Nov 2011) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation and Ensembl automatic annotation. Priority is given to the manually curated HAVANA annotation, using predicted Ensembl annotations when there are no corresponding manual annotations. The annotation was carried out on genome assembly GRCh37 (hg19).

NOTE: Due to UCSC Genome Browser using the NC_001807 mitochondrial genome sequence (chrM) and GENCODE annotating the NC_012920 mitochondrial sequence, the GENCODE mitochondrial sequences are not available in the UCSC Genome Browser. These annotations are available for download in the GENCODE GTF files.

Display Conventions and Configuration

This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Views available on this track are:
Genes
The gene annotations in this view are divided into three subtracks:
2-way
PolyA
Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria: Coloring for the gene annotations is based on the annotation type:

Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.

Methods

The GENCODE project aims to annotate all evidence-based gene features at high accuracy on the human reference sequence by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006).

GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus.

Non-coding transcript categorization: Non-coding transcripts are categorized using their biotype and the following criteria:

Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full-length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.

The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full-length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is expressed in human. Human transcript sequences from the International Nucleotide database Collaboration databases (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All introns boundaries must match exactly. The transcript start and end locations are allowed to differ.

The following categories are assigned to each of the evaluated annotations:

Release Notes

This GENCODE version 10 corresponds to Ensembl 65 from December 2011 and Vega 45 from Oct 2011. Also see: GENCODE project.

Verification

Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. Those experiments can be found at GEO:

See Harrow et al. (2006) for information on verification techniques.

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories: (contact: GENCODE at the Sanger Institute. )

Lab/Institution Contributors
GENCODE Principal Investigator Tim Hubbard
HAVANA manual annotation group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK Adam Frankish, Jose Manuel Gonzalez, Mike Kay, Alexandra Bignell, Gloria Despacio-Reyes, Garaub Mukherjee, Gary Sanders, Veronika Boychenko, Jennifer Harrow
Genome Bioinformatics Lab (CRG), Barcelona, Spain Thomas Derrien, Tyler Alioto, Andrea Tanzer, Roderic Guigó
Genome Bioinformatics, University of California Santa Cruz (UCSC), USA Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler
Comp. Genomics Lab, Washington University St. Louis (WUSTL), USA Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent
Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard, USA Mike Lin, Manolis Kellis
Computational Biology and Bioinformatics, Yale University (Yale), USA Philip Cayting, Suganthi Balasubramanian, Baikang Pei, Cristina Sisu, Mark Gerstein
Center for Integrative Genomics, University of Lausanne, Switzerland Cedric Howald, Alexandre Reymond
Ensembl genebuild group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK Steve Searle, Bronwen Aken, Amonida Zadissa, Daniel Barrell
Structural Computational Biology Group, Centro Nacional de Investigaciones Oncologicas (CNIO), Madrid, Spain José Manuel Rodríguez, Michael Tress, Alfonso Valencia

References

Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S et al. Ensembl 2011. Nucleic Acids Research. 2011;39 Database issue:D800-D806.

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Data Release Policy

GENCODE data are available for use without restrictions. The full data release policy for ENCODE is available here.