Description

The GENCODE Genes track (version 4, May 2010) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA and ENSEMBL. Priority is given to the manually curated HAVANA annotation, using predicted ENSEMBL annotations when there are no corresponding manual annotations. The annotation was carried out on genome assembly GRCh37 (hg19).

Display Conventions and Configuration

The annotations are divided into separate tracks based on source/confidence. The GENCODE project recommends that the annotations from level 1 & 2 be used for in-depth analysis in "finished" regions, combined levels 1, 2 and 3 should be used for all methods that analyze the entire genome and require a full gene set.

Level 1: validated

This level is comprised of pseudogene loci that were predicted independently by the analysis-pipelines from Yale, UCSC, and HAVANA manual annotation (WTSI) and transcripts that were verified by the GENCODE experimental pipeline.

Level 2: manual annotation

This level is comprised of HAVANA manual annotation from WTSI. The following regions are considered "fully annotated", although they will still be updated: chromosomes 1, 2, 3, 4, 6, 9, 10, 13, 20, 21, 22, X, Y, ENCODE pilot regions, chr11:2353995-3878750.

Level 3: automated annotation

This level is comprised of ENSEMBL annotation in regions where no HAVANA annotation or additional isoforms can be found.

NOTE: We try and synchronize the release cycles for GENCODE, Havana and Ensembl. This GENCODE version 4 corresponds to Ensembl 58 and Vega 38. Also see: GENCODE project.

The gene annotations are colored based on the annotation type and the confidence level. See the table below for the color key, as well as more detail about the transcript and feature types.

Class Color Description Transcript Types (see Vega Transcript Types)

Validated_coding Dark Orange Level 1 Validated:
coding regions protein_coding

Validated_processed Light Orange Level 1 Validated:
processed processed_transcript

Validated_processed_pseudogene Dark Pink Level 1 Validated:
processed pseudogenes processed_pseudogene, processed_transcript, transcribed_processed_pseudogene

Validated_unprocessed_pseudogene Medium Pink Level 1 Validated:
unprocessed pseudogenes transcribed_unprocessed_pseudogene, unprocessed_pseudogene

Validated_pseudogene Light Pink Level 1 Validated:
pseudogenes IG_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene

Havana_coding Dark Orange Level 2 Manual annotation:
coding IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,protein_coding

Havana_nonsense Medium Orange Level 2 Manual annotation:
nonsense nonsense_mediated_decay

Havana_non_coding Light Orange Level 2 Manual annotation:
non-coding ambiguous_orf, antisense, non_coding, processed_transcript, retained_intron

Havana_polyA Black Level 2 Manual annotation:
polyA polyA_signal, polyA_site, pseudo_polyA

Havana_processed_pseudogene Dark Pink Level 2 Manual annotation:
processed pseudogene processed_pseudogene, transcribed_processed_pseudogene

Havana_unprocessed_pseudogene Medium Pink Level 2 Manual annotation:
unprocessed pseudogene transcribed_unprocessed_pseudogene, unprocessed_pseudogene

Havana_pseudogene Light Pink Level 2 Manual annotation:
pseudogene IG_pseudogene, TR_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene

Havana_TEC Grey Level 2 Manual annotation:
TEC TEC, artifact

Ensembl_coding Dark Red Level 3 Automated annotation:
coding IG_C_gene, IG_D_gene, IG_J_gene, IG_V_gene, protein_coding

Ensembl_non_coding Light Orange Level 3 Automated annotation:
non-coding antisense, non_coding, processed_transcript, retained_intron

Ensembl_pseudogene Dark Pink Level 3 Automated annotation:
pseudogene IG_pseudogene, miRNA_pseudogene, misc_RNA_pseudogene, pseudogene, retrotransposed, unitary_pseudogene

Ensembl_processed_pseudogene Medium Pink Level 3 Automated annotation:
processed pseudogene processed_pseudogene

Ensembl_unprocessed_pseudogene Light Pink Level 3 Automated annotation:
unprocessed pseudogene unprocessed_pseudogene

Ensembl_RNA Light Red Level 3 Automated annotation:
RNA transcripts Mt_rRNA, Mt_tRNA, Mt_tRNA_pseudogene, miRNA, misc_RNA, rRNA, rRNA_pseudogene, scRNA_pseudogene, snRNA, snRNA_pseudogene, snoRNA, snoRNA_pseudogene, tRNA_pseudogene, tRNAscan

2way_pseudogene Dark Purple Level 3 Automated annotation:
pseudogenes pseudogenes

Class	Color	Description	Transcript Types (see Vega Transcript Types)
Validated_coding	Dark Orange	Level 1 Validated: coding regions	protein_coding
Validated_processed	Light Orange	Level 1 Validated: processed	processed_transcript
Validated_processed_pseudogene	Dark Pink	Level 1 Validated: processed pseudogenes	processed_pseudogene, processed_transcript, transcribed_processed_pseudogene
Validated_unprocessed_pseudogene	Medium Pink	Level 1 Validated: unprocessed pseudogenes	transcribed_unprocessed_pseudogene, unprocessed_pseudogene
Validated_pseudogene	Light Pink	Level 1 Validated: pseudogenes	IG_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Havana_coding	Dark Orange	Level 2 Manual annotation: coding	IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,protein_coding
Havana_nonsense	Medium Orange	Level 2 Manual annotation: nonsense	nonsense_mediated_decay
Havana_non_coding	Light Orange	Level 2 Manual annotation: non-coding	ambiguous_orf, antisense, non_coding, processed_transcript, retained_intron
Havana_polyA	Black	Level 2 Manual annotation: polyA	polyA_signal, polyA_site, pseudo_polyA
Havana_processed_pseudogene	Dark Pink	Level 2 Manual annotation: processed pseudogene	processed_pseudogene, transcribed_processed_pseudogene
Havana_unprocessed_pseudogene	Medium Pink	Level 2 Manual annotation: unprocessed pseudogene	transcribed_unprocessed_pseudogene, unprocessed_pseudogene
Havana_pseudogene	Light Pink	Level 2 Manual annotation: pseudogene	IG_pseudogene, TR_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Havana_TEC	Grey	Level 2 Manual annotation: TEC	TEC, artifact
Ensembl_coding	Dark Red	Level 3 Automated annotation: coding	IG_C_gene, IG_D_gene, IG_J_gene, IG_V_gene, protein_coding
Ensembl_non_coding	Light Orange	Level 3 Automated annotation: non-coding	antisense, non_coding, processed_transcript, retained_intron
Ensembl_pseudogene	Dark Pink	Level 3 Automated annotation: pseudogene	IG_pseudogene, miRNA_pseudogene, misc_RNA_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Ensembl_processed_pseudogene	Medium Pink	Level 3 Automated annotation: processed pseudogene	processed_pseudogene
Ensembl_unprocessed_pseudogene	Light Pink	Level 3 Automated annotation: unprocessed pseudogene	unprocessed_pseudogene
Ensembl_RNA	Light Red	Level 3 Automated annotation: RNA transcripts	Mt_rRNA, Mt_tRNA, Mt_tRNA_pseudogene, miRNA, misc_RNA, rRNA, rRNA_pseudogene, scRNA_pseudogene, snRNA, snRNA_pseudogene, snoRNA, snoRNA_pseudogene, tRNA_pseudogene, tRNAscan
2way_pseudogene	Dark Purple	Level 3 Automated annotation: pseudogenes	pseudogenes

This track uses filtering by category to select subsets of transcripts and has additional advanced features. Help with these features can be found here.

Methods

We aim to annotate all evidence-based gene features at high accuracy on the human reference sequence. This includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. We integrate computational approaches (including comparative methods), manual annotation and targeted experimental verification.

For a detailed description of the methods and references used, see Harrow et al (2006).

Verification

See Harrow et al. (2006) for information on verification techniques.

Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. Those experiments can be found at GEO:

GSE34797:[E-MTAB-684] - Batch IV is based on chromosome 3, 4 and 5 annotations from GENCODE 4 (January 2010).
GSE34820:[E-MTAB-737] - Batch V is based on annotations from GENCODE 6 (November 2010).
GSE34821:[E-MTAB-831] - Batch VI is based on annotations from GENCODE 6 (November 2010) as well as transcript models predicted by the Ensembl Genebuild group based on the Illumina Human BodyMap 2.0 data.

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories: (contact: GENCODE at the Sanger Institute. )

Lab/Institution
Contributors

HAVANA annotation group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK Adam Frankish, James Gilbert, Jennifer Harrow, Felix Kokocinski, Stephen Trevanion, Tim Hubbard (GENCODE Principal Investigator)

Genome Bioinformatics Lab (CRG), Barcelona, Spain Thomas Derrien, Tyler Alioto, Andrea Tanzer, Roderic Guigó

Genome Bioinformatics, University of California Santa Cruz (UCSC), USA Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler

Comp. Genomics Lab, Washington University St. Louis (WUSTL), USA Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent

Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard, USA Mike Lin, Manolis Kellis

Computational Biology and Bioinformatics, Yale University (Yale), USA Philip Cayting, Suganthi Balasubramanian, Baikang Pei, Cristina Sisu, Mark Gerstein

Center for Integrative Genomics, University of Lausanne, Switzerland Cedric Howald, Alexandre Reymond

ENSEMBL genebuild group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK Steve Searle, Bronwen Aken, Amonida Zadissa, Daniel Barrell

Structural Computational Biology Group, Centro Natcional de Investigaciones Oncologicas (CNIO), Madrid, Spain José Manuel Rodríguez, Michael Tress, Alfonso Valencia

References

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Flicek et al. Ensembl 2011 Nucleic Acids Research 2011 39 Database issue:D800-D806

Data Release Policy

GENCODE data are available for use without restrictions. The full data release policy for ENCODE is available here.

Lab/Institution	Contributors
HAVANA annotation group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Adam Frankish, James Gilbert, Jennifer Harrow, Felix Kokocinski, Stephen Trevanion, Tim Hubbard (GENCODE Principal Investigator)
Genome Bioinformatics Lab (CRG), Barcelona, Spain	Thomas Derrien, Tyler Alioto, Andrea Tanzer, Roderic Guigó
Genome Bioinformatics, University of California Santa Cruz (UCSC), USA	Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler
Comp. Genomics Lab, Washington University St. Louis (WUSTL), USA	Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent
Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard, USA	Mike Lin, Manolis Kellis
Computational Biology and Bioinformatics, Yale University (Yale), USA	Philip Cayting, Suganthi Balasubramanian, Baikang Pei, Cristina Sisu, Mark Gerstein
Center for Integrative Genomics, University of Lausanne, Switzerland	Cedric Howald, Alexandre Reymond
ENSEMBL genebuild group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Steve Searle, Bronwen Aken, Amonida Zadissa, Daniel Barrell
Structural Computational Biology Group, Centro Natcional de Investigaciones Oncologicas (CNIO), Madrid, Spain	José Manuel Rodríguez, Michael Tress, Alfonso Valencia