Description

The Gencode Genes track (v3.1, March 2007) shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project.

The gene annotations are colored based on the HAVANA annotation type. See the table below for the color key, as well as more detail about the transcript and feature types. The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known and Novel_CDS (which are colored dark green in the track display) be used as the reference gene annotation.

The v3.1 release includes the following updates and enhancements to v2.2 (Oct. 2005):

Apart from the usual additions and corrections, 69 loci (consisting of 132 transcripts) were re-annotated based on Rapid Amplification of cDNA Ends (RACE), array, and sequencing analyses performed within the Affymetrix/GENCODE collaboration (see the Methods section below, also Denoeud et al., 2007 and The ENCODE Project Consortium, 2007).
The polymorphic gene type was added.
PolyA features were added.
A bug affecting frames of CDSs with missing start or stop codons was fixed.
The experimental validation data contained in the Gencode Introns track from the previous release were integrated into the Gencode Genes track by annotators using the Human and Vertebrate Analysis and Annotation manual curation process (HAVANA).

Type Color Description

Known dark green Known protein-coding genes (i.e., referenced in Entrez Gene)

Novel_CDS dark green Have an open reading frame (ORF) and are identical, or have homology, to cDNAs or proteins but do not fall into the above category. These can be known in the sense that they are represented by mRNA sequences in the public databases, but they are not yet represented in Entrez Gene or have not received an official gene name. They can also be novel in that they are not yet represented by an mRNA sequence in human.

Novel_transcript light green Similar to Novel_CDS; however, cannot be assigned an unambigous ORF.

Putative light green Have identical, or have homology to spliced ESTs, but are devoid of significant ORF and polyA features. These are generally short (two or three exon) genes or gene fragments.

TEC light green (To Experimentally Confirm) Single-exon objects (supported by multiple unspliced ESTs with polyA sites and signals).

Polymorphic purple Have functional transcripts in one haplotype and "pseudo" (non-functional) transcripts in another.

Processed_pseudogene blue Pseudogenes that lack introns and are thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.

Unprocessed_pseudogene blue Pseudogenes that can contain introns, as they are produced by gene duplication.

Artifact grey Transcript evidence and/or its translation equivocal. Usually these arise from high-throughput cDNA sequencing projects that submit automatic annotation, sometimes resulting in erroneous CDSs in what turns out to be, for example, 3' UTRs. In addition HAVANA has extended this category to include cDNAs with non-canonical splice sites due to deletion/sequencing errors.

PolyA_signal brown Polyadenylation signal

PolyA_site orange Polyadenylation site

Pseudo_polyA pink "Pseudo"-polyadenylation signal detected in the sequence of a processed pseudogene.
Warning: Pseudo_polyA features and processed_pseudogenes generally don't overlap. The reason is that pseudogene annotations are based solely on protein evidence, whereas pseudo_polyA signals are identified from transcript evidence; as they are found at the end of the 3' UTR, they can lie several kb downstream of the 3' end of the pseudogene.

Type	Color	Description
Known	dark green	Known protein-coding genes (i.e., referenced in Entrez Gene)
Novel_CDS	dark green	Have an open reading frame (ORF) and are identical, or have homology, to cDNAs or proteins but do not fall into the above category. These can be known in the sense that they are represented by mRNA sequences in the public databases, but they are not yet represented in Entrez Gene or have not received an official gene name. They can also be novel in that they are not yet represented by an mRNA sequence in human.
Novel_transcript	light green	Similar to Novel_CDS; however, cannot be assigned an unambigous ORF.
Putative	light green	Have identical, or have homology to spliced ESTs, but are devoid of significant ORF and polyA features. These are generally short (two or three exon) genes or gene fragments.
TEC	light green	(To Experimentally Confirm) Single-exon objects (supported by multiple unspliced ESTs with polyA sites and signals).
Polymorphic	purple	Have functional transcripts in one haplotype and "pseudo" (non-functional) transcripts in another.
Processed_pseudogene	blue	Pseudogenes that lack introns and are thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
Unprocessed_pseudogene	blue	Pseudogenes that can contain introns, as they are produced by gene duplication.
Artifact	grey	Transcript evidence and/or its translation equivocal. Usually these arise from high-throughput cDNA sequencing projects that submit automatic annotation, sometimes resulting in erroneous CDSs in what turns out to be, for example, 3' UTRs. In addition HAVANA has extended this category to include cDNAs with non-canonical splice sites due to deletion/sequencing errors.
PolyA_signal	brown	Polyadenylation signal
PolyA_site	orange	Polyadenylation site
Pseudo_polyA	pink	"Pseudo"-polyadenylation signal detected in the sequence of a processed pseudogene. Warning: Pseudo_polyA features and processed_pseudogenes generally don't overlap. The reason is that pseudogene annotations are based solely on protein evidence, whereas pseudo_polyA signals are identified from transcript evidence; as they are found at the end of the 3' UTR, they can lie several kb downstream of the 3' end of the pseudogene.

The current full set of GENCODE annotations is available for download here.

Methods

For a detailed description of the methods and references used, see Harrow et al., 2006 and Denoeud et al., 2007.

5' RACE/array experiments

A combination of 5’ RACE and high-density tiling microarrays were used to empirically annotate 5’ transcription start sites (TSSs) and internal exons of all 410 annotated protein-coding loci across the 44 ENCODE regions (Oct. 2005 GENCODE freeze). The 5’ RACE reactions were performed with oligonucleotides mapping to a coding exon common to most of the transcripts of a protein-coding gene locus annotated by GENCODE (Oct. 2005 freeze) on polyA+ RNA from twelve adult human tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta) and three cell lines (GM06990 (lymphoblastoid), HL60 (acute promyelocytic leukemia) and HeLaS3 (cervix carcinoma)).

The RACE reactions were then hybridized to 20 nucleotide-resolution Affymetrix tiling arrays covering the non-repeated regions of the 44 ENCODE regions. The resulting "RACEfrags" -- array-detected fragments of RACE products -- were assessed for novelty by comparing their genome coordinates to those of GENCODE-annotated exons. Connectivity between novel RACEfrags and their respective index exon were further investigated by RT-PCR, cloning and sequencing. The resulting cDNA sequences (deposited in GenBank under accession numbers DQ655905-DQ656069 and EF070113-EF070122) were then fed into the HAVANA annotation pipeline as mRNA evidence (see "HAVANA manual annotations" below).

HAVANA manual annotations

The HAVANA process was used to produce these annotations.

Before the manual annotation process begins, an automated analysis pipeline for similarity searches and ab initio predictions is run on a computer farm and stored in an Ensembl MySQL database using a modified Ensembl analysis pipeline system. All searches and prediction algorithms, except CpG island prediction (see cpgreport in the EMBOSS application suite), are run on repeat-masked sequence. RepeatMasker is used to mask interspersed repeats, followed by Tandem repeats finder to mask tandem repeats.

Nucleotide sequence databases are searched with wuBLASTN, and significant hits are re-aligned to the unmasked genomic sequence using est2genome. The UniProt protein database is searched with wuBLASTX, and the accession numbers of significant hits are found in the Pfam database. The hidden Markov models for Pfam protein domains are aligned against the genomic sequence using Genewise to provide annotation of protein domains.

Several ab initio prediction algorithms are also run: Genescan and Fgenesh for genes, tRNAscan to find tRNAgenes and Eponine TSS to predict transcription start sites.

Once the automated analysis is complete, the annotator uses a Perl/Tk based graphical interface, "otterlace", developed in-house at the Wellcome Trust Sanger Institute to edit annotation data held in a separate MySQL database system. The interface displays a rich, interactive graphical view of the genomic region, showing features such as database matches, gene predictions, and transcripts created by the annotators. Gapped alignments of nucleotide and protein blast hits to the genomic sequence are viewed and explored using the "Blixem" alignment viewer.

Additionally, the "Dotter" dot plot tool is used to show the pairwise alignments of unmasked sequence, thus revealing the location of exons that are occasionally missed by the automated blast searches because of their small size and/or match to repeat-masked sequence.

The interface provides a number of tools that the annotator uses to build genes and edit annotations: adding transcripts, exon coordinates, translation regions, gene names and descriptions, remarks and polyadenlyation signals and sites.

Verification

See Harrow et al., 2006 for information on verification techniques.

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories:

Lab/Institution
Contributors

HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK Adam Frankish, Jonathan Mudge, James Gilbert, Tim Hubbard, Jennifer Harrow

Genome Bioinformatics Lab CRG, Barcelona, Spain France Denoeud, Julien Lagarde, Sylvain Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal Investigator)

Department of Genetic Medicine and Development, University of Geneva, Switzerland Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis

Center for Integrative Genomics, University of Lausanne, Switzerland Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond

Affymetrix, Inc., Santa Clara, CA, USA Philipp Kapranov, Thomas R. Gingeras

References

Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007 Jun;17(6):746-59.

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816.

Lab/Institution	Contributors
HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK	Adam Frankish, Jonathan Mudge, James Gilbert, Tim Hubbard, Jennifer Harrow
Genome Bioinformatics Lab CRG, Barcelona, Spain	France Denoeud, Julien Lagarde, Sylvain Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal Investigator)
Department of Genetic Medicine and Development, University of Geneva, Switzerland	Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis
Center for Integrative Genomics, University of Lausanne, Switzerland	Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond
Affymetrix, Inc., Santa Clara, CA, USA	Philipp Kapranov, Thomas R. Gingeras