The Gencode Genes track (v3.1, March 2007) shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project.
The gene annotations are colored based on the HAVANA annotation type. See the table below for the color key, as well as more detail about the transcript and feature types. The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known and Novel_CDS (which are colored dark green in the track display) be used as the reference gene annotation.
The v3.1 release includes the following updates and enhancements to v2.2 (Oct. 2005):
Type | Color | Description |
---|---|---|
Known | dark green | Known protein-coding genes (i.e., referenced in Entrez Gene) |
Novel_CDS | dark green | Have an open reading frame (ORF) and are identical, or have homology, to cDNAs or proteins but do not fall into the above category. These can be known in the sense that they are represented by mRNA sequences in the public databases, but they are not yet represented in Entrez Gene or have not received an official gene name. They can also be novel in that they are not yet represented by an mRNA sequence in human. |
Novel_transcript | light green | Similar to Novel_CDS; however, cannot be assigned an unambigous ORF. |
Putative | light green | Have identical, or have homology to spliced ESTs, but are devoid of significant ORF and polyA features. These are generally short (two or three exon) genes or gene fragments. |
TEC | light green | (To Experimentally Confirm) Single-exon objects (supported by multiple unspliced ESTs with polyA sites and signals). |
Polymorphic | purple | Have functional transcripts in one haplotype and "pseudo" (non-functional) transcripts in another. |
Processed_pseudogene | blue | Pseudogenes that lack introns and are thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome. |
Unprocessed_pseudogene | blue | Pseudogenes that can contain introns, as they are produced by gene duplication. |
Artifact | grey | Transcript evidence and/or its translation equivocal. Usually these arise from high-throughput cDNA sequencing projects that submit automatic annotation, sometimes resulting in erroneous CDSs in what turns out to be, for example, 3' UTRs. In addition HAVANA has extended this category to include cDNAs with non-canonical splice sites due to deletion/sequencing errors. |
PolyA_signal | brown | Polyadenylation signal |
PolyA_site | orange | Polyadenylation site |
Pseudo_polyA | pink | "Pseudo"-polyadenylation signal detected in the sequence
of a processed pseudogene. Warning: Pseudo_polyA features and processed_pseudogenes generally don't overlap. The reason is that pseudogene annotations are based solely on protein evidence, whereas pseudo_polyA signals are identified from transcript evidence; as they are found at the end of the 3' UTR, they can lie several kb downstream of the 3' end of the pseudogene. |
The current full set of GENCODE annotations is available for download
here.
For a detailed description of the methods and references used, see Harrow et al., 2006 and Denoeud et al., 2007.
A combination of 5’ RACE and high-density tiling microarrays were used to empirically annotate 5’ transcription start sites (TSSs) and internal exons of all 410 annotated protein-coding loci across the 44 ENCODE regions (Oct. 2005 GENCODE freeze). The 5’ RACE reactions were performed with oligonucleotides mapping to a coding exon common to most of the transcripts of a protein-coding gene locus annotated by GENCODE (Oct. 2005 freeze) on polyA+ RNA from twelve adult human tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta) and three cell lines (GM06990 (lymphoblastoid), HL60 (acute promyelocytic leukemia) and HeLaS3 (cervix carcinoma)).
The RACE reactions were then hybridized to 20 nucleotide-resolution Affymetrix tiling arrays covering the non-repeated regions of the 44 ENCODE regions. The resulting "RACEfrags" -- array-detected fragments of RACE products -- were assessed for novelty by comparing their genome coordinates to those of GENCODE-annotated exons. Connectivity between novel RACEfrags and their respective index exon were further investigated by RT-PCR, cloning and sequencing. The resulting cDNA sequences (deposited in GenBank under accession numbers DQ655905-DQ656069 and EF070113-EF070122) were then fed into the HAVANA annotation pipeline as mRNA evidence (see "HAVANA manual annotations" below).
The HAVANA process was used to produce these annotations.
Before the manual annotation process begins, an automated analysis pipeline for similarity searches and ab initio predictions is run on a computer farm and stored in an Ensembl MySQL database using a modified Ensembl analysis pipeline system. All searches and prediction algorithms, except CpG island prediction (see cpgreport in the EMBOSS application suite), are run on repeat-masked sequence. RepeatMasker is used to mask interspersed repeats, followed by Tandem repeats finder to mask tandem repeats.
Nucleotide sequence databases are searched with wuBLASTN, and significant hits are re-aligned to the unmasked genomic sequence using est2genome. The UniProt protein database is searched with wuBLASTX, and the accession numbers of significant hits are found in the Pfam database. The hidden Markov models for Pfam protein domains are aligned against the genomic sequence using Genewise to provide annotation of protein domains.
Several ab initio prediction algorithms are also run: Genescan and Fgenesh for genes, tRNAscan to find tRNAgenes and Eponine TSS to predict transcription start sites.
Once the automated analysis is complete, the annotator uses a Perl/Tk based graphical interface, "otterlace", developed in-house at the Wellcome Trust Sanger Institute to edit annotation data held in a separate MySQL database system. The interface displays a rich, interactive graphical view of the genomic region, showing features such as database matches, gene predictions, and transcripts created by the annotators. Gapped alignments of nucleotide and protein blast hits to the genomic sequence are viewed and explored using the "Blixem" alignment viewer.
Additionally, the "Dotter" dot plot tool is used to show the pairwise alignments of unmasked sequence, thus revealing the location of exons that are occasionally missed by the automated blast searches because of their small size and/or match to repeat-masked sequence.
The interface provides a number of tools that the annotator uses to build genes and edit annotations: adding transcripts, exon coordinates, translation regions, gene names and descriptions, remarks and polyadenlyation signals and sites.
See Harrow et al., 2006 for information on verification techniques.
This GENCODE release is the result of a collaborative effort among
the following laboratories:
Lab/Institution |
Contributors |
HAVANA annotation group, Wellcome Trust Sanger Insitute, Hinxton, UK | Adam Frankish, Jonathan Mudge, James
Gilbert, Tim Hubbard, Jennifer Harrow |
Genome Bioinformatics Lab CRG, Barcelona, Spain | France Denoeud, Julien Lagarde, Sylvain Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal Investigator) |
Department of Genetic Medicine and Development, University of Geneva, Switzerland | Catherine Ucla, Carine Wyss, Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis |
Center for Integrative Genomics, University of Lausanne, Switzerland | Jacqueline Chrast, Charlotte N. Henrichsen, Alexandre Reymond |
Affymetrix, Inc., Santa Clara, CA, USA | Philipp Kapranov, Thomas R. Gingeras |
Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J et al. Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007 Jun;17(6):746-59.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.
ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816.