Description

The Gencode Gene track shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project. A companion track, Gencode Introns, shows experimental gene structure validations for these annotations.

The gene annotations are colored based on the Havana annotation type. Known and validated transcripts are colored dark green, putative and unconfirmed are light green, pseudogenes are blue, and artifacts are grey. The transcript types are defined in more detail in the accompanying table.

The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known, Novel_CDS, Novel_transcript_gencode_conf, and Putative_gencode_conf (which are colored dark green in the track display) be used as the reference annotation.

Type Color Description
Known dark green Known protein coding genes (referenced in Entrez Gene, NCBI)
Novel_CDS dark green Novel protein coding genes annotated by Havana (not referenced in Entrez Gene, NCBI)
Novel_transcript_gencode_conf dark green Novel transcripts annotated by Havana (no ORF assigned) with at least one junction validated by RT-PCR
Putative_gencode_conf dark green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) with at least one junction validated by RT-PCR
Novel_transcript light green Novel transcripts annotated by Havana (no ORF assigned) not validated by RT-PCR
Putative light green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) not validated by RT-PCR
TEC light green Single exon objects (supported by multiple ESTs with polyA sites and signals) undergoing experimental validation/extension.
Processed_pseudogene blue Pseudogenes arising via retrotransposition (exon structure of parent gene lost)
Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained)
Artifact grey Transcript evidence and/or its translation equivocal

Methods

The Human and Vertebrate Analysis and Annotation manual curation process (HAVANA) was used to produce these annotations.

Finished genomic sequence was analyzed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, as well as a series of ab initio gene predictions. Nucleotide sequence databases were searched with WUBLASTN and significant hits were realigned to the unmasked genomic sequence by EST2GENOME. WUBLASTX was used to search the Uniprot protein database, and the accession numbers of significant hits were retrieved from the Pfam database. Hidden Markov models for Pfam protein domains were aligned against the genomic sequence using Genewise to provide annotation of protein domains.

A number of ab initio prediction algorithms were also run: Genscan and Fgenesh for genes, tRNAscan to find tRNA genes, and Eponine TSS for transcription start site predictions.

The annotators used the (AceDB-based) Otterlace interface to create and edit gene objects, which were then stored in a local database named Otter. In cases where predicted transcript structures from Ensembl are available, these can be viewed from within the Otterlace interface and may be used as starting templates for gene curation. Annotation in the Otter database is submitted to the EMBL/Genbank/DDBJ nucleotide database.

Verification

The gene objects selected for verification came from various computational prediction methods and HAVANA annotations.

RT-PCR and RACE experiments were performed on them, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using 12 poly(A)+ RNAs from Origene, eight from Clemente Associates/Quantum Magnetics and four from BD Biosciences as described in [Reymond et al., 2002a,b]. The relative amount of each cDNA was normalized by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System.

Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002b; Mouse Genome Sequencing Consortium, 2002; Guigo, 2003].

Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and four ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50°C; annealing temperature of the next 30 cycles was carried out at 50°C. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences).

Credits

Click here for a complete list of people who participated in the GENCODE project.

References

Ashurst, J.L. et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (Database Issue), D459-65 (2005).

Guigo, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003).

Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915), 520-62 (2002).

Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915), 582-6 (2002).

Reymond, A. et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79(6), 824-32 (2002).