Description

This track shows the pseudogenes located in ENCODE regions generated by five different methods—Yale Pipeline, GenCode manual annotation, two different UCSC methods, and Gene Identification Signature (GIS)—as well as a consensus pseudogenes subtrack based on the pseudogenes from all five methods. Datasets are displayed in separate subtracks within the annotation and are individually described below.

The annotations are colored as follows:

Type Color Description
Processed_pseudogene pink Pseudogenes arising via retrotransposition (exon structure of parent gene lost)
Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained)
Pseudogene_fragment light blue Pseudogenes sequences that are single-exon and cannot be confidently assigned to either the processed or the duplicated category
Undefined gray  


Consensus Pseudogenes

Description

This subtrack shows pseudogenes derived from a consensus of the five methods listed above. In the pseudogene.org data freeze dated 6 Jan. 2006, 201 consensus pseudogenes were found. Here, pseudogenes are defined as genomic sequences that are similar to known genes but exhibit various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions and are flagged as either recently-processed or non-processed.

Methods

The pseudogene sets were processed as follows:

Verification of the Consensus Pseudogenes

All pseudogenes in the list have been extensively curated by Adam Frankish and Jennifer Harrow at the The Wellcome Trust Sanger Institute.

References

More information about this data set is available from pseudogene.org/ENCODE.


Havana-Gencode Annotated Pseudogenes and Immunglobulin Segments

Description

This track shows pseudogenes annotated by the HAVANA group at the Wellcome Trust Sanger Institute. Pseudogenes have homology to protein sequences but generally have a disrupted CDS. For all annotated pseudogenes, an active homologous gene (the parent) can be identified elsewhere in the genome. Pseudogenes are classified as processed or unprocessed.

Methods

Prior to manual annotation, finished sequence is submitted to an automated analysis pipeline for similarity searches and ab initio gene predictions. The searches are run on a computer farm and stored in an Ensembl MySQL database using the Ensembl analysis pipeline system (Searle et al., 2004, Harrow et al., 2006).

A pseudogene is annotated where the total length of the protein homology to the genomic sequence is >20% of the length of the parent protein or >100 aa in length, whichever is shortest. If a gene structure has an ORF but has lost the structure of the parent gene, a pseudogene is annotated provided there is no evidence of transcription from the pseudogene locus. When an open but truncated reading frame is present, other evidence is used (for example, 3' genomic polyA tract) to allow classification as a pseudogene. When a parent gene has only a single coding exon (e.g. olfactory receptors), a small 5' or 3' truncation to the CDS at the pseudogene locus (compared to other family members) is sufficient to confirm pseudogene status where the truncation is predicted to significantly affect secondary structure by the literature and/or expert community.

Processed and unprocessed pseudogenes are distinguished on the basis of structure and genomic context. Processed pseudogenes, which arise via retrotransposition, lose the intron-exon structure of the parent gene, often have an A-rich tract indicative of the insertion site at their 3' end, and are flanked by different genomic sequence to the parent gene. Unprocessed pseudogenes, which arise via gene duplication, share both the intron-exon structure and flanking genomic sequence with the parent gene. Transcribed pseudogenes are indicated by the annotation of a pseudogene and transcript variant alongside each other.

References

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, et al. GENCODE: Producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Searle SM, Gilbert J, Iyer V, Clamp M. The otter annotation system. Genome Res. 2004 May;14(5):963-70.


Yale Pseudogenes

Description

This subtrack shows pseudogenes in the ENCODE regions identified by the Yale Pseudogene Pipeline. In this analysis, pseudogenes are defined as genomic sequences that are similar to known genes with various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions. Pseudogenes are flagged as recently processed, recently duplicated, or of uncertain origin (either ancient fragments or resulting from a single-exon parent).

Methods

Verification of Yale Pseudogenes

All pseudogenes in the list have been manually checked.

References

Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003 Dec;13(12):2541-58.

Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J Mol Biol. 2005 May 27;349(1):27-45.


UCSC Retrogene Predictions

Description

The Retrogene subtrack shows processed mRNAs that have been inserted back into the genome since the mouse/human split. Retrogenes can be functional genes that have acquired a promoter from a neighboring gene, non-functional pseudogenes, or transcribed pseudogenes.

Methods

The "type" field has four possible values:

These features can be downloaded from the table pseudoGeneLink in many formats using the Table Browser option on the menubar.

References

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003 Sep 30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7.


UCSC Pseudogene Predictions

Methods


GIS-PET Pseudogene Predictions

Description

This subtrack shows retrotransposed pseudogenes predicted by multiple mapped GIS-PETs (gene identification signature-pair end ditags) collected from two different cancer cell lines HCT116 and MCF7. A total of 49 non-redundant processed pseudogenes predicted in the ENCODE regions are presented in this dataset. Each pseudogene is labeled with an ID of the format AAA-GISPgene-XX, where "AAA" indicates the parental gene name, "GISPgene" is the GIS pseudogene, and "XX" is the unique ID for each pseudogene.

Methods

PETs were generated from full-length transcripts and computationally mapped onto the human genome to demarcate the transcript start and end positions. The PETs that mapped to multiple genome locations were grouped into PET-based gene families that include parent gene and pseudogenes. A representative member—the shortest PET as defined by genomic coordinates—was selected from each family. This representative PET was aligned to the hg17 genome using in order to identify all the putative pseudogenes at the whole genome level. All hits with an identity >=70% and coverage >=50% within ENCODE regions were reported. In this context, "coverage" refers to alignment coverage of the query sequence, i.e. a measure of how complete the predicted pseudogene is relative to the query sequence.

Verification of GIS-PET Pseudogene Predictions

Pseudogenes were verified by manual examination.

Credits

These data were generated by the ENCODE Pseudogene Annotation group: Jennifer Harrow, Wei Chia-Lin, Siew Woh Choo Adam Frankish, Robert Baertsch, France Denoeud, Deyou Zheng, Yontao Lu, Alexandre Reymond, Roderic Guigo Serra, Tom Gingeras, Suganthi Balasubramanian and Mark Gerstein.