Description

This track shows the pseudogenes located in ENCODE regions generated by five different methods—Yale Pipeline, GenCode manual annotation, two different UCSC methods, and Gene Identification Signature (GIS)—as well as a consensus pseudogenes subtrack based on the pseudogenes from all five methods. Datasets are displayed in separate subtracks within the annotation and are individually described below.

The annotations are colored as follows:

Type Color Description

Processed_pseudogene pink Pseudogenes arising via retrotransposition (exon structure of parent gene lost)

Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained)

Pseudogene_fragment light blue Pseudogenes sequences that are single-exon and cannot be confidently assigned to either the processed or the duplicated category

Undefined gray

Type	Color	Description
Processed_pseudogene	pink	Pseudogenes arising via retrotransposition (exon structure of parent gene lost)
Unprocessed_pseudogene	blue	Pseudogenes arising via gene duplication (exon structure of parent gene retained)
Pseudogene_fragment	light blue	Pseudogenes sequences that are single-exon and cannot be confidently assigned to either the processed or the duplicated category
Undefined	gray

Consensus Pseudogenes

Description

This subtrack shows pseudogenes derived from a consensus of the five methods listed above. In the pseudogene.org data freeze dated 6 Jan. 2006, 201 consensus pseudogenes were found. Here, pseudogenes are defined as genomic sequences that are similar to known genes but exhibit various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions and are flagged as either recently-processed or non-processed.

Methods

The pseudogene sets were processed as follows:

Step I: The four data sets were filtered to remove pseudogenes that overlap with current Gencode coding exons/loci. Pseudogenes overlapping with introns or noncoding genes were kept. Subsequent filtering of pseudogene sets, excluding the Havana set, removed pseudogenes overlapping with exons of UCSC Known Genes.
Step II: A union of the pseudogenes from each filtered set was created. If a pseudogenic region was annotated by more than one group, the lowest starting coordinate and highest ending coordinate were used as the boundaries.
Step III: A parent protein for each pseudogene in the union was assigned using a protein set from UniProt. Pseudogenes without a matching protein were excluded.
Step IV: Each pseudogene was realigned to its parent protein.
Step V: The consensus list of pseudogenes was updated with boundaries derived from the alignment in Step IV.
Step VI: The consensus list of pseudogenes was updated with the assigned parent proteins and new classifications (processed or non-processed).

Verification of the Consensus Pseudogenes

All pseudogenes in the list have been extensively curated by Adam Frankish and Jennifer Harrow at the The Wellcome Trust Sanger Institute.

References

More information about this data set is available from pseudogene.org/ENCODE.

Havana-Gencode Annotated Pseudogenes and Immunglobulin Segments

Description

This track shows pseudogenes annotated by the HAVANA group at the Wellcome Trust Sanger Institute. Pseudogenes have homology to protein sequences but generally have a disrupted CDS. For all annotated pseudogenes, an active homologous gene (the parent) can be identified elsewhere in the genome. Pseudogenes are classified as processed or unprocessed.

Methods

Prior to manual annotation, finished sequence is submitted to an automated analysis pipeline for similarity searches and ab initio gene predictions. The searches are run on a computer farm and stored in an Ensembl MySQL database using the Ensembl analysis pipeline system (Searle et al., 2004, Harrow et al., 2006).

A pseudogene is annotated where the total length of the protein homology to the genomic sequence is >20% of the length of the parent protein or >100 aa in length, whichever is shortest. If a gene structure has an ORF but has lost the structure of the parent gene, a pseudogene is annotated provided there is no evidence of transcription from the pseudogene locus. When an open but truncated reading frame is present, other evidence is used (for example, 3' genomic polyA tract) to allow classification as a pseudogene. When a parent gene has only a single coding exon (e.g. olfactory receptors), a small 5' or 3' truncation to the CDS at the pseudogene locus (compared to other family members) is sufficient to confirm pseudogene status where the truncation is predicted to significantly affect secondary structure by the literature and/or expert community.

Processed and unprocessed pseudogenes are distinguished on the basis of structure and genomic context. Processed pseudogenes, which arise via retrotransposition, lose the intron-exon structure of the parent gene, often have an A-rich tract indicative of the insertion site at their 3' end, and are flanked by different genomic sequence to the parent gene. Unprocessed pseudogenes, which arise via gene duplication, share both the intron-exon structure and flanking genomic sequence with the parent gene. Transcribed pseudogenes are indicated by the annotation of a pseudogene and transcript variant alongside each other.

References

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, et al. GENCODE: Producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Searle SM, Gilbert J, Iyer V, Clamp M. The otter annotation system. Genome Res. 2004 May;14(5):963-70.

Yale Pseudogenes

Description

This subtrack shows pseudogenes in the ENCODE regions identified by the Yale Pseudogene Pipeline. In this analysis, pseudogenes are defined as genomic sequences that are similar to known genes with various inactivating disablements (e.g. premature stop codons or frameshifts) in their putative protein-coding regions. Pseudogenes are flagged as recently processed, recently duplicated, or of uncertain origin (either ancient fragments or resulting from a single-exon parent).

Methods

Step I: Repeat-masked human genome sequence was used as the target for a six-frame TBLASTN where the query was the nonredundant human proteome set (European Bioinformatics Institute). Only high-quality human protein sequences from SWISS-PROT and TrEMBL were used, because this set included processed or duplicated pseudogenes.
Step II: BLAST hits that had a significant overlap with annotated multiple-exon Ensembl genes were removed from consideration.
Step III: The set of BLAST hits was reduced by selecting hits in decreasing significance level and removing matches that overlapped by more than 10 amino acids or 30 bp with a picked match.
Step IV: Adjacent matches on a chromosome were merged together if they were thought to belong to the same pseudogene locus. Merged matches were extended on both sides to include the length of the query protein to which they matched along with an extra 30 bp buffer on either side.
Step V: The FASTA program was used to re-align these extended hits to the genome. Redundant hits were removed and hits with gaps greater than 60 bp were split into two alignments.
Step VI: Alignments with possible artifactual frameshifts or stop codons introduced by the alignment process were closely inspected.
Step VII: False positives (E-value less than 10^-10 or amino acid sequence of less than 40% identity) and sequences matching protein queries containing repeats or low-complexity regions were removed. Potential functional genes were also removed. These were defined as having no frameshift disruptions, less than 95% sequence identity to the query protein, and translatable to a protein sequence longer than 95% of the length of the query protein.
Step VIII: The remaining putative pseudogene sequences were classified based on several criteria. The intron-exon structure of the functional gene was further used to infer whether a pseudogene was recently duplicated or processed. A duplicated pseudogene retains the intron-exon structure of its parent functional gene, whereas a processed pseudogene shows evidence that this structure has been spliced out. Those sequences where the insertions were 50% or more repeats (as detected by RepeatMasker) are "Disrupted" processed pseudogenes. Small pseudogene sequences that cannot be confidently assigned to either the processed or duplicated category may be ancient fragments. Further details can be found in the references below.

Verification of Yale Pseudogenes

All pseudogenes in the list have been manually checked.

References

Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003 Dec;13(12):2541-58.

Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J Mol Biol. 2005 May 27;349(1):27-45.

UCSC Retrogene Predictions

Description

The Retrogene subtrack shows processed mRNAs that have been inserted back into the genome since the mouse/human split. Retrogenes can be functional genes that have acquired a promoter from a neighboring gene, non-functional pseudogenes, or transcribed pseudogenes.

Methods

Step I: All GenBank mRNAs for a particular species were aligned to the genome using blastz.
Step II: mRNAs that aligned twice in the genome (once with introns and once without introns) were initially screened.
Step III: A series of features were scored to determine candidates for retrotranspostion events. These features included position and length of the polyA tail, degree of synteny with mouse, coverage of repetitive elements, number of exons that can still be aligned to the retroGene, and degree of divergence from the parent gene. Retrogenes are classified using a threshold score function that is a linear combination of this set of features. Retrogenes in the final set have a score threshold greater than 425 based on a ROC plot against the Vega annotated pseudogenes.

The "type" field has four possible values:

singleExon: the parent gene is a single exon gene
mrna: the parent gene is a spliced mrna that has no annotation in NCBI refSeq, UCSC knownGene or Mammalian Gene Collection (MGC)
annotated: the parent gene has been annotated by one of refSeq, knownGene or MGC
expressed: an mRNA overlaps the retrogene, indicating probable transcription

These features can be downloaded from the table pseudoGeneLink in many formats using the Table Browser option on the menubar.

References

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003 Sep 30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7.

UCSC Pseudogene Predictions

Methods

Step I: A set of pre-aligned human known genes was mapped across the human genome through the human Blastz Self Alignment using HomoMap (homologous mapping method). The fragments identified by HomoMap are homologs of genes from the Known Genes set.
Step II: Each homologous fragment was compared with its known reference gene and a set of features was then collected. The features included sequence identity, Ka/Ks ratio (asynonymous substitution per codon vs. synonymous substitution per codon), splicing sites, and the number of premature stop codons. These homologous fragments are either genes or pseudogenes.
Step III: Homologous fragments that overlapped known reference genes were labeled as positive samples; those overlapping known pseudogenes were labeled as negative samples.
Step IV: These positive and negative sets were used to train support vector machines (SVMs) to separate coding fragments from pseudo fragments. The trained SVMs were used to classify all homologous fragments into potential coding elements or potential pseudo elements.
Step V: Finally, a heuristic filter was used to correct some misclassified fragments and to generate the final potential pseudogene set.

GIS-PET Pseudogene Predictions

Description

This subtrack shows retrotransposed pseudogenes predicted by multiple mapped GIS-PETs (gene identification signature-pair end ditags) collected from two different cancer cell lines HCT116 and MCF7. A total of 49 non-redundant processed pseudogenes predicted in the ENCODE regions are presented in this dataset. Each pseudogene is labeled with an ID of the format AAA-GISPgene-XX, where "AAA" indicates the parental gene name, "GISPgene" is the GIS pseudogene, and "XX" is the unique ID for each pseudogene.

Methods

PETs were generated from full-length transcripts and computationally mapped onto the human genome to demarcate the transcript start and end positions. The PETs that mapped to multiple genome locations were grouped into PET-based gene families that include parent gene and pseudogenes. A representative member—the shortest PET as defined by genomic coordinates—was selected from each family. This representative PET was aligned to the hg17 genome using in order to identify all the putative pseudogenes at the whole genome level. All hits with an identity >=70% and coverage >=50% within ENCODE regions were reported. In this context, "coverage" refers to alignment coverage of the query sequence, i.e. a measure of how complete the predicted pseudogene is relative to the query sequence.

Verification of GIS-PET Pseudogene Predictions

Pseudogenes were verified by manual examination.

Credits

These data were generated by the ENCODE Pseudogene Annotation group: Jennifer Harrow, Wei Chia-Lin, Siew Woh Choo Adam Frankish, Robert Baertsch, France Denoeud, Deyou Zheng, Yontao Lu, Alexandre Reymond, Roderic Guigo Serra, Tom Gingeras, Suganthi Balasubramanian and Mark Gerstein.