Description

The potential pseudo fragment track shows genomic fragments that are likely to be pseudogenes. When compared with the homologous known gene in the same genome, the potential pseudo fragment has pseudogene-like features such as frameshift, premature stop codon, non conserved splicing sites, and so on. The structure of a potential pseudo fragment shown in the browser is based on the gene structure of its homologous known gene.

Methods

A set of pre-aligned human known genes was mapped across the human genome through the Human Blastz Self Alignment. This method, called the homologous mapping method, identifies fragments that are homologs of known genes. We used known genes from the Consensus CDS (CCDS), RefSeq, and MGC gene sets, which are currently the most reliable collections of genes. These genes have been aligned to the human genome, and their locations in the genome are available in the UCSC Genome Browser.

We compared each homologous fragment with its known gene and collected a set of features, such as sequence identity, dN/dS ratio, splicing sites, and number of premature stop codons, to determine whether the homologous fragment was a gene. These homologous fragments belong to either real genes or pseudogenes. We took homologous fragments overlapping known genes as positive samples and those overlapping known pseudogenes as negative samples, then used these samples to train Support Vector Machines (SVMs) to separate coding fragments from pseudo fragments. The trained SVMs were used to classify homologous fragments into potential coding elements or potential pseudo elements. Finally, a heuristic filter was used to correct some misclassified fragments and to determine whether the homologous fragment was likely to be a pseudogene.

References

Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003).

Pruitt, K.D., Tatusova, T., and Maglott D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl. Acids Res. 33(1), D501-D504 (2005).

Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003).

Credits

The homologous mapping method and this browser track were developed by Yontao Lu at UCSC.