Description

The ENCODE project has revealed the functional elements of segments of the human genome in unprecedented detail. However, the ability to clearly distinguish transcripts designated for translation into protein versus those that serve purely regulatory roles remains elusive. The standard means for doing this is to measure the proteins, if any, that are produced by transcripts via mass spectrometry-based proteogenomic mapping. In this process, chromatographically fractionated peptides are fed into a tandem mass spectrometer (MS/MS). The series of fragment masses produced in MS/MS create a signature that can then be used to identify the peptide from a protein or DNA sequence database. For proteogenomic mapping, this identifying spectrum is mapped directly back to its most likely encoding locus on a genome sequence (Giddings, et al. 2003). This allows the direct verification of protein-encoding transcripts.

The proteogenomic track displays mass spectrometry data that have been matched to the genomic sequence for selected cell lines, using a workflow and software specifically designed for this purpose.

The proteogenomic tracks can be used to identify which parts of the genome are translated into proteins, to verify which transcripts discovered by ENCODE are protein-encoding, and can also reveal new genes and/or splice variants of genes. Of particular interest may be its ability to reveal the translation of small open reading frames (ORFs), antisense transcripts, or sites annotated as introns that encode proteins.

Display Conventions and Configuration

The display for this track shows peptide mappings as contiguous, rectangular items. These items are rendered in grayscale according to the score, with darker items representing higher-confidence peptide mappings. The name of each item is the amino acid sequence of the peptide. If a period (.) appears at the end of a name, it signifies a stop codon.

In addition to the displayed genomic coordinates, several additional fields are available for each track item.

Methods

ENCODE cell lines K562 and GM12878 were used for large scale proteomic analysis. Cell lines were cultured according to standard ENCODE cell culture protocols and in-gel digestion was completed according to the standard protocol (Shevchenko, et al. 2007).

The proteolytic enzyme trypsin was used to digest the proteins in order to produce short, MS/MS analyzable peptides. Trypsin is a common protease that typically cleaves proteins after Arginine or Lysine. The metadata parameter enzyme specifies the restriction enzyme used for digestion. Tandem mass spectrometry (RPLC-MS/MS) analysis was then performed on an Eksigent Ultra-LTQ Orbitrap system. However, due to enzyme inefficiency, it does not always cleave at Arginine or Lysine, so there may be peptides that include an uncleaved Arg/Lys site. The number of such missed cleavages allowed in the search is described by the metadata parameter miscleavages.

We performed proteogenomic mapping (Jaffe, et al., 2004) with two missed cleavages allowed and using the whole human genomic sequence (UCSC hg19) via the genome fingerprint scanning (GFS) program (Giddings, et al. 2003) and newly developed Peppy (http://www.peppyresearch.com/). We used HMM_Score (Khatun, et al. 2008) to accurately match MS/MS spectra to their corresponding genome sequences. E-values are calculated, which estimate the number of results at the given score level which would be expected by random chance. We then empirically derived the false discovery rate for a given E-Value using a decoy database search and only those matches falling within the specified 5% FDR rate (E-Value <0.01) are included in the track. The results with 10% FDR (E-Value <0.05) are available under the Downloads page as Raw Signal.

Release Notes

This is Release 2 (July 2012). It contains a total of seven Proteogenomics experiments with the addition of one experiment available by download only. Unlike other ENCODE data, these data are not archived at GEO but at Proteome Commons. The first 32 digits of the Tranche Hash for each data set is stored as the labExpId.

Credits

Proteogenomic mapping: Dr. Jainab Khatun, Brian Risk, Mustaque Ahamed, Christopher Maier, Dr. John Wrobel and Dennis Crenshaw (Giddings Lab).

Proteomic analysis: Drs. Yanbao Yu and Ling Xie (Chen Lab).

Main Contact: Jainab Khatun

References

Giddings MC, Shah AA, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5.

Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004 Jan;4(1):59-77.

Khatun J, Hamlett E, Giddings MC. Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics. 2008 Mar 1;24(5):674-81.

Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nat Protoc. 2006;1(6):2856-60.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.