Description

The ENCODE project has revealed the functional elements of segments of the human genome in unprecedented detail. However, the ability to distinguish between transcripts designated for translation into protein and those that serve purely regulatory roles remains elusive. A standard means to determine if translation is occuring is to measure protein produced by transcripts via mass spectrometry-based proteogenomic mapping. In this process, proteins were digested to peptides using a protease such as trypsin and these petides were chromatographically fractionated and fed into a tandem mass spectrometer (MS/MS). This process creates a signature series of fragment masses which can be scanned against the theoretical translation and proteolytic digest of an entire genome to identify the genomic origins of sample proteins (Giddings et al., 2003).

This proteogenomic track displays mass spectrometry data that have been matched to genomic sequences for selected cell lines, using a workflow and software specifically designed for this purpose. The track can be used to identify which parts of the genome are translated into proteins, to verify which transcripts discovered by other ENCODE experiments are protein-coding, to reveal new genes and/or splice variants and proteins with post-translational modifications (PTM). Of particular interest is the possibility of uncovering the translation of small open reading frames (ORFs), antisense transcripts, or protein-coding regions that have been annotated as introns previously.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Color differences among the views are arbitrary. They provide a visual cue for distinguishing between the different cell types and compartments. Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.

This track shows peptide mappings as contiguous rectangular items rendered in grayscale according to their score, with darker items representing higher-confidence peptide mappings. The name of each item is the amino acid sequence of the peptide where a period (.) at the end of a name signifies a stop codon.

Peptide Genome and GENCODE Mapping(Filtered): Peptide mapping results based on hg19 and GENCODE annotation for mass-spectrometry-based proteomics experiments filtered for a false discovery rate (FDR) better than 5%. Specific field descriptions can be found below.
Modified Peptide Genome and GENCODE Mapping(Filtered): Modified peptide mapping results based on hg19 and GENCODE annotation for mass-spectrometry-based proteomics experiments filtered for a false discovery rate (FDR) better than 5%.

Unfiltered views are available on the Downloads page.

Fields specific to Proteogenomic tracks include:

The Item names the peptide sequence and is appended with a number for proteins with post-translational modifications (PTM) representing the integer portion of the PTM mass. The peptide sequence appears as a short label beside the main Genome Browser display window depending on the view configuration.
The Score is used to render shade to displayed rectangular items and is derived from the rawScore (see below) given by the proteomics peptide mapping software Peppy. It is computed as [(rawScore - rawScore at 10% FDR cutoff) / (rawScore at near 0% FDR cutoff - rawScore at 10% FDR cutoff)] * 1000, and is then converted to an integer. Raw scores above the 0% FDR threshold have a score of 1000 (best), while those below the 10% FDR threshold have a score of 0 (worst).
The rawScore is given by Peppy and is expressed as the negative log 10 of the p-value, which reflects the confidence of the mapping between the peptides and the spectrums. On the item details pages, rawScore is labeled: Raw score for a peptide/spectrum match.
The spectrumId is an identifier of the spectrum associated with the peptide mapping and can be used to track the original spectrum. On the item details pages, spectrumId is labeled: An identifier of the spectrum associated with the peptide mapping.
The peptideRank is a rank of the peptide/spectrum match used for a spectrum matching to different peptides. A spectrum can be chimeric (containing more than one peptide) and the spectrum can be mapped to two or more distinct peptides. Here, only the top-scoring match is reported. If more than one peptide "tied" for the top score, then all peptides were included and all matches have a peptideRank of 1. On the item details pages, peptideRank is labeled: Rank of the peptide/spectrum match, for spectrum matching to different peptides.
The peptideRepeatCount indicates the number of places in the genome where the peptide is encoded for a peptide/spectrum match. It reflects the prevalence or uniqueness of the peptide mapping in the genome. Those peptides mapped to only a few genomic locations will have a low peptideRepeatCount, whereas those peptides mapped to highly duplicated regions will have a high peptideRepeatCount. Peptides with a peptideRepeatCount greater than 10 times in the genome were deleted from the track (this field is for regular peptides only). On the item details pages, peptideRepeatCount is labeled: Indicates the number of places in the genome where the peptide is encoded for a peptide/spectrum match.
The modificationMass reflects the additional molecular weight for each modified peptide matched to a spectrum (this field is for PTM peptides only). On the item details pages, modificationMass is labeled: Reflects the additional molecular weight for each modified peptide matched to a spectrum.

Methods

ENCODE cell lines K562, GM12878, H1-hESC and H1-neurons were used for this large scale proteomic analysis. Cell lines were cultured according to standard ENCODE cell culture protocols and tryptic peptides were prepared using In-gel digestion (Shevchenko et al., 2007), FASP (Wiseniewski et al., 2009; Manza et al., 2005) or MudPIT (Washburn et al., 2001) protocols as indicated for each sample. Tandem mass spectrometry (RPLC-MS/MS) analysis was then performed on an Eksigent Ultra-LTQ Orbitrap system or a Q Exactive system (Thermo Scientific) as indicated.* The number of arginine or lysine sites missed by the trypsin enzyme is indicated by the metadata parameter miscleavages.

We performed proteogenomic mapping (Jaffe et al., 2004) on an in silico translation and proteolytic digestion of the whole human genome (UCSC Hg19), and the GENCODE translation of protein-coding transcripts database with up to one missed cleavage using Peppy software. The GENCODE version for H1-hESC (FASP protocol), K562, and GM12878 is V11 and it is V10 for H1-hESC (MudPIT protocol) and H1-neurons datasets. GENCODE V11 was initially used for database search and it was later found that GENCODE V10 is the preferred version and was subsequently used to replace GENCODE V11 for the analyses of the later datasets. Peppy's embedded algorithm matches the MS/MS spectra to peptides and outputs a matching score, and the peptides are then mapped back to their corresponding genomic sequences. The peptide/spectrum matches (PSMs) found from Hg19 genome and GENCODE searches were compared and the PSMs of higher score from either matches were reported. If the scores from both matches are equal, both of them were reported. Additional peptides matches were found by GENCODE search that were not found in Hg19 genome search, some of which span slice junctions. Overall, a cross-comparison and inclusion of results from both database searches resulted in a greater coverage.

For both the Hg19 genome and GENCODE database searches, a blind search for post-translational modifications (PTMs) was performed using Peppy software. In a blind PTM search, when Peppy matches a MS/MS spectrum to a peptide, if the matching score is increased after the addition of the molecular weight (MW) of a potential PTM, the peptide is determined as having a PTM. In the output of both the Hg19 genome and GENCODE searches, some spectra were output as matched with peptides of PTMs and others were output as matched with regular peptides, i.e., peptides without PTMs. Once the best-ranking PSMs were identified from either search, the regular peptides and peptides with PTMs were displayed in separate tracks.

For each data set, a reverse database search was also performed using all spectra to calculate the false discovery rate (FDR) (Elias et al., 2007). Only those matches with a FDR rate below 5% were included in this track. The unfiltered results of those peptides matches with an FDR rate below 10% are available for download.

*H1-hESC (FASP protocol), K562 and GM12878 samples were analyzed on the Eksigent Ultra LTQ Orbitrap system (Thermo Scientific) whereas H1-hESC (MudPIT protocol), H1-neurons sample were analyzed on the Q Exactive system (Thermo Scientific).

Release Notes

This is Release 1 of this track (Sept 2012). Unlike other ENCODE data, these data are not archived at GEO but at Proteome Commons. The first 32 digits of the Tranche Hash for each data set is stored as the labExpId.

Credits

Proteogenomic mapping: Dr. John Wrobel, Dr. Jainab Khatun, Mr. Brian Risk, and Mr. David Thomas (Giddings Lab).

Proteomic analysis: Dr. Yanbao Yu, Dr. Harsha Gunawardena, Dr. Ling Xie and Ms. Li Wang (Chen Lab).

Main Contact: John Wrobel

References

Giddings MC, Shah AA, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5.

Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nat Protoc. 2006;1(6):2856-60.

Wisniewski JR, Zougman A, Nagaraj N, Mann M. Universal sample preparation method for proteome analysis. Nat Methods. 2009 May;6(5):359-62.

Manza LL, Stamer SL, Ham AJ, Codreanu SG, Liebler DC. Sample preparation and digestion for proteomic analyses using spin filters. Proteomics. 2005 May;5(7):1742-5.

Washburn MP, Wolters D, Yates JR 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001 Mar;19(3):242-7.

Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004 Jan;4(1):59-77.

Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007 Mar;4(3):207-14.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column above. The full data release policy for ENCODE is available here.