This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome.
Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5′ACTGTAGGN6-3′. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5′ end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5′ end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5′ truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5′ end. The peak in insert size distribution is between 200-250 nucleotides.
Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5′ cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3′ end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3′ end. The 5′ end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5′ end. Reverse transcription was carried out using a poly-G oligo with a defined 5′ extension. The inserts were then amplified using oligos targeting the 5′ linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5′ ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3′ ends. Consequently, the mapped read lengths are variable.
Tags were removed from the 5′ ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3′ ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome.
Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows:
Verification was done by comparison of referential data generated from 8 individual sequencing lanes (Illumina technology).
The K562 cytosol alignments are exactly the same data as Release 1, but the alignments are now formatted in the bed14 format described below. These data have the string submittedDataVersion="V2 - file format change" in their metadata and the table names are appended with the string "V2".
The data format for the alignments in this track are provided in bigBed format. Each record is in bed 14 format with the first 12 fields described here. The final two fields are the two paired sequences, or in the case of single alignments, the 13th field is the sequence and the 14th field is a single N.
These data were generated and analyzed by the transcriptome group at Cold Spring Harbor Laboratories, and the Center for Genomic Regulation (Barcelona), who are participants in the ENCODE Transcriptome Group.
Credits: Carrie A. Davis, Jorg Drenkow, Huaien Wang, Alex Dobin and Tom Gingeras
Contacts: Carrie Davis and Tom Gingeras (CSHL).
Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.