Description

This track depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome.

This cloning protocol generates directional libraries that are read from the 5′ ends of the inserts, which should largely correspond to the 5′ ends of the mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles however the reads undergo post-processing resulting in trimming of their 3′ ends. Consequently, the mapped read lengths are variable.

Display Conventions and Configuration

To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Color differences among the views are arbitrary. They provide a visual cue for distinguishing between the different cell types and compartments.

Transfrags
Small RNA reads were assembled into "Transfrags" by merging reads with one or more overlapping nucleotides. In order to minimize ambiguity from reads that have the potential to map to multiple genomic loci, only the uniquely mapping reads were used to generate Transfrags. The BED6+ format of the transfrag files are created from "intervals-to-contigs" Galaxy tool written by Assaf Gordon in the Hannon lab at CSHL. A complete description of the columns in this format can be found here. The Transfrags view includes all transfrags before filtering.
Raw Signals
The Raw Signal views show the density of aligned tags on the plus and minus strands.
Alignments
The Alignments view shows reads mapped to the genome and indicates where bases may mismatch. Every mapped read is displayed, i.e. uncollapsed. The track may be set so that the color of each alignment denotes the mapping frequency of the corresponding read: pink for reads mapping to a single genomic location, pinkish-purple for reads mapping to exactly two genomic locations and purple for reads mapping to more than two possible locations. The alignment file follows the standard SAM format of Bowtie output with the following additions: the mapping frequency for each read is indicated in the BAM file using the tag 'YG' and the alignment color is specified in RGB space using the tag 'YC'. The 'XA' tag is Bowtie tag that means the aligned read belongs to strata. See the Bowtie Manual for more information about the SAM Bowtie output (including other tags) and the SAM Format Specification for more information on the SAM/BAM file format. Mapping quality is not available for this track and so, in accordance with the SAM Format Specification, a score of 255 is used. Also, as described in the Methods section below, the sequence quality scores are not available and so are all displayed as 40.

Methods

Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5′ cap structures.

Poly-A Polymerase was used to catalyze the addition of C's to the 3′ end. The 5′ ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5′ end. Reverse transcription was carried out using a poly-G oligo with a defined 5′ extension. The inserts were then amplified using oligos targeting the 5′ linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially, one lane was run. If an appreciable number of mappable reads were obtained, additional lanes were run. Sequence reads underwent quality filtration using Illumina standard pipeline (GERALD).

The Illumina reads were initially trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3′ adapter sequence into the 3′ end of the reads. The 3′ sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure are also trimmed. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B. et al). The alignment parameter allowed 0, 1, or 2 mismatches iteratively. We report reads that mapped 20 or fewer times.

Discrepancies between hg18 and hg19 versions of CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly, in that we no longer retain information regarding whether a read was 'clipped' off adapter sequence.

Verification

Comparison of referential data generated from 8 individual sequencing lanes (Illumina technology).

Credits

Hannon lab members: Katalin Fejes-Toth, Vihra Sotirova, Gordon Assaf, Jon Preall

And members of the Gingeras and Guigo labs.

Contact: Jonathan Preall at Cold Spring Harbor Labs

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.