Description

The CSHL small RNA track depicts short total RNA sequencing data from ENCODE tissues or sub-cellular compartments of ENCODE cell lines. The protocol used to generate these data produced directional reads from the 5' end of short RNAs (< 200 nt). Libraries were sequenced using an Illumina GAIIx. These data were generated by Cold Spring Harbor Laboratories as a part of the ENCODE Consortium. The ENCODE project seeks to identify and characterize all functional elements in the human genome. In many cases, there are Cap Analysis of Gene Expression (CAGE, see the RIKEN CAGE Loc track), Long RNA-seq (>200 nucleotides, see the CSHL Long RNA-seq track) and Pair-End di-TAG-RNA (PET-RNA, see the GIS RNA PET track) datasets available from the same biological replicates.

Many of the datasets produced by the Hannon lab (Generation 0 datasets) in Release 1 have been replaced by newly generated data from the Gingeras lab in Release 2. Of all Generation 0 datasets, only data from K562 and Prostate tissue are still displayed. All Generation 0 datasets are still available for download.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Color differences among the views provide a visual cue for distinguishing between the different cell types and compartments.

This track contains the following views:

Contigs
The Contigs are BED format files representing blocks of overlapping mapped reads from pooled biological replicates. The corresponding number of mapped reads, the RPKM value, and the non-parametric IDR (npIDR) are reported for each contig.
Raw Signal
The Raw Signal view shows the density of mapped reads on the plus and minus strands (wiggle format).

Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.

Downloadable Files

The following files can be found on the Downloads Page.
Alignments
The Alignments contain individual reads mapped to the genome and indicates where bases may mismatch. Every mapped read is displayed. The alignments file follows the standard SAM format. See the SAM Format Specification for more information on the SAM/BAM file format.
GENCODE V7
For each GENCODE V7 exon, the raw read count on the plus or minus strand, reads per million mapped reads (RPM) and the non-parametric IDR (npIDR) are reported in GFF format.
Transfrags
These data are now available for download only and have been replaced by Contigs. Small RNA reads were assembled into transcribed fragments, "Transfrags," by merging reads with one or more overlapping nucleotides. Only uniquely mapping reads were used to generate Transfrags. The BED6+ format of the transfrag files were created from the "intervals-to-contigs" Galaxy tool written by Assaf Gordon in the Hannon lab at CSHL. The transfrags data were not filtered.
Fastq
Raw sequence information is available in fastq format.

Methods

Experimental Procedures

Cells were grown according to the approved ENCODE cell culture protocols. Short RNAs between 20-200 nt were isolated from total RNA using a Qiagen RNeasy kit (Qiagen #74204) according to manufacturer protocol. Purified small RNAs were treated with Ribominus (Invitrogen # A10837-08) according to manufacturer protocols to remove ribosomal RNA. The 5' structures were removed by treatment with Tobacco Alkaline Pyrophosphatase to allow for ligation of a 5' linker. RNA fragments were poly-adenylated using poly-A polymerase (or poly-cytidylated, in the case of Generation 0 data) and a 5' linker was ligated using T4 RNA ligase. An anchored oligo-dT was used to prime the reverse transcriptase reaction. cDNA was amplified using universal PCR primers introduced during the reverse transcriptase reaction. Resulting libraries were gel purified and used in cluster generation on an Illumina GAIIx. The libraries were sequenced as a single read from the 5' end of the inserts for a total of 36 cycles.

Complete protocols are available in the Downloads Page.

Data Processing and Analysis

Data from the Gingeras and Guigo labs were preprocessed to remove experimentally derived poly-A tails and Illumina 3' linkers from raw reads. The best alignment to the Illumina 3' linker for each read was determined. If the number of mismatches in the alignment was less than 20% of the aligned length, the read was clipped from the first aligned base. Pre-processed reads were mapped using the STAR algorithm. For a description of STAR, the source code and mapping parameters used, see the STAR project website. Reads mapping 10 times or less are reported in the Raw Signal and Alignment files.

Mapped reads were discarded if they fell into one of the following categories: 1) it contained five or more consecutive A's, 2) it was less than 16 nt in length, 3) it mapped to more than one genomic position (multiply-mapped reads), 4) it mapped upstream of genomically encoded poly-A sequences. The remaining reads were used both to call contigs and to produce expression values over GENCODE V7 exons. Contigs were generated from overlapping reads in pooled biological replicates.

Generation 0 data

Reads were trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3' adapter sequence into the 3' end of the reads. The 3' sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were clipped off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure were also trimmed. Reads were aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B et al., 2009). Reads that mapped 20 or fewer times with 2 or less mismatches were reported. See Release Notes for more information on Generation 0 datasets.

Verification

The mapped data were visually inspected to verify the majority of the reads were mapping the 5' ends of annotated small RNA classes.

Release Notes

Update to Release 2 (July 2012): Alignments are no longer being displayed. They are now available as downloads only.

This is Release 2 (February 2012) of CSHL Small RNA-seq with new data from the Gingeras lab. It includes ten new cell lines (A549, AG04450, BJ, H1-hESC, HeLa-S3, HepG2, HUVEC, MCF-7, NHEK, and SK-N-SH_RA), a new displayed view (Contigs), and two new download files (GENCODE Predicted Exons and a Protocol Document).

Release 1 contained data produced by the Hannon lab that was remapped from hg18 and indicated in this release as Generation 0 since the older data had no replicates. When there is both Generation 0 and new data available, only the new data is displayed. The older data is available for download only. From the original eleven experiments displayed with Release 1, only two (prostate and K562 polysome) are still displayed. The new data for this track were created with a different process and have standard replicate numbers. The replicate labeling in the genome browser view is a counter indicating the total number of replicates submitted. The producing lab has replicate numbers that correspond to their internal bio-replicate numbering. Where these two numbering systems conflict, both are listed in the long label of the specific track.

Discrepancies between hg18 and hg19 versions of Generation 0 CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly in that we no longer retain information regarding whether a read was clipped off an adapter sequence.

Credits

Hannon lab members: Katalin Fejes-Toth, Vihra Sotirova, Gordon Assaf, Jon Preall

Gingeras and Guigo laboratories: Carrie A. Davis, Lei-Hoon See, Wei Lin

Contacts:

Jonathan Preall (Generation 0 Data from Hannon Lab)
Carrie Davis (experimental)
Alex Dobin (computational)
Wei Lin (computational)
Tom Gingeras (primary investigator)

References

Affymetrix ENCODE Transcriptome Project, Cold Spring Harbor Laboratory ENCODE Transcriptome Project. Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs. Nature. 2009 Feb 19;457(7232):1028-32.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column in the track configuration page and the download page. The full data release policy for ENCODE is available here.