Description

We have interrogated the transcribed loci in 420 selected ENCODE regions using RACE sequencing. We analyzed annotated known gene regions, but primarily focused on novel transcriptionally active regions (TARs) that were previously identified by high-density oligonucleotide tiling arrays and on random regions that were thought to not be transcribed.

Blue products: detected sequences from 5' RACE; Red Products: detected sequences from 3' RACE.

Methods

Target selection

The regions of our analysis are mainly selected in Chromosome 22 ENCODE region; with additional targets in Chromosome 11 and 21 ENCODE regions. Except for a few regions for test purposes, we selected most of the exon and novel TAR primer regions among those expressed (cell-type specific) regions in known exons and novel TAR regions detected by transcriptional tiling array experiments. The non-transcribed primer regions are selected in a tiled fashion among those regions that are neither known exons nor novel TARs.

Primer design

We designed four primers for each targeted region that can be exons of known gene, TAR (transcriptional active regions) or previously shown untranscribed regions. Two Gene Specific Primers (GSP1, GSP2) and two Nested Gene Specific Primers (NGSP1, NGSP2) on both plus and minus strand were selected for each targeted region using a modified Primer3 program.

5'-RACE, 3'-RACE experiments (Rapid Amplification of cDNA Ends) and end Sequencing

Human NB4 cell line total RNA, Hela S3 polyA+ RNA, placenta total RNA and polyA+ RNA (Ambion, TX, USA) were used in cDNA amplification by SMART RACETM kit (Clontech, CA, USA) according to the manufacture instructions. 5'-RACE-Ready cDNA and 3'-RACE-Ready cDNA were synthesized using PowerScript Reverse Transcriptase and SMARTII A oligo (5'-AAGCAGTGGTATCAACGCAGAGTACGCGGG-3'), 5'-CDS primer A [(5'-(T)25V N-3' (N = A, C, G, or T; V = A, G, or C)], or 3'-CDS primer A [5'-AAGCAGTGGTATCAACGCAGAGTAC(T)30V N-3' (N = A, C, G, or T; V = A, G, or C)]. A total of 1ug RNA was used in a final volume of 10ul Reverse Transcription (RT) reaction (100ng/ul). A RT reaction without reverse transcriptase was used as negative control to distinguish genomic DNA contamination. RACE was followed by PCR amplification using UPM (universal primer A), Gene Specific Primers (GSP1 or GSP2) on both strands of the genome. 0.5ul RT reaction from the above was used in 50ul of PCR reaction by Advantage? 2 PCR Enzyme System (Clontech, CA, USA). Nested PCRs were performed using Nested Universal Primer A (NUP 5'-AAGCAGTGGTATCAACGCAGAGT-3') and Nested Gene Specific Primers (NGSP1 or NGSP2). 1ul of RACE PCR product was used in 50ul reaction. The PCR program was 94℃ for 30 seconds and 72℃ 3 minutes for 5 cycles, then 94℃ for 30 seconds, 70℃ for 30 seconds and 72℃ 3 minutes, 5 cycles, followed by 25 cycles of 94℃ for 30 seconds, 68℃ for 30 seconds, concluded by an extension cycle of 72℃ for 3 minutes. Nested PCR products were end sequenced using NGSP1 or NGSP2. The RACE sequences have been submitted to GenBank database (accession numbers: EW712308-EW712635). Files accessions are expected to be available in the near future.

Mapping RACE sequence to the genome

We first use the command-line BLAT alignment tool (with default parameters for DNA to DNA alignment) to compare all the RACE sequence reads to the human genome assembly (hg18, Mar 2006), and then evaluated the `fitness scores" of the BLAT output matches with the following formulas:
sizeDif = abs((tEnd - tStart) - (qEnd - qStart)) + abs(qSize - (qEnd - qStart))
insertFactor = qNumInsert + tNumInsert
total = matches + repMatches + misMatches
badness = (1000 * misMatches + insertFactor + 3 * log(1 + sizeDif)) / total
fitness = 100 - badness * 0.1

where parameters such as tEnd have the same meanings as those defined in the BLAT documentation. The fitness score is based on the `percent identity score" in the UCSC Genome Browser, and it includes additional penalty on small overall matches. Once these fitness scores have been computed for one RACE experiment, a distribution of these scores was derived based on the characteristics of those BLAT matches that are located on the `correct" chromosomes, and only those `correct" matches with scores above a certain threshold (in the provided bed file, the threshold is 0) were kept as `valid" products and correspondingly as `valid" transcripts.

Consensus splice site analyses

For those BLAT matches with multiple blocks, the corresponding splice sites in the transcripts were further examined in the following way: a splice site is defined as a consensus one if and only if a `GT-AG" (or `GC-AG"/`AT-AC", which appear much less often) pattern can be observed within windows of eight nucleotides on the two ends of it (e.g. for a splice site starting at chromosome position i and ending at j, the windows are [i - 3, i + 5) and [j -5, j + 3)). An overall consensus score was then assigned to each transcript according the proportion of consensus splice sites in all its splicing events. We also used this consensus splice site criteria to filter out mirroring antisense transcripts caused by experimental artifacts. In the provided bed file, each reported sequence either is a single block or contains at least one consensus splice site.

Conclusion

We conclude that RACE sequencing is an efficient, sensitive and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs are capable of encoding protein and likely to be functional.

References

Systematic Analysis of Transcribed Loci in ENCODE Regions using RACE Sequencing Reveals Extensive Transcription in the Human Genome
Jia Qian Wu, Jiang Du, Joel Rozowsky, Zhengdong Zhang, Alexander E. Urban, Sherman Weissman, Mark Gerstein, and Michael Snyder (2007) submitted.