Description

This track shows AceView gene models constructed from cDNA by Danielle and Jean Thierry-Mieg at NCBI, using their AceView program.

AceView is unique in that it defines the genes genome-wide by using only, but exhaustively, the experimental cDNA sequences from the species itself. The analysis exploits sophisticated cDNA-to-genome co-alignment algorithms and the quality of the genome sequence to provide a comprehensive and non-redundant representation of the GenBank, dbEST, GSS, Trace and RefSeq cDNA sequences. The next release, later in 2011, will also include the data deposited in SRA (or assimilated public repository) as part of the SEQC collaborative project led by Leming Shi from FDA and involving high throughput RNA sequences provided by Helicos, Illumina, LifeTech SOLiD and Roche 454, which greatly refine and enrich the gene models.

In a way, the AceView transcripts represent a fully annotated non-redundant "nr" view of the public RNAs, minus cloning artefacts, contaminations and bad quality sequences. AceView transcripts currently represent a 10 times compaction relative to the raw data, with minimal loss of sequence information.

87% of the public RNA sequences are coalesced into AceView alternative transcripts and genes, thereby identifying close to twice as many main genes as there are "known genes" in both human and mouse. 18% to 25% of the spliced genes appear non-coding, in mouse and human respectively. Alternative transcripts are prominent in both species. The typical human gene produces on average eight distinct alternatively spliced forms from three promoters and with three non-overlapping terminal exons. It has on average three cassette exons and four internal donor or acceptor sites. The AceView site further proposes a thorough biological annotation of the reconstructed genes, including association to diseases and tissue specificity of the alternative transcripts.

AceView combines respect for the experimental data with extensive quality control. Evaluated in the ENCODE regions, AceView transcripts are close to indistinguishable from the manually curated Gencode reference genes (see Thierry-Mieg, 2006, or compare the two tracks in the Genome Browser), but over the entire genome the number of transcripts exceeds Havana/Vega by a factor of three and RefSeq by a factor of six.

Display Conventions and Configuration

This track follows the display conventions for gene tracks. All gene models displayed at UCSC are in the "cDNA-supported" class and are displayed in pink.

The track description page offers the following filter and configuration options:

Click the "AceView Gene Summary" on an individual transcript's details page to access the gene on the NCBI AceView website.

Methods

The millions of cDNA sequences available from the public databases (GenBank, dbEST, GSS, Traces, etc.) are aligned cooperatively on the genome sequence, taking care to keep the paired 5' and 3' reads from single clones associated in the same transcript. Useful information about tissue, stage, publications, isolation procedure and so on is gathered. AceView alignments on the genome use knowledge on sequencing errors gained from analyzing sequencing traces and cooperative refinements. They are usually obtained over the entire length of the EST or mRNA (average 98.8% aligned, 0.2% mismatches in mRNAs or 95.5% aligned, 1.4% mismatches in ESTs).

Multiple alignments are evaluated and the sequences are stringently kept only in their best position genome-wide. Less than 1% of the mRNAs and less than 2% of the ESTs will ultimately be aligned in more than one gene, usually in the ~1% closely repeated genes.

The cDNA sequences are then processed and cleaned: the vectors and polyA are clipped, the reads presumably submitted on the wrong strand are flipped, and the small insertion or deletion polymorphisms are identified. Eventual cDNA clone rearrangements or anomalous alignments are flagged and filtered (akin to manually) so as not to lose unique valuable information while avoiding pollution of the database with poorly supported anomalous data.

Unfortunately, cDNA libraries are still far from saturation, because up until high throughput sequencing, cDNA sequences were difficult to obtain. Yet they are the cleanest and most reliable information to define the molecular genes. For this reason, a single good-quality cDNA sequence, aligned with standard introns on the genome, is considered sufficient evidence for a given spliced mRNA fragment. In contrast, un-spliced alignments could reflect genomic contamination of cDNA libraries, and non-coding single exon genes are reported only if they are supported by six or more accessions. The numerous single exon TARs supported by 5 or fewer cDNAs belong to what is termed ‘the cloud’ (not displayed on the UCSC Genome Browser, but annotated in AceView and downloadable separately from the ftp site).

The cDNA sequences are clustered into a minimal number of alternative transcript variants, preferring partial transcripts to artificially completed ones. Sequences are concatenated by simple contact, but the combinatorics are avoided by allowing each cDNA accession to contribute to a single alternative variant, preferably one where it merges silently without bringing any new sequence information. As a result, all shorter reads compatible with a full-length mRNA will be absorbed in that transcript and will not be used to extend other incompatible transcripts.

About 70% of the variants, clearly identified on the Acembly site, have their entire protein coding region supported by a single cDNA; the others may be illicit concatenations that may be split and associated differently when more data become available. The main sequence of the transcript used in the annotation is that of the footprint of the transcript on the genome, which is of better quality than the mRNAs: this procedure corrects up to 2% of sequencing errors. Single base insertion, deletion, transition or transversion is shown graphically in the mRNA view, where frequent SNPs become evident.

Putative protein-coding regions are predicted from the mRNA sequence and annotated using BlastP, PFAM, Psort2, and comparison to AceView proteins from other species. The best proteins are scored (see the Aceview Overview on the Acembly site) and transcripts are putatively proposed to be protein-coding or non-coding.

Expression, cDNA support, tissue specificity, sequences of alternative transcripts, introns and exons, alternative promoters, alternative exons and alternative polyadenylation sites are evaluated and annotated in rich tables on the Acembly web site.

The reconstructed alternative transcripts are then grouped into genes if they share at least one exact intron boundary or if they have substantial sequence overlap (80% of the sequence of one included in the other). Coding and non-coding genes are defined, and genes in antisense are flagged.

AceView genes are matched by molecular contact to Entrez genes and named according to the Entrez Gene nomenclature. For novel genes not in Entrez, AceView creates new gene names that are maintained from release to release until the genes receive an official or Entrez gene name.

Knowledge on each gene is annotated provided there is PubMed support. Selected functional annotations are gathered from other sources, including Entrez. In addition, candidate tested disease associations are extracted directly from PubMed, in addition to OMIM and GAD. Finally, lists of the most closely related genes by function, pathway, protein complex, GO annotation, disease, cellular localization or all criteria taken together are proposed, to stimulate research and development.

Credits

Thanks to Danielle and Jean Thierry-Mieg at NCBI for providing this track for human, worm and mouse.

References

Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006;7 Suppl 1:S12.1-14.

AceView web site: http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly