Description

ECgene (gene prediction by EST clustering) predicts genes by combining genome-based EST clustering and a transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The position of splice sites, i.e. exon-intron boundaries, in the genome map is utilized as the critical information in the whole procedure. Sequences that share any splice sites in the genomic alignment are grouped together to define an EST cluster. Transcript assembly, which is based on graph theory, produces gene models and clone evidence, which is essentially identical to sub-clustering according to splice variants.

For more detailed information, see the ECgene website.

Methods

The following is a brief summary of the ECgene algorithm:

Genomic alignment of mRNA and ESTs: Input sequences were aligned against the genome using the Blat program developed by Jim Kent. Blat alignments were corrected for valid splice sites, and the SIM4 program was used for suspicious alignments if necessary.
Sequences that share more than one splice site were grouped together to define an EST cluster in a similar manner to the genome-based version of the UniGene algorithm.
The exon-connectivity in each cluster was represented as a directed acyclic graph (DAG). Distinct paths along exons were obtained by the depth-first-search (DFS) method. They correspond to possible gene models encompassing all alternative splicing events.
EST sequences in each cluster were sub-clustered further according to the compatibility of each splice variant with gene structure, and they can be regarded as clone evidence for the corresponding isoform. Gene models without sufficient evidence were discarded at this stage. The presence of polyA tails, detected from careful analysis of genomic alignment of mRNA and EST sequences, was specifically used to determine the gene boundary.
Finally, unspliced sequences were added without altering the exon-intron boundaries of existing gene models.

Coding potential of gene models: Peptide sequences are only available for those gene models judged to have good coding potential. ORF and CDS are determined based on the number of exons, the ORF length, presence of the start codon (Met), and the CDS length. ORFs (defined as the region between two adjacent stop codons) were classified into four groups:

spliced ORFs with Met
spliced ORFs without Met
single-exon ORFs with Met
single-exon ORFs without Met

Initially, the first group was searched for the ORF with the longest CDS. Coding sequences were accepted if they were longer than 30 amino acids (93 bp) or they were identical to one of SwissProt proteins excluding fragmented entries. If such an ORF could not be identified in the first group, the other groups were examined sequentially for the presence of an ORF using the same criteria. Genes lacking an apparent ORF were defined as non-coding RNA genes.

Credits

This algorithm and the predictions for this track were developed by Professor Sanghyuk Lee's Lab of Bioinformatics at Ewha Womans Univeristy, Seoul, KOREA.