Description
ECgene (gene prediction by EST clustering) predicts genes by combining
genome-based EST clustering and a transcript assembly procedure in a coherent
and consistent fashion. Specifically, ECgene takes alternative splicing events
into consideration. The position of splice sites, i.e. exon-intron boundaries,
in the genome map is utilized as the critical information in the whole
procedure. Sequences that share any splice sites in the genomic alignment are
grouped together to define an EST cluster. Transcript assembly, which is based
on graph theory, produces gene models and clone evidence, which is essentially
identical to sub-clustering according to splice variants.
For more detailed information, see the ECgene website.
Methods
The following is a brief summary of the ECgene algorithm:
-
Genomic alignment of mRNA and ESTs: Input sequences were aligned against the
genome using the Blat program developed by Jim Kent. Blat alignments were
corrected for valid splice sites, and the SIM4 program was used for suspicious
alignments if necessary.
-
Sequences that share more than one splice site were grouped together to define
an EST cluster in a similar manner to the genome-based version of the UniGene
algorithm.
-
The exon-connectivity in each cluster was represented as a directed acyclic
graph (DAG). Distinct paths along exons were obtained by the depth-first-search (DFS) method. They correspond to possible gene models encompassing all
alternative splicing events.
-
EST sequences in each cluster were sub-clustered further according to the
compatibility of each splice variant with gene structure, and they can be
regarded as clone evidence for the corresponding isoform. Gene models without
sufficient evidence were discarded at this stage. The presence of polyA tails,
detected from careful analysis of genomic alignment of mRNA and EST sequences,
was specifically used to determine the gene boundary.
-
Finally, unspliced sequences were added without altering the exon-intron
boundaries of existing gene models.
Coding potential of gene models:
Peptide sequences are only available for those gene models judged to have good
coding potential. ORF and CDS are determined based on the number of exons, the
ORF length, presence of the start codon (Met), and the CDS length. ORFs
(defined as the region between two adjacent stop codons) were classified into
four groups:
- spliced ORFs with Met
- spliced ORFs without Met
- single-exon ORFs with Met
- single-exon ORFs without Met
Initially, the first group was searched for the ORF with the longest CDS.
Coding sequences were accepted if they were longer than 30 amino acids (93 bp)
or they were identical to one of SwissProt proteins excluding fragmented
entries. If such an ORF could not be identified in the first group, the other
groups were examined sequentially for the presence of an ORF using the same
criteria. Genes lacking an apparent ORF were defined as non-coding RNA genes.
Credits
This algorithm and the predictions for this track were developed by Professor
Sanghyuk Lee's Lab of Bioinformatics at Ewha Womans Univeristy, Seoul, KOREA.