MCAST logo

Usage:

mcast [options] <motifs> <database>

Description:

MCAST searches a sequence database for statistically significant clusters of non-overlapping "hits" to the motifs in a query.

A "hit" is a sequence position that is sufficiently similar to a motif in the query. To be a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --p-thresh, below). The alignment of the motif and the sequence position is done without gaps. To compute the p-value of a motif alignment score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process; see option --bgfile, below. With DNA sequences, MCAST searches for hits on both the sequences given in the database, and their reverse complements.

A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. (Two hits separated by more than the maximum allowed gap will be reported in separate matches.)

MCAST searches for all of the matches between the query and the sequences in the database. Each match is assigned an E-value, and matches that score below an E-value threshold are printed in order of increasing E-value (see option --e-thresh, below).

The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is

S = -log2(p/pthresh),

where the significance threshold pthresh may be specified by the user. The total score of a match is the sum of the p-scores of the hits making up the match. MCAST finds the matches with the maximum match scores.

In order for E-values to be computed by MCAST, at least 100 matches must be found. If there are too few sequences in the database, or if certain other options are made too stringent (see Options, below), too few matches may exist for E-values to be computed. In this case, the results are sorted by match score, the E-value column is set to "NaN" and all matches are printed. This limitation can be overcome by specifying the --synth and --bgfile options. When those options are set, synthetic sequences will be generated from the provided background model, and used to estimate E-values.

A full description of the algorithm is found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Input:

Output:

Options: