mcast - a motif search tool

Usage:

mcast [options] <motifs> <database>

Description:

MCAST searches a sequence database for statistically significant clusters of non-overlapping "hits" to the motifs in a query.

A "hit" is a sequence position that is sufficiently similar to a motif in the query. To be a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --p-thresh, below). The alignment of the motif and the sequence position is done without gaps. To compute the p-value of a motif alignment score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process; see option --bgfile, below. With DNA sequences, MCAST searches for hits on both the sequences given in the database, and their reverse complements.

A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. (Two hits separated by more than the maximum allowed gap will be reported in separate matches.)

MCAST searches for all of the matches between the query and the sequences in the database. Each match is assigned an E-value, and matches that score below an E-value threshold are printed in order of increasing E-value (see option --e-thresh, below).

The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is

S = -log₂(p/pthresh),

where the significance threshold pthresh may be specified by the user. The total score of a match is the sum of the p-scores of the hits making up the match. MCAST finds the matches with the maximum match scores.

In order for E-values to be computed by MCAST, at least 100 matches must be found. If there are too few sequences in the database, or if certain other options are made too stringent (see Options, below), too few matches may exist for E-values to be computed. In this case, the results are sorted by match score, the E-value column is set to "NaN" and all matches are printed. This limitation can be overcome by specifying the --synth and --bgfile options. When those options are set, synthetic sequences will be generated from the provided background model, and used to estimate E-values.

A full description of the algorithm is found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Input:

<motifs> is a list of motifs, in MEME format.
<database> is a collection of sequences in FASTA format.

Output:

An HTML file containing a table listing the high scoring matches (sample).

Options:

--bgfile <bfile> - Read background frequencies from <bfile>. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.
--bgweight <weight> - Add <weight> times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. The default value is 4.0.
--e-thresh <ev> - Only print results with E-values less than <ev>. The default threshold is 10.0.
--max-gap <max-gap> - The value of <max-gap> specifies the longest distance allowed between two hits in a match. Hits separated by more than <max-gap> will be placed in different matches. The default value is 50. Note: Large values of <max-gap> combined with large values of pthresh may prevent MCAST from computing E-values.
--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--p-thresh <pv> - Only motif occurences with p-values less than <pv> will be considered in computing the match score. The default value is 5e-4.
--synth - Create synthetic sequences for estimating E-values. This is useful with small input databases where not enough match scores are found to estimate E-values. The --bgfile option must also be set when using this option.
--text Output is plain text rather then HTML.
--transfac - The input motif file is assumed to be in TRANSFAC format and is converted to MEME format before being used.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.