Usage:
mcast [options] <motifs> <database>
Description:
MCAST
searches a sequence database for
statistically significant clusters of non-overlapping "hits" to the
motifs in a query.
A "hit" is a sequence position that is sufficiently similar to a
motif in the query. To be a hit, the p-value of the motif
alignment score must be less than the significance threshold,
pthresh (see option --p-thresh, below). The alignment of the motif
and the sequence position is done without gaps. To compute
the p-value of a motif alignment score, MCAST
assumes that the sequences in the database were generated by a
0-order Markov process; see option --bgfile, below. With DNA sequences,
MCAST
searches for hits on both the sequences given in
the database, and their reverse complements.
A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. (Two hits separated by more than the maximum allowed gap will be reported in separate matches.)
MCAST
searches for all of the matches between the
query and the sequences in the database. Each match is assigned an
E-value, and matches that score below an E-value
threshold are printed in order of increasing E-value (see
option --e-thresh, below).
The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is
S = -log2(p/pthresh),
where the significance threshold pthresh may be specified by
the user. The total score of a match is the sum of the p-scores of
the hits making up the match. MCAST
finds the matches
with the maximum match scores.
In order for E-values to be computed by
MCAST
, at least 100 matches must be found. If there
are too few sequences in the database, or if certain other options
are made too stringent (see Options, below), too few matches may
exist for E-values to be computed. In this case, the results
are sorted by match score, the E-value column is set to
"NaN" and all matches are printed.
This limitation can be overcome by specifying the --synth and --bgfile options.
When those options are set, synthetic sequences will
be generated from the provided background model, and used to estimate
E-values.
A full description of the algorithm is found in:
Input:
-
<motifs>
is a list of motifs, in MEME format. -
<database>
is a collection of sequences in FASTA format.
Output:
- An HTML file containing a table listing the high scoring matches (sample).
Options:
--bgfile <bfile>
- Read background frequencies from<bfile>
. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keywordmotif-file
, then the frequencies will be taken from the motif file.--bgweight <weight>
- Add <weight> times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. The default value is 4.0.--e-thresh <ev>
- Only print results with E-values less than<ev>
. The default threshold is 10.0.--max-gap <max-gap>
- The value of<max-gap>
specifies the longest distance allowed between two hits in a match. Hits separated by more than<max-gap>
will be placed in different matches. The default value is 50. Note: Large values of<max-gap>
combined with large values of pthresh may preventMCAST
from computing E-values.--o <dir name>
- Specifies the output directory. If the directory already exists, the contents will not be overwritten.--oc <dir name>
- Specifies the output directory. If the directory already exists, the contents will be overwritten.--p-thresh <pv>
- Only motif occurences with p-values less than<pv>
will be considered in computing the match score. The default value is 5e-4.--synth
- Create synthetic sequences for estimating E-values. This is useful with small input databases where not enough match scores are found to estimate E-values. The--bgfile
option must also be set when using this option.--text
Output is plain text rather then HTML.--transfac
- The input motif file is assumed to be in TRANSFAC format and is converted to MEME format before being used.--verbosity 1|2|3|4
- Set the verbosity of status reports to standard error. The default level is 2.