AMA - a motif search tool

Usage:

ama [options] <motif file> <sequence file> [<background file>]

Description:

The name AMA stands for "Average Motif Affinity". The program scores a set of DNA sequences given a DNA-binding motif, treating each position in the sequence as a possible binding event. The score is calculated by averaging the likelihood ratio scores for all feasible binding events to the given sequence and to its reverse strand. The binding strength at each potential site is defined as the likelihood ratio of the site under the motif versus under a zero-order background model provided by the user.

By default, AMA reports the average motif affinity score. It can also report p-values, which are estimated analytically using the given zero-order background model or using the GC-content of each sequence.

AMA can also compute the sequence-dependent likelihood ratio score used by Clover. The denominator of this score depends on the sequence being scored, and is the likelihood of the site under a Markov model derived from the sequence itself. Unlike Clover, AMA also allows higher-order sequence-derived Markov models (see --sdbg option below).

If the input file contains more than one motif, the motifs will be processed consecutively.

Full details are given in the supplement to the GOMO paper:

Fabian A. Buske, Mikael Bóden, Denis C. Bauer and Timothy L. Bailey, "Assigning roles to DNA regulatory motifs using comparative genomics", Bioinformatics, 26(7):860-866, 2010.

Input:

<motif file> containing a list of motifs, in MEME format.
<sequence file> is a collection of sequences in FASTA format.
[<background file>] is a 0-order Markov model in background model format such as produced by fasta-get-markov.

Output:

AMA writes to standard out, unless you specify one of --o or --oc in which case the o-format option (if given) is ignored and separate files containing each output format are written to the named directory. The available output formats are gff and CisML.

gff output has the format:

 <sequence_name> ama sequence 1 <sequence_length> <sequence_score> <sequence_p-value> . . .

Options:

--sdbg <n> - Use a sequence-dependent Markov model of order <n> when computing likelihood ratios. A different sequence-dependent Markov model is computed for each sequence in the input and used to compute the likelihood ratio of all sites in that sequence. This option overrides --pvalues, --gcbins, and --rma; <background file> is required unless --sdbg is specified.
--motif <id> - Use only the motif identified by <id>. This option may be repeated.
--motif-pseudo <float> - A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency (default=0.1).
--norc - Do not score the reverse complement DNA strand. Both strands are scored by default.
--scoring avg-odds|max-odds - Indicates whether the average or the maximum likelihood ratio (odds) score should be calculated (default avg-odds). If max-odds is chosen, no p-value will be printed.
--rma - Scale the motif affinity score by the maximum achievable score for each motif. This is termed the Relative Motif Affinity score. This allows for direct comparison between different motifs. By default, affinity scores are not scaled.
--pvalues - Print the p-value of the average odds score in the output file. The p-score for a score is normally computed (but see --gcbins) assuming the sequences were each generated by the 0-order Markov model specified by the background file frequencies. By default, no p-value will be printed. This option is ignored if max-odds scoring is used.
--gcbins <bins> - Compensate p-values for the GC content of each sequence independently. This is done by computing the score distributions for a range of GC values. Using 41 bins (recommended) computes distributions at intervals of 2.5% GC content. The computation assumes that the ratios of G to C and A to T are both equal to 1. This assumption will fail if a sequence contains far more of a letter than its complement. This option sets the --pvalues option. By default, uncompensated p-values are printed. This option is ignored if max-odds scoring is used.
--cs - Enables combining of sequences with the same identifier by taking the average score and the Sidak corrected p-value: 1−(1−α)^1/n. Different sequences with the same identifier are used in GOMO databases if one gene in the reference species has more than one homologous gene in the related species (one-to-many relationship). By default sequences are processed independently of each other.
--o-format gff|cisml - Output file format (default cisml).
--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.
--max-seq-length <max> - Set the maximum length allowed for input sequences. By default the maximum allowed length is 250000000.
--last <max> - Use only scores of (up to) last <n> sequence positions to compute AMA. If the sequence is shorter than this value the entire sequence is scored. If the motif is longer than this value it will not be scored.