motiph

motiph

Usage:
motiph [options] <alignment file> <tree file> <motif file>

Description:
Scan a DNA multiple sequence alignment for occurrences of given motifs, taking into account the phylogenetic tree relating the sequences. Note: any sequences containing nothing but gaps are removed from the alignment.

The algorithm is as follows:
      A = a multiple alignment
      T = a phylogenetic tree
      M = a motif frequency matrix
      B = the background frequencies of the four bases in A
      U = a background evolutionary model with equilibrium frequencies B

      // Build evolutionary models.
      for j in positions of M {
        E[j] = an evolutionary model with equilibrium frequencies 
          from the jth position of M
      }

      // Build matrix of all possible scores
      for i in all possible alignment columns {
        for j in positions of M {
          foreground = site_likelihood(E[j], A[:][i], T)
          background = site_likelihood(U, A[:][i] T)
          score_matrix[i][j] = log_2(foreground / background)
        }    
      }
      // Calculate p-values of all possible scores for motif sized windows
      // windows in the alignment.
      pvalues = get_pv_lookup(score_matrix, B)

      // Calculate score for each motif sized window in the alignment.
      for i in columns of A {
        score = 0
        for j in positions of M {
          index = calculate the index of A[:][i + j] in the array
            of all possible alignment columns
          score = score + score_matrix[index][j]
        }    
        print pvalues[score]
      }
      
The core of the algorithm is a routine (site_likelihood) for scoring a particular column of the multiple alignment using a given evolutionary model and a given phylogenetic tree. The alignment site provides the observed nucleotides at the base of the tree, and we sum over the unobserved nucleotides in the rest of the tree, conditioning on the equilibrium distribution from the evolutionary model at the root of the tree (Felsenstein Pruning Algorithm). The tree must be a maximum likelihood tree, of the kind generated by DNAml from Phylip or by FastDNAml. Branch lengths in the tree are converted to conditional probabilities using the specified evolutionary model.
Input:

<alignment file> is a DNA multiple alignment in ClustalW format. Alternatively, if the --list option is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.

<tree file> is a phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.

<motif file> is a list of motifs, in MEME format.

Output:

The output is in GFF format with the forward and reverse scores listed as separate features. The scores are the p-values for the observed alignment columns in the motif sized windows.

Options:

--bg <float> - The mutation rate for sites in the background model. The default value is 1.

--bgfile <bfile> - Read background frequencies from <bfile>. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.

--column-freqs [simulated|empirical] - The way to compute the frequencies of all possible columns.
simulated - Use the evolutionary model to compute the frequency of each possible column of letters.
empirical - Count the numbers of each column in the input multiple alignments. All alignments using the same (sub)set of species are counted together. Frequencies are computed by dividing each by the total counts for that (sub)set of species.
The default is simulated. These frequencies are used for computing p-values for scores. If simulated is used, the accuracy of the p-values depends strongly on the accuracy of the evolutionary model.

--fg <float> - The mutation rate for sites in the foreground model(s). The default value is 1.

--gap <method> - Specifies the gap handling strategy. Allowed values for method are:
skip Skip those sites where any position in the alignment window contains a gap. This is the default gap handling strategy.
fixed Sites containing gaps are assigned a fixed score, specified by --gap-cost.
wildcard The gap character matches any base, and the score is the product of the corresponding probabilities.
minimum The gap character is assigned the score corresponding to the least likely letter at the given position.
model Use model-specific gap handling. Currently, the only model that supports this is f81_gap.

--gap-cost <float> - Specifies the costs for gaps when using the fixed gap handling strategy. Default is 0.0.

--hb - Use the Halpern-Bruno modification to the evolutionary model.

--list - Treat the second required input as a list of alignments, rather than a single alignment.

--max-stored-scores <max> - Set the maximum number of scores that will be stored. Precise calculation of q-values depends on having a complete list of scores. However, keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped, and approximate q-values will be calculated. By default the maximum number of stored matches is 100,000.

--model [single|average|jc|k2|f81|f84|hky|tn] - The evolutionary model to use. The available models are:
single - score first sequence: compute standard log-odds score of first sequence in the alignment; ignores tree but does NOT remove gaps.
average - compute average of standard log-odds score of aligned sites.
jc - Jukes-Cantor: equilibrium base frequencies are all 1/4; the only free parameter is the mutation rate.
k2 - Kimura 2-parameter: equilibrium base frequencies are all 1/4; the free parameters are the mutation rate and the transition/transversion rate ratio.
f81 - Felsenstein 1981: equilibrium base frequencies are taken from the alignment; the only free parameter is the mutation rate.
f84 - Felsenstein 1984: equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine->pyrimidine transitions is assumed to be 1.
hky - Hasegawa-Kishino-Yano: equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine-pyrimidine transitions is assumed to be equal to the ratio of purines to pyrimidines.
tn - Tamura-Nei: equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate, the transition/transversion rate ratio, and the ratio of purine-purine transitions to pyrimidine-pyrimidine transitions.
The default model is f81. A description of the f81 model is available in chapter 13 of Statistical Methods in Bioinformatics by Ewens and Grant. The other models are described in chapters 9 and 13 of Inferring Phylogenies by Felsenstein.

--motif <id> - Use only the motif identified by <id>. This option may be repeated.

--no-qvalue - Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). By default, q-values are computed.

--norc - Do not score the reverse complement DNA strand. Both strands are scored by default.

--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.

--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.

--output-pthresh <float> - The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. Using the --output-pthresh option will set the q-value threshold to 1.0. The default p-value threshold is 1e-4.

--output-qthresh <float> - The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. Using the --output-qthresh option will set the p-value threshold to 1.0. The default q-value threshold is 1.0.

--pseudocounts <float> - A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. The default value is 0.1.

--pur-pyr <float> - The ratio of the purine transition rate to pyrimidine transition rate. This parameter is used by the Tamura-nei model. The default value is 1.0.

--seed <n> - Seed for random number generator.

--text Limits output to plain text sent to standard out. For FIMO, the text output is unsorted, and q-values are not reported. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.

--transition-transversion <float> - The ratio of the transition rate to the transversion rate. This parameter is used by the Kimura 2-parameter, F84, HKY, and Tamura-nei models. The default value is 0.5.

--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.

Warning messages: None Bugs:
F84 and Tamura-Nei models fail trying to take the log of a negative number.

Allow local realignment, a la Monkey.

The --motif option should allow multiple motifs to be selected from the motif file.

Print motif consensus as part of feature properties.
Allow reading of a column frequency file; this requires writing a stand-alone to create such a file and establishing a format such as:
         # species_list_1
         AAA f_AAA
         ...
         TTT f_TTT
         # species_list_2
         AAAAA f_AAAAA
         ...
         TTTTT f_TTTTT
         ...