mhmmscan -- not currently supported

Usage: mhmmscan [options] <HMM file> <FASTA file>

Description:

mhmmscan searches a sequence database using a Meta-MEME motif-based hidden Markov model (HMM) of the kind produced by mhmm. This program is similar to mhmms, except that mhmmscan

Each sequence-vs-model match is assigned an E-value, and matches that score below a user-specified E-value threshold are printed in order of increasing E-value.

mhmmscan has two modes of computing match scores:

In p-value score mode, the search sequence can be thought of as consisting of three steps:

  1. find motif matches ("hits") with each sequence with p-values less than the user-specified p-value threshold,
  2. coalesce hits found in each sequence into "matches", where hits separated by more than maxgap positions are always separated into distinct matches, and
  3. report matches with E-values less than threshold.

The three parameters p-value threshold, maxgap and threshold are described in more detail under "Options:", below.

In log-odds score mode, the search can be thought of as consisting of two steps:

  1. find local matches between the model and each sequence that maximize the log-odds score and exceed minscore, and
  2. report matches with E-values less than threshold.

The threshold parameter is described in more detail under "Options:", below.

In order for E-values to be computed by mhmmscan, at least 100 matches must be found. If there are too few sequences in the database, or if certain other options are made to stringent (see Options, below), too few matches may exist for E-values to be computed. In this case, the results are sorted by match score, the E-value column is set to "NaN" and all matches are printed.

Input:

Output:

The mhmmscan/MCAST output has up to three sections containing your search results:

All three sections are always present in MCAST output. The second two sections will not be present in mhmmscan output unless the -fancy option was specified.

The results in all three sections are sorted by increasing E-value if possible, or by decreasing match score if E-values could not be computed.

DATABASE SEARCH RESULTS

The "Database Search Results" section consists of lines of the following form:

<ID> <E-value> <Score> <Hits> <Span> <Start> <End> <Length> <Description>

These fields contain, for each match found,

ALIGNMENTS

Each alignment lists the sequence identifier, match E-value and log-odds score along the left. On the right, it shows the alignment of the match with the sequence in groups of four segments. An example segment from an alignment is given below, followed by a description of what each line of the segment means. (The example shows p-value score mode. The row of p-values would be replaced by log-odds scores in log-odds score mode. If '--motif-scoring' is not on, the row of p-values or scores is absent.)

hb_P1_element
1.5e-07
55.02
                             2.4e-04                           2.4e-04            1.3e-04
                             *_____+3__*                       *____-2__*         *___+1_*
                             TTTTTTATGCG.......................TTTTATGACT.........CTAATCCG..................................
                              TTTTTAT+ +                       TTTTAT A T         +TAATC+G
          220 CGGAACATTAAAATGATTTTTATTTCTATGCTAAATCTGTTGTATTTACTTTTATAAATTTAATGTGTTTAATCTGTTCACATTTTTAAATACTTCGTATGCTATCNNNN     329 
      


MOTIF DIAGRAMS

The motif diagrams section shows the matches in schematic format. For each match, in the right two columns, it shows the sequence identifier and the match E-value. On the left, it shows the positions and spacings of the hits making up the match. Hits are labeled with numbers corresponding to the order the motifs were given in the query. A plus or minus sign preceding a hit indicates that the hit occurs on the given (+) or reverse complement (-) of the DNA sequence in the database.

LOG-ODDS SCORES

The log-odds scores for each motif column are created using prior information on the letters appearing in alignment columns. The prior information is the target frequencies [Karlin,S. and Altschul,S.F., PNAS USA , 87, 2264-2268] implicit in a scoring matrix. Meta-MEME can read a user-specified scoring matrix (in the same format as used by the BLAST family of programs) from a file or generate a PAM matrix. By default, PAM 250 is used for proteins, and PAM 1 is used for DNA. For DNA, the "PAM 1" frequency matrix is

              .990 .002 .006 .002
              .002 .990 .002 .006
              .060 .002 .990 .020
              .020 .060 .002 .990
      

Meta-MEME calculates the target frequencies qij = pipj exp(L sij) from the scoring matrix sij and the background letter frequencies pi by finding the value of L that makes the qij sum to one. These target frequencies are then used to create pseudo-frequencies to be added to the emission frequencies of the column, following the approach of [Henikoff,S. and Henikoff,J.G., JMB, 243, 574-578]. The pseudo-frequency for the ith letter is computed as: gi = sum j in alphabet (fj qij/pj).

The pseudo-frequencies, gi, are then combined with the emission frequencies, fi to give frequency estimates

Qi = (alpha fi + beta gi) / (alpha + beta).

Finally, the log-odds score for a letter in the motif column is computed by dividing by the background frequency of the letter and taking the logarithm,

Si = log(Qi / pi).

In general, alpha should be proportional to the amount of independent information in the emission frequencies. We have set it to the constant 20. The parameter beta is arbitrary and controls the relative importance of prior information. We set it to the constant 10.

Our method is essentially that used in PSI-BLAST [Altschul,S.F et al., NAR, 25:17, 3389-3402] without

  1. sequence weighting, and
  2. scaling for amount of independent information (alpha).

To do 1) and 2) correctly would require having and using alignment information rather than emission frequencies as the starting point.

Options:

Advanced Options:

The following five options are automatically set when you specify the '-max-gap' option. If you do not use '-max-gap', you can set these options individually.

The following option is automatically invoked when you specify '-p-thresh'. You can also set it when you do not want p-value score but want to prevent partial matches to motifs.

The following options can be used in both p-value and log-odds score modes to control how the emission probabilities in the HMM are converted into log-odds scores.

Bugs: None known.

Author: William Stafford Noble and Timothy Bailey.