FIMO - a motif search tool

Usage:

fimo [options] <motifs> <database>

Description:

The name FIMO stands for "Find Individual Motif Occurences." The program searches a sequence database for occurrences of known motifs, treating each motif independently. The program uses a dynamic programming algorithm to convert log-odds scores (in bits) into p-values, assuming a zero-order background model. The program reports all motif occurrences that have p-values smaller than the output p-value threshold. The p-value threshold can be set using the --outptu-pthresh option. The default p-value threshold is 1e-4. The p-values for each motif occurence are converted to q-values following the method of Benjamini and Hochberg ("q-value" is defined as the minimal false discovery rate at which a given motif occurrence is deemed significant). If a motif has the strand feature set to +/- (rather than +), then fimo will search both strands for occurrences.

The parameter --max-stored-scores sets the maximum number of motif occurrences that will be retained in memory. It defaults to 100,000. If the number of matches found reaches the maximum value allowed, FIMO will discard 50% of the least significant matches, and new matches falling below the significance level of the retained matches will also be discarded.

FIMO can make use of position specific priors (PSP) to improve its identification of true motif occurrences. To take advantage of PSP in FIMO you use must provide two command line options. The --psp option is used to set the name of a MEME PSP file, and the --prior-dist option is used to set the name of a file containing the binned distribution of priors.

Input:

<motifs> is the name of a file containing a list of motifs, in MEME format.
<database> is the name of a file containing a collection of sequences in FASTA format. The character - can be used to indicate that the sequence data should be read from standard input. This can only be used if the motif file contains a single motif.

The FASTA header lines are used as the source of sequence names. The sequence name is the string following the initial '>' up to the first white space character. If the sequence name is of the form: text:number-number, the text portion will be used as the sequence name. The numbers will be used as genomic coordinates, and the first number will be used as the coordinate of the first position of the sequence. In all other cases the coordinate of the first postion of the sequence is take as 1.

Output:

FIMO will create a directory, named fimo_out by default. Any existing output files in the directory will be overwritten. The directory will contain:

An XML file named fimo.xml using the CisML schema.
An HTML file named fimo.html
A plain text file named fimo.text
A plain text file in GFF format named fimo.gff
A plain text file in wiggle track format named fimo.wig

The default output directory can be overridden using the --o or --oc options which are described below.

The --text option will limit output to plain text sent to the standard output.

The HTML and plain text output contain the following columns:

The motif identifier
The sequence identiifer
The start position of the motif occurence (closed, 1-based coordinates)
The end position of the motif occurence (closed, 1-based coordinates). If the start position is larger then the end position, the motif occurrence is on the reverse strand.
The score for the motif occurence. The score is computed by by summing the appropriate entries from each column of the position-dependent scoring matrix that represents the motif.
The p-value of the motif occurence. The p-value is the probability of a random sequence of the same length as the motif matching that position of the sequence with a score at least as good.
The q-value of the motif occurence. The q-value is the estimated false discovery rate if the occurrence is accepted as significant. See Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA (2003) 100:9440–9445
The sequence matched to the motif.

The HTML and plain text output is sorted by increasing p-value.

The wiggle track output contains the following entries:

A track line containing:

The track type
The motif name
The source of the track (FIMO)

A step size line containing:

The sequence name
The width of the motif

A data line containing:

The start position of the motif occurence (closed, 1-based coordinates)
The motif occurrence score which is -log(p-value)

The wiggle track output is sorted by motif, sequence name, and position.

Options:

--bgfile <bfile> - Read background frequencies from <bfile>. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.
--max-seq-length <max> - Set the maximum length allowed for input sequences. By default the maximum allowed length is 250000000.
--max-stored-scores <max> - Set the maximum number of scores that will be stored. Precise calculation of q-values depends on having a complete list of scores. However, keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped, and approximate q-values will be calculated. By default the maximum number of stored matches is 100,000.
--motif <id> - Use only the motif identified by <id>. This option may be repeated.
--motif-pseudo <float> - A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency (default=0.1).
--no-qvalue - Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). By default, q-values are computed.
--norc - Do not score the reverse complement DNA strand. Both strands are scored by default.
--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--output-pthresh <float> - The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. Using the --output-pthresh option will set the q-value threshold to 1.0. The default p-value threshold is 1e-4.
--output-qthresh <float> - The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. Using the --output-qthresh option will set the p-value threshold to 1.0. The default q-value threshold is 1.0.
--psp <file> - File containing position specific priors (PSP) in MEME PSP format.
--prior-dist <file> - File containing binned distribution of priors. This file can be generated from a MEME PSP format file. using the compute-prior-dist utility.
--text Limits output to plain text sent to standard out. For FIMO, the text output is unsorted, and q-values are not reported. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.