GOMO - a Gene Ontology association tool for motifs

Usage:
gomo [options] <go-term database> <scoring file>+

Description:
The name GOMO stands for "Gene Ontology for Motifs." The program searches in a set of ranked genes for enriched GO terms associated with high ranking genes. The genes can be ranked, for example, by applying a motif scoring algorithms on their upstream sequence. The p-values for each GO-term are computed empirically by shuffling the gene identifiers in the ranking (ensuring consistancy across species) to generate scores from the null hypothesis. Then q-values are derived from these p-values following the method of Benjamini and Hochberg (where "q-value" is defined as the minimal false discovery rate at which a given GO-term is deemed significant). The program reports all GO terms that receive q-values smaller than a specified threshold, outputting a gomo score with emprically calculated p-values and q-values for each.

Input:

<go-term database> is a collection of GO terms mapped to to the sequences in the scoring file. Database are provided by the webservices and are formated using a simple tsv-format:
"GO-term" "Sequence identifiers separated by tabulator"
The exception to this rule is the first line which instead contains the url to lookup the gene ids. The url has ampersands (&) replaced with & and the place for the gene id marked by the token "!!GENEID!!" .
<scoring file> is a XML file which contains for each motif the sequences and their score. The XML file uses the CisML schema. When scoring data is available for multiple related species GOMO can take multiple scoring files where the true sequence identifiers have been mapped to their orthologs in the reference species for which the go-term database was supplied.

Output:

GOMO will create a directory, named gomo_out by default. Any existing output files in the directory will be overwritten. The directory will contain:

An XML file named gomo.xml providing the results in a machine readable format.
An HTML file named gomo.html providing the results in a human readable format.

The default output directory can be overridden using the --o or --oc options which are described below.

Additionally the user can override the creation of files altogether by specifying the --text option which outputs to standard out in a tab seperated values format:
"Motif Identifier" "GO Term Identifier" "GOMO Score" "p-value" "q-value"

By default GOMO calculates the ranksum statistics on the p-values of each gene given in the CisML input file . Using the option --gs switches the focus from the p-values to the scores. Any sequence failing to provide a p-value will prompt GOMO to abort the calculations. The same happens when any of the genes in the CisML file lacks a score attribute and --gs was activated.

Options:

--o <dir name> - Specifies the output directory. If the directory already exists, the contents will not be overwritten.
--oc <dir name> - Specifies the output directory. If the directory already exists, the contents will be overwritten.
--dag <go dag file> - Path to the optional Gene Ontology DAG to be used for identifying the most specific terms in the gomo xml output so they can be highlighted in the html output.
--text - Output in tab separated values format to standard out. Will not create an output directory or files.
--motif <id> - Use only the motif identified by <id>. This option may be repeated.
--shuffle_scores <n> - Number of times to shuffle the sequence = score assignment and use the shuffled scores to generate empirical p-values.
--score_E_thresh <n> - Threshold used on the gene score E-values above which all E-values become maximal in order to reduce the impact of noise. Subsequently, this results in all genes having E-values above the threshold to obtain the same rank in the ranksum statistics. The threshold will be ignored when gene scores are used (--gs).
--t <n> - Threshold used on the q-values above which results are not considered significant and subsequently will not be reported. Defaults to 0.05 . To show all results use a value of 1.0 .
--min_gene_count <n> Filter out GO-terms which are annotated with less genes. Defaults to 1 which shows all results.
--gs - Indicates that gene scores contained in the cisml file should be used for the calculations. The default is to use the gene p-values.
--nostatus - Suppresses the process information.
--verbosity 1|2|3|4 - Set the verbosity of status reports to standard error. The default level is 2.