psp-gen

Usage: psp-gen -pos <filename> -neg <filename> [options]

Short description:

psp-gen is used to allow MEME to perform discriminative motif discovery—to find motifs overrepresented in one set of sequences compared to in another set. It takes two files as input—the sequence file to be input to MEME, (the "positive" file) and a "negative" sequence file of sequences believed not to contain the same motifs as in the "positive" file. psp-gen creates a file for use by MEME that encapsulates information about probable discriminative motifs. psp-gen records its chosen motif width in the file, and MEME is able to adjust the data when it tries different motif widths.

Input:

You should choose the negative set as a contrast to the positive set: similar in most respects, but differing in the features that you would like to associate with motifs.

Output:

The output contains an entry for every sequence in the original data, in a FASTA-like format. The original FASTA name lines are retained, with additional annotation indicating the motif width and the scaling applied to raw scores.

Only the name and the motif width are interpreted by MEME; the scaling information is recored as documentation of the run of psp-gen that generate the PSP file.

Instead of sequence data, the output contains a number for each position in the sequence, representing the probability that a motif starts at that position. For example:

>A2ACQ1 3 scaledmin = 0.1 scaledmax = 0.9
0.001514752 0.001571433 0.00130429 0.002221764 0.00194399

starts with a name line indicating that the original sequence is called "A2ACQ1", psp-gen found the "best" motif scores with a width of 3, and scores were scaled to 0.1..0.9, before converting them to probabilities. The next line contains the first 5 positional prior values for sequence A2ACQ1.

Options:

Options are all case-specific:

Examples:

In each example, the sequence file is the same as the postive set used for psp-gen. While there is no check that you have done this because MEME runs as a separate program (other than that the sequence names need to match, as do the number of sites), using a positive set different to the sequence set MEME searches is unlikely to be useful.

Generate a discriminative prior for a DNA sequence set (pos-dna.fasta) likely to contain TF binding sites, given another DNA sequence set unlikely to contain TF binding sites (neg-dna.fasta):
psp-gen -pos pos-dna.fasta -neg neg-dna.fasta > dna.psp

Use these position-specific priors in MEME, searching both strands:
meme pos-dna.fasta -psp dna.psp -oc meme_out_dna -revcomp

Generate a discriminative prior for a protein sequence set (pos-prot.fasta) associated with a specific function, given another protein sequence set unlikely to relate to that function (neg-prot.fasta):
psp-gen -pos pos-prot.fasta -neg neg-prot.fasta -alpha prot -triples -maxrange > prot.psp

Use these position-specific priors in MEME:
meme pos-prot.fasta -psp prot.psp -oc meme_out_prot

View usage information:
psp-gen -h

Detailed description:

The psp-gen program generates a position-specific prior (PSP) in a data file, which can be used as an additional input to MEME to bias the search to sites more likely to contain a motif (or motifs) of interest. Without using a PSP, MEME assigns the same prior probability to all sites. A PSP is generated by psp-gen using two sequence sets, each in FASTA format:

The basic principle of a discriminative prior is to start from the question:

What fraction of all instances of a w-wide subsequence (or w-mer) across both sequence sets occur in the positive set?

The intuition is that any w-mer that is common in the positive set but not the negative set could form part of motif of interest, based on the way the two sets are chosen.

PSP calculation starts from the following equation based on the D-prior described in the Narlikar et al. RECOMB 2007 paper "Nucleosome Occupancy Information Improves de novo Motif Discovery". We count the number of instances Xi,j(w) of the subsequence w wide starting at position j in sequence Xi of the positive set, and similarly count Yi,j(w) in the negative set, then calculate for each site:

Di,j = Xi,j(w)
Xi,j(w) + Yi,j(w)

As pseudocounts, we add 1 to the enumerator, and 1 + |X|/|Y| to the denominator. The purpose of the pseudocounts is to avoid giving a high score to infrequent subsequences that are not (or rarely) found in the negative set. Once we have scored all sites for a given value of w, we scale scores to a range that can be set on the command line, with the default range [0.1,0.9].

We repeat scoring for the whole range of motif widths under consideration, and choose a "best" width using one of two methods: either we choose the value of w that maximizes the score over all sequences, or the value of w that maximizes the difference between the maximum and minimum score over all sequences. We also allow an option of using spaced triples instead of whole words to match subsequences. See Variations below for more details on triples and choosing the "best" width.

Once we have scored every available site and chosen a specific value of w, we convert the scores to probabilities, based on the following proportionalities, where Zi = j means that the site for a motif in sequence Xi is found at location j (with j = 0 signifying there is no site in the given sequence) and Di,j is the score for sequence Xi at the site starting at location j, numbered from 1:

P(Zi = 0) 1
and
P(Zi =j) Di,j
1-Di,j

We convert these proportionalities to a PSP by normalizing them to add up to 1 over each sequence.

Running with MEME:

Since MEME runs as a separate program, there is no way to check that the positive set is the same as the sequence set given to MEME. There are however checks that every name in the PSP file matches a name in the MEME sequence set, and that the number of sites in the PSP file for a particular sequence name matches the number of sites in the sequence file seen by MEME. While it does not make sense to use a completely different sequence set for PSP generation and for running MEME (with the resulting PSP file), MEME can use a PSP file that does not provide probabilities for every sequence. Any sequences not in the PSP file are given a uniform prior probability. MEME documentation includes more detail on how PSPs are used.

Variations:

For each variation, "default" means that you do not need to specify a command-line switch to turn on that setting.

Performance scalability:

When you use exact-word (w-mer) search, psp-gen run time scales linearly in each of |X|, |Y| and maxw-minw. Formally, time is O((|X| + |Y|)(wmax - wmin)). For triples, psp-gen scales linearly in each of |X| and |Y|, and quadratically in wmax, so run time is O((|X| + |Y|)wmax2).

Extra memory in both cases scales linearly with |X| + |Y|.

Since triples have a higher-order term in time complexity, we predict run time scalability in more detail than for exact word match. Adding in constants and a linear term, time complexity for triples is approximated by (where W = wmax-2, the number of variations in width used by the algorithm):

t = (k1|X| + k2|Y|)W2 + k3(|X| + |Y|)W

On a real machine (iMac, 3.06GHz Intel Core 2 Duo, 6MB L2 cache, 4GB RAM) the following is a good predictor of the upper bound on run time for likely values of |X|, |Y| and range of widths. Scale any results according to the speed of the actual machine on which you intend to run psp-gen. The formula reflects runs on randomly generated data; real data will likely result in faster runs because random data defeats some optimizations in psp-gen.

t=(4.37x10-6|X|+ 2.46x10-06|Y|)W2 + 6.25x10-06(|X| + |Y|)W s

For example, for |X| = |Y| = 60,000 and wmax = 50 (i.e., W = 48), estimated run time = 980s, the same in this case as measured run time. The following command line reproduces this run:

time psp-gen -neg pos60k.fasta -pos neg60k.fasta -alpha prot -minw 3 -maxw 50 -arbitraryOK -triples > priors60kmaxw50.psp

Known problems:


Author: Philip Machanick (p.machanick@imb.uq.edu.au)