Module: a locus on DNA consisting of a single occurrence of a motif or a cluster of motifs fulfilling a clustering criterion. a module should be considered to be a single functional regulatory unit. Module::Functionality The probability that a given module is functional, in the absense of any other information, is P(F|S), where F is a boolean, "is functional". S is the sequence of the candidate locus. Currently, this should just be as follows: prod_m(P(S_m|F)) / prod_m(P(S_m)) P(S_m|F) := PWM score or, possibly: P(S_m|F) / P(S_m) := PWM score If the latter, then we need only to combine all PWM scores. P(S|F) = prod_m(P(S_m|F)) P(S) = prod_m(P(S_m)) P(F) = unknown, but fixed across the run. where S_m denote the sequences of the motifs in the module, S_b denote the sequences of all other intercalating sequence. P(F|S) = P(S|F) * P(F) / P(S) Note that prod_b(P(S_b)) cancels. ModuleConservation::Similarity a score measuring the similarity of two modules. This will be a measure of the sequence similarities of the individual motifs and, if the module is a cluster of motifs, some measure of the similarity in the relative order and orientation. How could one interpret the idea that two very different modules could have the same probability of being functional? If we assume that "probability of being functional" is the only conserved quantity (what seems to be an unreasonable assumption), we would ignore the transitional properties of the evolutionary process. For example, even in the face of perfect conservation of function, and thus perfect maintainance of overall probability and abundance of modules, the evolutionary time elapsed between orthologous regions will give rise to different similarities in the module sequences themselves. This measure of similarity is an independent metric from the measure of probability of function. But, is there any reason to suspect that maintainance of function should correlate well with similarity? Yes, in general an argument can be made that there is a transition cost to mutations; that even a mutation that preserves function risks altering function. Therefore, the more preserved regulatory regions are more likely to be real than those not preserved. So, we need yet another score for the similarity of modules. It should be hierarchical to the similarity of sites and the similarity in orientation and relative order. The evolutionary path from one regulatory sequence to another. Evolution is capable of introducting point mutations, insertions and deletions, to change one DNA sequence to another. In comparing two DNA sequences with a presumed common ancestor, which is a common presumption, we take the symmetry argument that A->B/B->A point mutation and insertion/deletion are time symmetric phenomena. Without loss of generality, evolution of two such sequences from a common ancestor may be viewed as evolution of one to the other; assessment of a shortest or most likely path from one to the other should be the same regardless of direction taken. Aside from evolutionary distance, the continuous preservation of regulatory functionality is also presumed. Every snapshot sequence in the path from one sequence to the other is presumed to contain a functional regulatory sequence; the measure of functionality is the probability model we have, and a good evolutionary path should allow every snapshot to have "good" probability, at least above a certain threshold. So, a second measure in selecting a best evolutionary path, and thus an evolutionary distance between two sequences should take account of the probabilities of these snapshots. Is there a simple way to approximate this? Issues in Module Similarity: 1. Since two modules will have 'extra' sites, the relevant comparison might better involve just the 'core' sites. On the other hand, maybe the similarity score should scale with the number of sites. After all, the user is just setting a minimum number of each kind of site, and this seems to imply the more sites the better. 2. Find the best 1-to-1 mapping (bijection) of sites in one module with sites in the other. This will in general leave certain sites out of each module, but it is guaranteed to get at least the core sites, since both modules are guaranteed to contain the core sites. 2a. What is the 'best' bijection? It is a sum of terms reflecting evolutionary changes. The changes include point mutation, site flipping, and site movement. Relative rearrangement is subsumed as site movement and given no special consideration. 2b. There is the complication that this approach doesn't seem to be normalized in a proper sense. That is, take a set of orthologous pairs of modules among two species and plot the Module Similarities between them. Now, take a set of more tightly arranged orthologously paired modules that in general have the same evolutionary distance. Their similarity scores will in general have a different distribution. You must normalize them based on the overall distribution. This principle holds also for several other types of scores as well. Calculating a similarity score for a given site-to-site mapping of two modules Four components to the score: 1. permutation score(mapping){ For the motif similarity score, we want to measure the similarity of a pair of same-typed motifs, such that the similarities between two pairs may be used together in consistent fashion. To be more precise, if we've specified that a module consist of at least 1 A site and at least 1 B site, then we want the following condition to hold: module X : A A A B module X': A A A B module Y : A A B B module Y': A A B B Suppose site type A is a 10-mer, while B is a 6-mer. Suppose further that the same level of conservation exists between the ortholog pair X,X', and the pair Y,Y', i.e. that the A sites and B sites have had the same opportunity and rate of mutation per base is subject to the same evolutionary time, and same magnitude of opposing forces of conservation and mutation. Then, we need a similarity score whose distribution will not vary with length of site. Now the question is, will it vary with overall conservation of the site? Suppose we take the case of measuring conservation between ttx-3 12-mer site and 14-mer site? The 14-mer site really is just the 12-mer with two very poorly conserved bases tacked on to the end. If we take the percentage of conserved bases as the measure of similarity, then the 14-mer will overall be less conserved than the 12-mer, even though the 'real' quantity of interest, functional conservation, is equivalent between the two comparisons. Why not just use the percent difference in score as the indicator? Score does indeed measure conservation in the 'fitness' of the sites, but it would be blind to any change in nucleotide that was considered equivalent in its contribution to fitness of the site. This at least eliminates the systematic effect of non-conserved bases in the sites. But, it results in a loss of information about how many functionally important, yet iso-fit nucleotides were conserved or not. We need a scoring procedure that is in between. How about the per-base absolute value of the score difference, and averaged over the whole length of the site? So, the simplest way to achieve this would be to calculate the sequence of scores involving the series of point mutations from the query to target sequence. This is the shortest evolutionary path from one to the other, and so has the nice property of being opportunistic. Another benefit to this approach is it can be performed using existing functions of the motifs, or nuc_trie<> class, namely the score() function. True, the consensus and collection scores will all return the same value, and so there will be a similarity value of zero for these, but that is consistent with the concept of 'don't care' for consensus bases. Not only do you not care which of 2 (or 3, or 4) bases is used, but you don't care whether they are preserved accross putatively orthologous sites either.