- add correlation coefficient as a criteria: Adam Siepel 2005/03/02 On another front, it occurred to me that the correlation coefficient sometimes used in gene prediction stats could be another useful thing to report. For each inFile record, you could give a correlation coefficient based on all overlapping selectFile records. This would give you one number saying something about both directions of coverage and about the degree of "consistency" we were talking about the other day. For example, you could project the intronEsts into a bed of nonoverlapping features using featureBits, then run overlapSelect -cc (or similar), to get a cc number for each selectFile, which could then go in a database like the one I'm building. I think, when computing the cc, you might want to limit yourself to the range of the inFile record. That would make sense for my application at least, where my predictions are fragments. In other cases, you might want to compute the cc for the smallest interval including both the inFile record and all overlapping selectFile records. It looks to me like the number you'd compute for a given interval would be cc = (cN - ab) / sqrt(ab(N-a)(N-b)) where a is the number of "bits" (e.g., bases in exons) in the inFile record, b is the total number of bits in all overlapping selectFile records (within the interval), c is the number of bits in both the inFile and the selectFile records, and N is the length of the interval. For example, if you had a 1000 base interval with 100 bases within predicted exons, 150 bases of supporting EST evidence, and an overlap of 90 bases, then N = 1000, a = 100, b = 150, c = 90, and cc = 0.70. This number is defined as long as 0 < a,b < N. It will always be true that a > 0 (otherwise you don't have an inFile record). If a > 0 and b = 0, then you'd have 0/0 but you could just report 0. If a = N and b <= N (also possible), then c = b and you'd also have 0/0. You could report b/N in this case. The symmetric thing could be done if b = N and a <= N. I suppose an alternative would be to report a, b, c, and N for each inFile record. Then the cc or some alternative could easily be computed with an awk script. - add featureBits type of feature specifications (e.g. :intron)