Hi Mark - here's a session with tracks representing most of the steps I've worked on so far for the known genes III stuff. http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Jimkent&hgS_otherUserSessionName=jkg The tracks with long labels that start with JKG are from the latest run of the pipeline. These are near the bottom. The other tracks are for reference. I'm very happy with the merging/breaking step, which is handled by txPslToBed. This cleans up the blat alignments remarkably well. The algorithm is pretty simple: 1) Merge together alignment blocks separated by 5 bases or less on both genomic and mRNA side. 2) Break up alignments at longer gaps unless they are introns, meaning they have either gc/ag or gt/ag ends, and have no gap in mRNA side, and a gap of at least 30 on genomic side. If you can extend or subtract a single base on genomic side to get a good intron, go ahead and do it. 3) Throw out pieces smaller than 18 bases. The breaking up that happens here isn't irrevocable. The next phase can end up merging them back together. I'm pretty happy with the next step, which creates graphs out of transcripts that overlap at the exon level on the same strand. This is based largely on some code I started back in 2000, abandoned when the assembly got intense. Chuck took it over. I fixed a few bugs, and made it handle unspliced transcripts better. There are still a few small wrinkles to work out, but I think I'll work on some other steps first since this one is in reasonable shape. The algorithm is roughly: 1) Make a list of all unique exon boundaries from the previous phase. Splice sites are considered "hard" boundaries. Starts and ends are soft boundaries. 2) Create edges between the boundaries corresponding to exons and introns. If a transcript is broken up at the previous phase, and either the break is less than 70 kb, or it's a RefSeq transcript, then an intron is also added between the broken up pieces. This will be a "soft edged" intron. 3) Snap nearby (within 5 bases) soft boundaries to the closest hard boundary. (Start exon boundaries are only snapped to other start exon boundaries, not to end boundaries. Similarly end boundaries only snap to end boundaries.) 4) For edges that are half hard, and half soft (the softness not already snapped away), and the hard boundary participates in other edges that are hard on both sides, snap the soft side to match the most similar edge with two hard sides. 5) For edges that are half hard that can't yet be snapped, merge all soft ends to a single point that is a consensus favoring fairly strongly larger transcripts. 6) Merge together all overlapping edges that are soft on both sides. Make the edge boundaries equal to the median value of all the merged boundaries. I'm pretty happy with the orthologous exon/intron finder. This has a little bit of Chuck's orthoSplice left inside it, but it's much smaller than orthoSplice. The new program, txSplice, goes like so: 1) For each edge (intron or exon) in the splice graph, use the nets and chains to find the corresponding region of the mouse. On the mouse this called the "mapped interval" 2) Search splice graphs on the mouse for edges of the same type that overlap the mapped interval. 3) If the mapped interval boundaries match exactly the hard boundaries of the overlapping edge output the original human edge since it is an "edge with orthology." The next step is to create a graph with additional weights beyond those from the transcripts that initially built it. Weights from exoniphy exons and the orthologous edges are added. You can't see this in the browser, but there are text files with all the evidence. The next step is to trim the graph of edges lacking sufficient weight. This is in the browser. The final step is to run ExonWalk on the graph. I don't think I"m running exonWalk quite right. Perhaps I need to double up the evidence or something. It seems to skip some of the edges, which are supported by only a single mRNA (but do have orthology support). I've put up an earlier version of ExonWalk run on Chuck's orthoSplice output, that does better. On the other hand there's still a few cases where even the orthoSplice/ExonWalk combo is not behaving as desired. The location in the session is one of them. The input splice forms are not all represented in the ExonWalk output, and the output includes combinations of exons never seen in the input. As a consequence of all this I'm probably going to write my own walking phase starting tomorrow. I haven't done the CDS bits yet beyond establishing some test cases that validate that blat seems to be good enough for the transcript/protein alignments. I'm not sure what troubles you were having with it Mark. Possibly it was because it was protein/genome blat, not protein/transcript blat you were using. The protein/genome mapping is of course much harder, and likely small exons are missed and codons split by splicing aren't handled well. Anyway, have fun with the code review Mark! I'll be happy to follow this up with a whiteboard session on Wednesday, maybe at the Genecats meeting if you think it's of general interest. -Jim