Hi Mark - here's a session with tracks representing most of the steps
I've worked on so far for the known genes III stuff.

http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Jimkent&hgS_otherUserSessionName=jkg

The tracks with long labels that start with JKG are from the latest run of
the pipeline.  These are near the bottom.  The other tracks are for
reference.

I'm very happy with the merging/breaking step, which is handled
by txPslToBed.  This cleans up the blat alignments remarkably well.
The algorithm is pretty simple:
1) Merge together alignment blocks separated by 5 bases or less
     on both genomic and mRNA side.
2) Break up alignments at longer gaps unless they are introns, meaning
     they have either gc/ag or gt/ag ends, and have no gap in mRNA side,
     and a gap of at least 30 on genomic side.  If you can extend or subtract
     a single base on genomic side to get a good intron, go ahead and do it.
3) Throw out pieces smaller than 18 bases.
The breaking up that happens here isn't irrevocable.  The next phase can
end up merging them back together.

I'm pretty happy with the next step, which creates graphs out of transcripts
that overlap at the exon level on the same strand.  This is based largely
on some code I started back in 2000, abandoned when the assembly got
intense. Chuck took it over.  I fixed a few bugs, and made it handle unspliced
transcripts better.   There are still a few small wrinkles to work out, but I
think I'll work on some other steps first since this one is in reasonable shape.
The algorithm is roughly:
1) Make a list of all unique exon boundaries from the previous phase.
    Splice sites are considered "hard" boundaries.  Starts and ends
    are soft boundaries.
2) Create edges between the boundaries corresponding to exons and introns.
     If a transcript is broken up at the previous phase, and either the
     break is less than 70 kb, or it's a RefSeq transcript, then an intron is
     also added between the broken up pieces. This will be a "soft edged"
     intron.
3) Snap nearby (within 5 bases) soft boundaries to the closest hard
    boundary.  (Start exon boundaries are only snapped to other start exon
    boundaries, not to end boundaries. Similarly end boundaries only snap
    to end boundaries.)
4) For edges that are half hard, and half soft (the softness not already snapped away),
     and the hard boundary participates in other edges that are hard on both sides,
     snap the soft side to match the most similar edge with two hard sides.
5) For edges that are half hard that can't yet be snapped, merge all soft ends
     to a single point that is a consensus favoring fairly strongly larger transcripts.
6) Merge together all overlapping edges that are soft on both sides.  Make
     the edge boundaries equal to the median value of all the merged boundaries.

I'm pretty happy with the orthologous exon/intron finder.  This has a little bit
of Chuck's orthoSplice left inside it, but it's much smaller than
orthoSplice. The new program, txSplice, goes like so:
1) For each edge (intron or exon) in the splice graph, use the nets and chains
     to find the corresponding region of the mouse.  On the mouse this called
     the "mapped interval"
2) Search splice graphs on the mouse for edges of the same type that overlap
    the mapped interval.
3) If the mapped interval boundaries match exactly the hard boundaries
     of the overlapping edge output the original human edge since it
     is an "edge with orthology."

The next step is to create a graph with additional weights beyond those
from the transcripts that initially built it.  Weights from exoniphy exons and
the orthologous edges are added.  You can't see this in the browser, but
there are text files with all the evidence.

The next step is to trim the graph of edges lacking sufficient weight.  This
is in the browser.

The final step is to run ExonWalk on the graph.  I don't think I"m running
exonWalk quite right.  Perhaps I need to double up the evidence or
something.  It seems to skip some of the edges, which are supported
by only a single mRNA (but do have orthology support).  I've put up
an earlier version of ExonWalk run on Chuck's orthoSplice output,
that does better.   On the other hand there's still a few cases where
even the orthoSplice/ExonWalk combo is not behaving as desired.
The location in the session is one of them.  The input splice forms
are not all represented in the ExonWalk output, and the output includes
combinations of exons never seen in the input.   As a consequence of
all this I'm probably going to write my own walking phase starting tomorrow.

I haven't done the CDS bits yet beyond establishing some test cases
that validate that blat seems to be good enough for the transcript/protein
alignments.  I'm not sure what troubles you were having with it Mark.
Possibly it was because it was protein/genome blat, not protein/transcript
blat you were using.  The protein/genome mapping is of course much
harder, and likely small exons are missed and codons split by splicing
aren't handled well.

Anyway, have fun with the code review Mark!  I'll be happy to follow
this up with a whiteboard session on Wednesday, maybe at the
Genecats meeting if you think it's of general interest.

-Jim