This track displays the mouse reference alignment produced by
Scott Schwartz and
Webb Miller
of the Penn State Bioinformatics Group.
Methods
The reference alignments are an attempt to reduce the matches between
the human genome sequence (specifically, the Golden Path assembly from the
University of California at Santa Cruz) and the mouse whole-genome shotgun
data to canonical alignments in which each position in the human sequence is
aligned to at most one position in a mouse sequence. For human chromsome 22,
This set of alignments was computed by aligning all (trimmed and
masked) mouse reads with the human sequence using the blastz program.
The following are the main steps in our current strategy for producing
reference alignments from a collection of (potentially overlapping) blastz
alignments.
- Regions that align with a very large number of mouse reads are
identified. currently, this means intervals in which each human position
is aligned to at least 50 mouse reads. These highly duplicated
segments are excluded from further analysis.
- Alignments are trimmed whenever there exists another alignment with a
higher percent nucleotide identity in the overlap. For instance, suppose
alignment A covers human positions w-z, and B covers x-y, with w < x < y <
z. If on interval x-y, B has a higher percent identity than does A, then A
is broken into an alignment from w to x-1 and an alignment from y+1 to z,
with the segment from x to y being discarded in favor of B.
- An effort is made to determine which human segments are aligned to more
than one region of the mouse genome (moderately duplicated).
Whenever alignments are seen to have an appreciable overlap, the two
portions of mouse reads are compared. If they have at least 95% nucleotide
identity, they are treated as coming from the same part of the mouse genome
(allowing 5% sequencing error between the reads); otherwise the region is
labeled as duplicated.
- The final processing steps are to identify regions, not annotated as
exons, that are particularly well conserved (highly conserved>).
Credits
This track is produced from mouse sequence data provided by the
Mouse Genome Sequencing Consortium.