This track displays the mouse reference alignment produced by Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.

Methods

The reference alignments are an attempt to reduce the matches between the human genome sequence (specifically, the Golden Path assembly from the University of California at Santa Cruz) and the mouse whole-genome shotgun data to canonical alignments in which each position in the human sequence is aligned to at most one position in a mouse sequence. For human chromsome 22, This set of alignments was computed by aligning all (trimmed and masked) mouse reads with the human sequence using the blastz program.

The following are the main steps in our current strategy for producing reference alignments from a collection of (potentially overlapping) blastz alignments.

  1. Regions that align with a very large number of mouse reads are identified. currently, this means intervals in which each human position is aligned to at least 50 mouse reads. These highly duplicated segments are excluded from further analysis.
  2. Alignments are trimmed whenever there exists another alignment with a higher percent nucleotide identity in the overlap. For instance, suppose alignment A covers human positions w-z, and B covers x-y, with w < x < y < z. If on interval x-y, B has a higher percent identity than does A, then A is broken into an alignment from w to x-1 and an alignment from y+1 to z, with the segment from x to y being discarded in favor of B.
  3. An effort is made to determine which human segments are aligned to more than one region of the mouse genome (moderately duplicated). Whenever alignments are seen to have an appreciable overlap, the two portions of mouse reads are compared. If they have at least 95% nucleotide identity, they are treated as coming from the same part of the mouse genome (allowing 5% sequencing error between the reads); otherwise the region is labeled as duplicated.
  4. The final processing steps are to identify regions, not annotated as exons, that are particularly well conserved (highly conserved>).

Credits

This track is produced from mouse sequence data provided by the Mouse Genome Sequencing Consortium.