Description

We have generated a first version of alignments for the ENCODE regions. These alignments are in beta stage; future versions will incorporate newer alignment techniques that are currently under research and development in our group. We think that this early set of alignments will be useful for those who are eager to perform analysis on aligned data. For each region, we have used a human-centric methodology that comprises the following steps:

In the first step, the sequence of each species is "rearranged" so that it is orthologously collinear, with respect to the human sequence. In other words, the sequence of each species is mapped to the human sequence: First, a human-monotonic map is created based on local similarities between the two sequences (using an algorithm based on Shuffle-LAGAN), and then a new sequence is produced for the species, by glueing together different pieces of the original sequence, according to the mapping. The mapping allows for any rearrangements, such as inversions, translocations, and duplications. Therefore, a new FASTA file is created for each of the species (other than human), containing a sequence that is directly alignable to human using standard global alignment techniques.
In the second step, a multiple global alignment is created for every region using MLAGAN.

Moreover, for each region we report how the sequences have been rearranged, so that people who want to do comparative analysis on the alignment can later map the coordinates of the rearranged sequences back to the original ones. For each region, a subdirectory named "rearrangements" contains a compressed tar archive with .info files. A .info file contains the map between the original and the rearranged sequence of some species. For example, here is the info file for the galago species, region ENr133:

  galago 2 165283 244112      1  58088  0 33 + 0     73  58160
  galago 2 244113 348408  58089 174393 85  0 + 0  58044 174348
  galago 2 353199 357763 174394 180830  0 69 - 0 175801 182237
  galago 2 357764 369080 180831 195593 18  0 + 0 182613 197375
  galago 2 369569 498714 195594 301899  4  0 + 0 197246 303551

The info file tries to follow the conventions of AVID's draft sequence info file format. The first field contains the species name; the last two fields contain the species' coordinates, and the third and fourth fields contain the human coordinates. For example, in the first line of the example, the part of galago's sequence from the 73rd to the 58,160th base is mapped to the respective part (from 165,283 to 244,112) of the human sequence. The file is always sorted according to human coordinates, since it is a human-monotonic map. Fields number five and six correspond to the coordinates of the rearranged sequence that is created from the map; in the example, the first 58,088 bases of the rearranged galago sequence are copied from positions 73 - 58,160 of the original sequence. The ninth field contains a sign that distinguishes positive strands (+) from negative ones (-). In the example, positions 175,801 to 182,237 are reverse complemented and then put into positions 174,394 to 180,830 in the rearranged sequence. The rest of the fields are nonimportant or irrelevant. Notice that not all of the original sequence is present in the rearranged one; the algorithm may discard parts of the original sequence which could not be mapped to any place of the human sequence.

The info files have also been drawn into linear plots. Here is the linear plot of the previous example:

ENr133 galago linearplot

The first grey horizontal line represents the galago sequence, and the second line represents the human sequence. Black arrows are drawn in rearranged regions (showing the direction within the strand) and grey lines cross the two regions to indicate that they were linked. The same info file can also be represented in a pseudo-dotplot, like the following:

ENr133 galago dotplot

The horizontal axis represents the human sequence, and the vertical axis represents the other species' sequence. A black line is drawn to indicate a rearranged piece, and grey dotted lines indicate its boundaries in the human-monotonic axis. The figure looks like a dotplot, but one should have in mind that it's actually a rearrangement visualization; it does not directly depict any local alignment hits or aligned regions, like usual dotplots do. Also notice that in all plots, the line that represents the sequence of the other species always ends at the position of the furthest rearrangement (rather than the position of the last nucleotide in the sequence).

Linear plots and dotplots are available for all regions, and they are located in the "rearrangements" subdirectory, in PNG format.

The actual data location is: http://ai.stanford.edu/~asimenos/beta_encode_Nov_2004/data/

Finally, notice that we have used "rat" instead of "ratB" in these alignments. Also, our rearrangement algorithm was not fine-tuned for each region, and so it may behave differently from region to region. It seems that a few of the regions (especially the randomly picked ones) contain sequencies from other species that show very weak homology or too many repeated elements. For that, we decided to exclude the marmoset sequence from region ENr132 and the cow sequence from region ENr213. Lastly, in some regions, the original FASTA files of some of the species contained more than one sequencies, usualy from two or more different chromosomes of the species; in this case we concatenated the sequencies into a single one, before feeding them to the rearrangement algorithm. Therefore, in such cases, the MAF and .info coordinates (and the first horizontal line in the linear plot or the vertical axis in the dotplot) refer to the concatenated input file. Thus, extra care should be taken if one needs to map these coordinates back to the original sequences.

Feel free to download the alignments or browse through some interesting plots! Be sure to email me any comments!