BBSRC logo

Descriptions of the Algorithms

The IMR Algorithm

IMR: Iterative reads-Mapping and Re-assembly

IMR is a programme that iteratively assembles the short reads generated by Illumina sequencers. It should handle other short read data too, but has not been tested on other platforms this moment. It was designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the sequenced sample that most reads are alignable using a standard short read mapper such as Stampy, MAQ, SMALT, BWA. However, we are extending it to work with heterozygous genomes. The novel aspect of IMR is how, starting from the reference sequence, it iteratively mutates it towards the sequence of the sample, and the algorithms used for variant calling. These are described below.

Iterative realignment has a potential advantage over a single pass aligner for describing complex loci. Briefly, at each iteration, reads are aligned to the current version of a consensus sequence for a genome, high-confidence SNPs and indels are called, and incorporated into a new consensus. This process is then repeated until additional rounds of iteration produce few (or alternating) changes in the consensus sequence.

For assembling Arabidopsis thaliana accessions, we use the TAIR10 reference sequence as the consensus for the first iteration, and then align reads using STAMPY{Lunter, #21}. For other inbred genomes, such as inbred strains of mice or rats, substitute the appropriate reference genome (mm9 or rn3.4). We have found that convergence occurs after about five iterations when the number of additional variants accepted is less than 2% of the number of the variants detected in the first iteration. At that point, the majority of remaining variants are unresolvable “heterozygotes” or cycle between alleles in successive iterations. These ambiguous positions can result for multiple reasons, including where repetitive read mappings are not resolvable, where there is copy number variation, or where genomes harbour residual heterozygosity.

Variant calling uses two algorithms: