| Faculty of Life Sciences / Genetics, Evolution and Environment / UCL Genetics Institute / Richard Mott's UCL Page |
| 19 Arabidopsis genomes / QTL Mapping in MAGIC / Sequencing MAGIC Lines / Assembly Software|

IMR/DENOM short read genome assembler

This page contains resources relating to the IMR/DENOM package for assembling genomes from Illumina short read sequqnce data. This package was used to assemble 18 accessions of A thaliana, as described in our paper Multiple reference genomes and transcriptomes for Arabidopsis thaliana Nature 2011.. IMR v0.1.0 has also been used in Mouse genomic variation and its effect on phenotypes and gene regulation Nature 2011

IMR/DENOM comprises three independent programs, devised and written by Xiangchao Gan. The programmes can be run spearately or as a pipeline.

Descriptions of the algorithms are available.

LICENCE

Free for non-profit research purposes. Please contact authors otherwise. The program itself may not be modified in any way and no redistribution is allowed.

No condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of IMR/DENOM, or that it will be suitable for any particular purpose or for use under any specific conditions, or that the content or use of IMR/DENOM will not constitute or result in infringement of third-party rights.

Authors

Download

Currently the software is downloadable as precompiled Linux binaries. This version (0.4.1) is now available. Please send bug reports and comments to the authors. We also provide an example on how we assembled Bur-0 accession step by step( download , Be careful that the data is about 3.3G and probably takes long time to download. The commands used are listed in the case ).

Installation

Make sure all the downloaded binaries are in a directory in your execeutable path.

IMR/DENOM need samtools , picard, and BWA. You may need SOAPdenovo too if you want to run de novo assembly. Users usually install them on their own. But you can install all of them privately for IMR/DENOM by simply running: bootstrap in your IMR/DENOM folder.

User may need to install read-mapper of their interest. We suggest people to use Stampy as the read-mapper, which beated all the others in our internal test. You need intall it by yourself if used.


Running IMR/DENOM

    Prepare Project Description File

    A typical project might involve the assembly of more than one library with different insert sizes. The description file tells how to interpret the input files and group them together. Please read the details carefully. A simple example example1 is here and a more complicated example with multiple libraries example2 is also provided. Users can revise them accordingly.

    Single command mode

     imrdenom <proj_desctipiton_file> 
    will finish all steps of assembly for you. However, we strongly encourage people to run IMR,DENOM and MCMERGE separately for better parameter tuning and lower computational burden.

Output

    IMR/DENOM will produce a single .sdi file in sdi format containing all variants detected in the subfolder mcmerge/ of your project folder, which is defined in your project description file. Users can generate the fasta file easily using imr getgenome with .sdi file if needed.

The sdi (Snps, Deletions and Insertions) file format

    Each line of an sdi file consists of the columns chromosome, position, length, reference base, consensus base, quality value [\*0-9] (* or numeric value from 0-255).

    • chromosome The same name as the chromosome id in reference fasta file
    • position 1-based leftmost position
    • length the length difference of the changed sequence against reference (0 for SNPs, negative for deletions, positive for insertions)
    • reference base [-A-Z]+ (regular expression range)
    • consensus base [-A-Z]+ ((regular expression range), IUPAC code is used for heterozygous sites
    • quality value * means no value available; [0-9]+ shows the quality of this variant. It is not necessary Phred quality.
    Some examples are:
    Chr1    723      0      C           T       2
    Chr1    2719    -4      TGCA        -       1
    Chr1    6786     1      -           T       1
    Chr1    16786   -4      AGGCA       T       1
    
    The columns after the 6th are optional. In the IMR/DENOM output, 7-9th column means: HMQ coverage (the coverage from reads with mapq >=30), SNP Phred score, HMQ consensus base(the consensus base when considering reads with mapq>=30),



IMR

IMR Installation

Make sure the downloaded binarie file imr is in your execeutable path. In what follows, we denote the directory containing the binarie as $IMR_FOLDER.

The IMR-DENOM pipeline requires additional packages to be installed in order to perform read-mapping and de novo assembly

Download and install your read mapper of choice:

(It will support other assemblies such as Bowtie, soap2 and zoom soon).

Make sure the read mapper is on your executable path. (Note if Stampy is used, you also need to install BWA and a correct verion of python.)

Download picard and install it. You need notify IMR of the whereabouts of picard in your system, either by setting the environment variable PICARD_PATH or by downloading the picard binary files to $IMR_FOLDER/external/. Note: picard needs java, so you also need to make sure the correct java version is installed in your system.

Download samtools and install it.



Running IMR

IMR use the project description textfile mentioned before. When it is available, the default way to run IMR:

imr easyrun example.t
To use bwa as mapper rather than the default stampy.
imr easyrun  -m bwa example.t
To use IMR to align all reads to the reference without iteration to create a single bam file for visualization or other analysis.
imr easyrun --imrnocall example.t
To use IMR to call variants off an existing bam file, without iterations
imr imrcall  [options] {ref} {bamfile} [region...]
Other options of imr easyrun:
  --help                  produce help message
  -o [ --outputfile ] arg set the output sdi file
  -e [ --outbam ] arg     output the new bam file
  -f [ --format ] arg     file format used for preprocessing
  --imrnocall             Only map reads and merge bam files, no variant call
  --imrkeepdup            For the merged bam files, do not remove duplicates
  --imrstartfrommap       Start to map raw reads. It can reuse the previous 
                          finished part.
  --imrstartfromcall      Start from variant calling, no mapping or merging
  --mergeall              Merge all reads-group together then deal 
                          with(remove/keep) duplicate
  -m [ --mapper ] arg     the name of program used for mapping: 
                          bwa/maq/stampy/smalt [=stampy]
  --iterations arg        The number of rounds of iterations [=5]
  --iterstartfrom arg     Start Iteration from which round [=1]
  -p [ --threads ] arg    Maximum processors used, can be set in configure file
                          too [=4]
  -q [ --qual ] arg       fastq File format used: sanger,solexa,solexaold,usepr

IMR Parallel Computation Support

The alignment and analysis of next generation sequencing data are time-consuming. Even a common multi-core or multi-processor PC can benefit from IMR's parallel computation support by aligning several lanes simultaneously. Multi-threading is used as follows:

By setting the threads variable in the project description file. For example,

 threads 4
will align at most 4 lanes at one time.

Error Recovery

When an error occurs, such as loss of power or access to network storage, it is unnecessary to rerun everything from scratch. Instead, IMR can be restarted.For example:

imr easyrun -q usepre --iterstartfrom 2  example.t
will rerun IMR from second iteration.
imr easyrun -q usepre --iterstartfrom 2  --imrstartfromcall example.t
will rerun IMR from second iteration, starting from variant-calling.

Output

IMR produces three types of output files:

  1. a series of updated reference fasta files representing the state of the reference sequence at each iteration. For simplicity, we hereafter assume the project folder, set by outputfolder in the project description is $sequencing_project. The new reference files can be found under folder /sequencing_project/. Their names are as follows (in the form
    newref_*.fa,  * starting from A and ending at Z)
    :
    newref_A.fa      (The changed reference after the first iteration)
    newref_B.fa       (The changed reference after the second iteration)
    newref_C.fa       (The changed reference after the third iteration)
    ....	                                                    
    
  2. a series of bam files representing the reads aligned to the genome sequences in each iteration.
    $sequencing_project/A/($project_basename).bam (reads aligned to the original reference)
    $sequencing_project/B/($project_basename)_B.bam (reads aligned to the newref_A.fa)
    $sequencing_project/C/($project_basename)_C.bam (reads aligned to the newreg_B.fa)
    ...
    

    if --outbam is set, the specified file will be the same as the $sequencing_project/A/($project_basename)_A.bam, which is often used by MCMERGE or other variant calling algorithm.

  3. The sequence differences (SNPs and INDELs) between the original reference and the genome investigated (final iterated reference). All variants are available in a single sdi file, $sequencing_project/pro/result_imr.sdi. If --outputfile is set, the specified file will also be created.


DENOM

DENOM will align contigs obtained from de-novo assembly to a reference genome and call variants (ie differences between the contigs and the reference). In principle it can handle short read data too, but without extensive testing currently. It is designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the assembled sample.

DENOM is not designed to replace denovo assembly algorithms. On the contrary, it is designed to enhance them. Current denovo assemblers usually produce a large number of contigs (which may be scaffolded together to a limited extent), rather than complete chromosome sequences. DENOM is designed to achieve this.

DENOM is also complementary to IMR, in the sense that it can be used to integrte denovo contgs with the output of IMR. In assembling Arabidopsis thaliana, we have found that their combination improves both IMR and DENOM applied to the original reference genomes, especially in repetitive regions. Therefore we strongly suggest running both DENOM and IMR and then mergeing their result using MCMERGE. But DENOM itself is independent.

INSTALLING DENOM

BWA and SAMTOOLS must be installed on your system. Make sure they are on your executable path.

Running DENOM

    Option 1, Running DENOM through the inferface to SOAPdenovo

    Please install SOAPdenovo v12.04+ First.

     denom soapinteface <descriptiodescriptionnfile>
    
    The contigfile is exactly the same one used for IMR

    Output : DENOM create following files:

    $sequencing_project/soapassembly/soap4denom.contig { SOAPdenovo output}
    $sequencing_project/soapassembly/soap4denom.bam
    $sequencing_project/soapassembly/soap4denom.sdi
    
    The output file soap4denom.sdi, using the sdi format, containing all variants called by DENOM and the BAM file soap4denom.bam will be used by MCMERGE.

    Warning Since SOAPdenovo usually take a huge amount of memories (20G memory needed for arabidopsis with ~30x coverage), we strong suggest people to contact your admin before running this. In WTCHG, a special server is used to run this job.


    Option 2, Run DENOM for when assembled contig file is available

    Before running, it is necessary to assemble contigs using a denovo assembler. DENOM can directly use the result from either soapDenovo, ABYSS or velvet, with soapDenovo strongly suggested. When a FASTA format file of contigs is available, you can run using the command below.

         denom easyrun <ref.fa>  <contig.fa> <out.bam> <out.sdi> 
    behind easyrun:
         denom fasiege         prepare the fasta file for mapping 
         denom premap          prelimary mapping using bwa 
         denom varcall         call the variants
    

    Output : DENOM creates an output file <out.sdi>, using the sdi format, and a BAM file <out.bam>.



MCMERGE

MCMERGE merges the variants called from different algorithms/solutions. Currently, it is tuned to merge the outputs from IMR and DENOM, but the algorithm can be easily extended.

Running MCMERGE

        mcmerge easyrun [options]  <ref>  <imr.sdi>  <denom.sdi>  <imrbam>  <denombam>  <simbam>
    
    options:
       --help                    produce help message
       -o [ --outfile ] arg      output file
       -p [ --process ] arg (=4) number of cores used
       -t [ --tmpdir ] arg (=./) directory for temporatory files
    

    The input files imr.sdi, imrbam are produced by IMR, and denom.sdi, denombam are produced by DENOM. In addition, MCMERGE uses a simulation input file simbam, to identify regions of the reference likely to produce unreliable results, caused by the mapping algorithm, repetitive regions or errors in the reference genome sequence. Ideally, simbam should be computed taking into account the number of reads read-length distributions in the original fastq files (including multiple libraries), using IMR to align all the simulated reads to the reference with the same parameters. In practice, we have found this needs only be computed once, as unreliable loci are quite stable. Therefore we reccommend that simbam be directly downloaded from simulation bamfile . At the moment, simbam files for Arabidopsis, mouse and rat are available.

    You can also generate your own simbam file by running the script sim_imrdenom.pl with the reference genome as parameter. It will generate two simulated read files and a project description file. Run imr easyrun --imrnocall [-m bwa] pro_descript You will get a bam file. That is the simbam you need.


    Get the assembled genome

       mcmerge getgenome [options] <ref> <last.sdi>
    
    options:
       --help                    produce help message
       -o [ --outfile ] arg      set the output file
       -p [ --process ] arg (=4) set the number of processors used
       -t [ --tmpdir ] arg (=./) set the directory for temporary files
    
    MCMERGE usese multi-threading, with the number of threads/cores controlled by setting the --process option.

Output

For command mcmerge easyrun, the output is a sdi file sdi format. You can set the filename either by setting --outfile or use cosole pipe (i.e. >filename)

For command mcmerge getgenome, the output is a fasta file containing sequences for all chromosomes. You can set the filename either by setting --outfile or use cosole pipe (i.e. >filename)