This page contains resources relating to the IMR/DENOM package for assembling genomes from Illumina short read sequqnce data. This package was used to assemble 18 accessions of A thaliana, as described in our paper Multiple reference genomes and transcriptomes for Arabidopsis thaliana Nature 2011.. IMR v0.1.0 has also been used in Mouse genomic variation and its effect on phenotypes and gene regulation Nature 2011
IMR/DENOM comprises three independent programs, devised and written by Xiangchao Gan. The programmes can be run spearately or as a pipeline.
Descriptions of the algorithms are available.
Free for non-profit research purposes. Please contact authors otherwise. The program itself may not be modified in any way and no redistribution is allowed.
No condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of IMR/DENOM, or that it will be suitable for any particular purpose or for use under any specific conditions, or that the content or use of IMR/DENOM will not constitute or result in infringement of third-party rights.
Authors
Currently the software is downloadable as precompiled Linux binaries. This version (0.4.1) is now available. Please send bug reports and comments to the authors. We also provide an example on how we assembled Bur-0 accession step by step( download , Be careful that the data is about 3.3G and probably takes long time to download. The commands used are listed in the case ).
Make sure all the downloaded binaries are in a directory in your execeutable path.
IMR/DENOM need samtools , picard, and BWA. You may need SOAPdenovo too if you want to run de novo assembly. Users usually install them on their own. But you can install all of them privately for IMR/DENOM by simply running:
bootstrap
in your IMR/DENOM folder.
User may need to install read-mapper of their interest. We suggest people to use Stampy as the read-mapper, which beated all the others in our internal test. You need intall it by yourself if used.
A typical project might involve the assembly of more than one library with different insert sizes. The description file tells how to interpret the input files and group them together. Please read the details carefully. A simple example example1 is here and a more complicated example with multiple libraries example2 is also provided. Users can revise them accordingly.
imrdenom <proj_desctipiton_file>will finish all steps of assembly for you. However, we strongly encourage people to run IMR,DENOM and MCMERGE separately for better parameter tuning and lower computational burden.
imr getgenome
with .sdi file if needed.
Each line of an sdi file consists of the columns chromosome, position, length, reference base, consensus base, quality value [\*0-9] (* or numeric value from 0-255).
Chr1 723 0 C T 2 Chr1 2719 -4 TGCA - 1 Chr1 6786 1 - T 1 Chr1 16786 -4 AGGCA T 1The columns after the 6th are optional. In the IMR/DENOM output, 7-9th column means: HMQ coverage (the coverage from reads with mapq >=30), SNP Phred score, HMQ consensus base(the consensus base when considering reads with mapq>=30),
Make sure the downloaded binarie file imr
is in your
execeutable path. In what follows, we denote the directory containing
the binarie as $IMR_FOLDER.
The IMR-DENOM pipeline requires additional packages to be installed in order to perform read-mapping and de novo assembly
Download and install your read mapper of choice:
(It will support other assemblies such as Bowtie, soap2 and zoom soon).Make sure the read mapper is on your executable path. (Note if Stampy is used, you also need to install BWA and a correct verion of python.)
Download picard and install it. You need notify IMR of the whereabouts of picard in your system, either by setting the environment variable PICARD_PATH or by downloading the picard binary files to $IMR_FOLDER/external/. Note: picard needs java, so you also need to make sure the correct java version is installed in your system.
Download samtools and install it.
IMR use the project description textfile mentioned before. When it is available, the default way to run IMR:
imr easyrun example.tTo use bwa as mapper rather than the default stampy.
imr easyrun -m bwa example.tTo use IMR to align all reads to the reference without iteration to create a single bam file for visualization or other analysis.
imr easyrun --imrnocall example.tTo use IMR to call variants off an existing bam file, without iterations
imr imrcall [options] {ref} {bamfile} [region...]Other options of
imr easyrun
:
--help produce help message -o [ --outputfile ] arg set the output sdi file -e [ --outbam ] arg output the new bam file -f [ --format ] arg file format used for preprocessing --imrnocall Only map reads and merge bam files, no variant call --imrkeepdup For the merged bam files, do not remove duplicates --imrstartfrommap Start to map raw reads. It can reuse the previous finished part. --imrstartfromcall Start from variant calling, no mapping or merging --mergeall Merge all reads-group together then deal with(remove/keep) duplicate -m [ --mapper ] arg the name of program used for mapping: bwa/maq/stampy/smalt [=stampy] --iterations arg The number of rounds of iterations [=5] --iterstartfrom arg Start Iteration from which round [=1] -p [ --threads ] arg Maximum processors used, can be set in configure file too [=4] -q [ --qual ] arg fastq File format used: sanger,solexa,solexaold,usepr
The alignment and analysis of next generation sequencing data are time-consuming. Even a common multi-core or multi-processor PC can benefit from IMR's parallel computation support by aligning several lanes simultaneously. Multi-threading is used as follows:
maxreads XXin the project description file), and then align them simultaneously. This optimization can also be cancelled by setting
imr easyrun -q nopreon the command line or in the project description file using the parameter
prepara -q nopre
By setting the threads variable in the project description file. For example,
threads 4will align at most 4 lanes at one time.
When an error occurs, such as loss of power or access to network storage, it is unnecessary to rerun everything from scratch. Instead, IMR can be restarted.For example:
imr easyrun -q usepre --iterstartfrom 2 example.twill rerun IMR from second iteration.
imr easyrun -q usepre --iterstartfrom 2 --imrstartfromcall example.twill rerun IMR from second iteration, starting from variant-calling.
IMR produces three types of output files:
outputfolder
in the project description is $sequencing_project. The new reference files can be found under folder /sequencing_project/. Their names are as follows (in the form newref_*.fa, * starting from A and ending at Z):
newref_A.fa (The changed reference after the first iteration) newref_B.fa (The changed reference after the second iteration) newref_C.fa (The changed reference after the third iteration) ....
$sequencing_project/A/($project_basename).bam (reads aligned to the original reference) $sequencing_project/B/($project_basename)_B.bam (reads aligned to the newref_A.fa) $sequencing_project/C/($project_basename)_C.bam (reads aligned to the newreg_B.fa) ...
if --outbam
is set, the specified file will be the same as the $sequencing_project/A/($project_basename)_A.bam, which is often used by MCMERGE or other variant calling algorithm.
--outputfile
is set, the specified file will also be created.
DENOM will align contigs obtained from de-novo assembly to a reference genome and call variants (ie differences between the contigs and the reference). In principle it can handle short read data too, but without extensive testing currently. It is designed to reassemble homozygous genomes, eg inbred strains or haploid organisms, where a reference genome is available that is sufficiently similar to the genome of the assembled sample.
DENOM is not designed to replace denovo assembly algorithms. On the contrary, it is designed to enhance them. Current denovo assemblers usually produce a large number of contigs (which may be scaffolded together to a limited extent), rather than complete chromosome sequences. DENOM is designed to achieve this.
DENOM is also complementary to IMR, in the sense that it can be used to integrte denovo contgs with the output of IMR. In assembling Arabidopsis thaliana, we have found that their combination improves both IMR and DENOM applied to the original reference genomes, especially in repetitive regions. Therefore we strongly suggest running both DENOM and IMR and then mergeing their result using MCMERGE. But DENOM itself is independent.
BWA and SAMTOOLS must be installed on your system. Make sure they are on your executable path.
Please install SOAPdenovo v12.04+ First.
denom soapinteface <descriptiodescriptionnfile>The contigfile is exactly the same one used for IMR
Output : DENOM create following files:
$sequencing_project/soapassembly/soap4denom.contig { SOAPdenovo output} $sequencing_project/soapassembly/soap4denom.bam $sequencing_project/soapassembly/soap4denom.sdiThe output file
soap4denom.sdi
, using the
sdi format, containing all variants called by DENOM
and the BAM file soap4denom.bam
will be used by MCMERGE.
Warning Since SOAPdenovo usually take a huge amount of memories (20G memory needed for arabidopsis with ~30x coverage), we strong suggest people to contact your admin before running this. In WTCHG, a special server is used to run this job.
Before running, it is necessary to assemble contigs using a denovo assembler. DENOM can directly use the result from either soapDenovo, ABYSS or velvet, with soapDenovo strongly suggested. When a FASTA format file of contigs is available, you can run using the command below.
denom easyrun <ref.fa> <contig.fa> <out.bam> <out.sdi>behind easyrun:
denom fasiege prepare the fasta file for mapping denom premap prelimary mapping using bwa denom varcall call the variants
Output : DENOM creates an output file <out.sdi>
, using the
sdi format, and a BAM file <out.bam>
.
MCMERGE merges the variants called from different algorithms/solutions. Currently, it is tuned to merge the outputs from IMR and DENOM, but the algorithm can be easily extended.
mcmerge easyrun [options] <ref> <imr.sdi> <denom.sdi> <imrbam> <denombam> <simbam>options:
--help produce help message -o [ --outfile ] arg output file -p [ --process ] arg (=4) number of cores used -t [ --tmpdir ] arg (=./) directory for temporatory files
The input files imr.sdi, imrbam
are produced by
IMR, and denom.sdi, denombam
are produced by DENOM. In
addition, MCMERGE uses a simulation input file
simbam
, to identify regions of the reference likely
to produce unreliable results, caused by the mapping algorithm,
repetitive regions or errors in the reference genome sequence.
Ideally, simbam
should be computed taking into
account the number of reads
read-length distributions in the original fastq
files (including multiple libraries), using IMR to align all
the simulated reads to the reference with the same parameters.
In practice, we have found this needs only be computed once, as unreliable loci
are quite stable. Therefore we reccommend that simbam
be directly downloaded from simulation bamfile
. At the moment, simbam
files for Arabidopsis, mouse and rat are available.
You can also generate your own simbam file by running the script sim_imrdenom.pl
with the reference genome as parameter. It will generate two simulated read files and a project description file. Run
imr easyrun --imrnocall [-m bwa] pro_descript
You will get a bam file. That is the simbam you need.
mcmerge getgenome [options] <ref> <last.sdi>options:
--help produce help message -o [ --outfile ] arg set the output file -p [ --process ] arg (=4) set the number of processors used -t [ --tmpdir ] arg (=./) set the directory for temporary filesMCMERGE usese multi-threading, with the number of threads/cores controlled by setting the
--process
option.
For command mcmerge easyrun
, the output is a sdi file sdi format. You can set the filename either by setting --outfile
or use cosole pipe (i.e. >filename
)
For command mcmerge getgenome
, the output is a fasta file containing sequences for all chromosomes. You can set the filename either by setting --outfile
or use cosole pipe (i.e. >filename
)