BBSRC logoNSF logoFM-MPI logo

Imputation and Association Analysis in Arabidopsis thaliana MAGIC recombinant inbred lines

Beta Version

This page contains resources relating to the MAGIC population of recombinant inbred line of Arabidopsis thaliana, with a focus on data and software to perform genetic association. The MAGIC lines are described in our paper A Multiparent Advanced Generation Inter-Cross to Fine-Map Quantitative Traits in Arabidopsis thaliana.

Each MAGIC genome is a mosaic of the 19 founder genomes. In our PloS Genetics paper we genotyped the lines at 1200 SNPs spaced about 100kb apart in order to infer these mosaics probabalistically. To date over 700 lines have been genotyped.

703 these lines were originally genotyped at 1260 SNPs, and the data and R analysis software for QTL mapping using the original data is available here.

We have now resequenced almost 500 of these lines at low coverage in order to obtain about 500k SNPs for each line. The mosaics are then inferred by a dynamic programming algorithm, akin to the Viterbi path from a hidden Markov model. Importantly, the breakpoints in the mosaics can be mapped at high precisions, usually to within 1kb. These genome sequences can now be used for association analysis in the MAGIC lines. To do this, we exploit the genome sequences of the 19 founders of the MAGIC population. These genomes are described in our paper Multiple reference genomes and transcriptomes for Arabidopsis thaliana Nature 2011.

We have written two linked programs in C, reconstruction to infer breakpoints from low-coverage sequence data, and and to impute the genomes of the MAGIC lines, and genome_scan to perform association mapping on the imputed genomes. These programs are of wider utility than just Arabidopsis MAGIC lines: the imputation code will work on any population (not necessarily inbred) whose genomes are mosaics of a set of known founders.

We provide the raw SNP calls used to impute the genomes along with the imputed MAGIC genomes. If you just want to perform association mapping then there is no need to download the SNP calls.

Getting and Installing the Software

  1. Download the package magic_src_v4.0.tar.gz. Extract the achive using the commmand tar xvf magic_src_v4.0.tar.gz This will create a subdirectory ./SRC.
  2. To compile the two programs, type cd ./SRC; ./compile.csh
  3. Finally add the full path to the SRC directory to your $PATH environment variable.

Getting the low-coverage SNP data

  1. Create a directory, say ./MAGIC, to hold the data and cd to it.
  2. Issue the command wget -r --no-directories 'http://mtweb.cs.ucl.ac.uk/mus/www/POOLING/ARABIDOPSIS/FOUNDER/MAGIC.NEW/'
  3. The data wil be downloaded to the current directory. Each MAGIC line is represented by a separate text file with name equal to the corresponding MAGIC line (eg MAGIC.100). Don't put any other files in the MAGIC directory - this directory is searched by the reconstruction program and any file in it is read into the program.
  4. Each file contains many rows, one per SNP, each with four columns, denoting the chromosome, TAIR10 bp position, reference allele and called allele. eg.
    Chr1 303 T C
    Chr1 331 A T
    Chr1 341 T T
    Chr1 346 C C
    Chr1 425 C C
    Chr1 429 G G
      
  5. Download a text file transposons.gff containing the positions of transposons in the TAIR10 genome - these regions are used to mask out potentially unreliable regions during the imputation.

Getting the variants in the 19 MAGIC founder genomes

  1. Over 3 million sequence variance segregate in the 19 MAGIC founders. They are tabulated in chromosome-specific files that can be downloaded from here
  2. These files contain the sequence variants in the 19 founder genomes, aligned againast a common coordinate system, called the pseudogenome (essentially the coordinate system of the multiple alig nment of all the genomes, so that each genome can be derived from it by deletion or substiution, but never by insertion). The columns are:
    chrthe chromosome
    pse.bp the bp coordinate in the pseudo genome coordinate. In general pse.bp > bp
    bp the bp coordinate against the Col-0 reference. If the number is of the form "N.5" then the allele should be inserted after reference coord N
    nalleles the number of distinct alleles at the site
    maf the MAJOR allele frequency (this is more informative than the minor allele freq when there are more than two alleles)
    The remaining columns give the alleles for the 19 genomes in the order: bur-0 can-0 col-0 ct-1 edi-0 hi-0 kn-0 ler-0 mt-0 no-0 oy-0 po-0 rsch-4 sf-2 tsu-0 wil-2 ws-0 wu-0 zu-0 Note that some variants are not simple SNPs or indels, but comprise complex imbalanced substitutions.
  3. Download these files and put them in a directory, eg ./VARIANT.TABLES

Creating the imputed MAGIC genome mosaics

  1. Run the program reconstruction to create the genome mosaics and impute the MAGIC genomes, like this:
    reconstruction  -a ./VARIANTS.TABLES -m ./transposons.gff -c ./MAGIC/ 
    Here, ./VARIANTS.TABLES is the directory containing the 19 genomes variant calls, transposons.gff the file of regoins to mask out, ./MAGIC the directory of SNP calls in the MAGIC genomes.
  2. The complete set of command-line options for reconstruction are:
    ‑d assume genomes are diploid (ie can be heterozygous rather than inbred)
    ‑a path to the directory containing the variant tables, which must have file names of the form chr*.alleles.txt
    ‑c path to the directory containing the SNP calls for the MAGIC lines
    ‑m name of the mask file (in GFF format) containing unreliable repetitive regions of the genome to be maksed out
    ‑o output directory: all output files are written here, defualt is the current directory ./
    ‑p the penalty for changing state in the reconstruction algorithm
    ‑w optional directory to write filtered SNP calls (if not set then then calls are not written)
    ‑z name of output file containing the imputed mosaics (defaults to mosaic.txt)
    ‑s name of species. Sets the number and names of chromosomes. Can be "arabidopsis", "human", "mouse", "rice". Alternatively, if you set it to be an integer such as "1", then the analysis assumes a species with that given number of chromosomes. You must have already created corresponding chromosome-specific variant tables for each such chromosome.
  3. The output files produced by the program are:
    mosaic.txt A text summary of the mosaic structure of all imputed MAGIC lines. Not required by genome_scan but can be used for generating visula representations of the mosaics.
    chr*.imputed.txt Text files containing the imputed genotypes at all 3 million variable sites, divided into chromosomes. Alleles are recoded as integers with 0 indicating missing data. Not used, but usefule for manual investigation of the data.
    chr*.imputed.Data Binary files containing the imputed genotypes. These files are used by genome_scan
    chr*.uncompressed.Data uncompressed binary files, similar to above
    chr*.haplotype.Data Binary haplotype files conatining the predicted mosaics, used for haplotype association analysis.
    chr*alleles.DataBinary versions of the variant table files, for use by genome_scan
    chr*.sdp.txt Text files contaning the allelic distribution patterns. Not used.

Getting the Imputed MAGIC Genomes

Instead of creating the imputed data from the SNPs, you can download the files needs for genetic association using the command

wget -r --no-directories
      'http://mtweb.cs.ucl.ac.uk/mus/www/POOLING/ARABIDOPSIS/FOUNDER/chr*'
 

Plots of genome mosaics

The R script segments.R may be used to generate PDFs of the genome mosaics generated by the reconstruction program. Use the R command:


segment.plot( file="mosaic.txt", pdffile="segments.pdf", genome=FALSE)

Performing Association Analysis