Low-coverage Sequencing, Imputation and Association in Arabidopsis MAGIC lines

This page contains resources relating to the MAGIC population of recombinant inbred line of Arabidopsis thaliana, with a focus on data and software to perform genetic association. The MAGIC lines are described in our paper A Multiparent Advanced Generation Inter-Cross to Fine-Map Quantitative Traits in Arabidopsis thaliana.

Each MAGIC genome is a mosaic of the 19 founder genomes. In our PloS Genetics paper we genotyped the lines at 1200 SNPs spaced about 100kb apart in order to infer these mosaics probabalistically. To date over 700 lines have been genotyped.

703 these lines were originally genotyped at 1260 SNPs, and the data and R analysis software for QTL mapping using the original data is available here.

We have now resequenced almost 500 of these lines at low coverage in order to obtain about 500k SNPs for each line. The mosaics are then inferred by a dynamic programming algorithm, akin to the Viterbi path from a hidden Markov model. Importantly, the breakpoints in the mosaics can be mapped at high precisions, usually to within 1kb. These genome sequences can now be used for association analysis in the MAGIC lines. To do this, we exploit the genome sequences of the 19 founders of the MAGIC population. These genomes are described in our paper Multiple reference genomes and transcriptomes for Arabidopsis thaliana Nature 2011.

We have written two linked programs in C, reconstruction to infer breakpoints from low-coverage sequence data, and and to impute the genomes of the MAGIC lines, and genome_scan to perform association mapping on the imputed genomes. These programs are of wider utility than just Arabidopsis MAGIC lines: the imputation code will work on any population (not necessarily inbred) whose genomes are mosaics of a set of known founders.

We provide the raw SNP calls used to impute the genomes along with the imputed MAGIC genomes. If you just want to perform association mapping then there is no need to download the SNP calls.

Getting and Installing the Software

Download the package magic_src_v4.0.tar.gz. Extract the achive using the commmand tar xvf magic_src_v4.0.tar.gz This will create a subdirectory ./SRC.
To compile the two programs, type cd ./SRC; ./compile.csh
Finally add the full path to the SRC directory to your $PATH environment variable.

Getting the low-coverage SNP data

Create a directory, say ./MAGIC, to hold the data and cd to it.
Issue the command >wget -r --no-directories 'http://mtweb.cs.ucl.ac.uk/mus/www/POOLING/ARABIDOPSIS/FOUNDER/MAGIC.NEW/'
The data wil be downloaded to the current directory. Each MAGIC line is represented by a separate text file with name equal to the corresponding MAGIC line (eg MAGIC.100). Don't put any other files in the MAGIC directory - this directory is searched by the reconstruction program and any file in it is read into the program.
Each file contains many rows, one per SNP, each with four columns, denoting the chromosome, TAIR10 bp position, reference allele and called allele. eg.
```
Chr1 303 T C
Chr1 331 A T
Chr1 341 T T
Chr1 346 C C
Chr1 425 C C
Chr1 429 G G
  
```
Download a text file transposons.gff containing the positions of transposons in the TAIR10 genome - these regions are used to mask out potentially unreliable regions during the imputation.

Getting the variants in the 19 MAGIC founder genomes

Over 3 million sequence variance segregate in the 19 MAGIC founders. They are tabulated in chromosome-specific files that can be downloaded from here

These files contain the sequence variants in the 19 founder genomes, aligned against a common coordinate system, called the pseudogenome (essentially the coordinate system of the multiple alig nment of all the genomes, so that each genome can be derived from it by deletion or substiution, but never by insertion). This is equivalent to the concept of a Pan-Genome in other analyses. The columns are:

chr	the chromosome
pse.bp	the bp coordinate in the pseudo genome coordinate. In general pse.bp > bp
bp	the bp coordinate against the Col-0 reference. If the number is of the form "N.5" then the allele should be inserted after reference coord N
nalleles	the number of distinct alleles at the site
maf	the MAJOR allele frequency (this is more informative than the minor allele freq when there are more than two alleles)

The remaining columns give the alleles for the 19 genomes in the order: bur-0 can-0 col-0 ct-1 edi-0 hi-0 kn-0 ler-0 mt-0 no-0 oy-0 po-0 rsch-4 sf-2 tsu-0 wil-2 ws-0 wu-0 zu-0 Note that some variants are not simple SNPs or indels, but comprise complex imbalanced substitutions.

Download these files and put them in a directory, eg ./VARIANT.TABLES

Creating the imputed MAGIC genome mosaics

Run the program reconstruction to create the genome mosaics and impute the MAGIC genomes, like this:
```
reconstruction  -a ./VARIANTS.TABLES -m ./transposons.gff -c ./MAGIC/ 
```
Here, ./VARIANTS.TABLES is the directory containing the 19 genomes variant calls, transposons.gff the file of regoins to mask out, ./MAGIC the directory of SNP calls in the MAGIC genomes.

The complete set of command-line options for reconstruction are:

‑d	assume genomes are diploid (ie can be heterozygous rather than inbred)
‑a	path to the directory containing the variant tables, which must have file names of the form chr*.alleles.txt
‑c	path to the directory containing the SNP calls for the MAGIC lines
‑m	name of the mask file (in GFF format) containing unreliable repetitive regions of the genome to be maksed out
‑o	output directory: all output files are written here, defualt is the current directory ./
‑p	the penalty for changing state in the reconstruction algorithm
‑w	optional directory to write filtered SNP calls (if not set then then calls are not written)
‑z	name of output file containing the imputed mosaics (defaults to mosaic.txt)
‑s	name of species. Sets the number and names of chromosomes. Can be "arabidopsis", "human", "mouse", "rice". Alternatively, if you set it to be an integer such as "1", then the analysis assumes a species with that given number of chromosomes. You must have already created corresponding chromosome-specific variant tables for each such chromosome.

The output files produced by the program are:

mosaic.txt	A text summary of the mosaic structure of all imputed MAGIC lines. Not required by genome_scan but can be used for generating visula representations of the mosaics.
chr*.imputed.txt	Text files containing the imputed genotypes at all 3 million variable sites, divided into chromosomes. Alleles are recoded as integers with 0 indicating missing data. Not used, but usefule for manual investigation of the data.
chr*.imputed.Data	Binary files containing the imputed genotypes. These files are used by genome_scan
chr*.uncompressed.Data	uncompressed binary files, similar to above
chr*.haplotype.Data	Binary haplotype files conatining the predicted mosaics, used for haplotype association analysis.
chr*alleles.Data	Binary versions of the variant table files, for use by genome_scan
chr*.sdp.txt	Text files contaning the allelic distribution patterns. Not used.

Getting the Imputed MAGIC Genomes

Instead of creating the imputed data from the SNPs, you can download the files needs for genetic association using the command

wget -r --no-directories
      'http://mtweb.cs.ucl.ac.uk/mus/www/POOLING/ARABIDOPSIS/FOUNDER/chr*'

Plots of genome mosaics

The R script segments.R may be used to generate PDFs of the genome mosaics generated by the reconstruction program. Use the R command:


segment.plot( file="mosaic.txt", pdffile="segments.pdf", genome=FALSE)

Performing Association Analysis

genome_scan will perform genome association ananlysis of a quantitative trait using the imputed genotypes or haplotypes generated by reconstruction. It is very fast, taking under a minute to scan all 3 million sites, can perform stepwise multi-locus regression and can perform permutation tests to evaluate genome wide significance. It cannot handle covariates - so remove the effects of any covariates first by working with residuals. Example:
```
genome_scan -f phenotypes.txt -p days.to.bolt -p 1000
```
will perform a genome scan on the phenotype days.to.bolt from the phenotype file phenotypes.txt, with 1000 permutations.

The command line options for genome_scan are:

‑a	directory containing the binary versions of the allele data, defaults to ./
‑d	directory conatining the text-format imputed data (usually the same as -a, defaults to ./)
‑f	path to a tab-delimited text file containing the phenotype data. This must comprise columns, the first row conatining the column names. One column must be named 'SUBJECT.NAME' and contain the MAGIC lines ids in the form MAGIC.N, where N is an integer. The other columns are the phenotype values. Use NA for missing data.
‑p	The name of the phenotype to scan. This must be one of the columns in the phenotype file
‑w	Output directory (default is ./)
‑t	threshold log10 pvalue for reporting potential associations, default is 4.
‑h	flag to perform haplotype analysis rather than an imputed variant analysis. Defaults to off
‑c	use uncompressed Binaries rather than compressed. Ignore this, it's a debugging switch
‑n	number of permutations to perform, to determine genomewide significance. Defaults to 0 (no permutations).
‑H	print a help message

genome_scan produces the following output files in the output directory specified by the -w switch. The files vary depending on whether a haplotype analysis (using the -h switch) or variable site analysis (the default, in which each of the 3 million or so variable sites is tested for association) was performed. Here "phenotype" is replaced by the name of the phenotype specified by the -p switch:

variable site analysis:
phenotype.annotated.txt	The -log10 P-values and founder alleles at all variable sites with logp>threshold. The default htreshold (specified by the -t switch) is 4, so this file includes all variants that might conceivably be interesting - the genomewise threshold for Normally distributed data is typically about logp=6
phenotype.logP.txt	logp values at all sites with logp>threshold
phenotype.gscan.txt	The logp values of all variants exceeding the threshold, formatted for upload into gscandb
phenotype.perms.txt	permutation results for non-haplotype analysis
phenotype.mult.txt	result of forward selection for multiple QTLs
phenotype.lars.txt	ignore this file
haplotype analysis:
phenotype.haplotype.gscan.txt	if the -h switch is set, then this file containing the haplotype association values is produced
phenotype.haplotype.logP.txt	haplotype logP values at intervals with logp > threshold
phenotype.perms_hap.txt	permutation results for haplotype analysis