This page contains resources relating to the MAGIC population of recombinant inbred line of Arabidopsis thaliana, with a focus on data and software to perform genetic association. The MAGIC lines are described in our paper A Multiparent Advanced Generation Inter-Cross to Fine-Map Quantitative Traits in Arabidopsis thaliana.
Each MAGIC genome is a mosaic of the 19 founder genomes. In our PloS Genetics paper we genotyped the lines at 1200 SNPs spaced about 100kb apart in order to infer these mosaics probabalistically. To date over 700 lines have been genotyped.
703 these lines were originally genotyped at 1260 SNPs, and the data and R analysis software for QTL mapping using the original data is available here.
We have now resequenced almost 500 of these lines at low coverage in order to obtain about 500k SNPs for each line. The mosaics are then inferred by a dynamic programming algorithm, akin to the Viterbi path from a hidden Markov model. Importantly, the breakpoints in the mosaics can be mapped at high precisions, usually to within 1kb. These genome sequences can now be used for association analysis in the MAGIC lines. To do this, we exploit the genome sequences of the 19 founders of the MAGIC population. These genomes are described in our paper Multiple reference genomes and transcriptomes for Arabidopsis thaliana Nature 2011.
We have written two linked programs in C, reconstruction to infer breakpoints from low-coverage sequence data, and and to impute the genomes of the MAGIC lines, and genome_scan to perform association mapping on the imputed genomes. These programs are of wider utility than just Arabidopsis MAGIC lines: the imputation code will work on any population (not necessarily inbred) whose genomes are mosaics of a set of known founders.
We provide the raw SNP calls used to impute the genomes along with the imputed MAGIC genomes. If you just want to perform association mapping then there is no need to download the SNP calls.
Chr1 303 T C Chr1 331 A T Chr1 341 T T Chr1 346 C C Chr1 425 C C Chr1 429 G G
chr | the chromosome |
pse.bp | the bp coordinate in the pseudo genome coordinate. In general pse.bp > bp |
bp | the bp coordinate against the Col-0 reference. If the number is of the form "N.5" then the allele should be inserted after reference coord N |
nalleles | the number of distinct alleles at the site |
maf | the MAJOR allele frequency (this is more informative than the minor allele freq when there are more than two alleles) |
reconstruction -a ./VARIANTS.TABLES -m ./transposons.gff -c ./MAGIC/Here, ./VARIANTS.TABLES is the directory containing the 19 genomes variant calls, transposons.gff the file of regoins to mask out, ./MAGIC the directory of SNP calls in the MAGIC genomes.
‑d | assume genomes are diploid (ie can be heterozygous rather than inbred) |
‑a | path to the directory containing the variant tables, which must have file names of the form chr*.alleles.txt |
‑c | path to the directory containing the SNP calls for the MAGIC lines |
‑m | name of the mask file (in GFF format) containing unreliable repetitive regions of the genome to be maksed out |
‑o | output directory: all output files are written here, defualt is the current directory ./ |
‑p | the penalty for changing state in the reconstruction algorithm |
‑w | optional directory to write filtered SNP calls (if not set then then calls are not written) |
‑z | name of output file containing the imputed mosaics (defaults to mosaic.txt) |
‑s | name of species. Sets the number and names of chromosomes. Can be "arabidopsis", "human", "mouse", "rice". Alternatively, if you set it to be an integer such as "1", then the analysis assumes a species with that given number of chromosomes. You must have already created corresponding chromosome-specific variant tables for each such chromosome. |
mosaic.txt | A text summary of the mosaic structure of all imputed MAGIC lines. Not required by genome_scan but can be used for generating visula representations of the mosaics. |
chr*.imputed.txt | Text files containing the imputed genotypes at all 3 million variable sites, divided into chromosomes. Alleles are recoded as integers with 0 indicating missing data. Not used, but usefule for manual investigation of the data. |
chr*.imputed.Data | Binary files containing the imputed genotypes. These files are used by genome_scan |
chr*.uncompressed.Data | uncompressed binary files, similar to above |
chr*.haplotype.Data | Binary haplotype files conatining the predicted mosaics, used for haplotype association analysis. |
chr*alleles.Data | Binary versions of the variant table files, for use by genome_scan |
chr*.sdp.txt | Text files contaning the allelic distribution patterns. Not used. |
Instead of creating the imputed data from the SNPs, you can download the files needs for genetic association using the command
wget -r --no-directories 'http://mtweb.cs.ucl.ac.uk/mus/www/POOLING/ARABIDOPSIS/FOUNDER/chr*'
The R script segments.R may be used to generate PDFs of the genome mosaics generated by the reconstruction program. Use the R command:
segment.plot( file="mosaic.txt", pdffile="segments.pdf", genome=FALSE)
genome_scan -f phenotypes.txt -p days.to.bolt -p 1000will perform a genome scan on the phenotype days.to.bolt from the phenotype file phenotypes.txt, with 1000 permutations.
‑a | directory containing the binary versions of the allele data, defaults to ./ |
‑d | directory conatining the text-format imputed data (usually the same as -a, defaults to ./) |
‑f | path to a tab-delimited text file containing the phenotype data. This must comprise columns, the first row conatining the column names. One column must be named 'SUBJECT.NAME' and contain the MAGIC lines ids in the form MAGIC.N, where N is an integer. The other columns are the phenotype values. Use NA for missing data. |
‑p | The name of the phenotype to scan. This must be one of the columns in the phenotype file |
‑w | Output directory (default is ./) |
‑t | threshold log10 pvalue for reporting potential associations, default is 4. |
‑h | flag to perform haplotype analysis rather than an imputed variant analysis. Defaults to off |
‑c | use uncompressed Binaries rather than compressed. Ignore this, it's a debugging switch |
‑n | number of permutations to perform, to determine genomewide significance. Defaults to 0 (no permutations). |
‑H | print a help message |
variable site analysis: | |
phenotype.annotated.txt | The -log10 P-values and founder alleles at all variable sites with logp>threshold. The default htreshold (specified by the -t switch) is 4, so this file includes all variants that might conceivably be interesting - the genomewise threshold for Normally distributed data is typically about logp=6 |
phenotype.logP.txt | logp values at all sites with logp>threshold |
phenotype.gscan.txt | The logp values of all variants exceeding the threshold, formatted for upload into gscandb |
phenotype.perms.txt | permutation results for non-haplotype analysis |
phenotype.mult.txt | result of forward selection for multiple QTLs |
phenotype.lars.txt | ignore this file |
haplotype analysis: | |
phenotype.haplotype.gscan.txt | if the -h switch is set, then this file containing the haplotype association values is produced |
phenotype.haplotype.logP.txt | haplotype logP values at intervals with logp > threshold |
phenotype.perms_hap.txt | permutation results for haplotype analysis |