19 genomes of Arabidopsis thaliana
This page contains resources relating to the 19 genomes project,
in which the genome sequences, transcriptomes and protein annotations
of 19 accessions of the plant Arabidopsis thaliana are
described. These genomes are the founders of the MAGIC genetic
reference population of recombinant inbred lines, and contribute to the 1001 Arabidopsis genomes project.
These genomes are described in our paper Mutiple reference genomes
and transcriptomes for Arabidopsis thaliana Nature 2011.
the IMR-DENOM software used to assemble the genomes is now
available. The newest version is 0.4.1.
Genetic differences between Arabidopsis thaliana accessions underlie
the plant’s extensive phenotypic variation, and until now these have
been interpreted largely in the context of the annotated reference
accession Col-0. Here we report the sequencing, assembly and
annotation of the genomes of 18 natural A. thaliana accessions, and
their transcriptomes. When assessed on the basis of the reference
annotation, one-third of protein-coding genes are predicted to be
disrupted in at least one accession. However, re-annotation of each
genome revealed that alternative gene models often restore coding
potential. Gene expression in seedlings differed for nearly half of
expressed genes and was frequently associated with cis variants within
5 kilobases, as were intron retention alternative splicing
events. Sequence and expression variation is most pronounced in genes
that respond to the biotic environment. Our data further promote
evolutionary and functional studies in A. thaliana, especially the
MAGIC genetic reference population descended from these accessions.
Details of Accessions Sequenced
The accessions sequenced are:
Accession |
Origin |
AIMS Stock Centre # |
Bur-0 |
Ireland |
CS6643 |
Can-0 |
Canary Isles |
CS6660 |
Ct-1 |
Italy |
CS6674 |
Edi-0 |
Scotland |
CS6688 |
Hi-0 |
Netherlands |
CS6736 |
Kn-0 |
Lithuania |
CS6762 |
Ler-0 |
Poland, formerly Germany |
CS20 |
Mt-0 |
Libya |
CS1380 |
No-0 |
Germany |
CS6805 |
Oy-0 |
Norway |
CS6824 |
Po-0 |
Germany |
CS6839 |
Rsch-4 |
Russia |
CS6850 |
Sf-2 |
Spain |
CS6857 |
Tsu-0 |
Japan |
CS6874 |
Wil-2 |
Russia |
CS6889 |
Ws-0 |
Russia |
CS6891 |
Wu-0 |
Germany |
CS6897 |
Zu-0 |
Germany |
CS6902 |
More details of the sequencing (libraries, read lengths, yields, coverage etc) are available here.
Collaborators
The project is a collaboration between
- Richard Mott (currently UCL, UK) and Xiangchao Gan (currently MPIPZ, Germany). Responsible for genome
sequencing and assembly. Funded by the BBSRC grant BB/F022697/1
"Resequencing Arabidopsis thaliana" and BB/D016029/2. All sequencing
was performed by the Genomics Core at WTCHG (David Buck and
colleagues), except for one library of Mt-0 (GATC Biotech Gmbh) and
one library of Ws-0 (The Sainsbury Laboratory, Eric Kemen). Rune
Lyngsoe (Dept Stats, University of Oxford) developed the ancestral
recombination graph analysis. Nicholas Harbered and Eric Kamen
(Plant Sciences, Oxford) confirmed indel data.
- University of Utah, USA (Richard Clark and Josh Steffen), Transcriptome sequencing and genome analysis. Funded by USA National Science Foundation (NSF) Arabidopsis 2010 award no. 0929262, and NSF Major Research Instrumentation award no. 0820985.
- Max Planck Institute, Tubingen, Germany (Gunnar Raesch, Oliver Steagle, Jonas Behr), Reannotation and association mapping.
- Kansas State University (Christopher Toomajian and Katie Hildebrand), Data analysis. Funded by USA National Science Foundation (NSF) Arabidopsis 2010 award no. 0929262.
- University of Bath, UK (Paula Kover), Provision of genomic DNA
and data analysis. Funded by BBSRC grants BB/F022697/1, BB/D016029/2.
Data Available for Download
NOTE: These data were revised and expanded on 5th September 2011. If you
downloaded any annotation data prior to that date please check if
there is a more recent version. The genome sequences and lists of
variants are unaffected.
- Genome Assemblies and Genomic DNA sequence reads
- fasta sequences of the 18 accessions' genomes. Lower confidence "uncovered regions" are in lower case. These are generally repetitive regions that were uncovered when the reads were re-mapped to the final assemblies. These regions therefore may be deleted or may correspond to places where reads could map to more than one locus. Tables of uncovered regions are available here.
- BAM files of reads aligned to the inferred genome sequence of each accession. All libraries for a a given accession are combined into one BAM. This includes publicly available single-end Bur-0 data from the Weigel lab, as wells as paired-end Bur-0 data generated by us. These BAMS comprise all the reads used for the assembly process.
- BAM files of individual libraries reads aligned to TAIR10 reference assembly (does not include public Bur-0 data). These reads comprise all the data generated by us for this project. These data are also available from the EBI SRA under accession number ERP000565. For most accessions there are two BAM files: phase 1 corresponds to at least one GA-II lane of 36bp paired end reads with ~200bp inserts. phase 2 (the files with names matching PII) comprise a GA-II lane of 51bp paired end reads from 400 bp inserts. The exceptions are (i) Hi-0 : In place of phase II two single-end 51bp libraries (amplified and non-amplified) were run. (ii) Mt-0: phase I comprised 36bp paired end data with ~400 bp inserts performed by GATC Biotech Gmbh; phase 2 comprised 51bp mate pair with 1.5kb inserts run at WTCHG. (iii) Ws-0: An additional lane of 36bp single end was performed at the Sainsbury Laboratory (Ws_0_ws_tsl.bam).
- Lists of sequence variants relative to TAIR10 (SDI format).
- Variant tables of all accessions in chromosome-specific files.
- Compact genome coordinates and mapping tools
- We provide a coordinate
mapping database, providing comparative genome coordinates of
all strains, helping to map between the different coordinate system.
See README for details and example python scripts
for usage.
- Transcriptomes
- BAM files of seedling RNAseq reads aligned to each genome assembly.
- Fastq files of the same data (unmapped).
- Reference Col-0 annotations
- Repeat analysis. All regions in which a 50mer centred at that location maps to more than one location.
- Deleted regions
- TAIR 10 annotation for reference
- Genome annotation and analysis
- GBROWSE database of annotated genomes. Tracks:
- Gene Prediction Experiments -> De novo gene annotations by mGene.ngs
- Gene Prediction Experiments -> Consolidated gene annotations (TAIR10+mGene.ngs)
- Gene Prediction Experiments -> Novel gene annotations (mGene.ngs+cufflinks)
- files of annotatations (GFF3, GTF and other formats)
- denovo annotations predicted using mGene.ngs and the seedling RNA-seq data, independently of the TAIR10 annotations. This directory contains de novo gene annotations for the 19 accessions (*.accession_name.*). We provide
two formats (gene finder format version 3: *.gff3 and matlab/octave data files: *.mat) in two
coordinate systems (reference genome: *.Col-0.{gff3,mat} and strain coordinate system: rest).
accession_name
- consolidated annotations, after integration with TAIR10:
- If the TAIR10 gene structure is disrupted in a splice site, then we use the gene prediction
(independent of whether there is full RNA-seq evidence of the transcript.) Note that in this case
the TAIR 10 gene structure is not valid anymore.
- We include predicted transcripts (on any accession) for which each intron is RNA-seq confirmed or
it was previously annotated in TAIR 10 as intron. We exclude transcripts merging two annotated
genes.
- We do not include the remaining predicted transcripts, but use the mapped, TAIR10 annotation
with valid transcript structure instead.
After a selection of transcripts, we determine the maximum open reading frame on the accessions
genomes. If necessary we extend the annotated 5' and 3' ends, to identify an appropriate coding
region. The classification into non-coding RNA, protein coding gene, pseudo gene and transposable
element gene is inherited from the TAIR 10 annotation to the consolidated annotations.
- Novel gene predictions by mGene.ngs and cufflinks, nucleotide and amino acid sequences.
- Protein sequences of all
predicted transcripts (FASTA).
- coding DNA sequences of all predicted
transcripts (FASTA).
- RNA sequences of all predicted
transcripts (FASTA).
- Gene expression and genetic mapping
- GBROWSE database of
annotated genomes. Tracks:
- Expressed genes
- Differentially expressed genes
- Gene expression fold change
- Heritability
- Different types of cis eQTLs (nucleotide variant, structural,
copy number variation)
- Gene expression summary table.
- Summary information for all analyzed genes, including newly
predicted genes.
- Information includes gene type, gene expression variability,
copy number variation, different types of (cis) polymorphisms,
impact of population structure. See README for details.
- Raw gene expression data tables
- Raw counts, RKPM values and p-values for variability for
every strains.
- Information includes estimates for alternative filters,
p-values for significant gene expression and p-values for
statistical tests. See README for details.
- Alternative splicing (AS) summary table.
- Summary information for alternative splicing events.
- Information includes type of AS event, differential
alternative splicing and genetic regulation of AS. See README for details.