The GSCANDB software
This page describes the GSCANDB software:
- A linux server running instances of MySQL and Apache, or their equivalents on other OS.
- A MySQL account with privileges to create, drop and alter databases.
- Java (jdk.1.6) and ANT have to be installed on your system, for creating and populating the gscan database.
- Download and unzip the file: gscandb_v1.zip
- Download and unzip the biomart files: biomart.zip
- Go to the gscandb_v1 directory
- Ask your web server administrator to create a web alias for the gscandb_v1 directory.
- Configure the database.properties file (located in the gscandb_v1 sub-directory called etc)by replacing xxx with your username and password
If required, edit some or all of the following parameter settings, to suit your environment:
tmpdir=/tmp/www-data/ This is the temporary directory for images generated by GScan
- Create the database and build the tables by issuing the command
from within the gscandb_v1 directory.
The Genome Scan Viewer
Option B: Update and add data to the database.
- Check that the web server works, by pointing it at the url:
- This should generate a web page with header 'Genome Scan Viewer', similar to that on our GSCANDB web site http://gscan.well.ox.ac.uk/gs/wwwqtl.cgi except the scrolling lists and pulldown menus will be empty
The database will need to be populated with your data. We provide some example data from our mouse QTL mapping experiment in the compressed tarball gscandb.examples.zip. Download and unpack it, preferably into a directory different to the gscandb directory. The directory contains comma-separate files, which format and content are further described in the Input files section below.
GSCANDB can be populated using different arguments, depending on whether it is being populated for the first time or whether data is being updated or added to the database.
sh loadData.sh dir=/data/infiles marker=marker.csv
is the same as writing:
sh loadData.sh marker=/data/infiles/marker.csv
Arguments marker, sample, genotype and gscan indicate the type of the input file. See section Input files for further description of the input files.
sh loadData.sh dir=/gscan/gscandb/csvInfiles marker=marker.csv update
Option C: Add data to the database.
Faster then option B, but requires the infile to include unique identifiers only (they cannot already be in the database)
sh loadData.sh dir=/gscan/gscandb/csvInfiles marker=marker.csv append
To see all the available command line options for populating the database, issue the command:
sh loadData.sh help
See section Examples for more examples on how to populate gscandb using loadData.sh.
All the infiles should be comma seperated.
The headers for the csv files files are as follows. Null fields should be entered as ",,". Fields in bold cannot be null.
marker.csv containing basic marker information
marker_mapping.csv containing positions of the markers on genome builds
sample.csv containing information about samples (individuals with genotypes)
genotype.csv containing the genotypes of the markers on the samples
hapmap.csv containing haplotype map information for the markers
- files named
Biochem.ALP.chr*.scan containing genome-scan data for one phenotype across 20 chromosomes in a special format described below.
threshold.csv containing significance threshold information for genome scans
|TABLE NAME|| FIELD NAMES|
|genome_build || name, date species, comments, ensembl, ensembldb, ensemblspecies,liftover |
| phenotype ||name, description,public_name|
| population ||name, species, size, comments |
|marker|| name, marker_type, leftseq, rightseq, alias |
|marker_mapping ||marker, genome_build, chromosome, bp_position, strand, cm |
| trait_locus || name, population,genome_scan, subscan_label phenotype, marker1, marker2, species, chromosome, start_bp, end_bp, threshold, score, peak, label, comment, url |
| sample || name, gender, notes |
| genotype || marker, sample, genotype |
| hapblock ||genome_build, chromosome, marker_start, marker_end, info|
| chromosome ||name,genome_build,length|
Most of the fields in the tables are self-explanatory, but in detail:
In GSCANDB a genome scan is associated with a mapping population, phenotype and genome build. Each scan contains one or more named subscans. A subscan is a series of quantitative measurements along the genome, where each measurement is associated with a marker or marker interval. The subscan mechanism is useful for storing different analyses of the same underlying data, for example we analyse all our phenotypes in at least four ways, looking for singlepoint additive and dominance effects and and multipoint additive and dominance effects. Note that the marker order of the data depends on the genome build, and is therefore defined by uploading marker_mapping files. Although genome scan files may contain positional information, this is ignored.
Genome Scan input files have two accepted formats:
- genome_build contains basic information about a genome assembly against which the genome scan data are to be plotted and the genome annotations matched.. At present the only information which is required is the name of the genome build; the other ensembl-related fields are unused but will be used in later releases. The liftover field is used to lift genome scans defined on one genome build onto another build (described below).
- phenotypedescribes a phenotype. The most important field other than the name is public_name, which is the text displayed in the interface. If the option "public.version" => 1 in qtlOptions.pl then only phenotypes with public names are displayed. This mechanism is used so that public and development versions can be run off the same database.
- population defines a mapping population.
- marker contains information about a marker. Note that in GSCANDB a marker is a unique location on the genome.
- marker_mapping contains the locations of markers on genome builds
- trait_locus contains information about QTL. A QTL can be defined in two ways.
- If the QTL is associated with a genome subscan - ie corresponds to some region in the scan that is likely to contain a functional variant, then both the genome_scan and subscan_label must be defined appropriately the range of the QTL is defined in terms of markers (marker1 to marker2), rather than by base-pair position (chromosome, start_bp, end_bp). The advantage of the former method is that is is possible to lift over between genome builds.
- species defines species.
- sample defines individuals with genotypes
- genotype defines genotypes connect a sample to a marker
- hapblock contains special information about the locations of haplotype blocks.
- chromosome defines chromosome lengths.
Examples will be shown here
Last modified: Tue Oct 13 12:01:24 BST 2009