The GSCANDB software

This page describes the GSCANDB software:

System requirements
Download, configuration and installation
The Genome Scan Viewer
Uploading data
Input files
Examples

System Requirements

A linux server running instances of MySQL and Apache, or their equivalents on other OS.
A MySQL account with privileges to create, drop and alter databases.
Java (jdk.1.6) and ANT have to be installed on your system, for creating and populating the gscan database.

Download, configuration and installation

Download and unzip the file: gscandb_v1.zip
Download and unzip the biomart files: biomart.zip
Go to the gscandb_v1 directory
```
cd gscandb_v1
```
Ask your web server administrator to create a web alias for the gscandb_v1 directory.

Configure the database.properties file (located in the gscandb_v1 sub-directory called etc)by replacing xxx with your username and password

```
 username=xxx  
```
```
 password=xxx  
```
If required, edit some or all of the following parameter settings, to suit your environment:
```
 server=localhost 
```
```
 systemURL=http://localhost
```

 baseURL=http://localhost/yourGScanWebAlias

 tmpdir=/tmp/www-data/   This is the temporary directory for images generated by GScan

 biomart=/yourPathToTheBioMartDirectory/biomart/lib

Create the database and build the tables by issuing the command
```
 ant
```
from within the gscandb_v1 directory.

The Genome Scan Viewer

Check that the web server works, by pointing it at the url:
```
http://localhost/yourGScanWebAlias/wwwqtl.cgi
```
This should generate a web page with header 'Genome Scan Viewer', similar to that on our GSCANDB web site http://gscan.well.ox.ac.uk/gs/wwwqtl.cgi except the scrolling lists and pulldown menus will be empty

Input files

```
marker.csv
```
containing basic marker information
```
marker_mapping.csv
```
containing positions of the markers on genome builds
```
sample.csv
```
containing information about samples (individuals with genotypes)
```
genotype.csv
```
containing the genotypes of the markers on the samples
```
hapmap.csv
```
containing haplotype map information for the markers
files named
```
Biochem.ALP.chr*.scan
```
containing genome-scan data for one phenotype across 20 chromosomes in a special format described below.
```
threshold.csv
```
containing significance threshold information for genome scans

The headers for the csv files files are as follows. Null fields should be entered as ",,". Fields in bold cannot be null.

TABLE NAME	FIELD NAMES
genome_build	name, date species, comments, ensembl, ensembldb, ensemblspecies,liftover
phenotype	name, description,public_name
population	name, species, size, comments
marker	name, marker_type, leftseq, rightseq, alias
marker_mapping	marker, genome_build, chromosome, bp_position, strand, cm
trait_locus	name, population,genome_scan, subscan_label phenotype, marker1, marker2, species, chromosome, start_bp, end_bp, threshold, score, peak, label, comment, url
sample	name, gender, notes
genotype	marker, sample, genotype
hapblock	genome_build, chromosome, marker_start, marker_end, info
chromosome	name,genome_build,length

Most of the fields in the tables are self-explanatory, but in detail:

genome_build contains basic information about a genome assembly against which the genome scan data are to be plotted and the genome annotations matched.. At present the only information which is required is the name of the genome build; the other ensembl-related fields are unused but will be used in later releases. The liftover field is used to lift genome scans defined on one genome build onto another build (described below).
phenotypedescribes a phenotype. The most important field other than the name is public_name, which is the text displayed in the interface. If the option "public.version" => 1 in qtlOptions.pl then only phenotypes with public names are displayed. This mechanism is used so that public and development versions can be run off the same database.
population defines a mapping population.
marker contains information about a marker. Note that in GSCANDB a marker is a unique location on the genome.
marker_mapping contains the locations of markers on genome builds
trait_locus contains information about QTL. A QTL can be defined in two ways.
If the QTL is associated with a genome subscan - ie corresponds to some region in the scan that is likely to contain a functional variant, then both the genome_scan and subscan_label must be defined appropriately the range of the QTL is defined in terms of markers (marker1 to marker2), rather than by base-pair position (chromosome, start_bp, end_bp). The advantage of the former method is that is is possible to lift over between genome builds.
species defines species.
sample defines individuals with genotypes
genotype defines genotypes connect a sample to a marker
hapblock contains special information about the locations of haplotype blocks.
chromosome defines chromosome lengths.

In GSCANDB a genome scan is associated with a mapping population, phenotype and genome build. Each scan contains one or more named subscans. A subscan is a series of quantitative measurements along the genome, where each measurement is associated with a marker or marker interval. The subscan mechanism is useful for storing different analyses of the same underlying data, for example we analyse all our phenotypes in at least four ways, looking for singlepoint additive and dominance effects and and multipoint additive and dominance effects. Note that the marker order of the data depends on the genome build, and is therefore defined by uploading marker_mapping files. Although genome scan files may contain positional information, this is ignored. Genome Scan input files have two accepted formats:

tabular files are tab-separated tables with a header line. Additional information (about population, genome build, phenotype, subscan label) is provided by command-line arguments. We recommend you use this format. This format was designed initially for uploading microarray expression data, but can be used for other singlepoint data as well. The header line of the file must contain exactly one column named either "marker", "Transcript" or "gene", and another corresponding to the scan data, and whose name matches the command-line argument -colname. Each row of the file corresponds to data for a specific marker/probe/transcript/gene.
scan files have a header section with key-value data defining: the phenotype, mapping population, the genome build, the unit of measurement, the type of plot ("interval" or "point") the formula used in the analysis (or any other piece of text), followed by a table of space-separated scan data section. Both single-point and multi-point data are supported. A minimal example is given here. The data section starts after the line BEGIN_SCAN_DATA and ends before the line END_SCAN_DATA. The first row gives the column names. We use this format to upload files produced by our QTL-mapping pipeline written in R.
To upload subscans named "additive" and "full" for the phenotype "EMO" population "HS", genome build "34",from a genome scan file called "EMO.txt" containing columns named "additive" and "full" one would type
```
sh loadData.sh gscan=Emo.txt pop=HS build=34 pheno=EMO labels=additive,full
```
The command-line argument -colname specifies a comma-separated list of column names to upload, corresponding to the subscan names.

Examples

Examples will be shown here

Thorhildur Juliusdottir

Last modified: Tue Oct 13 12:01:24 BST 2009