##################################################### # # # file specifications for glfTools/glfv3 # # # ##################################################### ### Introduction ### This file holds information on the various file formats used by the glfTools package. ### Contents ### 1: GLFv3 Formats 1.1: GLFv3 Binary Format 1.2: GLFv3 Text Dump Format 2: Genotype Formats 2.1: Text Genotype Format 2.2: Binary Genotype Format 3: pos File Format 4: maq SNP Format ### 1: GLFv3 Formats ### ## 1.1: GLFv3 Binary Format ## The Genotype Likelihood Format (version 3) stores SNP and indel information, including genotype likelihoods and other quality information across many sites, in an efficiency binary format. The GLFv3 binary format specification can be found in Appendix A to the SAM Format Specification (http://samtools.sourceforge.net/SAM1.pdf) ## 1.2: GLFv3 Text Dump Format ## The GLFv3 text dump format gives .glf file information in a human-readable form. The format is: chrom pos r d rmQ min_lk AA AC AG AT CC CG CT GG GT TT Where chrom is the chromosome name, pos is the position on the chromosome, r is the reference genotype, d is the read depth, rmQ is the rms mapping quality, min_lk is the negative log likelihood of the highest likelihood genotype, and AA-TT are the negative log likelihoods of the genotypes given read data (normalised such that the lowest negative log-likelihood is 0). Note that this text format consists of 4 tab-separated fields, chrom, pos, site information, likelihood values. The second two fields are then broken down by fixed-field space-seperation (i.e. %3d %3d %3d etc). ### 2: Genotype Formats ### ## 2.1: Text Genotype Format ## This file consists of a number of lines in the following form: chrom pos XX where chrom is a the chromosome name (a character string), pos is the position in the chromosome (an int) and XX is the genotype (two characters from {A,C,T,G,N}). There is a secondary form: chrom pos X where character X is the genotype as a single IUPAC ambiguity code. All entries on the same chromosome must be contigous, and sorted by position, though the chromosomes themselves need not be sorted. There is a combined version of this format, for storing genotype calls for a number of individuals; this consists a number of text genotype files concatanated together, with a title line for each one containing a single field with indentifying the Individual. This is the output produced by bin2hapmap. ## 2.2: Binary genotype format ## The binary file contains genotype calls for a number of positions. This binary genotype consists of: A) a dictionary of chromosome names chromDict: Dictionary magic number "DICT" (char[4]) Number of chromosomes N (int) N x chromosome strings: string length len (int) chromosome name s (char[len]) B) number of sites n (int) C) array of chromosomes chrArray (int[n]) D) array of positions posArray (int[n]) E) individual genotype data (read until EOF): length of sample name len (int) same name (char[len]) genotype list gList (int[n]) chrArray gives the chromosomes as keys to the dictionary chromDict, and posArray[i] gives the co-ordinates of site i on the chromosome given by chrArray[i]. The entries to gList use the conversion code {0:AA,1:AC,2:AG,3:AT,4:CC,5:CG,6:CT,7:GG,8:GT,9:TT}, with any other number taken to be "NN". ### 3: pos File Format ### A text file, where each entry corresponds to a chromosome position. Each entry has the format: chrom pos N N M where chrom is a the chromosome name (a character string), pos is the position in the chromosome (an int), and N, N and M aren't currently used. The entries can be in any order, and chromosomes can be mixed up - the entries are grouped by chromosome and sorted within the code. ### 4: maq SNP Format ### This is a text format for SNP calls, taken from the program maq. Each entry gives genotype calls and call-quality information for a particular site. The format is: chrom pos r b q d n rms_mapQ nqs b2 q2 b3 Where: chr chromosome name, x position, r reference base, b consensus base, q Phred-like consensus quality, d read depth, n the average number of hits of reads covering this position (not currently supported, set to 0.00), rms_mapQ the rms mapping quality of the reads covering the position, nqs the minimum consensus quality in the 3bp flanking regions at each side of the site (6bp in total), b2 the second best call, q2 log likelihood ratio of the second best and the third best call, b3 and the third best call.