#####################################################
#						    #
# usage information for glfTools/glfv3 	            #
#						    #
#####################################################

### Contents ###

1: General information
2: Usage Information
	2.1: hapmap2bin
	2.2: hapmap2dict
	2.3: bin2hapmap
	2.4: glf dump
	2.5: glf extract
	2.6: glf stats
	2.7: glf soloPrior
	2.8: glf snpCall
	2.9: glf subCall
	2.10: glf checkGenotype
	2.11: glf import


### 1: General Information ###

The glfTools package contains a number of utilities for handling the Genotype Likelihood Format version 3 (GLFv3). GLFv3 is used as a pileup output by the samtools package, and is a compact and informative way of storing likelihood and quality information for large-scale sequence data.

The package contains the binary file "glf", which has a number of subcommands for reading, manipulating and calling genotypes from .glf files, as well as functionality for checking genotypes. Also included is the program hapmap2bin, which converts genotype files to a compact binary format, which is used for genotype checking by the glf command checkGenotype.

Detailed usage information is included below, and information on the file formats used by the package are given in file_specs.txt.


### 2: Usage Information ###


## 2.1: hapmap2bin ##

usage: hapmap2bin [-i] [-d DICT_FILE] [-f FILE_LIST] [genotype_file1.snp genotype_file2.snp ...] > genotype_file.bin

hapmap2bin takes in a number of text genotype files, and outputs a single cobmined binary genotype file to stdout. The sample names are taken to be the file locations of the text genotype files.

The text genotype files can either be given directory from the command line, or in a text file containing a list of text genotype files. hapmap2bin can work as a stand-alone command; however, if the text genotype files have divergent SNP lists some sites may be left out. To avoid this, generate a binary dictionary position file using hapmap2dict, and give it to hapmap2bin using the option "-d"

See file_specs.txt for information on the input and output files.

Options:
	-i	read secondary text genotype files (using single character IUPAC ambiguity codes for genotypes)
	-d DICT_FILE  read the chromosome dictionary and position array from a binary file
	-f FILE_LIST  read in a list of files to convert to binary, rather than taking them from the command line


## 2.2: hapmap2dict ##

usage: hapmap2dict [-f FILELIST] [genotype_file1.snp genotype_file2.snp ...] > dictionary_file.bin

hapmap2dict generated a binary dictionary position file from a number of text genotype files. This dictionary contains the position of every site on every chromosome in text genotypes given, and should be generated when dealing with a number of genotype files with non-identicle SNP lists (and is essential for non-identicle chromosome lists). The dictionary can be given as input to hapmap2bin, to ensure all sites are included in the resulting binary genotype file.

See file_specs.txt for information on the input and output files.

Options:
	-f FILELIST  reads a list of files from a text file


## 2.3: bin2hapmap ##

usage: bin2hapmap genotype_file.bin > genotype_file.snp

bin2hapmap converts from a binary genotype file to a multi-individual text genotype file. See file_specs.txt for information on the input and output files.


## 2.4: glf dump ##

Usage: glf dump < infile.glf > outfile.txt

glf dump converts between the GLFv3 binary format, to a human-readable GLFv3 text dump format.

See file_specs.txt for information on the GLFv3 text dump format.


## 2.5: glf extract ##

usage: glf extract [-file site_file.pos] [-name CHR_NAME] [-start x] [-end y] < infile.glf > outfile.glf

glf extract extracts a subset of sites from a glf file. Either a pos file with sites to be extracted, or a chromosome name to extract, must be given. If a chromosome name is given, start and end co-ordinates can also be provided. Note that glf extract currently removes indel information from a glf file, extracting only SNP information.

See file_specs.txt for information on the pos file format.

Options:
	-file site_file.pos	a pos file of sites to extract
	-name CHR_NAME	the name of a chromosome to extract
	-start x	the first position on chromosome CHR_NAME to extract 
	-end y	the last position on chromosome CHR_NAME to extract


## 2.6: glf stats ##

Usage: glf stats < infile.glf

glf stats prints out the length of each chromosome in infile.glf, as well as depth, mapping quality and reference likelihood histograms.


## 2.7: glf soloPrior ##

usage: glf soloPrior [-theta THETA] [-het HETSUPPRESS] < infile.glf > outfile.glf

glf soloPrior applies adjusts the genotype likelihoods based on prior information.

It reweights the likelihoods by the probability of difference from the reference, assuming the reference is a correct haplotype from the same population, given a population-scaled mutation rate THETA. It also reduces all het likelihoods by HETSUPPRESS - e.g. set to 50 to model haploid chromosomes.

Options:
	-theta THETA	population-scaled mutation rate, default is 0.001
	-het HETSUPPRESS	A penalty applied to het likelihoods, default is 0


## 2.8: glf snpCall ##

usage: glf snpCall < infile.glf > outfile.snp

glf snpCall calls all non-homozygous reference bases from infile.glf, and outputs it as the maq SNP format. Note that this does not apply any priors; these should be applied beforehand using glf soloPrior.

See file_specs.txt for information on the maq SNP format.


## 2.9: glf subCall ##

usage: glf subCall -file sites_file.pos < infile.glf > outfile.snp

glf subCall calls all sites given in the pos file sites_file.pos using genotype likelihood data from infile.glf, and outputs them as the maq SNP format. Note that this does not apply any priors; these should be applied beforehand using glf soloPrior.

See file_specs.txt for information on the maq SNP format and the pos file format.

Options:
	-file sites_file.pos	A pos file contains sites to be called


## 2.10: glf checkGenotype ##

usage: glf checkGenotype genotype_file.bin infile.glf

glf checkGenotype calculates the likelihood of the genotype information in infile.glf for each genotype in the binary genotype file genotype_file.bin. It prints the likelihood entropy for infile.glf, and then the negative log-likelihood and number of sites with non-missing data for each genotype, listed in order of decreasing likelihood.

## 2.11: glf import ##

usage: glf import [-Q] glf_textdump.txt > out.glf

glf import converts a text-dumped glf (such as produced by glf dump) into a binary glf.

glf import can also be used to convert a multisample QCall-format text file into a number of glf files (one per sample) using the -Q argument. This can thus be used to convert between multisample bcf files and glf files using bcftools and glftools, e.g.

> bcftools view -Q testfile.bcf > testfile.qcall
> glf import -Q testfile.qcall

A glf will be produced for each sample, in the working directory, each named {FILENAME}.glf