This directory contains applications for stand-alone use, built specifically for a Linux 64-bit machine. For help on the bigBed and bigWig applications see: http://genome.ucsc.edu/goldenPath/help/bigBed.html http://genome.ucsc.edu/goldenPath/help/bigWig.html View the file 'FOOTER' to see the usage statement for each of the applications.
Name Last modified Size Description
Parent Directory - FOOTER 05-Apr-2011 17:00 61K bedClip 05-Apr-2011 16:59 235K bedExtendRanges 05-Apr-2011 16:59 2.6M bedGraphToBigWig 05-Apr-2011 16:59 243K bedItemOverlapCount 05-Apr-2011 16:59 2.6M bedSort 05-Apr-2011 16:59 203K bedToBigBed 05-Apr-2011 16:59 314K bigBedInfo 05-Apr-2011 16:59 248K bigBedSummary 05-Apr-2011 16:59 248K bigBedToBed 05-Apr-2011 16:59 247K bigWigInfo 05-Apr-2011 16:59 239K bigWigSummary 05-Apr-2011 16:59 238K bigWigToBedGraph 05-Apr-2011 16:59 238K bigWigToWig 05-Apr-2011 16:59 238K blat/ 06-Apr-2011 15:45 - faCount 05-Apr-2011 16:59 160K faFrag 05-Apr-2011 16:59 157K faOneRecord 05-Apr-2011 16:59 135K faPolyASizes 05-Apr-2011 16:59 157K faRandomize 05-Apr-2011 16:59 157K faSize 05-Apr-2011 16:59 160K faSomeRecords 05-Apr-2011 16:59 139K faToNib 05-Apr-2011 16:59 163K faToTwoBit 05-Apr-2011 16:59 245K fetchChromSizes 05-Apr-2011 16:59 2.6K genePredToGtf 05-Apr-2011 16:59 2.6M gff3ToGenePred 05-Apr-2011 17:00 2.7M gtfToGenePred 05-Apr-2011 17:00 2.6M hgWiggle 05-Apr-2011 17:00 2.7M htmlCheck 05-Apr-2011 16:59 227K liftOver 05-Apr-2011 16:59 2.6M liftOverMerge 05-Apr-2011 16:59 207K liftUp 05-Apr-2011 16:59 2.7M mafSpeciesSubset 05-Apr-2011 16:59 159K mafsInRegion 05-Apr-2011 16:59 219K nibFrag 05-Apr-2011 16:59 165K overlapSelect 05-Apr-2011 17:00 2.7M paraFetch 05-Apr-2011 16:59 202K paraSync 05-Apr-2011 16:59 202K pslCDnaFilter 05-Apr-2011 16:59 220K pslPretty 05-Apr-2011 16:59 1.2M pslReps 05-Apr-2011 16:59 718K pslSort 05-Apr-2011 16:59 719K sizeof 05-Apr-2011 16:59 5.3K stringify 05-Apr-2011 16:59 139K textHistogram 05-Apr-2011 16:59 146K twoBitInfo 05-Apr-2011 16:59 238K twoBitToFa 05-Apr-2011 16:59 305K validateFiles 05-Apr-2011 16:59 2.7M wigCorrelate 05-Apr-2011 16:59 259K wigToBigWig 05-Apr-2011 16:59 889K
================================================================ ======== bedClip ==================================== ================================================================ bedClip - Remove lines from bed file that refer to off-chromosome places. usage: bedClip input.bed chrom.sizes output.bed options: -verbose=2 - set to get list of lines clipped and why ================================================================ ======== bedExtendRanges ==================================== ================================================================ bedExtendRanges - extend length of entries in bed 6+ data to be at least the given length, taking strand directionality into account. usage: bedExtendRanges database length files(s) options: -host mysql host -user mysql user -password mysql password -tab Separate by tabs rather than space -verbose=N - verbose level for extra information to STDERR example: bedExtendRanges hg18 250 stdin bedExtendRanges -user=genome -host=genome-mysql.cse.ucsc.edu hg18 250 stdin will transform: chr1 500 525 . 100 + chr1 1000 1025 . 100 - to: chr1 500 750 . 100 + chr1 775 1025 . 100 - ================================================================ ======== bedGraphToBigWig ==================================== ================================================================ bedGraphToBigWig v 4 - Convert a bedGraph program to bigWig. usage: bedGraphToBigWig in.bedGraph chrom.sizes out.bw where in.bedGraph is a four column file in the format: <chrom> <start> <end> <value> and chrom.sizes is two column: <chromosome name> <size in bases> and out.bw is the output indexed big wig file. The input bedGraph file must be sorted, use the unix sort command: sort -k1,1 -k2,2 unsorted.bedGraph > sorted.bedGraph options: -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 -unc - If set, do not use compression. ================================================================ ======== bedItemOverlapCount ==================================== ================================================================ bedItemOverlapCount - count number of times a base is overlapped by the items in a bed file. Output is bedGraph 4 to stdout. usage: sort bedFile.bed | bedItemOverlapCount [options] <database> stdin To create a bigWig file from this data to use in a custom track: sort bedFile.bed | bedItemOverlapCount [options] <database> stdin \ > bedFile.bedGraph bedGraphToBigWig bedFile.bedGraph chrom.sizes bedFile.bw where the chrom.sizes is obtained with the script: fetchChromSizes See also: http://genome-test.cse.ucsc.edu/~kent/src/unzipped/utils/userApps/fetchChromSizes options: -zero add blocks with zero count, normally these are ommitted -bed12 expect bed12 and count based on blocks Without this option, only the first three fields are used. -max if counts per base overflows set to max (4294967295) instead of exiting -outBounds output min/max to stderr -chromSize=sizefile Read chrom sizes from file instead of database sizefile contains two white space separated fields per line: chrom name and size -host=hostname mysql host used to get chrom sizes -user=username mysql user -password=password mysql password Notes: * You may want to separate your + and - strand items before sending into this program as it only looks at the chrom, start and end columns of the bed file. * Program requires a <database> connection to lookup chrom sizes for a sanity check of the incoming data. Even when the -chromSize argument is used the <database> must be present, but it will not be used. * The bed file *must* be sorted by chrom * Maximum count per base is 4294967295. Recompile with new unitSize to increase this ================================================================ ======== bedSort ==================================== ================================================================ bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same. ================================================================ ======== bedToBigBed ==================================== ================================================================ bedToBigBed v. 4 - Convert bed file to bigBed. usage: bedToBigBed in.bed chrom.sizes out.bb Where in.bed is in one of the ascii bed formats, but not including track lines and chrom.sizes is two column: <chromosome name> <size in bases> and out.bb is the output indexed big bed file. The in.bed file must be sorted by chromosome,start, to sort a bed file, use the unix sort command: sort -k1,1 -k2,2n unsorted.bed > sorted.bed options: -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 512 -bedFields=N - Number of fields that fit standard bed definition. If undefined assumes all fields in bed are defined. -as=fields.as - If have non-standard fields, it's great to put a definition of each field in a row in AutoSql format here. -unc - If set, do not use compression. ================================================================ ======== bigBedInfo ==================================== ================================================================ bigBedInfo - Show information about a bigBed file. usage: bigBedInfo file.bb options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and theier sizes -as - get autoSql spec ================================================================ ======== bigBedSummary ==================================== ================================================================ bigBedSummary - Extract summary information from a bigBed file. usage: bigBedSummary file.bb chrom start end dataPoints Get summary data from bigBed for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) options: -type=X where X is one of: coverage - % of region that is covered (default) mean - average depth of covered regions min - minimum depth of covered regions max - maximum depth of covered regions -fields - print out information on fields in file. If fields option is used, the chrom, start, end, dataPoints parameters may be omitted -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigBedToBed ==================================== ================================================================ bigBedToBed - Convert from bigBed to ascii bed format. usage: bigBedToBed input.bb output.bed options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -maxItems=N - if set, restrict output to first N items -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigInfo ==================================== ================================================================ bigWigInfo - Print out information about bigWig file. usage: bigWigInfo file.bw options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -minMax - list the min and max on a single line ================================================================ ======== bigWigSummary ==================================== ================================================================ bigWigSummary - Extract summary information from a bigWig file. usage: bigWigSummary file.bigWig chrom start end dataPoints Get summary data from bigWig for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) NOTE: start and end coordinates are in BED format (0-based) options: -type=X where X is one of: mean - average value in region (default) min - minimum value in region max - maximum value in region std - standard deviation in region coverage - % of region that is covered -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToBedGraph ==================================== ================================================================ bigWigToBedGraph - Convert from bigWig to bedGraph format. usage: bigWigToBedGraph in.bigWig out.bedGraph options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToWig ==================================== ================================================================ bigWigToWig - Convert bigWig to wig. This will keep more of the same structure of the original wig than bigWigToBedGraph does, but still will break up large stepped sections into smaller ones. usage: bigWigToWig in.bigWig out.wig options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== blat ==================================== ================================================================ blat - Standalone BLAT v. 34x10 fast sequence search command line tool usage: blat database query [-ooc=11.ooc] output.psl where: database and query are each either a .fa , .nib or .2bit file, or a list these files one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from and external file. This will increase the speed by a factor of 40 in many cases, but is not required output.psl is where to put the output. Subranges of nib and .2bit files may specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used. options: -t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna -prot Synonymous with -t=prot -q=prot -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize -tileSize=N sets the size of match that triggers an alignment. Usually between 8 and 12 Default is 11 for DNA and 5 for protein. -stepSize=N spacing between tiles. Default is tileSize. -oneOff=N If set to 1 this allows one mismatch in tile and still triggers an alignments. Default is 0. -minMatch=N sets the number of tile matches. Usually set from 2 to 4 Default is 2 for nucleotide, 1 for protein. -minScore=N sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30 -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1. -noHead suppress .psl header (so it's just a tab-separated file) -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically only comes into play with makeOoc. Also affected by stepSize. When stepSize is halved repMatch is doubled to compensate. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are lower - mask out lower cased sequence upper - mask out upper cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN - minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress -trimT Trim leading poly-T -noTrimA Don't trim trailing poly-A -trimHardA Remove poly-A tail from qSize as well as alignments in psl output -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID. Query sizes must not exceed 5000. -out=type Controls output file format. Type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high quality mRNAs look harder for small initial and terminal exons. Not recommended for ESTs -maxIntron=N Sets maximum intron size. Default is 750000 -extendThroughN - Allows extension of alignment through large blocks of N's ================================================================ ======== faCount ==================================== ================================================================ faCount - count base statistics and CpGs in FA files. usage: faCount file(s).fa -summary show only summary statistics -dinuc include statistics on dinucletoide frequencies -strands count bases on both strands ================================================================ ======== faFrag ==================================== ================================================================ faFrag - Extract a piece of DNA from a .fa file. usage: faFrag in.fa start end out.fa options: -mixed - preserve mixed-case in FASTA file ================================================================ ======== faOneRecord ==================================== ================================================================ faOneRecord - Extract a single record from a .FA file usage: faOneRecord in.fa recordName ================================================================ ======== faPolyASizes ==================================== ================================================================ faPolyASizes - get poly A sizes usage: faPolyASizes in.fa out.tab output file has four columns: id seqSize tailPolyASize headPolyTSize options: ================================================================ ======== faRandomize ==================================== ================================================================ faRandomize - Program to create random fasta records using same base frequency as seen in original fasta records. Use optional -seed flag to specify seed for random number generator. usage: faRandomize in.fa randomized.fa ================================================================ ======== faSize ==================================== ================================================================ faSize - print total base count in fa files. usage: faSize file(s).fa Command flags -detailed outputs name and size of each record has the side effect of printing nothing else -tab output statistics in a tab separated format ================================================================ ======== faSomeRecords ==================================== ================================================================ faSomeRecords - Extract multiple fa records usage: faSomeRecords in.fa listFile out.fa options: -exclude - output sequences not in the list file. ================================================================ ======== faToNib ==================================== ================================================================ faToNib - Convert from .fa to .nib format usage: faToNib [options] in.fa out.nib options: -softMask - create nib that soft-masks lower case sequence -hardMask - create nib that hard-masks lower case sequence ================================================================ ======== faToTwoBit ==================================== ================================================================ faToTwoBit - Convert DNA from fasta to 2bit format usage: faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit options: -noMask - Ignore lower-case masking in fa file. -stripVersion - Strip off version number after . for genbank accessions. -ignoreDups - only convert first sequence if there are duplicates ================================================================ ======== fetchChromSizes ==================================== ================================================================ usage: fetchChromSizes <db> > <db>.chrom.sizes used to fetch chrom.sizes information from UCSC for the given <db> <db> - name of UCSC database, e.g.: hg18, mm9, etc ... This script expects to find one of the following commands: wget, mysql, or ftp in order to fetch information from UCSC. Route the output to the file <db>.chrom.sizes as indicated above. Example: fetchChromSizes hg18 > hg18.chrom.sizes ================================================================ ======== genePredToGtf ==================================== ================================================================ genePredToGtf - Convert genePred table or file to gtf. usage: genePredToGtf database genePredTable output.gtf If database is 'file' then track is interpreted as a file rather than a table in database. options: -utr - Add 5UTR and 3UTR features -honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end codon records -source=src set source name to uses -addComments - Add comments before each set of transcript records. allows for easier visual inspection Note: use a refFlat table or extended genePred table or file to include the gene_name attribute in the output. This will not work with a refFlat table dump file. If you are using a genePred file that starts with a numeric bin column, drop it using the UNIX cut command: cut -f 2- in.gp | genePredToGtf file stdin out.gp ================================================================ ======== gfClient ==================================== ================================================================ gfClient v. 34x10 - A client for the genomic finding program that produces a .psl file usage: gfClient host port seqDir in.fa out.psl where host is the name of the machine running the gfServer port is the same as you started the gfServer with seqDir is the path of the .nib or .2bit files relative to the current dir (note these are needed by the client as well as the server) in.fa is a fasta format file. May contain multiple records out.psl where to put the output options: -t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein -prot Synonymous with -d=prot -q=prot -dots=N Output a dot every N query sequences -nohead Suppresses psl five line header -minScore=N sets minimum score. This is twice the matches minus the mismatches minus some sort of gap penalty. Default is 30 -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -out=type Controls output file format. Type is one of: psl - Default. Tab separated format without actual sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -maxIntron=N Sets maximum intron size. Default is 750000 ================================================================ ======== gfServer ==================================== ================================================================ gfServer v 34x10 - Make a server to quickly find where DNA occurs in genome. To set up a server: gfServer start host port file(s) Where the files are in .nib or .2bit format To remove a server: gfServer stop host port To query a server with DNA sequence: gfServer query host port probe.fa To query a server with protein sequence: gfServer protQuery host port probe.fa To query a server with translated dna sequence: gfServer transQuery host port probe.fa To query server with PCR primers gfServer pcr host port fPrimer rPrimer maxDistance To process one probe fa file against a .nib format genome (not starting server): gfServer direct probe.fa file(s).nib To test pcr without starting server: gfServer pcrDirect fPrimer rPrimer file(s).nib To figure out usage level gfServer status host port To get input file list gfServer files host port Options: -tileSize=N size of n-mers to index. Default is 11 for nucleotides, 4 for proteins (or translated nucleotides). -stepSize=N spacing between tiles. Default is tileSize. -minMatch=N Number of n-mer matches that trigger detailed alignment Default is 2 for nucleotides, 3 for protiens. -maxGap=N Number of insertions or deletions allowed between n-mers. Default is 2 for nucleotides, 0 for protiens. -trans Translate database to protein in 6 frames. Note: it is best to run this on RepeatMasked data in this case. -log=logFile keep a log file that records server requests. -seqLog Include sequences in log file (not logged with -syslog) -ipLog Include user's IP in log file (not logged with -syslog) -syslog Log to syslog -logFacility=facility log to the specified syslog facility - default local0. -mask Use masking from nib file. -repMatch=N Number of occurrences of a tile (nmer) that trigger repeat masking the tile. Default is 1024. -maxDnaHits=N Maximum number of hits for a dna query that are sent from the server. Default is 100. -maxTransHits=N Maximum number of hits for a translated query that are sent from the server. Default is 200. -maxNtSize=N Maximum size of untranslated DNA query sequence Default is 40000 -maxAsSize=N Maximum size of protein or translated DNA queries Default is 8000 -canStop If set then a quit message will actually take down the server ================================================================ ======== gff3ToGenePred ==================================== ================================================================ gff3ToGenePred - convert a GFF3 file to a genePred file usage: gff3ToGenePred inGff3 outGp options: -maxParseErrors=50 - Maximum number of parsing errors before aborting. A negative value will allow an unlimited number of errors. Default is 50. -maxConverErrors=50 - Maximum number of conversion errors before aborting. A negative value will allow an unlimited number of errors. Default is 50. -honorStartStopCodons - only set CDS start/stop status to complete if there are corresponding start_stop codon records This converts: - top-level gene records with mRNA records - top-level mRNA records - mRNA records can contain exon and CDS, or only CDS, or only exon for non--coding. The first step is to parse GFF3 file, up to 50 errors are reported before aborting. If the GFF3 files is successfully parse, it is converted to gene, annotation. Up to 50 conversion errors are reported before aborting. Input file must conform to the GFF3 specification: http://www.sequenceontology.org/gff3.shtml ================================================================ ======== gtfToGenePred ==================================== ================================================================ gtfToGenePred - convert a GTF file to a genePred usage: gtfToGenePred gtf genePred options: -genePredExt - create a extended genePred, including frame information and gene name -allErrors - skip groups with errors rather than aborting. Useful for getting infomation about as many errors as possible. -infoOut=file - write a file with information on each transcript -sourcePrefix=pre - only process entries where the source name has the specified prefix. May be repeated. -impliedStopAfterCds - implied stop codon in after CDS -simple - just check column validity, not hierarchy, resulting genePred may be damaged -geneNameAsName2 - if specified, use gene_name for the name2 field instead of gene_id. ================================================================ ======== hgWiggle ==================================== ================================================================ hgWiggle - fetch wiggle data from data base or file usage: hgWiggle [options] <track names ...> options: -db=<database> - use specified database -chr=chrN - examine data only on chrN -chrom=chrN - same as -chr option above -position=[chrN:]start-end - examine data in window start-end (1-relative) (the chrN: is optional) -chromLst=<file> - file with list of chroms to examine -doAscii - perform the default ascii output, in addition to other outputs - Any of the other -do outputs turn off the default ascii output -rawDataOut - output just the data values, nothing else -htmlOut - output stats or histogram in HTML instead of plain text -doStats - perform stats measurement, default output text, see -htmlOut -doBed - output bed format -lift=<D> - lift ascii output positions by D (0 default) -bedFile=<file> - constrain output to ranges specified in bed <file> -dataConstraint='DC' - where DC is one of < = >= <= == != 'in range' -ll=<F> - lowerLimit compare data values to F (float) (all but 'in range') -ul=<F> - upperLimit compare data values to F (float) (need both ll and ul when 'in range') -help - display more examples and extra options (to stderr) When no database is specified, track names will refer to .wig files example using the file chrM.wig: hgWiggle chrM example using the database table hg17.gc5Base: hgWiggle -chr=chrM -db=hg17 gc5Base ================================================================ ======== htmlCheck ==================================== ================================================================ htmlCheck - Do a little reading and verification of html file usage: htmlCheck how url where how is: ok - just check for 200 return. Print error message and exit -1 if no 200 getAll - read the url (header and html) and print to stdout getHeader - read the header and print to stdout getCookies - print list of cookies getHtml - print the html, but not the header to stdout getForms - print the form structure to stdout getVars - print the form variables to stdout getLinks - print links getTags - print out just the tags checkLinks - check links in page checkLinks2 - check links in page and all subpages in same host (Just one level of recursion) checkLocalLinks - check local links in page checkLocalLinks2 - check local links in page and connected local pages (Just one level of recursion) submit - submit first form in page if any using 'GET' method validate - do some basic validations including TABLE/TR/TD nesting options: cookies=cookie.txt - Cookies is a two column file containing <cookieName><space><value><newLine> note: url will need to be in quotes if it contains an ampersand. ================================================================ ======== liftOver ==================================== ================================================================ liftOver - Move annotations from one assembly to another usage: liftOver oldFile map.chain newFile unMapped oldFile and newFile are in bed format by default, but can be in GFF and maybe eventually others with the appropriate flags below. The map.chain file has the old genome as the target and the new genome as the query. *********************************************************************** WARNING: liftOver was only designed to work between different assemblies of the same organism, it may not do what you want if you are lifting between different organisms. *********************************************************************** options: -minMatch=0.N Minimum ratio of bases that must remap. Default 0.95 -gff File is in gff/gtf format. Note that the gff lines are converted separately. It would be good to have a separate check after this that the lines that make up a gene model still make a plausible gene after liftOver -genePred - File is in genePred format -sample - File is in sample format -bedPlus=N - File is bed N+ format -positions - File is in browser "position" format -hasBin - File has bin value (used only with -bedPlus) -tab - Separate by tabs rather than space (used only with -bedPlus) -pslT - File is in psl format, map target side only -minBlocks=0.N Minimum ratio of alignment blocks or exons that must map (default 1.00) -fudgeThick (bed 12 or 12+ only) If thickStart/thickEnd is not mapped, use the closest mapped base. Recommended if using -minBlocks. -multiple Allow multiple output regions -minChainT, -minChainQ Minimum chain size in target/query, when mapping to multiple output regions (default 0, 0) -minSizeT deprecated synonym for -minChainT (ENCODE compat.) -minSizeQ Min matching region size in query with -multiple. -chainTable Used with -multiple, format is db.tablename, to extend chains from net (preserves dups) -errorHelp Explain error messages ================================================================ ======== liftOverMerge ==================================== ================================================================ liftOverMerge - Merge multiple regions in BED 5 files generated by liftOver -multiple usage: liftOverMerge oldFile newFile options: -mergeGap=N Max size of gap to merge regions (default 0) ================================================================ ======== liftUp ==================================== ================================================================ liftUp - change coordinates of .psl, .agp, .gap, .gl, .out, .gff, .gtf .bscore .tab .gdup .axt .chain .net, genePred, .wab, .bed, or .bed8 files to parent coordinate system. usage: liftUp [-type=.xxx] destFile liftSpec how sourceFile(s) The optional -type parameter tells what type of files to lift If omitted the type is inferred from the suffix of destFile Type is one of the suffixes described above. DestFile will contain the merged and lifted source files, with the coordinates translated as per liftSpec. LiftSpec is tab-delimited with each line of the form: offset oldName oldSize newName newSize LiftSpec may optionally have a sixth column specifying + or - strand, but strand is not supported for all input types. The 'how' parameter controls what the program will do with items which are not in the liftSpec. It must be one of: carry - Items not in liftSpec are carried to dest without translation drop - Items not in liftSpec are silently dropped from dest warn - Items not in liftSpec are dropped. A warning is issued error - Items not in liftSpec generate an error If the destination is a .agp file then a 'large inserts' file also needs to be included in the command line: liftUp dest.agp liftSpec how inserts sourceFile(s) This file describes where large inserts due to heterochromitin should be added. Use /dev/null and set -gapsize if there's not inserts file. options: -nohead No header written for .psl files -dots=N Output a dot every N lines processed -pslQ Lift query (rather than target) side of psl -axtQ Lift query (rather than target) side of axt -chainQ Lift query (rather than target) side of chain -netQ Lift query (rather than target) side of net -wabaQ Lift query (rather than target) side of waba alignment (waba lifts only work with query side at this time) -nosort Don't sort bed, gff, or gdup files, to save memory -gapsize change contig gapsize from default -ignoreVersions - Ignore NCBI-style version number in sequence ids of input files -extGenePred lift extended genePred ================================================================ ======== mafSpeciesSubset ==================================== ================================================================ mafSpeciesSubset - Extract a maf that just has a subset of species. usage: mafSpeciesSubset in.maf species.lst out.maf Where: in.maf is a file where the sequence source are either simple species names, or species.something. Usually actually it's a genome database name rather than a species before the dot to tell the truth. species.lst is a file with a list of species to keep out.maf is the output. It will have columns that are all - or . in the reduced species set removed, as well as the lines representing species not in species.lst removed. options: -keepFirst - If set, keep the first 'a' line in a maf no matter what Useful for mafFrag results where we use this for the gene name ================================================================ ======== mafsInRegion ==================================== ================================================================ mafsInRegion - Extract MAFS in a genomic region usage: mafsInRegion regions.bed out.maf|outDir in.maf(s) options: -outDir - output separate files named by bed name field to outDir -keepInitialGaps - keep alignment columns at the beginning and of a block that are gapped in all species ================================================================ ======== nibFrag ==================================== ================================================================ nibFrag - Extract part of a nib file as .fa (all bases/gaps lower case by default) usage: nibFrag [options] file.nib start end strand out.fa where strand is + (plus) or m (minus) options: -masked - use lower case characters for bases meant to be masked out -hardMasked - use upper case for not masked-out and 'N' characters for masked-out bases -upper - use upper case characters for all bases -name=name Use given name after '>' in output sequence -dbHeader=db Add full database info to the header, with or without -name option -tbaHeader=db Format header for compatibility with tba, takes database name as argument ================================================================ ======== overlapSelect ==================================== ================================================================ wrong # args: overlapSelect [options] selectFile inFile outFile Select records based on overlapping chromosome ranges. The ranges are specified in the selectFile, with each block specifying a range. Records are copied from the inFile to outFile based on the selection criteria. Selection is based on blocks or exons rather than entire range. Options starting with -select* apply to selectFile and those starting with -in* apply to inFile. Options: -selectFmt=fmt - specify selectFile format: psl - PSL format (default for *.psl files). pslq - PSL format, using query instead of target genePred - genePred format (default for *.gp or *.genePred files). bed - BED format (default for *.bed files). If BED doesn't have blocks, the bed range is used. chain - chain file format (default from .chain files) chainq - chain file format, using query instead of target -selectCoordCols=spec - selectFile is tab-separate with coordinates as described by spec, which is one of: o chromCol - chrom in this column followed by start and end. o chromCol,startCol,endCol,strandCol,name - chrom, start, end, and strand in specified columns. Columns can be omitted from the end or left empty to not specify. NOTE: column numbers are zero-based -selectCds - Use only CDS in the selectFile -selectRange - Use entire range instead of blocks from records in the selectFile. -inFmt=fmt - specify inFile format, same values as -selectFmt. -inCoordCols=spec - inFile is tab-separate with coordinates specified by spec, in format described above. -inCds - Use only CDS in the inFile -inRange - Use entire range instead of blocks of records in the inFile. -nonOverlapping - select non-overlapping instead of overlapping records -strand - must be on the same strand to be considered overlapping -oppositeStrand - must be on the opposite strand to be considered overlapping -excludeSelf - don't compare records with the same coordinates and name. Warning: using only one of -inCds or -selectCds will result in different coordinates for the same record. -idMatch - only select overlapping records if they have the same id -aggregate - instead of computing overlap bases on individual select entries, compute it based on the total number of inFile bases overlap by selectFile records. -overlapSimilarity and -mergeOutput will not work with this option. -overlapThreshold=0.0 - minimum fraction of an inFile record that must be overlapped by a single select record to be considered overlapping. Note that this is only coverage by a single select record, not total coverage. -overlapThresholdCeil=1.1 - select only inFile records with less than this amount of overlap with a single record, provided they are selected by other criteria. -overlapSimilarity=0.0 - minimum fraction of inFile and select records that Note that this is only coverage by a single select record and this is; bidirectional inFile and selectFile must overlap by this amount. A value of 1.0 will select identical records (or CDS if both CDS options are specified. Not currently supported with -aggregate. -overlapSimilarityCeil=1.1 - select only inFile records with less than this amount of similarity with a single record. provided they are selected by other criteria. -overlapBases=-1 - minimum number of bases of overlap, < 0 disables. -statsOutput - output overlap statistics instead of selected records. If no overlap criteria is specified, all overlapping entries are reported, Otherwise only the pairs passing the criteria are reported. This results in a tab-separated file with the columns: inId selectId inOverlap selectOverlap overBases Where inOverlap is the fraction of the inFile record overlapped by the selectFile record and selectOverlap is the fraction of the select record overlap by inFile records. With -aggregate, output is: inId inOverlap inOverBases inBases -statsOutputAll - like -statsOutput, however output all inFile records, including those that are not overlapped. -statsOutputBoth - like -statsOutput, however output all selectFile and inFile records, including those that are not overlapped. -mergeOutput - output file with be a merge of the input file with the selectFile records that selected it. The format is inRec<tab>selectRec. if multiple select records hit, inRec is repeated. This will increase the memory required. Not supported with -nonOverlapping or -aggregate. -idOutput - output a tab-separated file of pairs of inId selectId with -aggregate, only a single column of inId is written -dropped=file - output rows that were dropped to this file. -verbose=n - verbose > 1 prints some details, ================================================================ ======== paraFetch ==================================== ================================================================ paraFetch - try to fetch url with multiple connections usage: paraFetch N R URL outPath where N is the number of connections to use R is the number of retries ================================================================ ======== paraSync ==================================== ================================================================ paraSync 1.0 paraSync - uses paraFetch to recursively mirror url to given path usage: paraSync {options} N R URL outPath where N is the number of connections to use R is the number of retries Options: -A='ext1,ext2' means accept only files with ext1 or ext2 ================================================================ ======== pslCDnaFilter ==================================== ================================================================ wrong # of args: pslCDnaFilter [options] inPsl outPsl Filter cDNA alignments in psl format. Filtering criteria are comparative, selecting near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment. WARNING: comparative filters requires that the input is sorted by query name. The command: 'sort -k 10,10' will do the trick. Each alignment is assigned a score that is based on identity and weighted towards longer alignments and those with introns. This can do either global or local best-in-genome selection. Local near best in genome keeps fragments of an mRNA that align in discontinuous locations from other fragments. It is useful for unfinished genomes. Global near best in genome keeps alignments based on overall score. Options: -algoHelp - print message describing the filtering algorithm. -localNearBest=-1.0 - local near best in genome filtering, keeping aligments within this fraction of the top score for each aligned portion of the mRNA. A value of zero keeps only the best for each fragment. A value of -1.0 disables (default). -globalNearBest=-1.0 - global near best in genome filtering, keeping aligments withing this fraction of the top score. A value of zero keeps only the best alignment. A value of -1.0 disables (default). -ignoreNs - don't include Ns (repeat masked) while calculating the score and coverage. That is treat them as unaligned rather than mismatches. Ns are still counts as mismatches when calculating the identity. -ignoreIntrons - don't favor apparent introns when scoring. -minId=0.0 - only keep alignments with at least this fraction identity. -minCover=0.0 - minimum fraction of query that must be aligned. If -polyASizes is specified and the query is in the file, the ploy-A is not included in coverage calculation. -minSpan=0.0 - keep only alignments whose target length are at least this fraction of the longest alignment passing the other filters. This can be useful for removing possible retroposed genes. -minQSize=0 - drop queries shorter than this size -minAlnSize=0 - minimum number of aligned bases. This includes repeats, but excludes poly-A/poly-T bases if available. -minNonRepSize=0 - Minimum number of matching bases that are not repeats. This does not include mismatches. Must use -repeats on BLAT if doing unmasked alignments. -maxRepMatch=1.0 - Maximum fraction of matching bases that are repeats. Must use -repeats on BLAT if doing unmasked alignments. -maxAligns=-1 - maximum number of alignments for a given query. If exceeded, then alignments are sorted by score and only this number will be saved. A value of -1 disables (default) -polyASizes=file - tab separate file with information about poly-A tails and poly-T heads. Format is outputted by faPolyASizes: id seqSize tailPolyASize headPolyTSize -usePolyTHead - if a poly-T head was detected and is longer than the poly-A tail, it is used when calculating coverage instead of the poly-A head. -bestOverlap - filter overlapping alignments, keeping the best of alignments that are similar. This is designed to be used with overlapping, windowed alignments, where one alignment might be truncated. Does not discarding ones with weird overlap unless -filterWeirdOverlapped is specified. -hapRegions=psl - PSL format alignments of each haplotype pseudo-chromosome to the corresponding reference chromosome region. This is used to map alignments between regions. -dropped=psl - save psls that were dropped to this file. -weirdOverlapped=psl - output weirdly overlapping PSLs to this file. -filterWeirdOverlapped - Filter weirdly overlapped alignments, keeping the single highest scoring one or an arbitrary one if multiple with the same high score. -alignStats=file - output the per-alignment statistics to this file -hapRefMapped=psl - output PSLs of haplotype to reference chromosome cDNA alignments mappings (for debugging purposes). -hapRefCDnaAlns=psl - output PSLs of haplotype cDNA to reference cDNA alignments (for debugging purposes). -alnIdQNameMode - add internal assigned alignment numbers to cDNA names on output. Useful for debugging, as they are include in the verbose tracing as [#1], etc. Will make a mess of normal production usage. -noValidate - don't run pslCheck validation. -verbose=1 - 0: quite 1: output stats 2: list problem alignment (weird or invalid) 3: list dropped alignments and reason for dropping 4: list kept psl and info 5: info about all PSLs The default options don't do any filtering. If no filtering criteria are specified, all PSLs will be passed though, except those that are internally inconsistent. THE INPUT MUST BE BE SORTED BY QUERY for the comparative filters. ================================================================ ======== pslPretty ==================================== ================================================================ pslPretty - Convert PSL to human readable output usage: pslPretty in.psl target.lst query.lst pretty.out options: -axt - save in something like Scott Schwartz's axt format Note gaps in both sequences are still allowed in the output which not all axt readers will expect -dot=N Put out a dot every N records -long - Don't abbreviate long inserts -check=fileName - Output alignment checks to filename It's a really good idea if the psl file is sorted by target if it contains multiple targets. Otherwise this will be very very slow. The target and query lists can either be fasta, 2bit or nib files, or a list of fasta, 2bit and/or nib files one per line ================================================================ ======== pslReps ==================================== ================================================================ pslReps - analyse repeats and generate genome wide best alignments from a sorted set of local alignments usage: pslReps in.psl out.psl out.psr where in.psl is an alignment file generated by psLayout and sorted by pslSort, out.psl is the best alignment output and out.psr contains repeat info options: -nohead don't add PSL header -ignoreSize Will not weigh in favor of larger alignments so much -noIntrons Will not penalize for not having introns when calculating size factor -singleHit Takes single best hit, not splitting into parts -minCover=0.N minimum coverage to output. Default is 0. -ignoreNs Ignore 'N's when calculating minCover. -minAli=0.N minimum alignment ratio default is 0.93 -nearTop=0.N how much can deviate from top and be taken default is 0.01 -minNearTopSize=N Minimum size of alignment that is near top for alignment to be kept. Default 30. -coverQSizes=file Tab-separate file with effective query sizes. When used with -minCover, this allows polyAs to be excluded from the coverage calculation ================================================================ ======== pslSort ==================================== ================================================================ pslSort - merge and sort psCluster .psl output files usage: pslSort dirs[1|2] outFile tempDir inDir(s) This will sort all of the .psl files in the directories inDirs in two stages - first into temporary files in tempDir and second into outFile. The device on tempDir needs to have enough space (typically 15-20 gigabytes if processing whole genome) pslSort g2g[1|2] outFile tempDir inDir(s) This will sort a genome to genome alignment, reflecting the alignments across the diagonal. Adding 1 or 2 after the dirs or g2g will limit the program to only the first or second pass repectively of the sort Options: -nohead - do not write psl header: -verbose=N Set verbosity level, higher for more output. Default 1 ================================================================ ======== sizeof ==================================== ================================================================ type bytes bits char 1 8 unsigned char 1 8 short int 2 16 u short int 2 16 int 4 32 unsigned 4 32 long 8 64 unsigned long 8 64 long long 8 64 u long long 8 64 size_t 8 64 void * 8 64 float 4 32 double 8 64 long double 16 128 LITTLE ENDIAN machine detected byte order: normal order: 0x12345678 in memory: 0x78563412 ================================================================ ======== stringify ==================================== ================================================================ stringify - Convert file to C strings usage: stringify [options] in.txt A stringified version of in.txt will be printed to standard output. Options: -var=varname - create a variable with the specified name containing the string. -static - create the variable as a string array. ================================================================ ======== textHistogram ==================================== ================================================================ textHistogram - Make a histogram in ascii usage: textHistogram [options] inFile Where inFile contains one number per line. options: -binSize=N - Size of bins, default 1 -maxBinCount=N - Maximum # of bins, default 25 -minVal=N - Minimum value to put in histogram, default 0 -log - Do log transformation before plotting -noStar - Don't draw asterisks -col=N - Which column to use. Default 1 -aveCol=N - A second column to average over. The averages will be output in place of counts of primary column. -real - Data input are real values (default is integer) -autoScale=N - autoscale to N # of bins -probValues - show prob-Values (density and cum.distr.) (sets -noStar too) -freq - show frequences instead of counts -skip=N - skip N lines before starting, default 0 ================================================================ ======== twoBitInfo ==================================== ================================================================ twoBitInfo - get information about sequences in a .2bit file usage: twoBitInfo input.2bit output.tab options: -nBed instead of seq sizes, output BED records that define areas with N's in sequence -noNs outputs the length of each sequence, but does not count Ns Output file has the columns:: seqName size The 2bit file may be specified in the form path:seq or path:seq1,seq2,seqN... so that information is returned only on the requested sequence(s). If the form path:seq:start-end is used, start-end is ignored. ================================================================ ======== twoBitToFa ==================================== ================================================================ twoBitToFa - Convert all or part of .2bit file to fasta usage: twoBitToFa input.2bit output.fa options: -seq=name - restrict this to just one sequence -start=X - start at given position in sequence (zero-based) -end=X - end at given position in sequence (non-inclusive) -seqList=file - file containing list of the desired sequence names in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189 where coordinates are half-open zero-based, i.e. [start,end) -noMask - convert sequence to all upper case -bpt=index.bpt - use bpt index instead of built in one -bed=input.bed - grab sequences specified by input.bed. Will exclude introns Sequence and range may also be specified as part of the input file name using the syntax: /path/input.2bit:name or /path/input.2bit:name or /path/input.2bit:name:start-end ================================================================ ======== validateFiles ==================================== ================================================================ validateFiles - Validate format of different track input files Program exits with non-zero status if any errors detected otherwise exits with zero status Use filename 'stdin' to read from stdin Files can be in .gz, .bz2, .zip, .Z format and are automatically decompressed Multiple input files of the same type can be listed Error messages are written to stderr OK or failing file lines can be optionally written to stdout usage: validateFiles -type=FILE_TYPE file1 [file2 [...]] options: -type=(a value from the list below) tagAlign|pairedTagAlign|broadPeak|narrowPeak|gappedPeak|bedGraph : see http://genomewiki.cse.ucsc.edu/EncodeDCC/index.php/File_Formats fasta : Fasta files (only one line of sequence, and no quality scores) fastq : Fasta with quality scores (see http://maq.sourceforge.net/fastq.shtml) csfasta : Colorspace fasta (implies -colorSpace) (see link below) csqual : Colorspace quality (see link below) (see http://marketing.appliedbiosystems.com/mk/submit/SOLID_KNOWLEDGE_RD?_JS=T&rd=dm) BAM : Binary Alignment/Map (see http://samtools.sourceforge.net/SAM1.pdf) bigWig : Big Wig (see http://genome.ucsc.edu/goldenPath/help/bigWig.html -chromDb=db Specify DB containing chromInfo table to validate chrom names and sizes -chromInfo=file.txt Specify chromInfo file to validate chrom names and sizes -colorSpace Sequences include colorspace values [0-3] (can be used with formats such as tagAlign and pairedTagAlign) -zeroSizeOk For BED-type positional data, allow rows with start==end otherwise require strictly start < end -genome=path/to/hg18.2bit Validate tagAlign or pairedTagAlign sequences match genome in .2bit file -mismatches=n Maximum number of mismatches in sequence (or read pair) if validating tagAlign or pairedTagAlign files -mismatchTotalQuality=n Maximum total quality score at mismatching positions -matchFirst=n only check the first N bases of the sequence -mmPerPair Check either pair dont exceed mismatch count if validating pairedTagAlign files (default is the total for the pair) -mmCheckOneInN=n Check mismatches in only one in 'n' lines (default=1, all) -nMatch N's do not count as a mismatch -privateData Private data so empty sequence is tolerated -printOkLines Print lines which pass validation to stdout -quick[=N] Just test the first N lines of each file (default 1000) -printFailLines Print lines which fail validation to stdout -isSort input is sorted by chrom -version Print version -allowOther allow chromosomes that aren't native in BAM's -allowBadLength allow chromosomes that have the wrong length in BAM -complementMinus complement the query sequence on the minus strand (for testing BAM) -doReport output report in filename.report -showBadAlign show non-compliant alignments -bamPercent=N.N percentage of BAM alignments that must be compliant -allowErrors=N number of errors allowed to still pass (default 0) -maxErrors=N Maximum lines with errors to report in one file before stopping (default 10) ================================================================ ======== wigCorrelate ==================================== ================================================================ wigCorrelate - Produce a table that correlates all pairs of wigs. usage: wigCorrelate one.wig two.wig ... n.wig This works on bigWig as well as wig files. The output is to stdout options: -clampMax=N - values larger than this are clipped to this value ================================================================ ======== wigToBigWig ==================================== ================================================================ wigToBigWig v 4 - Convert ascii format wig file (in fixedStep, variableStep or bedGraph format) to binary big wig format. usage: wigToBigWig in.wig chrom.sizes out.bw Where in.wig is in one of the ascii wiggle formats, but not including track lines and chrom.sizes is two column: <chromosome name> <size in bases> and out.bw is the output indexed big wig file. options: -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 -clip - If set just issue warning messages rather than dying if wig file contains items off end of chromosome. -unc - If set, do not use compression. ================================================================