This directory contains applications for stand-alone use,
built specifically for a Linux 64-bit machine.
For help on the bigBed and bigWig applications see:
http://genome.ucsc.edu/goldenPath/help/bigBed.html
http://genome.ucsc.edu/goldenPath/help/bigWig.html
View the file 'FOOTER' to see the usage statement for
each of the applications.
Name Last modified Size Description
Parent Directory -
FOOTER 05-Apr-2011 17:00 61K
bedClip 05-Apr-2011 16:59 235K
bedExtendRanges 05-Apr-2011 16:59 2.6M
bedGraphToBigWig 05-Apr-2011 16:59 243K
bedItemOverlapCount 05-Apr-2011 16:59 2.6M
bedSort 05-Apr-2011 16:59 203K
bedToBigBed 05-Apr-2011 16:59 314K
bigBedInfo 05-Apr-2011 16:59 248K
bigBedSummary 05-Apr-2011 16:59 248K
bigBedToBed 05-Apr-2011 16:59 247K
bigWigInfo 05-Apr-2011 16:59 239K
bigWigSummary 05-Apr-2011 16:59 238K
bigWigToBedGraph 05-Apr-2011 16:59 238K
bigWigToWig 05-Apr-2011 16:59 238K
blat/ 06-Apr-2011 15:45 -
faCount 05-Apr-2011 16:59 160K
faFrag 05-Apr-2011 16:59 157K
faOneRecord 05-Apr-2011 16:59 135K
faPolyASizes 05-Apr-2011 16:59 157K
faRandomize 05-Apr-2011 16:59 157K
faSize 05-Apr-2011 16:59 160K
faSomeRecords 05-Apr-2011 16:59 139K
faToNib 05-Apr-2011 16:59 163K
faToTwoBit 05-Apr-2011 16:59 245K
fetchChromSizes 05-Apr-2011 16:59 2.6K
genePredToGtf 05-Apr-2011 16:59 2.6M
gff3ToGenePred 05-Apr-2011 17:00 2.7M
gtfToGenePred 05-Apr-2011 17:00 2.6M
hgWiggle 05-Apr-2011 17:00 2.7M
htmlCheck 05-Apr-2011 16:59 227K
liftOver 05-Apr-2011 16:59 2.6M
liftOverMerge 05-Apr-2011 16:59 207K
liftUp 05-Apr-2011 16:59 2.7M
mafSpeciesSubset 05-Apr-2011 16:59 159K
mafsInRegion 05-Apr-2011 16:59 219K
nibFrag 05-Apr-2011 16:59 165K
overlapSelect 05-Apr-2011 17:00 2.7M
paraFetch 05-Apr-2011 16:59 202K
paraSync 05-Apr-2011 16:59 202K
pslCDnaFilter 05-Apr-2011 16:59 220K
pslPretty 05-Apr-2011 16:59 1.2M
pslReps 05-Apr-2011 16:59 718K
pslSort 05-Apr-2011 16:59 719K
sizeof 05-Apr-2011 16:59 5.3K
stringify 05-Apr-2011 16:59 139K
textHistogram 05-Apr-2011 16:59 146K
twoBitInfo 05-Apr-2011 16:59 238K
twoBitToFa 05-Apr-2011 16:59 305K
validateFiles 05-Apr-2011 16:59 2.7M
wigCorrelate 05-Apr-2011 16:59 259K
wigToBigWig 05-Apr-2011 16:59 889K
================================================================
======== bedClip ====================================
================================================================
bedClip - Remove lines from bed file that refer to off-chromosome places.
usage:
bedClip input.bed chrom.sizes output.bed
options:
-verbose=2 - set to get list of lines clipped and why
================================================================
======== bedExtendRanges ====================================
================================================================
bedExtendRanges - extend length of entries in bed 6+ data to be at least the given length,
taking strand directionality into account.
usage:
bedExtendRanges database length files(s)
options:
-host mysql host
-user mysql user
-password mysql password
-tab Separate by tabs rather than space
-verbose=N - verbose level for extra information to STDERR
example:
bedExtendRanges hg18 250 stdin
bedExtendRanges -user=genome -host=genome-mysql.cse.ucsc.edu hg18 250 stdin
will transform:
chr1 500 525 . 100 +
chr1 1000 1025 . 100 -
to:
chr1 500 750 . 100 +
chr1 775 1025 . 100 -
================================================================
======== bedGraphToBigWig ====================================
================================================================
bedGraphToBigWig v 4 - Convert a bedGraph program to bigWig.
usage:
bedGraphToBigWig in.bedGraph chrom.sizes out.bw
where in.bedGraph is a four column file in the format:
<chrom> <start> <end> <value>
and chrom.sizes is two column: <chromosome name> <size in bases>
and out.bw is the output indexed big wig file.
The input bedGraph file must be sorted, use the unix sort command:
sort -k1,1 -k2,2 unsorted.bedGraph > sorted.bedGraph
options:
-blockSize=N - Number of items to bundle in r-tree. Default 256
-itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
-unc - If set, do not use compression.
================================================================
======== bedItemOverlapCount ====================================
================================================================
bedItemOverlapCount - count number of times a base is overlapped by the
items in a bed file. Output is bedGraph 4 to stdout.
usage:
sort bedFile.bed | bedItemOverlapCount [options] <database> stdin
To create a bigWig file from this data to use in a custom track:
sort bedFile.bed | bedItemOverlapCount [options] <database> stdin \
> bedFile.bedGraph
bedGraphToBigWig bedFile.bedGraph chrom.sizes bedFile.bw
where the chrom.sizes is obtained with the script: fetchChromSizes
See also:
http://genome-test.cse.ucsc.edu/~kent/src/unzipped/utils/userApps/fetchChromSizes
options:
-zero add blocks with zero count, normally these are ommitted
-bed12 expect bed12 and count based on blocks
Without this option, only the first three fields are used.
-max if counts per base overflows set to max (4294967295) instead of exiting
-outBounds output min/max to stderr
-chromSize=sizefile Read chrom sizes from file instead of database
sizefile contains two white space separated fields per line:
chrom name and size
-host=hostname mysql host used to get chrom sizes
-user=username mysql user
-password=password mysql password
Notes:
* You may want to separate your + and - strand
items before sending into this program as it only looks at
the chrom, start and end columns of the bed file.
* Program requires a <database> connection to lookup chrom sizes for a sanity
check of the incoming data. Even when the -chromSize argument is used
the <database> must be present, but it will not be used.
* The bed file *must* be sorted by chrom
* Maximum count per base is 4294967295. Recompile with new unitSize to increase this
================================================================
======== bedSort ====================================
================================================================
bedSort - Sort a .bed file by chrom,chromStart
usage:
bedSort in.bed out.bed
in.bed and out.bed may be the same.
================================================================
======== bedToBigBed ====================================
================================================================
bedToBigBed v. 4 - Convert bed file to bigBed.
usage:
bedToBigBed in.bed chrom.sizes out.bb
Where in.bed is in one of the ascii bed formats, but not including track lines
and chrom.sizes is two column: <chromosome name> <size in bases>
and out.bb is the output indexed big bed file.
The in.bed file must be sorted by chromosome,start,
to sort a bed file, use the unix sort command:
sort -k1,1 -k2,2n unsorted.bed > sorted.bed
options:
-blockSize=N - Number of items to bundle in r-tree. Default 256
-itemsPerSlot=N - Number of data points bundled at lowest level. Default 512
-bedFields=N - Number of fields that fit standard bed definition. If undefined
assumes all fields in bed are defined.
-as=fields.as - If have non-standard fields, it's great to put a definition
of each field in a row in AutoSql format here.
-unc - If set, do not use compression.
================================================================
======== bigBedInfo ====================================
================================================================
bigBedInfo - Show information about a bigBed file.
usage:
bigBedInfo file.bb
options:
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
-chroms - list all chromosomes and their sizes
-zooms - list all zoom levels and theier sizes
-as - get autoSql spec
================================================================
======== bigBedSummary ====================================
================================================================
bigBedSummary - Extract summary information from a bigBed file.
usage:
bigBedSummary file.bb chrom start end dataPoints
Get summary data from bigBed for indicated region, broken into
dataPoints equal parts. (Use dataPoints=1 for simple summary.)
options:
-type=X where X is one of:
coverage - % of region that is covered (default)
mean - average depth of covered regions
min - minimum depth of covered regions
max - maximum depth of covered regions
-fields - print out information on fields in file.
If fields option is used, the chrom, start, end, dataPoints
parameters may be omitted
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
================================================================
======== bigBedToBed ====================================
================================================================
bigBedToBed - Convert from bigBed to ascii bed format.
usage:
bigBedToBed input.bb output.bed
options:
-chrom=chr1 - if set restrict output to given chromosome
-start=N - if set, restrict output to only that over start
-end=N - if set, restict output to only that under end
-maxItems=N - if set, restrict output to first N items
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
================================================================
======== bigWigInfo ====================================
================================================================
bigWigInfo - Print out information about bigWig file.
usage:
bigWigInfo file.bw
options:
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
-chroms - list all chromosomes and their sizes
-zooms - list all zoom levels and their sizes
-minMax - list the min and max on a single line
================================================================
======== bigWigSummary ====================================
================================================================
bigWigSummary - Extract summary information from a bigWig file.
usage:
bigWigSummary file.bigWig chrom start end dataPoints
Get summary data from bigWig for indicated region, broken into
dataPoints equal parts. (Use dataPoints=1 for simple summary.)
NOTE: start and end coordinates are in BED format (0-based)
options:
-type=X where X is one of:
mean - average value in region (default)
min - minimum value in region
max - maximum value in region
std - standard deviation in region
coverage - % of region that is covered
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
================================================================
======== bigWigToBedGraph ====================================
================================================================
bigWigToBedGraph - Convert from bigWig to bedGraph format.
usage:
bigWigToBedGraph in.bigWig out.bedGraph
options:
-chrom=chr1 - if set restrict output to given chromosome
-start=N - if set, restrict output to only that over start
-end=N - if set, restict output to only that under end
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
================================================================
======== bigWigToWig ====================================
================================================================
bigWigToWig - Convert bigWig to wig. This will keep more of the same structure of the
original wig than bigWigToBedGraph does, but still will break up large stepped sections
into smaller ones.
usage:
bigWigToWig in.bigWig out.wig
options:
-chrom=chr1 - if set restrict output to given chromosome
-start=N - if set, restrict output to only that over start
-end=N - if set, restict output to only that under end
-udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs
================================================================
======== blat ====================================
================================================================
blat - Standalone BLAT v. 34x10 fast sequence search command line tool
usage:
blat database query [-ooc=11.ooc] output.psl
where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
-ooc=11.ooc tells the program to load over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many cases, but is not required
output.psl is where to put the output.
Subranges of nib and .2bit files may specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
The default is dna
-prot Synonymous with -t=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to
the tileSize
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for protein.
-stepSize=N spacing between tiles. Default is tileSize.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-minScore=N sets minimum score. This is the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
-noHead suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
-repMatch=N sets the number of repetitions of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc.
Also affected by stepSize. When stepSize is halved repMatch is
doubled to compensate.
-mask=type Mask out repeats. Alignments won't be started in masked region
but may extend through it in nucleotide searches. Masked areas
are ignored entirely in protein or translated searches. Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above but
for query rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not be
masked in any way, but matches in repeat areas will be reported
separately from matches in other areas in the psl output.
-minRepDivergence=NN - minimum percent divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out files.
-dots=N Output dot every N sequences to show program's progress
-trimT Trim leading poly-T
-noTrimA Don't trim trailing poly-A
-trimHardA Remove poly-A tail from qSize as well as alignments in
psl output
-fastMap Run for fast DNA/DNA remapping - not allowing introns,
requiring high %ID. Query sizes must not exceed 5000.
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
-maxIntron=N Sets maximum intron size. Default is 750000
-extendThroughN - Allows extension of alignment through large blocks of N's
================================================================
======== faCount ====================================
================================================================
faCount - count base statistics and CpGs in FA files.
usage:
faCount file(s).fa
-summary show only summary statistics
-dinuc include statistics on dinucletoide frequencies
-strands count bases on both strands
================================================================
======== faFrag ====================================
================================================================
faFrag - Extract a piece of DNA from a .fa file.
usage:
faFrag in.fa start end out.fa
options:
-mixed - preserve mixed-case in FASTA file
================================================================
======== faOneRecord ====================================
================================================================
faOneRecord - Extract a single record from a .FA file
usage:
faOneRecord in.fa recordName
================================================================
======== faPolyASizes ====================================
================================================================
faPolyASizes - get poly A sizes
usage:
faPolyASizes in.fa out.tab
output file has four columns:
id seqSize tailPolyASize headPolyTSize
options:
================================================================
======== faRandomize ====================================
================================================================
faRandomize - Program to create random fasta records using
same base frequency as seen in original fasta records.
Use optional -seed flag to specify seed for random number
generator.
usage:
faRandomize in.fa randomized.fa
================================================================
======== faSize ====================================
================================================================
faSize - print total base count in fa files.
usage:
faSize file(s).fa
Command flags
-detailed outputs name and size of each record
has the side effect of printing nothing else
-tab output statistics in a tab separated format
================================================================
======== faSomeRecords ====================================
================================================================
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.
================================================================
======== faToNib ====================================
================================================================
faToNib - Convert from .fa to .nib format
usage:
faToNib [options] in.fa out.nib
options:
-softMask - create nib that soft-masks lower case sequence
-hardMask - create nib that hard-masks lower case sequence
================================================================
======== faToTwoBit ====================================
================================================================
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
-noMask - Ignore lower-case masking in fa file.
-stripVersion - Strip off version number after . for genbank accessions.
-ignoreDups - only convert first sequence if there are duplicates
================================================================
======== fetchChromSizes ====================================
================================================================
usage: fetchChromSizes <db> > <db>.chrom.sizes
used to fetch chrom.sizes information from UCSC for the given <db>
<db> - name of UCSC database, e.g.: hg18, mm9, etc ...
This script expects to find one of the following commands:
wget, mysql, or ftp in order to fetch information from UCSC.
Route the output to the file <db>.chrom.sizes as indicated above.
Example: fetchChromSizes hg18 > hg18.chrom.sizes
================================================================
======== genePredToGtf ====================================
================================================================
genePredToGtf - Convert genePred table or file to gtf.
usage:
genePredToGtf database genePredTable output.gtf
If database is 'file' then track is interpreted as a file
rather than a table in database.
options:
-utr - Add 5UTR and 3UTR features
-honorCdsStat - use cdsStartStat/cdsEndStat when defining start/end
codon records
-source=src set source name to uses
-addComments - Add comments before each set of transcript records.
allows for easier visual inspection
Note: use a refFlat table or extended genePred table or file to include
the gene_name attribute in the output. This will not work with a refFlat
table dump file. If you are using a genePred file that starts with a numeric
bin column, drop it using the UNIX cut command:
cut -f 2- in.gp | genePredToGtf file stdin out.gp
================================================================
======== gfClient ====================================
================================================================
gfClient v. 34x10 - A client for the genomic finding program that produces a .psl file
usage:
gfClient host port seqDir in.fa out.psl
where
host is the name of the machine running the gfServer
port is the same as you started the gfServer with
seqDir is the path of the .nib or .2bit files relative to the current dir
(note these are needed by the client as well as the server)
in.fa is a fasta format file. May contain multiple records
out.psl where to put the output
options:
-t=type Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
-prot Synonymous with -d=prot -q=prot
-dots=N Output a dot every N query sequences
-nohead Suppresses psl five line header
-minScore=N sets minimum score. This is twice the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format without actual sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-maxIntron=N Sets maximum intron size. Default is 750000
================================================================
======== gfServer ====================================
================================================================
gfServer v 34x10 - Make a server to quickly find where DNA occurs in genome.
To set up a server:
gfServer start host port file(s)
Where the files are in .nib or .2bit format
To remove a server:
gfServer stop host port
To query a server with DNA sequence:
gfServer query host port probe.fa
To query a server with protein sequence:
gfServer protQuery host port probe.fa
To query a server with translated dna sequence:
gfServer transQuery host port probe.fa
To query server with PCR primers
gfServer pcr host port fPrimer rPrimer maxDistance
To process one probe fa file against a .nib format genome (not starting server):
gfServer direct probe.fa file(s).nib
To test pcr without starting server:
gfServer pcrDirect fPrimer rPrimer file(s).nib
To figure out usage level
gfServer status host port
To get input file list
gfServer files host port
Options:
-tileSize=N size of n-mers to index. Default is 11 for nucleotides, 4 for
proteins (or translated nucleotides).
-stepSize=N spacing between tiles. Default is tileSize.
-minMatch=N Number of n-mer matches that trigger detailed alignment
Default is 2 for nucleotides, 3 for protiens.
-maxGap=N Number of insertions or deletions allowed between n-mers.
Default is 2 for nucleotides, 0 for protiens.
-trans Translate database to protein in 6 frames. Note: it is best
to run this on RepeatMasked data in this case.
-log=logFile keep a log file that records server requests.
-seqLog Include sequences in log file (not logged with -syslog)
-ipLog Include user's IP in log file (not logged with -syslog)
-syslog Log to syslog
-logFacility=facility log to the specified syslog facility - default local0.
-mask Use masking from nib file.
-repMatch=N Number of occurrences of a tile (nmer) that trigger repeat masking the tile.
Default is 1024.
-maxDnaHits=N Maximum number of hits for a dna query that are sent from the server.
Default is 100.
-maxTransHits=N Maximum number of hits for a translated query that are sent from the server.
Default is 200.
-maxNtSize=N Maximum size of untranslated DNA query sequence
Default is 40000
-maxAsSize=N Maximum size of protein or translated DNA queries
Default is 8000
-canStop If set then a quit message will actually take down the
server
================================================================
======== gff3ToGenePred ====================================
================================================================
gff3ToGenePred - convert a GFF3 file to a genePred file
usage:
gff3ToGenePred inGff3 outGp
options:
-maxParseErrors=50 - Maximum number of parsing errors before aborting. A negative
value will allow an unlimited number of errors. Default is 50.
-maxConverErrors=50 - Maximum number of conversion errors before aborting. A negative
value will allow an unlimited number of errors. Default is 50.
-honorStartStopCodons - only set CDS start/stop status to complete if there are
corresponding start_stop codon records
This converts:
- top-level gene records with mRNA records
- top-level mRNA records
- mRNA records can contain exon and CDS, or only CDS, or only
exon for non--coding.
The first step is to parse GFF3 file, up to 50 errors are reported before
aborting. If the GFF3 files is successfully parse, it is converted to gene,
annotation. Up to 50 conversion errors are reported before aborting.
Input file must conform to the GFF3 specification:
http://www.sequenceontology.org/gff3.shtml
================================================================
======== gtfToGenePred ====================================
================================================================
gtfToGenePred - convert a GTF file to a genePred
usage:
gtfToGenePred gtf genePred
options:
-genePredExt - create a extended genePred, including frame
information and gene name
-allErrors - skip groups with errors rather than aborting.
Useful for getting infomation about as many errors as possible.
-infoOut=file - write a file with information on each transcript
-sourcePrefix=pre - only process entries where the source name has the
specified prefix. May be repeated.
-impliedStopAfterCds - implied stop codon in after CDS
-simple - just check column validity, not hierarchy, resulting genePred may be damaged
-geneNameAsName2 - if specified, use gene_name for the name2 field
instead of gene_id.
================================================================
======== hgWiggle ====================================
================================================================
hgWiggle - fetch wiggle data from data base or file
usage:
hgWiggle [options] <track names ...>
options:
-db=<database> - use specified database
-chr=chrN - examine data only on chrN
-chrom=chrN - same as -chr option above
-position=[chrN:]start-end - examine data in window start-end (1-relative)
(the chrN: is optional)
-chromLst=<file> - file with list of chroms to examine
-doAscii - perform the default ascii output, in addition to other outputs
- Any of the other -do outputs turn off the default ascii output
-rawDataOut - output just the data values, nothing else
-htmlOut - output stats or histogram in HTML instead of plain text
-doStats - perform stats measurement, default output text, see -htmlOut
-doBed - output bed format
-lift=<D> - lift ascii output positions by D (0 default)
-bedFile=<file> - constrain output to ranges specified in bed <file>
-dataConstraint='DC' - where DC is one of < = >= <= == != 'in range'
-ll=<F> - lowerLimit compare data values to F (float) (all but 'in range')
-ul=<F> - upperLimit compare data values to F (float)
(need both ll and ul when 'in range')
-help - display more examples and extra options (to stderr)
When no database is specified, track names will refer to .wig files
example using the file chrM.wig:
hgWiggle chrM
example using the database table hg17.gc5Base:
hgWiggle -chr=chrM -db=hg17 gc5Base
================================================================
======== htmlCheck ====================================
================================================================
htmlCheck - Do a little reading and verification of html file
usage:
htmlCheck how url
where how is:
ok - just check for 200 return. Print error message and exit -1 if no 200
getAll - read the url (header and html) and print to stdout
getHeader - read the header and print to stdout
getCookies - print list of cookies
getHtml - print the html, but not the header to stdout
getForms - print the form structure to stdout
getVars - print the form variables to stdout
getLinks - print links
getTags - print out just the tags
checkLinks - check links in page
checkLinks2 - check links in page and all subpages in same host
(Just one level of recursion)
checkLocalLinks - check local links in page
checkLocalLinks2 - check local links in page and connected local pages
(Just one level of recursion)
submit - submit first form in page if any using 'GET' method
validate - do some basic validations including TABLE/TR/TD nesting
options:
cookies=cookie.txt - Cookies is a two column file
containing <cookieName><space><value><newLine>
note: url will need to be in quotes if it contains an ampersand.
================================================================
======== liftOver ====================================
================================================================
liftOver - Move annotations from one assembly to another
usage:
liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.
***********************************************************************
WARNING: liftOver was only designed to work between different
assemblies of the same organism, it may not do what you want
if you are lifting between different organisms.
***********************************************************************
options:
-minMatch=0.N Minimum ratio of bases that must remap. Default 0.95
-gff File is in gff/gtf format. Note that the gff lines are converted
separately. It would be good to have a separate check after this
that the lines that make up a gene model still make a plausible gene
after liftOver
-genePred - File is in genePred format
-sample - File is in sample format
-bedPlus=N - File is bed N+ format
-positions - File is in browser "position" format
-hasBin - File has bin value (used only with -bedPlus)
-tab - Separate by tabs rather than space (used only with -bedPlus)
-pslT - File is in psl format, map target side only
-minBlocks=0.N Minimum ratio of alignment blocks or exons that must map
(default 1.00)
-fudgeThick (bed 12 or 12+ only) If thickStart/thickEnd is not mapped,
use the closest mapped base. Recommended if using
-minBlocks.
-multiple Allow multiple output regions
-minChainT, -minChainQ Minimum chain size in target/query, when mapping
to multiple output regions (default 0, 0)
-minSizeT deprecated synonym for -minChainT (ENCODE compat.)
-minSizeQ Min matching region size in query with -multiple.
-chainTable Used with -multiple, format is db.tablename,
to extend chains from net (preserves dups)
-errorHelp Explain error messages
================================================================
======== liftOverMerge ====================================
================================================================
liftOverMerge - Merge multiple regions in BED 5 files
generated by liftOver -multiple
usage:
liftOverMerge oldFile newFile
options:
-mergeGap=N Max size of gap to merge regions (default 0)
================================================================
======== liftUp ====================================
================================================================
liftUp - change coordinates of .psl, .agp, .gap, .gl, .out, .gff, .gtf .bscore
.tab .gdup .axt .chain .net, genePred, .wab, .bed, or .bed8 files to parent
coordinate system.
usage:
liftUp [-type=.xxx] destFile liftSpec how sourceFile(s)
The optional -type parameter tells what type of files to lift
If omitted the type is inferred from the suffix of destFile
Type is one of the suffixes described above.
DestFile will contain the merged and lifted source files,
with the coordinates translated as per liftSpec. LiftSpec
is tab-delimited with each line of the form:
offset oldName oldSize newName newSize
LiftSpec may optionally have a sixth column specifying + or - strand,
but strand is not supported for all input types.
The 'how' parameter controls what the program will do with
items which are not in the liftSpec. It must be one of:
carry - Items not in liftSpec are carried to dest without translation
drop - Items not in liftSpec are silently dropped from dest
warn - Items not in liftSpec are dropped. A warning is issued
error - Items not in liftSpec generate an error
If the destination is a .agp file then a 'large inserts' file
also needs to be included in the command line:
liftUp dest.agp liftSpec how inserts sourceFile(s)
This file describes where large inserts due to heterochromitin
should be added. Use /dev/null and set -gapsize if there's not inserts file.
options:
-nohead No header written for .psl files
-dots=N Output a dot every N lines processed
-pslQ Lift query (rather than target) side of psl
-axtQ Lift query (rather than target) side of axt
-chainQ Lift query (rather than target) side of chain
-netQ Lift query (rather than target) side of net
-wabaQ Lift query (rather than target) side of waba alignment
(waba lifts only work with query side at this time)
-nosort Don't sort bed, gff, or gdup files, to save memory
-gapsize change contig gapsize from default
-ignoreVersions - Ignore NCBI-style version number in sequence ids of input files
-extGenePred lift extended genePred
================================================================
======== mafSpeciesSubset ====================================
================================================================
mafSpeciesSubset - Extract a maf that just has a subset of species.
usage:
mafSpeciesSubset in.maf species.lst out.maf
Where:
in.maf is a file where the sequence source are either simple species
names, or species.something. Usually actually it's a genome
database name rather than a species before the dot to tell the
truth.
species.lst is a file with a list of species to keep
out.maf is the output. It will have columns that are all - or . in
the reduced species set removed, as well as the lines representing
species not in species.lst removed.
options:
-keepFirst - If set, keep the first 'a' line in a maf no matter what
Useful for mafFrag results where we use this for the gene name
================================================================
======== mafsInRegion ====================================
================================================================
mafsInRegion - Extract MAFS in a genomic region
usage:
mafsInRegion regions.bed out.maf|outDir in.maf(s)
options:
-outDir - output separate files named by bed name field to outDir
-keepInitialGaps - keep alignment columns at the beginning and of a block that are gapped in all species
================================================================
======== nibFrag ====================================
================================================================
nibFrag - Extract part of a nib file as .fa (all bases/gaps lower case by default)
usage:
nibFrag [options] file.nib start end strand out.fa
where strand is + (plus) or m (minus)
options:
-masked - use lower case characters for bases meant to be masked out
-hardMasked - use upper case for not masked-out and 'N' characters for masked-out bases
-upper - use upper case characters for all bases
-name=name Use given name after '>' in output sequence
-dbHeader=db Add full database info to the header, with or without -name option
-tbaHeader=db Format header for compatibility with tba, takes database name as argument
================================================================
======== overlapSelect ====================================
================================================================
wrong # args: overlapSelect [options] selectFile inFile outFile
Select records based on overlapping chromosome ranges. The ranges are
specified in the selectFile, with each block specifying a range.
Records are copied from the inFile to outFile based on the selection
criteria. Selection is based on blocks or exons rather than entire
range.
Options starting with -select* apply to selectFile and those starting
with -in* apply to inFile.
Options:
-selectFmt=fmt - specify selectFile format:
psl - PSL format (default for *.psl files).
pslq - PSL format, using query instead of target
genePred - genePred format (default for *.gp or
*.genePred files).
bed - BED format (default for *.bed files).
If BED doesn't have blocks, the bed range is used.
chain - chain file format (default from .chain files)
chainq - chain file format, using query instead of target
-selectCoordCols=spec - selectFile is tab-separate with coordinates
as described by spec, which is one of:
o chromCol - chrom in this column followed by start and end.
o chromCol,startCol,endCol,strandCol,name - chrom, start, end, and
strand in specified columns. Columns can be omitted from the end
or left empty to not specify.
NOTE: column numbers are zero-based
-selectCds - Use only CDS in the selectFile
-selectRange - Use entire range instead of blocks from records in
the selectFile.
-inFmt=fmt - specify inFile format, same values as -selectFmt.
-inCoordCols=spec - inFile is tab-separate with coordinates specified by
spec, in format described above.
-inCds - Use only CDS in the inFile
-inRange - Use entire range instead of blocks of records in the inFile.
-nonOverlapping - select non-overlapping instead of overlapping records
-strand - must be on the same strand to be considered overlapping
-oppositeStrand - must be on the opposite strand to be considered overlapping
-excludeSelf - don't compare records with the same coordinates and name.
Warning: using only one of -inCds or -selectCds will result in different
coordinates for the same record.
-idMatch - only select overlapping records if they have the same id
-aggregate - instead of computing overlap bases on individual select entries,
compute it based on the total number of inFile bases overlap by selectFile
records. -overlapSimilarity and -mergeOutput will not work with
this option.
-overlapThreshold=0.0 - minimum fraction of an inFile record that
must be overlapped by a single select record to be considered
overlapping. Note that this is only coverage by a single select
record, not total coverage.
-overlapThresholdCeil=1.1 - select only inFile records with less than
this amount of overlap with a single record, provided they are selected
by other criteria.
-overlapSimilarity=0.0 - minimum fraction of inFile and select records that
Note that this is only coverage by a single select record and this
is; bidirectional inFile and selectFile must overlap by this
amount. A value of 1.0 will select identical records (or CDS if
both CDS options are specified. Not currently supported with
-aggregate.
-overlapSimilarityCeil=1.1 - select only inFile records with less than this
amount of similarity with a single record. provided they are selected by
other criteria.
-overlapBases=-1 - minimum number of bases of overlap, < 0 disables.
-statsOutput - output overlap statistics instead of selected records.
If no overlap criteria is specified, all overlapping entries are
reported, Otherwise only the pairs passing the criteria are
reported. This results in a tab-separated file with the columns:
inId selectId inOverlap selectOverlap overBases
Where inOverlap is the fraction of the inFile record overlapped by
the selectFile record and selectOverlap is the fraction of the
select record overlap by inFile records. With -aggregate, output
is:
inId inOverlap inOverBases inBases
-statsOutputAll - like -statsOutput, however output all inFile records,
including those that are not overlapped.
-statsOutputBoth - like -statsOutput, however output all selectFile and
inFile records, including those that are not overlapped.
-mergeOutput - output file with be a merge of the input file with the
selectFile records that selected it. The format is
inRec<tab>selectRec.
if multiple select records hit, inRec is repeated. This will increase
the memory required. Not supported with -nonOverlapping or -aggregate.
-idOutput - output a tab-separated file of pairs of
inId selectId
with -aggregate, only a single column of inId is written
-dropped=file - output rows that were dropped to this file.
-verbose=n - verbose > 1 prints some details,
================================================================
======== paraFetch ====================================
================================================================
paraFetch - try to fetch url with multiple connections
usage:
paraFetch N R URL outPath
where N is the number of connections to use
R is the number of retries
================================================================
======== paraSync ====================================
================================================================
paraSync 1.0
paraSync - uses paraFetch to recursively mirror url to given path
usage:
paraSync {options} N R URL outPath
where N is the number of connections to use
R is the number of retries
Options:
-A='ext1,ext2' means accept only files with ext1 or ext2
================================================================
======== pslCDnaFilter ====================================
================================================================
wrong # of args: pslCDnaFilter [options] inPsl outPsl
Filter cDNA alignments in psl format. Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.
WARNING: comparative filters requires that the input is sorted by
query name. The command: 'sort -k 10,10' will do the trick.
Each alignment is assigned a score that is based on identity and
weighted towards longer alignments and those with introns. This
can do either global or local best-in-genome selection. Local
near best in genome keeps fragments of an mRNA that align in
discontinuous locations from other fragments. It is useful for
unfinished genomes. Global near best in genome keeps alignments
based on overall score.
Options:
-algoHelp - print message describing the filtering algorithm.
-localNearBest=-1.0 - local near best in genome filtering,
keeping aligments within this fraction of the top score for
each aligned portion of the mRNA. A value of zero keeps only
the best for each fragment. A value of -1.0 disables
(default).
-globalNearBest=-1.0 - global near best in genome filtering,
keeping aligments withing this fraction of the top score. A
value of zero keeps only the best alignment. A value of -1.0
disables (default).
-ignoreNs - don't include Ns (repeat masked) while calculating the
score and coverage. That is treat them as unaligned rather than
mismatches. Ns are still counts as mismatches when calculating
the identity.
-ignoreIntrons - don't favor apparent introns when scoring.
-minId=0.0 - only keep alignments with at least this fraction
identity.
-minCover=0.0 - minimum fraction of query that must be
aligned. If -polyASizes is specified and the query is in
the file, the ploy-A is not included in coverage
calculation.
-minSpan=0.0 - keep only alignments whose target length are
at least this fraction of the longest alignment passing the
other filters. This can be useful for removing possible
retroposed genes.
-minQSize=0 - drop queries shorter than this size
-minAlnSize=0 - minimum number of aligned bases. This includes
repeats, but excludes poly-A/poly-T bases if available.
-minNonRepSize=0 - Minimum number of matching bases that are not repeats.
This does not include mismatches.
Must use -repeats on BLAT if doing unmasked alignments.
-maxRepMatch=1.0 - Maximum fraction of matching bases
that are repeats. Must use -repeats on BLAT if doing
unmasked alignments.
-maxAligns=-1 - maximum number of alignments for a given query. If
exceeded, then alignments are sorted by score and only this number
will be saved. A value of -1 disables (default)
-polyASizes=file - tab separate file with information about
poly-A tails and poly-T heads. Format is outputted by
faPolyASizes:
id seqSize tailPolyASize headPolyTSize
-usePolyTHead - if a poly-T head was detected and is longer
than the poly-A tail, it is used when calculating coverage
instead of the poly-A head.
-bestOverlap - filter overlapping alignments, keeping the best of
alignments that are similar. This is designed to be used with
overlapping, windowed alignments, where one alignment might be truncated.
Does not discarding ones with weird overlap unless -filterWeirdOverlapped
is specified.
-hapRegions=psl - PSL format alignments of each haplotype pseudo-chromosome
to the corresponding reference chromosome region. This is used to map
alignments between regions.
-dropped=psl - save psls that were dropped to this file.
-weirdOverlapped=psl - output weirdly overlapping PSLs to
this file.
-filterWeirdOverlapped - Filter weirdly overlapped alignments, keeping
the single highest scoring one or an arbitrary one if multiple with
the same high score.
-alignStats=file - output the per-alignment statistics to this file
-hapRefMapped=psl - output PSLs of haplotype to reference chromosome
cDNA alignments mappings (for debugging purposes).
-hapRefCDnaAlns=psl - output PSLs of haplotype cDNA to reference cDNA
alignments (for debugging purposes).
-alnIdQNameMode - add internal assigned alignment numbers to cDNA names
on output. Useful for debugging, as they are include in the verbose
tracing as [#1], etc. Will make a mess of normal production usage.
-noValidate - don't run pslCheck validation.
-verbose=1 - 0: quite
1: output stats
2: list problem alignment (weird or invalid)
3: list dropped alignments and reason for dropping
4: list kept psl and info
5: info about all PSLs
The default options don't do any filtering. If no filtering
criteria are specified, all PSLs will be passed though, except
those that are internally inconsistent.
THE INPUT MUST BE BE SORTED BY QUERY for the comparative filters.
================================================================
======== pslPretty ====================================
================================================================
pslPretty - Convert PSL to human readable output
usage:
pslPretty in.psl target.lst query.lst pretty.out
options:
-axt - save in something like Scott Schwartz's axt format
Note gaps in both sequences are still allowed in the
output which not all axt readers will expect
-dot=N Put out a dot every N records
-long - Don't abbreviate long inserts
-check=fileName - Output alignment checks to filename
It's a really good idea if the psl file is sorted by target
if it contains multiple targets. Otherwise this will be
very very slow. The target and query lists can either be
fasta, 2bit or nib files, or a list of fasta, 2bit and/or nib files
one per line
================================================================
======== pslReps ====================================
================================================================
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
options:
-nohead don't add PSL header
-ignoreSize Will not weigh in favor of larger alignments so much
-noIntrons Will not penalize for not having introns when calculating
size factor
-singleHit Takes single best hit, not splitting into parts
-minCover=0.N minimum coverage to output. Default is 0.
-ignoreNs Ignore 'N's when calculating minCover.
-minAli=0.N minimum alignment ratio
default is 0.93
-nearTop=0.N how much can deviate from top and be taken
default is 0.01
-minNearTopSize=N Minimum size of alignment that is near top
for alignment to be kept. Default 30.
-coverQSizes=file Tab-separate file with effective query sizes.
When used with -minCover, this allows polyAs
to be excluded from the coverage calculation
================================================================
======== pslSort ====================================
================================================================
pslSort - merge and sort psCluster .psl output files
usage:
pslSort dirs[1|2] outFile tempDir inDir(s)
This will sort all of the .psl files in the directories
inDirs in two stages - first into temporary files in tempDir
and second into outFile. The device on tempDir needs to have
enough space (typically 15-20 gigabytes if processing whole genome)
pslSort g2g[1|2] outFile tempDir inDir(s)
This will sort a genome to genome alignment, reflecting the
alignments across the diagonal.
Adding 1 or 2 after the dirs or g2g will limit the program to
only the first or second pass repectively of the sort
Options:
-nohead - do not write psl header:
-verbose=N Set verbosity level, higher for more output. Default 1
================================================================
======== sizeof ====================================
================================================================
type bytes bits
char 1 8
unsigned char 1 8
short int 2 16
u short int 2 16
int 4 32
unsigned 4 32
long 8 64
unsigned long 8 64
long long 8 64
u long long 8 64
size_t 8 64
void * 8 64
float 4 32
double 8 64
long double 16 128
LITTLE ENDIAN machine detected
byte order: normal order: 0x12345678 in memory: 0x78563412
================================================================
======== stringify ====================================
================================================================
stringify - Convert file to C strings
usage:
stringify [options] in.txt
A stringified version of in.txt will be printed to standard output.
Options:
-var=varname - create a variable with the specified name containing
the string.
-static - create the variable as a string array.
================================================================
======== textHistogram ====================================
================================================================
textHistogram - Make a histogram in ascii
usage:
textHistogram [options] inFile
Where inFile contains one number per line.
options:
-binSize=N - Size of bins, default 1
-maxBinCount=N - Maximum # of bins, default 25
-minVal=N - Minimum value to put in histogram, default 0
-log - Do log transformation before plotting
-noStar - Don't draw asterisks
-col=N - Which column to use. Default 1
-aveCol=N - A second column to average over. The averages
will be output in place of counts of primary column.
-real - Data input are real values (default is integer)
-autoScale=N - autoscale to N # of bins
-probValues - show prob-Values (density and cum.distr.) (sets -noStar too)
-freq - show frequences instead of counts
-skip=N - skip N lines before starting, default 0
================================================================
======== twoBitInfo ====================================
================================================================
twoBitInfo - get information about sequences in a .2bit file
usage:
twoBitInfo input.2bit output.tab
options:
-nBed instead of seq sizes, output BED records that define
areas with N's in sequence
-noNs outputs the length of each sequence, but does not count Ns
Output file has the columns::
seqName size
The 2bit file may be specified in the form path:seq or path:seq1,seq2,seqN...
so that information is returned only on the requested sequence(s).
If the form path:seq:start-end is used, start-end is ignored.
================================================================
======== twoBitToFa ====================================
================================================================
twoBitToFa - Convert all or part of .2bit file to fasta
usage:
twoBitToFa input.2bit output.fa
options:
-seq=name - restrict this to just one sequence
-start=X - start at given position in sequence (zero-based)
-end=X - end at given position in sequence (non-inclusive)
-seqList=file - file containing list of the desired sequence names
in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189
where coordinates are half-open zero-based, i.e. [start,end)
-noMask - convert sequence to all upper case
-bpt=index.bpt - use bpt index instead of built in one
-bed=input.bed - grab sequences specified by input.bed. Will exclude introns
Sequence and range may also be specified as part of the input
file name using the syntax:
/path/input.2bit:name
or
/path/input.2bit:name
or
/path/input.2bit:name:start-end
================================================================
======== validateFiles ====================================
================================================================
validateFiles - Validate format of different track input files
Program exits with non-zero status if any errors detected
otherwise exits with zero status
Use filename 'stdin' to read from stdin
Files can be in .gz, .bz2, .zip, .Z format and are
automatically decompressed
Multiple input files of the same type can be listed
Error messages are written to stderr
OK or failing file lines can be optionally written to stdout
usage:
validateFiles -type=FILE_TYPE file1 [file2 [...]]
options:
-type=(a value from the list below)
tagAlign|pairedTagAlign|broadPeak|narrowPeak|gappedPeak|bedGraph
: see http://genomewiki.cse.ucsc.edu/EncodeDCC/index.php/File_Formats
fasta : Fasta files (only one line of sequence, and no quality scores)
fastq : Fasta with quality scores (see http://maq.sourceforge.net/fastq.shtml)
csfasta : Colorspace fasta (implies -colorSpace) (see link below)
csqual : Colorspace quality (see link below)
(see http://marketing.appliedbiosystems.com/mk/submit/SOLID_KNOWLEDGE_RD?_JS=T&rd=dm)
BAM : Binary Alignment/Map
(see http://samtools.sourceforge.net/SAM1.pdf)
bigWig : Big Wig
(see http://genome.ucsc.edu/goldenPath/help/bigWig.html
-chromDb=db Specify DB containing chromInfo table to validate chrom names
and sizes
-chromInfo=file.txt Specify chromInfo file to validate chrom names and sizes
-colorSpace Sequences include colorspace values [0-3] (can be used
with formats such as tagAlign and pairedTagAlign)
-zeroSizeOk For BED-type positional data, allow rows with start==end
otherwise require strictly start < end
-genome=path/to/hg18.2bit Validate tagAlign or pairedTagAlign sequences match genome
in .2bit file
-mismatches=n Maximum number of mismatches in sequence (or read pair) if
validating tagAlign or pairedTagAlign files
-mismatchTotalQuality=n Maximum total quality score at mismatching positions
-matchFirst=n only check the first N bases of the sequence
-mmPerPair Check either pair dont exceed mismatch count if validating
pairedTagAlign files (default is the total for the pair)
-mmCheckOneInN=n Check mismatches in only one in 'n' lines (default=1, all)
-nMatch N's do not count as a mismatch
-privateData Private data so empty sequence is tolerated
-printOkLines Print lines which pass validation to stdout
-quick[=N] Just test the first N lines of each file (default 1000)
-printFailLines Print lines which fail validation to stdout
-isSort input is sorted by chrom
-version Print version
-allowOther allow chromosomes that aren't native in BAM's
-allowBadLength allow chromosomes that have the wrong length
in BAM
-complementMinus complement the query sequence on the minus strand (for testing BAM)
-doReport output report in filename.report
-showBadAlign show non-compliant alignments
-bamPercent=N.N percentage of BAM alignments that must be compliant
-allowErrors=N number of errors allowed to still pass (default 0)
-maxErrors=N Maximum lines with errors to report in one file before
stopping (default 10)
================================================================
======== wigCorrelate ====================================
================================================================
wigCorrelate - Produce a table that correlates all pairs of wigs.
usage:
wigCorrelate one.wig two.wig ... n.wig
This works on bigWig as well as wig files.
The output is to stdout
options:
-clampMax=N - values larger than this are clipped to this value
================================================================
======== wigToBigWig ====================================
================================================================
wigToBigWig v 4 - Convert ascii format wig file (in fixedStep, variableStep
or bedGraph format) to binary big wig format.
usage:
wigToBigWig in.wig chrom.sizes out.bw
Where in.wig is in one of the ascii wiggle formats, but not including track lines
and chrom.sizes is two column: <chromosome name> <size in bases>
and out.bw is the output indexed big wig file.
options:
-blockSize=N - Number of items to bundle in r-tree. Default 256
-itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
-clip - If set just issue warning messages rather than dying if wig
file contains items off end of chromosome.
-unc - If set, do not use compression.
================================================================