INSTALLING IN-SILICO PCR Installing In-Silico PCR involves four major steps: 1) Creating sequence databases. 2) Running the gfServer program to create in-memory indexes of the databases. (If you are just using the command-line gfPcr program rather than the web-driver webPcr, this is as far as you need to go.) 3) Editing the webPcr.cfg file to tell it what machine and port the gfServer(s) are running on, and optionally customizing the webPcr appearance to users. 4) Copying the webPcr executable and webPcr.cfg to a directory where the web server can execute webPcr as a CGI. CREATING SEQUENCE DATABASES You create databases with the program faToTwoBit. Typically you'll create a separate database for each genome you are indexing. Each database can contain up to four billion bases of sequence in an unlimited number of records. The databases for webPcr and webBlat are identical. The input to faToTwoBit is one or more fasta format files each of which can contain multiple records. If the sequence contains repeat sequences, as is the case with vertebrates and many plants, the repeat sequences can be represented in lower case and the other sequence in upper case. The gfServer program can be configured to ignore the repeat sequences. The output of faToTwoBit is a file which is designed for fast random access and efficient storage. The output files store four bases per byte. They use a small amount of additional space to store the case of the DNA and to keep track of runs of N's in the input. Non-N ambiguity codes such as Y and U in the input sequence will be converted to N. Here's how a typical installation might create a mouse and a human genome database: cd /data/genomes mkdir twoBit faToTwoBit human/hg16/*.fa twoBit/hg16.2bit faToTwoBit mouse/mm4/*.fa twoBit/mm4.2bit There's no need to put all of the databases in the same directory, but it can simplify bookkeeping. The databases can also be in the .nib format which was used with blat and gfClient/gfServer until recently. The .nib format only packed 2 bases per byte, and could only handle one record per nib file. Recent versions of blat and related programs can use .2bit files as well. CREATING IN-MEMORY INDICES WITH GFSERVER The gfServer for webPcr uses approximately 1/2 a byte of RAM per base of the sequence that is indexed. The gfServer is memory intensive but typically does not require a lot of CPU power. The gfServer can be on the same machine as the web server for installations just two mammalian sized genomes or less. Many installations will work better with the gfServers located on other machines. A typical installation might go: ssh bigRamMachine cd /data/genomes/twoBit gfServer -stepSize=5 start bigRamMachine 17779 hg16.2bit & It will take approximately half an hour for the indexing to complete on the human genome. The gfServer needs to be run in the same directory as the 2bit file (or nib files). When the indexing is complete gfServer writes: "Server ready for queries!" Note that the -stepSize=5 option is not required for blat, but it will make blat more sensitive. Two gfServer options are useful for trading off between speed and sensitivity: tileSize and stepSize. TileSize is the size of the DNA word indexed. By default it is 11. StepSize is the spacing between indexed words, which by default is equal to tileSize. The PCR programs can find primers with a minimum perfect 3' match of tileSize + stepSize - 1. With tileSize=11 and stepSize=5, the minimum perfect match is 15. In general if the gfServer index is to be shared with blat, we don't recommend going smaller than tileSize=9 stepSize=4. The option -mask will for repeat-rich genome such as human make the index use about less space and run faster. Generally if the gfServer is shared with blat you don't want to use the -mask, since many 3' UTRs do include repeat sequences. However since it is rarely fruitful to have a PCR primer in a repeat, if the gfServer is dedicated to PCR it makes sense to use -mask. EDITING THE WEBPCR.CFG FILE The webPcr.cfg file tells the webPcr program where to look for gfServers and for sequence. The basic format of the .cfg file is line oriented with the first word of the line being a command. Blank lines and lines starting with # are ignored. The webBlat.cfg and webPcr.cfg files are similar. The webBlat.cfg commands are: gfServer - defines host and port a gfServer is running on, the associated sequence directory, and the name of the database to display in the webPcr web page. background - defines the background image if any to display on web page company - defines company name to display on web page The background and company commands are optional. The webPcr.cfg file must have at least one valid gfServer line. Here is a webPcr.cfg file that you might find at a typical installation: company Awesome Research Amalgamated background /images/dnaPaper.jpg gfServer bigRamMachine 17779 /data/genomes/2bit/hg16.2bit Human Genome gfServer bigRamMachine 17781 /data/genomes/2bit/mm4.2bit Mouse Genome # The bacto group in Jersey is ok with us using their server too gfServer bactoGfServer.awesomeresearch.com 17779 /bacto/genomes/2bit/bacteria.2bit Bacteria PUTTING WEBPCR WHERE THE WEB SERVER CAN EXECUTE IT The details of this step vary highly from web server to web server. On a typical Apache installation it might be: ssh webServer cd kent/webPcr cp webPcr /usr/local/apache/cgi-bin cp webPcr.cfg /usr/local/apache/cgi-bin assuming that you've put the executable and config file in kent/webPcr. On OS-X, instead of /usr/local/apache/cgi-bin typically you'll copy stuff to /LibraryWebServer/CGI-Executables. Unless you are administering your own computer you will likely need to ask your local system administrators for help with this step.