INSTALLING WEBBLAT Installing A Web-Based Blat Server involves four major steps: 1) Creating sequence databases. 2) Running the gfServer program to create in-memory indexes of the databases. 3) Editing the webBlat.cfg file to tell it what machine and port the gfServer(s) are running on, and optionally customizing the webBlat appearance to users. 4) Copying the webBlat executable and webBlat.cfg to a directory where the web server can execute webBlat as a CGI. CREATING SEQUENCE DATABASES You create databases with the program faToTwoBit. Typically you'll create a separate database for each genome you are indexing. Each database can contain up to four billion bases of sequence in an unlimited number of records. The databases for webPcr and webBlat are identical. The input to faToTwoBit is one or more fasta format files each of which can contain multiple records. If the sequence contains repeat sequences, as is the case with vertebrates and many plants, the repeat sequences can be represented in lower case and the other sequence in upper case. The gfServer program can be configured to ignore the repeat sequences. The output of faToTwoBit is a file which is designed for fast random access and efficient storage. The output files store four bases per byte. They use a small amount of additional space to store the case of the DNA and to keep track of runs of N's in the input. Non-N ambiguity codes such as Y and U in the input sequence will be converted to N. Here's how a typical installation might create a mouse and a human genome database: cd /data/genomes mkdir twoBit faToTwoBit human/hg16/*.fa twoBit/hg16.2bit faToTwoBit mouse/mm4/*.fa twoBit/mm4.2bit There's no need to put all of the databases in the same directory, but it can simplify bookkeeping. The databases can also be in the .nib format which was used with blat and gfClient/gfServer until recently. The .nib format only packed 2 bases per byte, and could only handle one record per nib file. Recent versions of blat and related programs can use .2bit files as well. CREATING IN-MEMORY INDICES WITH GFSERVER The gfServer program creates an in-memory index of a nucleotide sequence database. The index can either be for translated or untranslated searches. Translated indexes enable protein-based blat queries and use approximately two bytes per unmasked base in the database. Untranslated indexes are used nucleotide-based blat queries as well as for In-silico PCR. An index for normal blat uses approximately 1/4 byte per base. For blat on smaller (primer-sized) queries or for In-silico PCR a more thorough index that requires 1/2 byte per base is recommended. The gfServer is memory intensive but typically doesn not require a lot of CPU power. Memory permitting multiple gfServers can be run on the same machine. A typical installation might go: ssh bigRamMachine cd /data/genomes/twoBit gfServer start bigRamMachine 17779 hg16.2bit & gfServer -trans -mask start bigRamMachine 17778 hg16.2bit & the -trans flag makes a translated index. It will take approximately 15 minutes to build an untranslated index, and 45 minutes to build a translate index. To build an untranslated index to be shared with In-silico PCR do gfServer -stepSize=5 bigRamMachine 17779 hg16.2bit & This index will be slightly more sensitive, noticeably so for small query sequences, with blat. EDITING THE WEBBLAT.CFG FILE The webBlat.cfg file tells the webBlat program where to look for gfServers and for sequence. The basic format of the .cfg file is line oriented with the first word of the line being a command. Blank lines and lines starting with # are ignored. The webBlat.cfg and webPcr.cfg files are similar. The webBlat.cfg commands are: gfServer - defines host and port a (untranslated) gfServer is running on, the associated sequence directory, and the name of the database to display in the webPcr web page. gfServerTrans - defines location of a translated server. background - defines the background image if any to display on web page company - defines company name to display on web page tempDir - where to put temporary files. This path is relative to where the web server executes CGI scripts. It is good to remove files that haven't been accessed for 24 hours from this directory periodically, via a cron job or similar mechanism. The background and company commands are optional. The webBlat.cfg file must have at least one valid gfServer or gfServerTrans line, and a tempDir line. Here is a webBlat.cfg file that you might find at a typical installation: company Awesome Research Amalgamated background /images/dnaPaper.jpg gfServerTrans bigRamMachine 17778 /data/genomes/2bit Human Genome gfServer bigRamMachine 17779 /data/genomes/2bit Human Genome gfServerTrans mouseServer 17780 /data/genomes/2bit Mouse Genome gfServer mouseServer 17781 /data/genomes/2bit Mouse Genome tempDir ../tmp PUTTING WEBBLAT WHERE THE WEB SERVER CAN EXECUTE IT The details of this step vary highly from web server to web server. On a typical Apache installation it might be: ssh webServer cd kent/webBlat cp webBlat /usr/local/apache/cgi-bin cp webBlat.cfg /usr/local/apache/cgi-bin assuming that you've put the executable and config file in kent/webBlat. On OS-X, instead of /usr/local/apache/cgi-bin typically you'll copy stuff to /LibraryWebServer/CGI-Executables. Unless you are administering your own computer you will likely need to ask your local system administrators for help with this step.