GenBank/RefSeq Genome Setup

The page describes that process of setting up the GenBank/RefSeq update process for a new genome or new assembly. It's assumed that the automated download and build already in place and in /hive/data/outside/genbank/.

Recent changes

The working directory for initial alignments is now removed unless -keep is specified.
gbBlat no longer needs to be modified to specify the ooc file. It is now specified in the genbank.conf file.
Many changed to the genbank.conf file.

Initial Alignment

Building the initial alignments is a long process, involving gigabytes of disk space and over a day of cluster time. Do not be surprised if it does not go smoothly. Manual intervention maybe required to correct problems.

Make sure ssh is configure to not require a passphase.
In this document, $db refers to the database being aligned. Substitute the actual database name (e.g. hg15).
If this is the first time this organism has been aligned, some source files need to be edited. The genbank update code is under kent/src/hg/makeDb/genbank/
- A mapping between the prefix used in the UCSC databases names (e.g. hg of hg13) and the organism names used in GenBank needs to be defined. This is done by editing genbank/src/lib/gbGenome.c and rebuilding the programs. It maybe necessary to define multiple organism name mappings. A list of organism in GenBank/RefSeq, along with the count of cDNAs is in:
```
           /hive/data/outside/genbank/data/organism.lst
           
```
- cd to the top of the genbank source (kent/src/hg/makeDb/genbank/)
- make to test if the source builds
- make install-server to update /hive/data/outside/genbank/.
- Once these changes are debugged and committed, ask markd to update the round-robin code.
Edit kent/src/hg/makeDb/genbank/etc/genbank.conf to configure this databases. Must set:
- $db.serverGenome
- $db.clusterGenome
- $db.lift
You may want to set some database load options to override the defaults. If you have a pseudo chromosome with unplaced sequence, be sure to specify $db.align.maxGapChrs and a lift file.
Commit your changes and then go to the top of the genbank source tree and update the installed genbank etc files with:
```
           make etc-update
           
```
ssh fileServer
Where fileServer is the NFS server with /hive/data/outside/genbank/.
cd /hive/data/outside/genbank
This directory is $gbRoot.
nice bin/gbAlignStep -initial $db&

This will run the entire alignment process. The -initial option defaults several parameters for and initial alignment and prevents this alignment from blocking the automatic daily alignments.
Warning: gbAlignStep and other GenBank do not currently accept options after the positional arguments (i.e. the databases).

All output is saved in the log file.
If your organism has xeno ESTs enabled, it's a good idea to start out by aligning and loading just the the mRNAs, as this will go much faster. Two options control what is aligned:
- -srcDb=name - Restrict the source database to either genbank or refseq.
- -type=name - Restrict the type of sequence processeed to either mrna or est.
Note that since the alignment processs only aligns what needs to be aligned, no option is required when doing the ESTs after an initial mRNA alignment.
If anything fails, a subset of the tasks done by gbAlignStep script can be rerun after correcting the problem. This is done using the -continue=subtask option with subtask is either
- copy - continue with coping to the iserver, this skips extracting the sequences to align.
- run - Continue with parasol blat run.
- finish - finish, alignments, doing lifting and filtering.
If the parasol alignment run fails, then can be continued using parasol directly, followed by an gbAlignStep with -continue=finish. If parasol loses track of the jobs, one can use the parasol recover command to generate a new jobs file with the jobs that have not completed.

Initial Database Load

nice bin/gbDbLoadStep -drop -initialLoad $db
- The -drop option drops any existing GenBank or RefSeq tables before loading.
- If an initial load was done using only the mRNAs, it will be most likely be much faster to drop all of the GenBank tables and load with the -initialLoad option when loading the ESTs.
After an initial review of the loaded alignments:
- Enable daily alignment of the databases by adding the databases to etc/align.dbs.
- Enable databases update of hgwdev by adding the databases to etc/hgwdev.dbs.
- make update-etc

Realigning Tracks

It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems.

If automated alignment or update has been enabled for the database, disable it by editing $gbRoot/etc/align.dbs.
Make sure an automated alignment isn't current running.
To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example:
- data/aligned/genbank.139.0/hg16/
- data/aligned/refseq.139.0/hg16/
To realign native RefSeq mRNAs for hg16, one would remove:
- data/aligned/refseq.139.0/hg16/*/mrna.native.*
To realign xeno GeneBank ESTs for hg16, one would remove:
- data/aligned/refseq.139.0/hg16/*/est.*.xeno.*
Do an initial alignment as described above, restricting with -srcDb and -type.
Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead.
- nice bin/gbDbLoadStep -reload -srcDb=genbank -type=mrna $db