GenBank/RefSeq Genome Setup

The page describes that process of setting up the GenBank/RefSeq update process for a new genome or new assembly. It's assumed that the automated download and build already in place and in /hive/data/outside/genbank/.

Recent changes

  1. The working directory for initial alignments is now removed unless -keep is specified.
  2. gbBlat no longer needs to be modified to specify the ooc file. It is now specified in the genbank.conf file.
  3. Many changed to the genbank.conf file.

Initial Alignment

Building the initial alignments is a long process, involving gigabytes of disk space and over a day of cluster time. Do not be surprised if it does not go smoothly. Manual intervention maybe required to correct problems.
  1. Make sure ssh is configure to not require a passphase.
  2. In this document, $db refers to the database being aligned. Substitute the actual database name (e.g. hg15).
  3. If this is the first time this organism has been aligned, some source files need to be edited. The genbank update code is under kent/src/hg/makeDb/genbank/
  4. Edit kent/src/hg/makeDb/genbank/etc/genbank.conf to configure this databases. Must set: You may want to set some database load options to override the defaults. If you have a pseudo chromosome with unplaced sequence, be sure to specify $db.align.maxGapChrs and a lift file.

    Commit your changes and then go to the top of the genbank source tree and update the installed genbank etc files with:

               make etc-update
               
  5. ssh fileServer
    Where fileServer is the NFS server with /hive/data/outside/genbank/.
  6. cd /hive/data/outside/genbank
    This directory is $gbRoot.
  7. nice bin/gbAlignStep -initial $db&

    This will run the entire alignment process. The -initial option defaults several parameters for and initial alignment and prevents this alignment from blocking the automatic daily alignments.

    Warning: gbAlignStep and other GenBank do not currently accept options after the positional arguments (i.e. the databases).

    All output is saved in the log file.

    If your organism has xeno ESTs enabled, it's a good idea to start out by aligning and loading just the the mRNAs, as this will go much faster. Two options control what is aligned:

    Note that since the alignment processs only aligns what needs to be aligned, no option is required when doing the ESTs after an initial mRNA alignment.

    If anything fails, a subset of the tasks done by gbAlignStep script can be rerun after correcting the problem. This is done using the -continue=subtask option with subtask is either

    If the parasol alignment run fails, then can be continued using parasol directly, followed by an gbAlignStep with -continue=finish. If parasol loses track of the jobs, one can use the parasol recover command to generate a new jobs file with the jobs that have not completed.

Initial Database Load

  1. nice bin/gbDbLoadStep -drop -initialLoad $db
  2. After an initial review of the loaded alignments:

Realigning Tracks

It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems.
  1. If automated alignment or update has been enabled for the database, disable it by editing $gbRoot/etc/align.dbs.
  2. Make sure an automated alignment isn't current running.
  3. To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: To realign native RefSeq mRNAs for hg16, one would remove: To realign xeno GeneBank ESTs for hg16, one would remove:
  4. Do an initial alignment as described above, restricting with -srcDb and -type.
  5. Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead.