GenBank/RefSeq Update Deployment

This page describes how the GenBank/RefSeq update process is deployed.

This is a proposed setup, not currently implemented. The following system setup is required:

Create a user genbank on the cluster, and the round-robin and GBDB server (hgnfs1). Enable sudo to genbank for markd.
There is currently sufficient disk space on /cluster/store5/ for GenBank files and alignments for human, mouse and rat. However diskspace should be monitored and may need to be increased.
Setup an rsync server on eieio accessable from the GBDB server.
Setup the /somewhere/genbank/ directory on the GBDB server, owned by genbank, preferably on the same filesytems as /gbdb/ (but not under /gbdb/). NFS export and mount on the round-robin servers as /genbank/. I should also available as /genbank/ on the GBDB server as well.

Download/Processing/Alignment (build)

These three steps are collectively know as the build phase.
The GenBank root directory is currently at:
/cluster/store5/genbank/
Estimates of disk space requirements:
- download/ - 50-75gb, depending on how many previous release are maintained. Once a new release is downloaded and processed (quarterly), old downloaded files can be archived.
- processed/ - 25-50gb - processed files must be maintained as long as some database is using sequences from them.
- aligned/ - ~3gb per release per genome assembly
- Cluster accessable, temporary work space - ~2gb,
Note that these replace data currently kept in other locations, however the downloads it now include the HTG sequences, which add several gigabytes of data.
The download, processing, and alignment steps run on the GenBank build server, which should have the following attributes:
- Should have the GenBank root directories as local filesystem.
- Should have at least two CPUs.
- Must be able to rsh to kkr1u00 and kk.
kkstore is probably the best candidate.
A dedicated user, genbank, allows multiple people to manage the jobs.
A cron job will start the process daily at 1am.

Round-Robin Database Update

In order to update the databases on the round-robin servers, each server must have acecss to the processed/ and aligned/ directories. FASTA files under the processed/ directory must be copied into the /gbdb/genbank/ directory. Since these directories are large, they are maintained on the GBDB server for access by the round-robin servers.
- The GBDB server exports a /genbank/ directory to the the round-robin servers, which contains the processed/ and aligned/ directories.
- If possible, the /gbdb/ and /genbank/ directories should be on the same physical file system on the GBDB server. This way, the FASTA file under the /gbdb/ directory can be hard links to the ones under the processed/ directory, saving significant disk space. If this is not possible, the FASTA files will be copied.
A process running on the GBDB server must be able to rsync files from the GenBank root. on the cluster
A cron job GBDB server polls (with rsync) the GenBank build server to determine if new alignments are ready.
- Copy new processed/ and aligned/ files to /genbank/ hierarchy, in to passed, one to get the data files, and a second to get the index files.
- Update the /gbdb/genbank/ hierarchy with the new FASTA files. If /genbank/ and /gbdb/ are the same file system, these will be hardlinks.
- Flag copy as complete.
The each round-robin server periodically examines the the /genbank/ to see if a copy has completed.
- Run the database update step to update tables the tables.
- Run gbSanity to verify the update.