GenBank/RefSeq Update Deployment
This page describes how the GenBank/RefSeq update process is deployed.
This is a proposed setup, not currently implemented. The following
system setup is required:
- Create a user
genbank
on the cluster, and the
round-robin and GBDB server (hgnfs1
).
Enable sudo
to genbank
for
markd
.
- There is currently sufficient disk space on
/cluster/store5/
for GenBank files and alignments for human,
mouse and rat. However diskspace should be monitored and may need to be
increased.
- Setup an rsync server on
eieio
accessable from
the GBDB server.
- Setup the
/somewhere/genbank/
directory on the
GBDB server, owned by genbank
, preferably on the same
filesytems as /gbdb/
(but not under /gbdb/
).
NFS export and mount on the round-robin servers as
/genbank/
. I should also available as
/genbank/
on the GBDB server as well.
Download/Processing/Alignment (build)
- These three steps are collectively know as the build
phase.
- The GenBank root directory is currently at:
/cluster/store5/genbank/
-
Estimates of disk space requirements:
download/
-
50-75gb, depending on how many previous release are maintained. Once
a new release is downloaded and processed (quarterly), old downloaded
files can be archived.
processed/
- 25-50gb
- processed files must be maintained as long as some database is
using sequences from them.
aligned/
-
~3gb per release per genome assembly
- Cluster accessable, temporary work space - ~2gb,
Note that these replace data currently kept in other locations, however
the downloads it now include the HTG sequences, which add several
gigabytes of data.
-
The download, processing, and alignment steps run on the GenBank
build server, which should have the following attributes:
- Should have the GenBank root directories as local
filesystem.
- Should have at least two CPUs.
- Must be able to
rsh
to kkr1u00
and
kk
.
kkstore
is probably the best candidate.
- A dedicated user,
genbank
, allows multiple people to
manage the jobs.
- A cron job will start the process daily at 1am.
Round-Robin Database Update
-
In order to update the databases on the round-robin servers, each
server must have acecss to the
processed/
and
aligned/
directories. FASTA files under the
processed/
directory must be copied into the
/gbdb/genbank/
directory. Since these directories are
large, they are maintained on the GBDB server for access by
the round-robin servers.
- The GBDB server exports a
/genbank/
directory to the the round-robin servers, which contains the
processed/
and aligned/
directories.
- If possible, the
/gbdb/
and /genbank/
directories should be on the same physical file system on the GBDB
server. This way, the FASTA file under the /gbdb/
directory can be hard links to the ones under the
processed/
directory, saving significant disk space. If
this is not possible, the FASTA files will be copied.
- A process running on the GBDB server must be able to
rsync
files from the GenBank root. on the cluster
-
A cron job GBDB server polls (with rsync) the GenBank build server
to determine if new alignments are ready.
- Copy new
processed/
and aligned/
files
to /genbank/
hierarchy, in to passed, one to get the
data files, and a second to get the index files.
- Update the
/gbdb/genbank/
hierarchy with the new
FASTA files. If /genbank/
and /gbdb/
are the same file system, these will be hardlinks.
- Flag copy as complete.
-
The each round-robin server periodically examines the the
/genbank/
to see if a copy has completed.