hgwdev), and then the database
pushed to the round-robin servers. For incremental updated, the process is
run on all servers.
gbStatus) is used to keep track of the current
version of the Genbank data this is loaded in the database. While some of
this information is redundent to the mrna table, this data is
only used by by the update. This table is updated last, so that it can be
used to record form failures during update.
The columns of the tables are:
acc - GenBank accessionversion - GenBank version number.modDate - last modified date.type - the type of the entry: EST or
mRNAsrcDb - source database: GenBank or
RefSeqgbSeq - id in gbSeq table.numAligns - number of alignments of the accession in the
approriate trackseqRelease - release version where the sequence was
obtainedseqUpdate - update where sequence was obtained (date or
fullmetaRelease - release version where the metadata was
obtainedmetaUpdate - update where the metadata was obtained
(date or full)extRelease - release version containing the external
fileextUpdate - update containing the external file (date or
full)time - time that this entry was inserted This algorithm was designed to update the database to the latest
information about a sequence, without regards to which release and update
contains the data. However checking every partation proved to be costly,
requiring scanning the gbStatus and seq tables. The gbLoaded
table was created as an optimization. This table containes the releases,
updates and partitions that have been loaded. For a given partition, the
updates containing gbidx or alidx files are
compared to the gbLoaded table. If all of the updates are
loaded, there is no need to do any more checking. This saves loading the
alidx files and querying the gbStatus and
gbSeq tables.
The columns of the tables are:
srcDb - source database: GenBank or
RefSeqtype - the type of the entry: EST or
mRNAloadRelease - release versionloadUpdate - update date or fullaccPrefix - accession prefix for ESTstime - time entry was addedIf the load process crashes, new sequences may be in the other tables,
but not in the gbStatus table. To detect this, we check the
new sequences against the gbSeq table. These are orphaned
sequences that must first be removed before loading. new
seqChanged metaChanged deleted
orphaned
To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.
gbidx files
with the gbLoaded, skipping the partation
if there is no missing updates.
etc/ignore.idx from all relevent
tables. Doing this upfront prevents a lot of complexity in other
code.gbStatus table to
contents of the processed/ and aligned/ directories
(gbIndex). classifying each accessions as:
gbStatus.gbSeq table to see if contains any of the
new entries, which become orphaned. These are
sequences from a failed load.mrna table.refSeqStatus and
refLink tables.gbSeq table.gbSeq table. Add new
accessions to the gbSeq table. This must be the first
table update so that orphans can be detected.gbSeq table. Add new
accessions to the gbSeq table.refSeqStatus and
refLink tables and add new accessions to these
tables.author,
library, etc), No attempt is made to remove entries that
will no longer be referenced.gbSeq table. Add new
accessions to the gbSeq table.mrna table. Add new
accessions to the mrna table.gbSeq and
gbExtFile tables for all relChanged
entries.gbStatus table. Add new
accessions to the table.