hgwdev
), and then the database
pushed to the round-robin servers. For incremental updated, the process is
run on all servers.
gbStatus
) is used to keep track of the current
version of the Genbank data this is loaded in the database. While some of
this information is redundent to the mrna
table, this data is
only used by by the update. This table is updated last, so that it can be
used to record form failures during update.
The columns of the tables are:
acc
- GenBank accessionversion
- GenBank version number.modDate
- last modified date.type
- the type of the entry: EST
or
mRNA
srcDb
- source database: GenBank
or
RefSeq
gbSeq
- id in gbSeq
table.numAligns
- number of alignments of the accession in the
approriate trackseqRelease
- release version where the sequence was
obtainedseqUpdate
- update where sequence was obtained (date or
full
metaRelease
- release version where the metadata was
obtainedmetaUpdate
- update where the metadata was obtained
(date or full
)extRelease
- release version containing the external
fileextUpdate
- update containing the external file (date or
full
)time
- time that this entry was inserted This algorithm was designed to update the database to the latest
information about a sequence, without regards to which release and update
contains the data. However checking every partation proved to be costly,
requiring scanning the gbStatus and seq tables. The gbLoaded
table was created as an optimization. This table containes the releases,
updates and partitions that have been loaded. For a given partition, the
updates containing gbidx
or alidx
files are
compared to the gbLoaded
table. If all of the updates are
loaded, there is no need to do any more checking. This saves loading the
alidx
files and querying the gbStatus
and
gbSeq
tables.
The columns of the tables are:
srcDb
- source database: GenBank
or
RefSeq
type
- the type of the entry: EST
or
mRNA
loadRelease
- release versionloadUpdate
- update date or full
accPrefix
- accession prefix for ESTstime
- time entry was addedIf the load process crashes, new sequences may be in the other tables,
but not in the gbStatus
table. To detect this, we check the
new sequences against the gbSeq
table. These are orphaned
sequences that must first be removed before loading. new
seqChanged metaChanged deleted
orphaned
To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.
gbidx
files
with the gbLoaded
, skipping the partation
if there is no missing updates.
etc/ignore.idx
from all relevent
tables. Doing this upfront prevents a lot of complexity in other
code.gbStatus
table to
contents of the processed/
and aligned/
directories
(gbIndex). classifying each accessions as:
gbStatus
.gbSeq
table to see if contains any of the
new entries, which become orphaned. These are
sequences from a failed load.mrna
table.refSeqStatus
and
refLink
tables.gbSeq
table.gbSeq
table. Add new
accessions to the gbSeq
table. This must be the first
table update so that orphans can be detected.gbSeq
table. Add new
accessions to the gbSeq
table.refSeqStatus
and
refLink
tables and add new accessions to these
tables.author
,
library
, etc), No attempt is made to remove entries that
will no longer be referenced.gbSeq
table. Add new
accessions to the gbSeq
table.mrna
table. Add new
accessions to the mrna
table.gbSeq
and
gbExtFile
tables for all relChanged
entries.gbStatus
table. Add new
accessions to the table.