GenBank/RefSeq Data Database Update Step

This step is done for each browser database (species and assembly) on each database server. When building a new database, this process is run on the master database (hgwdev), and then the database pushed to the round-robin servers. For incremental updated, the process is run on all servers.

Genbank Status Table

This table (gbStatus) is used to keep track of the current version of the Genbank data this is loaded in the database. While some of this information is redundent to the mrna table, this data is only used by by the update. This table is updated last, so that it can be used to record form failures during update.

The columns of the tables are:

Genbank Loaded Table

This algorithm was designed to update the database to the latest information about a sequence, without regards to which release and update contains the data. However checking every partation proved to be costly, requiring scanning the gbStatus and seq tables. The gbLoaded table was created as an optimization. This table containes the releases, updates and partitions that have been loaded. For a given partition, the updates containing gbidx or alidx files are compared to the gbLoaded table. If all of the updates are loaded, there is no need to do any more checking. This saves loading the alidx files and querying the gbStatus and gbSeq tables.

The columns of the tables are:

Algorithm

Table updates are done the the following steps. This is designed to allow restarting the update process on a crash from any point. It prevents display of stale data if the update process aborts. There is a window where a sequence that has change will not be in the database.

If the load process crashes, new sequences may be in the other tables, but not in the gbStatus table. To detect this, we check the new sequences against the gbSeq table. These are orphaned sequences that must first be removed before loading. new seqChanged metaChanged deleted orphaned

To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.