GenBank/RefSeq Data Database Update Step

This step is done for each browser database (species and assembly) on each database server. When building a new database, this process is run on the master database (hgwdev), and then the database pushed to the round-robin servers. For incremental updated, the process is run on all servers.

Genbank Status Table

This table (gbStatus) is used to keep track of the current version of the Genbank data this is loaded in the database. While some of this information is redundent to the mrna table, this data is only used by by the update. This table is updated last, so that it can be used to record form failures during update.

The columns of the tables are:

acc - GenBank accession
version - GenBank version number.
modDate - last modified date.
type - the type of the entry: EST or mRNA
srcDb - source database: GenBank or RefSeq
gbSeq - id in gbSeq table.
numAligns - number of alignments of the accession in the approriate track
seqRelease - release version where the sequence was obtained
seqUpdate - update where sequence was obtained (date or full
metaRelease - release version where the metadata was obtained
metaUpdate - update where the metadata was obtained (date or full)
extRelease - release version containing the external file
extUpdate - update containing the external file (date or full)
time - time that this entry was inserted

Genbank Loaded Table

This algorithm was designed to update the database to the latest information about a sequence, without regards to which release and update contains the data. However checking every partation proved to be costly, requiring scanning the gbStatus and seq tables. The gbLoaded table was created as an optimization. This table containes the releases, updates and partitions that have been loaded. For a given partition, the updates containing gbidx or alidx files are compared to the gbLoaded table. If all of the updates are loaded, there is no need to do any more checking. This saves loading the alidx files and querying the gbStatus and gbSeq tables.

The columns of the tables are:

srcDb - source database: GenBank or RefSeq
type - the type of the entry: EST or mRNA
loadRelease - release version
loadUpdate - update date or full
accPrefix - accession prefix for ESTs
time - time entry was added

Algorithm

Table updates are done the the following steps. This is designed to allow restarting the update process on a crash from any point. It prevents display of stale data if the update process aborts. There is a window where a sequence that has change will not be in the database.

If the load process crashes, new sequences may be in the other tables, but not in the gbStatus table. To detect this, we check the new sequences against the gbSeq table. These are orphaned sequences that must first be removed before loading. new seqChanged metaChanged deleted orphaned

To minimize the memory require for the update, one partition of the date is loaded at a time. Partitions the RefSeq mRNAs, the GenBank mRNAs, and the GenBank ESTs split on the first two letters of the accession.

For each partation of the GenBank and RefSeq data:
- Determine if this partation could have any data to load by comparing the processed gbidx files with the gbLoaded, skipping the partation if there is no missing updates.
- Delete any accession in etc/ignore.idx from all relevent tables. Doing this upfront prevents a lot of complexity in other code.
- Compare accession versions stored in gbStatus table to contents of the processed/ and aligned/ directories (gbIndex). classifying each accessions as:
  - new - Accession is not in gbStatus.
  - seqChanged - The sequence and metadata changed.
  - metaChanged - The metadata changed.
  - extChanged - The release containing the external sequence files has changed and the entry has not changed. This is used to migrate fasta file references to the latest release, to allow cleanup of older releases.
  - deleted - The accession is not in the gbIndex.
- Check gbSeq table to see if contains any of the new entries, which become orphaned. These are sequences from a failed load.
- Remove seqChanged, deleted, and orphaned from alignment and orientation information tables.
- Remove deleted and orphaned accessions from the mrna table.
- If this is RefSeq, remove deleted and orphaned accessions from the refSeqStatus and refLink tables.
- Remove deleted and orphanedaccessions from the gbSeq table.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table. This must be the first table update so that orphans can be detected.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
- If this is RefSeq, update rows for seqChanged and metaChanged accessions in the refSeqStatus and refLink tables and add new accessions to these tables.
- Add new strings to the unique string tables (author, library, etc), No attempt is made to remove entries that will no longer be referenced.
- Update rows for seqChanged and metaChanged accessions in the gbSeq table. Add new accessions to the gbSeq table.
- Update rows for seqChanged and metaChanged accessions in the mrna table. Add new accessions to the mrna table.
- Add new and seqChanged rows to the alignment and orientation information tables.
- Update sequence fastas references the gbSeq and gbExtFile tables for all relChanged entries.
- Update rows for seqChanged and metaChanged accessions in the gbStatus table. Add new accessions to the table.