GenBank/RefSeq Annoying Issues
- The entire GenBank directory is replace when a new version is
release. Daily releases are relative to this.
- GenBank daily release don't indicate deleted entries.
- GenBank daily filenames don't include a year, so daily files between
the beginning of the year and next release (probably Jan 15th) will not
sort in a simple manner.
- RefSeq updates it cumulative files daily as well as having separate
daily files. There is no concept of a release.
- RefSeq deleted entries are still in the older daily releases. There
no daily records indicating when an entry has been deleted.
- Ocassionally, there are incorrect genbank entries that break
assumptions in this code. These are skipped by placing them an
data/ignore.idx
acc.
- MySql ISAM tables don't support foreign keys. Using
auto_increment
for id columns was a problems because
mysqlimport
would reset the numbers (or at least not insert
zero).
- Want to use disk files rather than a database to track genbank
repository files. This is faster when we need to look at all entries and
makes setup and loading multiple database servers easier. It was also
easier to implement. However this proved to be a problem for ESTs, which
require large amount of memory to handle. To reduce the memory required,
ESTs are partitioned by the first two letters of the accession.
- Don't handle realigning sequences (say to take advantage of
changes to the aligner).