GenBank/RefSeq Data Processing Step

The data processing step extracts data from the downloaded GenBank files into a format that is ready for import into the database.

Run the gbProcessStep script, which:
- Examines each full and daily download files to determine which files need to be created. For each set of source files, check to see if mrna.md5 and est.*.md5 files exist in the approriate processed/ directory.
- For each missing *.md5 file, run the gbProcessSeqs script, which:
- Parse flat-files with gbToFaRa into data files that are used to update the browser databases. An index file (*.gbidx) is created to location the each sequence and version. All species remain grouped together; spliting by species at this step would generate a very large number of small files.
- Checksum (md5) the data files. The checksum file serves as indicator that the task completed successfuly.

A GenBank index file is a tab-seperated file in the format:

    acc version moddate organism

The name of the file is either mrna.gbidx or est.*.gbidx and is associated with the a *.ra or *.fa files of the same name. The columns are: