GenBank/RefSeq Data Processing Step
The data processing step extracts data from the downloaded GenBank files
into a format that is ready for import into the database.
Algorithm
-
Run the
gbProcessStep script, which:
- Examines each full and daily download files to determine which
files need to be created. For each set of source files, check to see
if
mrna.md5 and est.*.md5 files exist in
the approriate processed/ directory.
- For each missing
*.md5 file, run the
gbProcessSeqs script, which:
- Parse flat-files with
gbToFaRa into data files that
are used to update the browser databases. An index file
(*.gbidx) is created to location the each sequence and
version. All species remain grouped together; spliting by species at
this step would generate a very large number of small files.
- Checksum (md5) the data files. The checksum file serves as
indicator that the task completed successfuly.
-
$gbRoot/data/processed/ - data extracted from the NCBI
flat-files
-
genbank.${ver}/
-
full/
mrna.ra.gz - meta-data for mRNAs
mrna.fa - fasta sequence data
mrna.gbidx - index file
mrna.md5 - checksums of all mRNA files
est.aa.ra.gz - files for ESTs accessions
starting with AA (case insensitive).
est.aa.fa, est.aa.gbidx,
est.aa.cksum
est.ab.ra.gz, est.ab.fa,
est.ab.gbidx, est.ab.cksum
- ...
-
daily.${date}/
mrna.ra.gz, ...
est.aa.ra, ...
- ...
-
refseq.${ver}/
Genbank index file
A GenBank index file is a tab-seperated file in the format:
acc version moddate organism
The name of the file is either mrna.gbidx or
est.*.gbidx and is associated with the a *.ra or
*.fa files of the same name. The columns are:
acc - GenBank or RefSeq accession
version - Version number, not including the
accession
moddate - Modification date, in 2002-22-08 format
organism - Organism name