GenBank/RefSeq Data Processing Step
The data processing step extracts data from the downloaded GenBank files
into a format that is ready for import into the database.
Algorithm
-
Run the
gbProcessStep
script, which:
- Examines each full and daily download files to determine which
files need to be created. For each set of source files, check to see
if
mrna.md5
and est.*.md5
files exist in
the approriate processed/
directory.
- For each missing
*.md5
file, run the
gbProcessSeqs
script, which:
- Parse flat-files with
gbToFaRa
into data files that
are used to update the browser databases. An index file
(*.gbidx
) is created to location the each sequence and
version. All species remain grouped together; spliting by species at
this step would generate a very large number of small files.
- Checksum (md5) the data files. The checksum file serves as
indicator that the task completed successfuly.
-
$gbRoot/data/processed/
- data extracted from the NCBI
flat-files
-
genbank.${ver}/
-
full/
mrna.ra.gz
- meta-data for mRNAs
mrna.fa
- fasta sequence data
mrna.gbidx
- index file
mrna.md5
- checksums of all mRNA files
est.aa.ra.gz
- files for ESTs accessions
starting with AA (case insensitive).
est.aa.fa
, est.aa.gbidx
,
est.aa.cksum
est.ab.ra.gz
, est.ab.fa
,
est.ab.gbidx
, est.ab.cksum
- ...
-
daily.${date}/
mrna.ra.gz
, ...
est.aa.ra
, ...
- ...
-
refseq.${ver}/
Genbank index file
A GenBank index file is a tab-seperated file in the format:
acc version moddate organism
The name of the file is either mrna.gbidx
or
est.*.gbidx
and is associated with the a *.ra
or
*.fa
files of the same name. The columns are:
acc
- GenBank or RefSeq accession
version
- Version number, not including the
accession
moddate
- Modification date, in 2002-22-08 format
organism
- Organism name