GenBank/RefSeq Alignment Step
This step is done for each browser database (species and assembly) that is
being updated.
Algorithm
- Select the most current GenBank (and corresponding RefSeq) full
releases from the
processed/
directory.
-
If there is not a corresponding
aligned/.../full
directory for the release:
- If there is a previous aligned release for this database, find
the aligned sequences that have not changed since that release and
the new, full release. Save these in a a temporary directory.
- Copy new and changed sequences to temporary fasta files for
alignment.
-
For each
processed/.../daily.${ver}/
directory that does not have a completed alignment/
directory:
- Copy new and changed sequences to temporary fasta files for
alignment.
- If the number of sequences requiring alignment exceeds some
configured threshold, send e-mail requesting an alignment on the big
cluster and stop the automated process. Normally, this should only be
required when a new database is built.
- If the number of sequences is below the threshold, run BLAT on the
mini-cluster, using the
parasol make
facility.
-
Process the completed alignments:
- Combine alignments migrated from the previous releases if
pending
- Building an index file
- Checksum the files.
- Do some sanity check on the alignment. Check that changed
sequences continue to align, at least in most cases and that the
number of aligned sequences increases.
-
$gbRoot/data/aligned/
- aligned files
-
genbank.${ver}/
-
${db}/
- alignments for this genome
database (e.g. hg12).
-
full/
- Alignments corrisponding to the full
release. This is a combination alignments migrated from
previous releases and new alignments.
mrna.native.psl.gz
,
mrna.native.oi.gz
,
mrna.native.alidx
,
mrna.native.md5
est.aa.native.psl.gz
,
est.aa.native.oi.gz
,
est.aa.native.alidx
,
est.aa.native.md5
,
mrna.xeno.psl.gz
,
mrna.xeno.oi.gz
,
mrna.xeno.alidx
,
mrna.xeno.md5
est.aa.xeno.psl.gz
,
est.aa.xeno.oi.gz
,
est.aa.xeno.alidx
,
est.aa.xeno.md5
,
daily.${date}/
- Alignments for
sequence that were new or modified in the daily update.
-
refseq.${ver}/
- BLAT alignments for RefSeq,
same structure as used for GenBank, with only native mRNAs.
Index file
Two alignment index files are always created for the corrisponding
processed.gbidx file, for native and xeno, the if there are no sequences to
align. This supports easy checking for the alignment being completed. The
file is a tab-seperated in the format:
acc version numaligns
The name of the file is either mrna.alidx
or
est.*.alidx
and is associated with the a *.psl
file of the same name. The columns are:
acc
- GenBank or RefSeq accession
version
- Version number, not including the
accession
numAligns
- Count of the number of alignments for this
accession.