This file describes how we made the browser database on the Oct 7th freeze. CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARRY INFO o - Create the database. - ssh cc94 - Enter mysql via: mysql -u hguser -phguserstuff - At mysql prompt type: create database hg5; quit - make a semi-permanent alias: alias hg5 mysql -u hguser -phguserstuff -A hg5 o - Make sure there is at least 5 gig free on cc94:/var/mysql o - Store the mRNA (non-alignment) info in database. - hgLoadRna new hg5 - hgLoadRna add hg5 /projects/compbio/experiments/hg/h/mrna.120/mrna.fa ~/hg/h/mrna.120/mrna.ra - hgLoadRna add hg5 /projects/compbio/experiments/hg/h/mrna.120/est.fa ~/hg/h/mrna.120/est.ra The last line will take quite some time to complete. It will count up to about 2,000,000 before it is done. REPEAT MASKING o - Lift up the repeat masker coordinates as so: - ssh kks00 - cd ~/gs/oo.23 - edit jkStuff/liftOut.sh and update ooGreedy version number - source jkStuff/liftOut.sh o - Copy over RepeatMasking for chromosomes 21 and 22: - cp /projects/cc/hg/gs.4/oo.18/21/chr21.fa.out ~/gs/oo.23/21 o - Load RepeatMasker output into database: - cd /projects/hg/gs.5/oo.23 - hgLoadOut hg5 ?/*.fa.out ??/*.fa.out (Ignore the "Strange perc. field warnings. Maybe mention them to Arian someday.) This will keep things going until the sensitive RepeatMasker run comes in from Victor or elsewhere. When it does: o - Unzip and process Victor's stuff with doMerge.sh and victor_merge_rm_out1.pl. o - Copy in Victor's repeat masking runs over the contig runs in ~/oo/*/ctg* o - Lift these up as so: - ssh kks00 - cd ~/gs/oo.23 - source jkStuff/liftOut2.sh o - Load RepeatMasker output into database again: - ssh cc94 - cd /projects/hg/gs.5/oo.23 - hgLoadOut hg5 ?/*.fa.out ??/*.fa.out (Ignore the "Strange perc. field warnings. Maybe mention them to Arian someday.) STORING O+O SEQUENCE AND ASSEMBLY INFORMATION o - Create packed chromosome sequence files and put in database - hg5 < ~/src/hg/lib/chromInfo.sql - cd /projects/hg/gs.5/oo.23 - hgNibSeq hg5 /projects/hg/gs.5/oo.23/nib ?/chr*.fa ??/chr*.fa o - Store o+o info in database. - cd /projects/hg/gs.5/oo.23 - jkStuff/liftAgp.sh - jkStuff/liftGl.sh ooGreedy.93.gl - cat ?/ctg*/nt ??/ctg*/nt > nt.all - hgGoldGapGl hg5 /projects/hg/gs.5 oo.23 - cd /projects/hg/gs.5 - hgClonePos hg5 oo.23 ffa/sequence.inf /projects/hg/gs.5 (Ignore warnings about missing clones - these are in chromosomes 21 and 22) - hgCtgPos hg5 oo.23 o - Make and load GC percent table - login to cc94 - cd /projects/hg/gs.5/oo.23/bed - mkdir gcPercent - cd gcPercent - hg5 < ~/src/hg/lib/gcPercent.sql - hgGcPercent hg5 ../../nib MAKING AND STORING mRNA AND EST ALIGNMENTS o - Generate mRNA and EST alignments as so: - cdnaOnOoJobs /projects/hg/gs.5/oo.23 /projects/hg/gs.5/oo.23/jkStuff/postPsl refseq mrna est - cd /projects/hg/gs.5/oo.23/jkStuff/postPsl - edit in/mrna in/est in/refseq to change directory h/mrna to h/mrna - edit all.con to remove the big contig on chromosome 6 (ctg15907) - split into 5 files. - submit each to condor - run psLayout to generate ctg15907.refseq.psl, ctg15907.mrna.psl and ctg15907.est.psl on kks00 because it has a gigabyte of memory (need 400 meg). - Wait 2 days or so for last stragglers. o - Process EST and mRNA alignments into near best in genome. Since some of the EST files are over 2 gig, you need to do this on an Alpha. - ssh beta - cd /projects/hg/gs.5/oo.23 - cat */lift/*.lft > jkStuff/liftAll.lft - cd /projects/hg/gs.5/oo.23/jkStuff/postPsl/psl - mkdir /projects/hg/gs.5/oo.23/psl - mkdir /projects/hg/gs.5/oo.23/psl/mrnaRaw - ln */*.mrna.psl *.mrna.psl !$ - cd /projects/hg/gs.5/oo.23/psl - rm /scratch/jk/* - pslSort dirs mrnaGoodRaw.psl /scratch/jk mrnaRaw - pslReps mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr - liftUp all_mrna.psl ../jkStuff/liftAll.lft carry mrnaBestRaw.psl - pslSortAcc nohead mrna /scratch/jk all_mrna.psl - check mrna dir looks good. - rm -r mrnaRaw mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr Repeat this for the ests. You may be able to do this with the script ~/oo/jkStuff/postPslNearBest.sh. It's written but not tested. o - Load mRNA alignments into database. - ssh cc94 - cd /projects/hg/gs.5/oo.23/psl/mrna - foreach i (*.psl) mv $i $i:r_mrna.psl end - hgLoadPsl hg5 *.psl - cd .. - remove header lines from all_mrna.psl - hgLoadPsl hg5 all_mrna.psl o - Load EST alignments into database. - ssh cc94 - cd /projects/hg/gs.5/oo.23/psl/est - foreach i (*.psl) mv $i $i:r_est.psl end - hgLoadPsl hg5 *.psl - cd .. - remove header lines from all_est.psl - hgLoadPsl hg5 all_est.psl o - Create subset of ESTs with introns and load into database. - ssh kks00 - cd ~/oo - source jkStuff/makeIntronEst.sh - ssh cc94 - cd ~/oo/psl/intronEst - hgLoadPsl hg5 *.psl FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto cc94 - mkdir ~/oo/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/oo/sts.fa > fa.lst - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g - log onto cc01 - cc ~/oo/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl - LOADING IN EXTERNAL FILES o - Make external submission directory - mkdir /projects/hg/gs.5/oo.23/bed o - Load chromosome bands: - login to cc94 - cd /projects/hg/gs.5/oo.23/bed - mkdir cytoBands - cp /projects/cc/hg/mapplots/data/tracks/oo.23/cytobands.bed cytoBands - hg5 < ~/src/hg/lib/cytoBand.sql - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit o - Load STSs - login to cc94 - cd ~kent/src/hg/lib - hg5 < stsMarker.sql - cd ~/oo/bed - mkdir stsMarker - cd stsMarker - cp /projects/cc/hg/mapplots/data/tracks/oo.23/stsMarker.bed stsMarker - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'stsMarker.bed' into table stsMarker; - At mysql> prompt type quit - Ignore warnings from extra fields (7611 of them). o - Load rnaGene table - login to cc94 - cd ~kent/src/hg/lib - hg5 < rnaGene.sql - cd /projects/hg/gs.5/oo.23/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo23.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo23.gff - hgRnaGenes hg5 chrom.gff o - Load cpgIslands - login to cc94 - cd /projects/hg/gs.5/oo.23/bed - mkdir cpgIsland - cd cpgIsland - hg5 < ~kent/src/hg/lib/cpgIsland.sql - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_oct.oo23.tar - tar -xf cpg*.tar - awk -f filter.awk */ctg*/*.cpg > contig.bed - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'cpgIsland.bed' into table cpgIsland o - Load exoFish table - login to cc94 - cd /projects/hg/gs.5/oo.23/bed - mkdir exoFish - cd exoFish - hg5 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /projects/hg/gs.5/oo.23/bed/exoFish/oo23.ecore1.4jimkent - cp oo23.ecore1.4jimkent foo - Substitute "chr" for "CHR" in that file by subs -e foo >/dev/null - Add dummy name column and convert to tab separated with awk -f filter.awk foo > exoFish.bed - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'exoFish.bed' into table exoFish o - Load SNPs into database. - ssh cc94 - cd ~/oo/bed - mkdir snp - cd snp - Download SNPs from ftp://ncbi.nlm.nih.gov/pub/sherry/dbSNPoo23.txt.gz - Unpack. - grep RANDOM db*.txt > snpTsc.txt - grep MIXED db*.txt >> snpTsc.txt - grep BAC_OVERLAP db*.txt > snpNih.txt - grep OTHER db*.txt >> snpNih.txt - awk -f filter snpTsc.txt > snpTsc.contig.bed - awk -f filter snpNih.txt > snpNih.contig.bed - liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed - liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed - hg5 < ~kent/src/hg/lib/snpTsc.sql - hg5 < ~kent/src/hg/lib/snpNih.sql - Start up mySQL with the command 'hg5'. At the prompt do: load data local infile 'snpTsc.bed' into table snpTsc; load data local infile 'snpNihbed' into table snpNih; o - Mouse synteny track - login to cc94. - cd ~/kent/src/hg/lib - hg5 < mouseSyn.sql - mkdir ~/oo/bed/mouseSyn - cd ~/oo/bed/mouseSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as mouseSyn.txt - awk -f format.awk *.txt > mouseSyn.bed - delete first line of mouseSyn.bed - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'mouseSyn.bed' into table mouseSyn o - Load Genie known genes and associated peptides - login to cc94 - cd ~kent/src/hg/lib - hg5 < knownInfo.sql - cd /projects/hg/gs.5/oo.23/bed - wget https://www.neomorphic.com/public/genesets/files/affymetrix-2000-12-26.tgz - gtar -zxf affy*.tgz - mv affymetrix-2000-12-26 affy.1 - mv affy*.tgz affy.1 - cd affy.1 - cat */ctg*/ctg*.nr-knowns.gtf > knownContigs.gtf - liftUp knownChrom.gtf ../../jkStuff/liftAll.lft warn knownContigs.gtf - ldHgGene hg5 genieKnown knownChrom.gtf - hgKnownGene hg5 knownChrom.gtf - cat */ctg*/ctg*.nr-knowns.aa > known.aa - hgPepPred hg5 genie known.aa o - Create knownMore table. (Need to do this after knownInfo.) - ssh cc94 - cd /projects/hg/gs.5/oo.23/bed - mkdir knownMore - cd knownMore - wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref - wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc - wget www.ncbi.nlm.nih.gov/Omim/Index/genetable.html - copy genetable.html to omimIds - Remove first few (header) lines from omimIds - subs -e omimIds (This will perform the substitutions in subs.in, which will convert it to a nice tab-separated file with two columns: name, OMIM id.) - Remove the line containing the single word 'mhc' from omimIds (mhc isn't a gene, it's a bunch of genes, no wonder it's confused!) - wget http://www.gene.ucl.ac.uk/public-files/nomen/nomeids.txt - fixCr nomeids.txt - Put a # at the start of the first line of nomeids.txt - wget http://www.gene.ucl.ac.uk/public-files/nomen/nomenclature.txt - fixCr nomenclature.txt - hg5 < ~kent/src/hg/lib/knownMore.sql - hgKnownMore hg5 loc2ref mim2loc omimIds nomeids.txt o - Load Genie predicted genes and associated peptides. - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene hg5 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred hg5 genie pred.aa - hg5 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; o - Load Softberry genes and associated peptides: - cd ~/oo/bed - mkdir softberry - cd softberry - get ftp://www.softberry.com/pub/GP_SCOCT/* - ldHgGene hg5 softberryGene chr*.sgp.gff - hgPepPred hg5 softberry *.pro o - Load Ensembl genes: - cd ~/oo/bed - mkdir ensembl - wget ftp://ftp.sanger.ac.uk/pub/birney/genome/ensembl_oo23_october.gtf.gz - gunzip *.gz - liftUp chromRaw.gtf ../../jkStuff/liftAll.lft - grep -v start_codon chromRaw.gtf | grep -v stop_codon > chrom.gtf - ldHgGene hg5 ensGene chrom.gtf o - Load simpleRepeats - login to cc94 - cd ~kent/src/hg/lib - hg5 < simpleRepeat.sql - mkdir ~/oo/bed/simpleRepeat - cd ~/oo/bed/simpleRepeat - download *.bed from ftp.hgsc.bcm.tcm.edu/pub/users/GP-uploads/SimpleRepeats/Sept05 - catUncomment chr*.bed > all.bed - Enter database with "hg5" command. - At mysql> prompt type in: load data local infile 'all.bed' into table simpleRepeat