This file describes how we made the browser database on the Dec 12, 2000 freeze. CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARRY INFO o - ln -s /projects/hg2/gs.6/oo.27 ~/oo o - Create the database. - ssh cc94 - Enter mysql via: mysql -u hguser -phguserstuff - At mysql prompt type: create database hg6; quit - make a semi-permanent alias: alias hg6 mysql -u hguser -phguserstuff -A hg6 o - Make sure there is at least 5 gig free on cc94:/usr/local/mysql o - Store the mRNA (non-alignment) info in database. - hgLoadRna new hg6 - hgLoadRna add hg6 /projects/hg2/mrna.122/mrna.fa /projects/hg2/mrna.122/mrna.ra - hgLoadRna add hg6 /projects/hg2/mrna.122/est.fa /projects/hg2/mrna.122/est.ra The last line will take quite some time to complete. It will count up to about 3,200,000 before it is done. REPEAT MASKING o - Lift up the repeat masker coordinates as so: - ssh kks00 - cd ~/gs/oo.27 - edit jkStuff/liftOut.sh and update ooGreedy version number - source jkStuff/liftOut.sh o - Copy over RepeatMasking for chromosomes 21 and 22: - cp /projects/cc/hg/gs.4/oo.18/21/chr21.fa.out ~/gs/oo.27/21 o - Load RepeatMasker output into database: - cd /projects/hg2/gs.6/oo.27 - hgLoadOut hg6 ?/*.fa.out ??/*.fa.out (Ignore the "Strange perc. field warnings. Maybe mention them to Arian someday.) This will keep things going until the sensitive RepeatMasker run comes in from Victor or elsewhere. When it does: o - Unzip and process Victor's stuff with doMerge.sh and victor_merge_rm_out1.pl. o - Copy in Victor's repeat masking runs over the contig runs in ~/oo/*/ctg* o - Lift these up as so: - ssh kks00 - cd ~/gs/oo.27 - source jkStuff/liftOut2.sh o - Load RepeatMasker output into database again: - ssh cc94 - cd /projects/hg2/gs.6/oo.27 - hgLoadOut hg6 ?/*.fa.out ??/*.fa.out (Ignore the "Strange perc. field warnings. Maybe mention them to Arian someday.) STORING O+O SEQUENCE AND ASSEMBLY INFORMATION o - Create packed chromosome sequence files and put in database - hg6 < ~/src/hg/lib/chromInfo.sql - cd /projects/hg2/gs.6/oo.27 - hgNibSeq hg6 /projects/hg2/gs.6/oo.27/nib ?/chr*.fa ??/chr*.fa o - Store o+o info in database. - cd /projects/hg2/gs.6/oo.27 - jkStuff/liftAgp.sh - jkStuff/liftGl.sh ooGreedy.99.gl - hgGoldGapGl hg6 /projects/hg2/gs.6 oo.27 - cd /projects/hg2/gs.6 - hgClonePos hg6 oo.27 ffa/sequence.inf /projects/hg2/gs.6 (Ignore warnings about missing clones - these are in chromosomes 21 and 22) (Had to patch sequence.inf and fin/fa from gs.5 to take care of missing AL133398 and AL031034. - hgCtgPos hg6 oo.27 o - Make and load GC percent table - login to cc94 - cd /projects/hg2/gs.6/oo.27/bed - mkdir gcPercent - cd gcPercent - hg6 < ~/src/hg/lib/gcPercent.sql - hgGcPercent hg6 ../../nib MAKING AND STORING mRNA AND EST ALIGNMENTS o - Generate mRNA and EST alignments as so: - cdnaOnOoJobs /projects/hg2/gs.6/oo.27 /projects/hg2/gs.6/oo.27/jkStuff/postPsl refseq mrna est - cd /projects/hg2/gs.6/oo.27/jkStuff/postPsl - edit in/mrna in/est in/refseq to change directory h/mrna to h/mrna - edit all.con to remove the big contig on chromosome 6 (ctg15907) - split into 5 files. - submit each to condor - run psLayout to generate ctg15907.refseq.psl, ctg15907.mrna.psl and ctg15907.est.psl on kks00 because it has a gigabyte of memory (need 400 meg). - Wait 2 days or so for last stragglers. - Check errors by inspecting cat err/* | more - Do ls -1 log | wc cat log/* | grep return | grep "value 0" | wc ls -1 psl/*.psl psl/*/*.psl | wc and make sure two counts agree (except last should be two higher because of ctg15907). o - Process EST and mRNA alignments into near best in genome. First the mRNAs: - cd /projects/hg2/gs.6/oo.27 - cd /projects/hg2/gs.6/oo.27/jkStuff/postPsl/psl - mkdir /projects/hg2/gs.6/oo.27/psl - mkdir /projects/hg2/gs.6/oo.27/psl/mrnaRaw - ln */*.mrna.psl *.mrna.psl !$ - mkdir /projects/hg2/gs.6/oo.27/psl/estRaw - ln */*.est.psl *.est.psl !$ - cd /projects/hg2/gs.6/oo.27/psl - pslSort dirs mrnaGoodRaw.psl temp mrnaRaw - pslReps mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr - liftUp all_mrna.psl ../jkStuff/liftAll.lft carry mrnaBestRaw.psl - pslSortAcc nohead mrna temp all_mrna.psl - check mrna dir looks good. - rm -r mrnaRaw mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr (You might be able to do the EST version of this with jkStuff/postEst.sh). Since some of the EST files are over 2 gig, you need to do some of this on an Alpha. However since some of the files are too large for the per-process memory on an Alpha, you'll have to do some on kks00 as well. To do this run the following script - ssh kks00 - cd ~/oo - source jkStuff/postEst.sh o - Load mRNA alignments into database. - ssh cc94 cd /projects/hg2/gs.6/oo.27/psl/mrna foreach i (*.psl) mv $i $i:r_mrna.psl end hgLoadPsl hg6 *.psl cd .. - remove header lines from all_mrna.psl - hgLoadPsl hg6 all_mrna.psl o - Load EST alignments into database. - ssh cc94 cd /projects/hg2/gs.6/oo.27/psl/est foreach i (*.psl) mv $i $i:r_est.psl end hgLoadPsl hg6 *.psl cd .. - remove header lines from all_est.psl - hgLoadPsl hg6 all_est.psl o - Create subset of ESTs with introns and load into database. - ssh kks00 cd ~/oo source jkStuff/makeIntronEst.sh - ssh cc94 cd ~/oo/psl/intronEst hgLoadPsl hg6 *.psl SIMPLE REPEAT TRACK o - Execute the following (which will take about 8 hours). mkdir ~/oo/bed/simpleRepeat cd ~/oo/nib foreach i (*.nib) echo processing $i trfBig $i ~/oo/bed/simpleRepeat/$i -bed end cd ~/oo/bed/simpleRepeat rm *.nib cat c*.bed > all.bed Then log onto cc94 and cd ~/oo/bed/simpleRepeat hg6 < ~/src/hg/lib/simpleRepeat.sql At the mysql> prompt type load data local infile 'all.bed' into table simpleRepeat quit PRODUCING MOUSE ALIGNMENTS o - Make sure that contig/ctg*.fa.masked files exist already. o - Mask simple repeats in addition to normal repeats with: mkdir ~/oo/jkStuff/trfCon cd ~/oo/allctgs /bin/ls -1 | grep -v '\.' > ~/oo/jkStuff/trfCon/ctg.lst cd ~/oo/jkStuff/trfCon mkdir err mkdir log mkdir out edit ~/oo/jkStuff/trf.gsub to update gs and oo version gensub ctg.lst ~/oo/jkStuff/trf.gsub mv gensub.out trf.con condor_submit trf.con wait for this to finish. o - Load up the cluster with masked copies of the genome. ssh kks00 cd ~/oo/allctgs zip -j -1 ../ctgFaTrf.zip */ctg*.fa.trf ssh cc01 cd ~/oo ccCp ctgFaTrf.zip /var/tmp/hg/gs.6/oo.27/ctgFaTrf.zip Do the following to unpack cd ~/cluster edit parOne to read cd /var/tmp/hg/gs.6/oo.27 mkdir tanMaskNib cd tanMaskNib unzip ../ctgFaTrf.zip and then parAll Double check that all data is in place. o - Do mouse/human alignments. ssh cc01 cd ~/mouse/bigCon mkdir err mkdir out mkdir log mkdir psl ls -1S /var/tmp/hg/gs.6/oo.27/tanMaskNib/*.fa.trf > human.lst ls -1S /projects/hg3/mouse/ncbiTrace/fasta.mus_* > mouse.lst gensub2 mouse.lst human.lst aliMouse.gsub aliMouse.con break aliMouse.con into ten files and submit a couple a day. o - Check alignments as so: ssh kks00 cd ~/mouse/bigCon gensub2 mouse.lst human.lst fixMouse.gsub fixMouse.sh chmod a+x fixMouse.sh fixMouse.sh | tee fixMouse.out o - If any output in fixMouse.out then resubmit jobs as so: ssh cc01 cd ~/mouse/bigCon mv log log.old; mkdir log mv err err.old; mkdir err mv out out.old; mkdir out condor_submit fix.con o - Sort alignments as so and get rid of pile-ups due to contamination: cd ~/mouse/bigCon pslCat -dir -check psl | liftUp -type=.psl stdout ~/oo.27/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin o - Get rid of big pile-ups due to contamination as so: cd chrom foreach i (*.psl) echo $i mv $i xxx pslUnpile -maxPile=150 xxx $i rm xxx end o - Rename to correspond with tables as so and load into database: foreach i (*.psl) set r = $i:r mv $i ${r}_blatMouse.psl end o - Remove long redundant bits from read names by making a file called subs.in with the following line: gnl|ti^ti and running the command cd ~/mouse/bigCon/chrom subs -e -c ^ *.psl > /dev/null o - load into database as so: ssh cc94 cd ~/mouse/bigCon/chrom hgLoadPsl hg6 *.psl hgLoadRna addSeq '-abbr=gnl|' hg6 /projects/hg3/mouse/ncbiTrace/fasta.mus* These last two will take quite some time. Perhaps an hour each. o - Build fa file with only mouse traces that are part of alignments: ssh kks00 cd ~/mouse/vsOo29 subsetTraces ~/mouse/ncbiTrimmed blatMouse aliTrace.fa '-abbr=gnl|' '-tracePat=*.fa' '=pslPat=*.psl' o - Load mouse trace ancillary information into the mouseTraceInfo table. This parses the trace id and template id out of the NCBI trace FASTA files. The name of the table as well as the database is specified: hgTraceInfo hg6 mouseTraceInfo /projects/hg3/mouse/ncbiTrace/fasta.mus* PRODUCING KNOWN GENES (RefSeq) o - Go to the Entrez browser at http://www.ncbi.nlm.nih.gov/Entrez/ and select Nucleotide in the search box. Go to Limits and select 'exclude all of the above', Molecule: mRNA, Only from: RefSeq, and then put "Homo sapiens" [organism] in the search box and hit go. Tweak things so can save it to /projects/hg3/refseq.123/refseq.gb in GenBank flat file format. o - Download ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2acc and mim2loc to /projects/hg3/refseq.123 o - Similarly download refSeq proteins in fasta format to refSeq.pep o - Align mRNAs as so: ssh kks00 cd /projects/hg3/refSeq gbToFaRa mrna.fil refSeq.fa refSeq.ra refSeq.ta refseq.gb gfClient cc 17779 ~/oo/nib refSeq.fa rsVsOo27Raw.psl (This will take about 4 hours) pslReps rsVsOo27Raw.psl rsVsOo27Mixed.psl /dev/null -sizeMatters -minAli=0.97 -nearTop=0.005 -minCover=0.2 pslSortAcc head chrom temp rsVsOo27Mixed.psl pslCat chrom/*.psl > rsVsOo27.psl grep NM_ rsVsOo27.psl > known.oo27.psl o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh cc94 cd /projects/hg3/refseq.123 hgRefSeqMrna hg6 refSeq.fa refSeq.ra known.oo27.psl loc2ref refSeqPep.fa mim2loc LOADING IN EXTERNAL FILES o - Load chromosome bands: - login to cc94 - cd /projects/hg2/gs.6/oo.27/bed - mkdir cytoBands - cp /projects/cc/hg/mapplots/data/tracks/oo.27/cytobands.bed cytoBands - hg6 < ~/src/hg/lib/cytoBand.sql - Enter database with "hg6" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit o - Load STSs - login to cc94 - cd ~/oo/bed - hg6 < ~/src/hg/lib/stsMarker.sql - mkdir stsMarker - cd stsMarker - cp /projects/cc/hg/mapplots/data/tracks/oo.27/stsMarker.bed stsMarker - Enter database with "hg6" command. - At mysql> prompt type in: load data local infile 'stsMarker.bed' into table stsMarker; - At mysql> prompt type quit - Ignore warnings from extra fields (7611 of them). o - Load SNPs into database. - ssh cc94 - cd ~/oo/bed - mkdir snp - cd snp - Download SNPs from ftp://ncbi.nlm.nih.gov/pub/sherry/dbSNPoo27.txt.gz - Unpack. grep RANDOM db*.txt > snpTsc.txt grep MIXED db*.txt >> snpTsc.txt grep BAC_OVERLAP db*.txt > snpNih.txt grep OTHER db*.txt >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hg6 < ~kent/src/hg/lib/snpTsc.sql hg6 < ~kent/src/hg/lib/snpNih.sql - Start up mySQL with the command 'hg6'. At the prompt do: load data local infile 'snpTsc.bed' into table snpTsc; load data local infile 'snpNihbed' into table snpNih; o - Load cpgIslands - login to cc94 - cd /projects/hg2/gs.6/oo.27/bed - mkdir cpgIsland - cd cpgIsland - hg6 < ~kent/src/hg/lib/cpgIsland.sql - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_dec12.oo27.tar - tar -xf cpg*.tar - awk -f filter.awk */ctg*/*.cpg > contig.bed - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed - Enter database with "hg6" command. - At mysql> prompt type in: load data local infile 'cpgIsland.bed' into table cpgIsland o - Load rnaGene table - login to cc94 - cd ~kent/src/hg/lib - hg6 < rnaGene.sql - cd /projects/hg2/gs.6/oo.27/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff - hgRnaGenes hg6 chrom.gff o - Mouse synteny track - login to cc94. - cd ~/kent/src/hg/lib - hg6 < mouseSyn.sql - mkdir ~/oo/bed/mouseSyn - cd ~/oo/bed/mouseSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as mouseSyn.txt - awk -f format.awk *.txt > mouseSyn.bed - delete first line of mouseSyn.bed - Enter database with "hg6" command. - At mysql> prompt type in: load data local infile 'mouseSyn.bed' into table mouseSyn o - Load Ensembl genes: cd ~/oo/bed mkdir ensembl wget ftp://ftp.ensembl.org/current/data/gtf/ens100_genes_gtf_dump.tar.gz mkdir contig cd contig gtar -zxf ../*.gz cd .. liftUp chromRaw.gtf ../../jkStuff/liftAll.lft warn contig/*.gtf grep -v start_codon chromRaw.gtf | grep -v stop_codon > chrom.gtf ldHgGene hg6 ensGene chrom.gtf o - Load Ensembl peptides: (poke around ensembl to get their peptide files as ensembl.pep) (substitute ENST for ENSP in ensemble.pep with subs) hgPepPred hg6 ensPep ensembl.pep o - Load Softberry genes and associated peptides: cd ~/oo/bed mkdir softberry cd softberry get ftp://www.softberry.com/pub/GP_SCDEC/* ldHgGene hg6 softberryGene chr*.gff hgPepPred hg6 softberry *.pro hgSoftberryHom hg6 *.pro o - Load Softberry promoters cd ~/oo/bed mkdir softPromoter cd softPromoter move Softberry_prometers.tar.gz from Victor's email attachment into this directory. gtar -zxf *.gz hgSoftPromoter hg6 *.prom o - Load exoFish table - login to cc94 - cd /projects/hg2/gs.6/oo.27/bed - mkdir exoFish - cd exoFish - hg6 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /projects/hg2/gs.6/oo.27/bed/exoFish/oo27.ecore1.4jimkent - cp oo27.ecore1.4jimkent foo - Substitute "chr" for "CHR" in that file by subs -e foo >/dev/null - Copy in filter.awk from previous version's bed/exoFish dir. - Add dummy name column and convert to tab separated with awk -f filter.awk foo > exoFish.bed - Enter database with "hg6" command. - At mysql> prompt type in: load data local infile 'exoFish.bed' into table exoFish >>> Done to here ~~~ o - Load Genie predicted genes and associated peptides. - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene hg6 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred hg6 genie pred.aa - hg6 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto cc94 - mkdir ~/oo/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/oo/sts.fa > fa.lst - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g - log onto cc01 - cc ~/oo/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl