This file describes how we made the browser database on NCBI build 28 (Dec 22, 2001 freeze) GETTING FRESH mRNA AND EST SEQUENCE FROM GENBANK. This will create a genbank.127 directory containing compressed GenBank flat files and a mrna.127 containing unpacked sequence info and auxiliarry info in a relatively easy to parse (.ra) format. o - Point your browser to ftp://ncbi.nlm.nih.gov/genbank and look at the README. Figure out the current release number (which is 127). o - Consider deleting one of the older genbank releases. It's good to at least keep one previous release though. o - Where there is space make a new genbank directory. Create a symbolic link to it: mkdir /whereverYouPutIt/genbank.127 ln -s /whereverYouPutIt/genbank.127 ~/genbank o - cd ~/genbank o - ftp ncbi.nlm.nih.gov (do anonymous log-in). Then do the following commands inside ftp: cd genbank prompt mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv* This will take at least 2 hours. o - Log onto server and change to your genbank directory. o - cd /cluster/store1 o - mkdir mrna.127 o - cd genbank.127 o - gunzip -c gbpri*.gz | gbToFaRa ~/hg/h/mrna.fil ../mrna.127/mrna.fa ../mrna.127/mrna.ra ../mrna.127/mrna.ta stdin o - gunzip -c gbest*.gz | gbToFaRa ~/hg/h/mrna.fil ../mrna.127/est.fa ../mrna.127/est.ra ../mrna.127/est.ta stdin CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARRY INFO (done) o - ln -s /cluster/store1/gs.11/build28 ~/oo o - Create the database. - ssh hgwdev - Enter mysql via: mysql -u hgcat -pbigsecret - At mysql prompt type: create database hg10; quit - make a semi-permanent read-only alias: alias hg10 mysql -u hguser -phguserstuff -A hg10 o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql o - Store the mRNA (non-alignment) info in database. hgLoadRna new hg10 hgLoadRna add hg10 /cluster/store1/mrna.127/mrna.fa /cluster/store1/mrna.127/mrna.ra hgLoadRna add hg10 /cluster/store1/mrna.127/est.fa /cluster/store1/mrna.127/est.ra The last line will take quite some time to complete. It will count up to about 3,800,000 before it is done. REPEAT MASKING (done) For the NCBI assembly we repeat mask on the sensitive mode setting. Patrick Gavin did this, and perhaps will fill in the details later. Overall the process was to break the NT contigs into pieces of 1000000 bases each with cd NTXXX mkdir split cd split faSplit size ../NTXXX.fa 1000000 NTXXX -maxN=1000001 -lift=../NTXXX.lft and then arrange to do a cluster run that generates NTXXX_X.fa.out files in the split directory. After this lift up the .out files to whole contig coordinates with liftUp NTXXX.fa.out NTXXX.lft warn split/*.fa.out Then lift these up to chromosome coordinates with cd ~/oo tcsh jkStuff/liftOut2.sh Then load the .out files into the database with: ssh hgwdev cd ~/oo hgLoadOut hg10 ?/*.fa.out ??/*.fa.out STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (done) Create packed chromosome sequence files ssh kkstore cd ~/oo tcsh jkStuff/makeNib.sh Load chromosome sequence info into database ssh hgwdev mysql -A -u hgcat -pbigsecret hg10 < ~/src/hg/lib/chromInfo.sql cd ~/oo hgNibSeq -preMadeNib hg10 /cluster/store1/gs.11/build28/nib ?/chr*.fa ??/chr*.fa Store o+o info in database. cd /cluster/store1/gs.11/build28 jkStuff/liftGl.sh contig.gl hgGoldGapGl hg10 /cluster/store1/gs.11 build28 cd /cluster/store1/gs.11 hgClonePos hg10 build28 ffa/sequence.inf /cluster/store1/gs.11 (Ignore warnings about missing clones - these are in chromosomes 21 and 22) hgCtgPos hg10 build28 Make and load GC percent table ssh hgwdev cd /cluster/store1/gs.11/build28/bed mkdir gcPercent cd gcPercent mysql -A -u hgcat -pbigsecret hg10 < ~/src/hg/lib/gcPercent.sql hgGcPercent hg10 ../../nib MAKING AND STORING mRNA AND EST ALIGNMENTS (done) o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa from /cluster/store1/mrna.127 into /var/tmp/hg/h/mrna o - Use BLAT to generate refSeq, mRNA and EST alignments as so: Make sure that /scratch/hg/gs.11/build28/contigs is loaded with NT_*.fa and pushed to the cluster nodes. cd ~/oo/bed foreach i (refSeq mrna est) mkdir $i cd $i echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst ls -1 /scratch/hg/mrna.127/$i.fa > mrna.lst mkdir psl gensub2 genome.lst mrna.lst gsub spec jabba make hut spec jabba push hut end check on progress with jabba check hut in mrna, est, and refSeq directories. o - Process refSeq mRNA and EST alignments into near best in genome. cd ~/oo/bed cd refSeq pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_refSeq.psl cd .. cd mrna pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp chrom.psl cd .. cd est pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_est.psl cd .. o - Load refSeq alignments into database ssh hgwdev cd /cluster/store1/gs.11/build28/bed/refSeq pslCat -dir chrom > refSeqAli.psl hgLoadPsl hg10 -tNameIx refSeqAli.psl o - Load mRNA alignments into database. ssh hgwdev cd /cluster/store1/gs.11/build28/bed/mrna/chrom foreach i (*.psl) mv $i $i:r_mrna.psl end hgLoadPsl hg10 *.psl cd .. hgLoadPsl hg10 all_mrna.psl -nobin o - Load EST alignments into database. ssh hgwdev cd /cluster/store1/gs.11/build28/bed/est/chrom foreach i (*.psl) mv $i $i:r_est.psl end hgLoadPsl hg10 *.psl cd .. hgLoadPsl hg10 all_est.psl -nobin o - Create subset of ESTs with introns and load into database. - ssh kkstore cd ~/oo tcsh jkStuff/makeIntronEst.sh - ssh hgwdev cd ~/oo/bed/est/intronEst hgLoadPsl hg10 *.psl o - Put orientation info on ESTs into database: ssh kkstore cd ~/oo/bed/est pslSortAcc nohead contig /cluster/fast1/temp/contig.psl mkdir /scratch/hg/gs.11/build28/bed cp -r contig /scratch/hg/gs.11/build28/bed/est sudo /cluster/install/utilities/updateLocal cd ~/oo/bed mkdir estOrientInfo cd estOrientInfo mkdir ei ls -1 /scratch/hg/gs.11/build28/bed/est > psl.lst Get onto kk and cd to ~/oo/bed/estOrientInfo. Copy in gsub from the previous version and edit it to say where things are located in scratch on this version. Then: gensub2 gsub psl.lst single spec para create spec para try para push check until done, or use 'para shove' o - Create rnaCluster table ssh hgwdev cd ~/oo mkdir -p bed/rnaCluster/chrom foreach i (? ??) cd $i foreach j (chr*.fa) set c = $j:r echo clusterRna hg10 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c clusterRna hg10 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c end cd .. end cd bed/rnaCluster hgLoadBed hg10 rnaCluster chrom/*.bed PRODUCING KNOWN GENES (done) o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ into /cluster/store1/mrna.127/refSeq. o - Unpack this into fa files and get extra info with: cd /cluster/store1/mrna.127/refSeq gunzip hs.gbff gunzip hs.faa.gz gbToFaRa ~/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta o - Align refSeq.fa to genome as described under mRNA/EST alignments above. o - Get extra info from NCBI and produce refGene table as so: ssh hgwdev cd ~/oo/bed/refSeq wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc o - Similarly download refSeq proteins in fasta format to refSeq.pep o - Align these by processes described under mRNA/EST alignments above. o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh hgwdev cd ~/oo/bed/refSeq ln -s /cluster/store1/mrna.127 mrna hgRefSeqMrna hg10 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl loc2ref mrna/refSeq/hs.faa mim2loc o - Add Jackson labs info cd ~/oo/bed mkdir jaxOrtholog cd jaxOrtholog ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt awk -f filter.awk *.rpt > jaxOrtholog.tab Load this into mysql with something like: mysql -u hgcat -pBIGSECRET hg10 < ~/src/hg/lib/jaxOrtholog.sql mysql -u hgcat -pBIGSECRET -A hg10 and at the mysql> prompt load data local infile 'jaxOrtholog.tab' into table jaxOrtholog; o - Add RefSeq status info (done 6/19/02) hgRefSeqStatus hg10 loc2ref SIMPLE REPEAT TRACK (done) o - Create cluster jabba job like so: ssh kkstore cd ~/oo/bed mkdir simpleRepeat cd simpleRepeat cp ~/lastOo/bed/simpleRepeat/gsub mkdir trf echo single > single.lst echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst gensub2 genome.lst single.lst gsub spec jabba make hut spec jabba push hut Do jabba push huts until it is done. This one is getting erratic errors, requiring multiple runs on some jobs currently. I'm not sure if the problem lies in the cluster, in trfBig, or trf. Jorge suggested it might have been because some of the local disks were full. liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed o - Load this into the database as so ssh hgwdev cd ~/oo/bed/simpleRepeat hgLoadBed hg10 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql PRODUCING GENSCAN PREDICTIONS (done) o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so: ssh roar cd ~/build28 source jkStuff/gsBig.sh & Wait about 3 1/2 days for these to finish. started 11:00 am Oct 5. finished by 9:00 pm Oct 8. o - Convert these to chromosome level files as so: cd ~/build28 mkdir bed/genscan liftUp bed/genscan/genscan.gtf jkStuff/liftAll.lft warn ?/ctg*/genscan.gtf ??/ctg*/genscan.gtf liftUp bed/genscan/genscanSubopt.bed jkStuff/liftAll.lft warn ?/ctg*/genscanSub.bed ??/ctg*/genscanSub.bed cat ?/ctg*/genscan.pep ??/ctg*/genscan.pep > bed/genscan/genscan.pep o - Load into the database as so: ssh hgwdev cd ~/build28/bed/genscan ldHgGene hg10 genscan genscan.gtf hgPepPred hg10 generic genscanPep genscan.pep hgLoadBed hg10 genscanSubopt genscanSubopt.bed PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (done) Make sure that the NT*.fa files are lower-case repeat masked, and that the simpleRepeat track is made. Then put doubly (simple & interspersed) repeat mask files onto /cluster local drive as so. ssh kkstore tcsh mkdir /scratch/hg/gs.11/build28/trfFa cd ~/oo foreach i (? ??) cd $i foreach j (NT*) maskOutFa $j/$j.fa ../bed/simpleRepeat/trf/$j.bed -softAdd /scratch/hg/gs.11/build28/trfFa/$j.fa.trf echo done $i/$j end cd .. end Then ask admins to do a binrsync. TWINSCAN GENE PREDICTIONS (done 06/10/02) mkdir -p ~/oo/bed/twinscan cd ~/oo/bed/twinscan wget http://genes.cs.wustl.edu/human/Gtf.tgz wget http://genes.cs.wustl.edu/human/Ptx.tgz gunzip -c Gtf.tgz | tar xvf - gunzip -c Ptx.tgz | tar xvf - ldHgGene hg10 twinscan *.gtf -exon=CDS - pare down to id (and remove 0-length files) : foreach f (chr*.ptx) perl -wpe 's/^\>.*\s+source_id\s*\=\s*(\S+).*$/\>$1/;' < \ $f > $f:r-fixed.fa if (-z $f:r-fixed.fa) then rm $f:r-fixed.fa endif end hgPepPred hg10 generic twinscanPep chr*-fixed.fa CREATE GOLDEN TRIANGLE (todo) Make sure that rnaCluster table is in place. Then extract Affy expression info into a form suitable for Eisen's clustering program with: cd ~/oo/bed mkdir triangle cd triangle eisenInput hg10 affyHg10.txt Transfer this to Windows and do k-means clustering with k=200 with cluster. Transfer results file back to ~/oo/bed/triangle/affyCluster_K_G200.kgg. Then do promoSeqFromCluster hg10 1000 affyCluster_K_G200.kgg kg200.unmasked kg200.lift Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions to kg200. Then cat kg200/*.fa > all1000.fa and set up cluster Improbizer run to do 100 controls for every real run on each - putting the output in imp.200.1000.e. When improbizer run is done make a file summarizing the runs as so: cd imp.200.1000.e motifSig ../imp.200.1000.e.iri ../kg200 motif control* get rid of insignificant motifs with: cd .. awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri turn rest into just dnaMotifs with iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt Extract all promoters with featureBits hg10 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa Locate motifs on all promoters with dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2 liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed DOING HUMAN/MOUSE ALIGMENTS (todo) o - Start with the mouse assembly in 1 Mb chunks lower-case repeat and tandem-repeat masked in /scratch/hg/mm1/trfFa, and the human assembly in contigs similiarly masked in /scratch/hg/gs.11/build28/trfFa Then ssh kkstore cd ~/oo/bed mkdir blatMus cd blatMus ls -1 /scratch/hg/mm1/trfFa/*.fa.trf > mouseAll mkdir mm cd mm splitFile ../mouseAll 50 mm cd .. ls -1 mm/* > mouse.lst Then bundle up the human into pieces of less than 12 meg mostly by ls -lS /scratch/hg/gs.11/build28/trfFa/*.fa.trf > bigHuman edit this file and move all of the lines less than 3 meg into the file smallHuman. Then do awk '{printf("/scratch/hg/gs.11/build28/trfFa/%s\n", $9);}' bigHuman > bigH awk '{printf("/scratch/hg/gs.11/build28/trfFa/%s\n", $9);}' smallHuman > smallH mkdir hs cd hs splitFile ../bigH 1 big rm big177 splitFile ../smallH 4 small rm small 705 cd .. ls -1 hs/* > human.lst (The rm's probably indicate splitFile needs a fix - they are zero length). Finally generate the job list with gensub2 human.lst mouse.lst gsub spec o - Do the cluster run as so ssh kk cd ~/oo/bed/blatMus mkdir psl para create spec para try and then do para push/check/push/check/shove etc. o - Sort alignments as so ssh kkstore cd ~/oo/bed/blatMus pslCat -dir -check psl | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin -pslQ | pslSortAcc nohead chromPile /cluster/store2/temp stdin o - Get rid of big pile-ups due to contamination as so: mkdir chrom cd chromPile foreach i (*.psl) echo $i pslUnpile -maxPile=250 $i ../chrom/$i end o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatMus/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatMus.psl end hgLoadPsl hg10 *.psl o - load sequence into database as so: ssh kks00 faSplit about /projects/hg3/mouse/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/mouse/arachne.3/whole/unplaced ssh hgwdev hgLoadRna addSeq '-abbr=gnl|' hg10 /projects/hg3/mouse/arachne.3/whole/unpla*.fa hgLoadRna addSeq '-abbr=con' hg10 /projects/hg3/mouse/arachne.3/whole/SET*.mfa This will take quite some time. Perhaps an hour . o - Produce 'best in genome' filtered version: ssh kks00 cd ~/mouse/vsOo33 pslSort dirs blatMouseAll.psl temp blatMouse pslReps blatMouseAll.psl bestMouseAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1 pslSortAcc nohead bestMouse temp bestMouseAll.psl cd bestMouse foreach i (*.psl) set r = $i:r mv $i ${r}_bestMouse.psl end o - Load best in genome into database as so: ssh hgwdev cd ~/mouse/vsOo33/bestMouse hgLoadPsl hg10 *.psl PRODUCING CROSS_SPECIES mRNA ALIGMENTS (done for mRNA, not for EST) Here you align vertebrate mRNAs against the masked genome on the cluster you set up during the previous step. o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into /cluster/store1/genbank.127 o - Process these out of genbank flat files as so: ssh kkstore cd /cluster/store1/genbank.127 mkdir ../mrna.127 gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbvrt*.gz gbinv*.gz | gbToFaRa ~/hg/h/xenoRna.fil ../mrna.127/xenoRna.fa ../mrna.127/xenoRna.ra ../mrna.127/xenoRna.ta stdin cd ../mrna.127 faSplit sequence xenoRna.fa 2 xenoRna ssh kks00 cd /projects/hg2 mkdir mrna.127 cp /cluster/store1/mrna.127/xenoRna.* mrna.127 Set up cluster run. First make sure genome is in kks00:/scratch/hg/gs.8/build28/contigTrf in RepeatMasked + trf form. (This is probably done already in mouse alignment stage). Also make sure /scratch/hg/mrna.127 is loaded with xenoRna.fa Then do: ssh kkstore cd /cluster/store1/gs.8/build28/oo/bed mkdir xenoMrna cd xenoMrna mkdir psl ls -1S /scratch/hg/gs.11/build28/trfFa > human.lst ls -1S /scratch/hg/mrna.127/xenoRna?*.fa > mrna.lst cp ~/lastOo/bed/xenoMrna/gsub . gensub2 human.lst mrna.lst gsub spec jabba make hut spec jabba push hut Do jabba check hut until the run is done, doing jabba push hut if necessary on occassion. Sort xeno mRNA alignments as so: ssh kkstore cd ~/oo/bed/xenoMrna pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.25 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoMrna.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/oo/bed/xenoMrna hgLoadPsl hg10 xenoMrna.psl -tNameIx cd /cluster/store1/mrna.127 hgLoadRna add hg10 /cluster/store1/mrna.127/xenoRna.fa xenoRna.ra Similarly do xenoEst aligments: ssh kkstore cd /cluster/store1/gs.8/build28/oo/bed mkdir xenoEst cd xenoEst mkdir psl ls -1S /scratch/hg/gs.11/build28/trfFa/*.fa > human.lst ls -1S /scratch/hg/mrna.127/xenoEst?*.fa > mrna.lst cp ~/lastOo/bed/xenoEst/gsub . gensub2 human.lst mrna.lst gsub spec jabba make hut spec jabba shove hut Sort xenoEst alignments: ssh kkstore cd ~/oo/bed/xenoEst pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.10 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoEst.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/oo/bed/xenoEst hgLoadPsl hg10 xenoEst.psl -tNameIx cd /cluster/store1/mrna.127 foreach i (xenoEst*.fa) echo processing $i hgLoadRna add hg10 /cluster/store1/mrna.127/$i $i:r.ra end PRODUCING FISH ALIGNMENTS (todo) o - Do fish/human alignments. ssh kk cd ~/mm/bed mkdir blatFish cd blatFish mkdir psl ls -1S /scratch/hg/fish > fish.lst ls -1S /scratch/hg/build28/trfFa > human.lst Copy over gsub from previous version and edit paths to point to current assembly. gensub2 human.lst fish.lst gsub spec para create spec para try Make sure jobs are going ok with para check. Then para push wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so pslCat -dir psl | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin o - Copy to hgwdev:/scratch. Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatFish/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl hg10 *.psl TIGR GENE INDEX (todo) o - Download ftp://ftp.tigr.org/private/HGI_ren/*aug.tgz into ~/oo.29/bed/tgi and then execute the following commands: cd ~/oo.29/bed/tgi mv cattleTCs_aug.tgz cowTCs_aug.tgz foreach i (mouse cow human pig rat) mkdir $i cd $i gtar -zxf ../${i}*.tgz gawk -v animal=$i -f ../filter.awk * > ../$i.gff cd .. end mv human.gff human.bak sed s/THCs/TCs/ human.bak > human.gff ldHgGene -exon=TCs hg7 tigrGeneIndex *.gff LOAD STS MAP (done) - login to hgwdev cd ~/oo/bed hg10 < ~/src/hg/lib/stsMap.sql mkdir stsMap cd stsMap bedSort /projects/cc/hg/mapplots/data/tracks/build28/stsMap.bed stsMap.bed - Enter database with "hg10" command. - At mysql> prompt type in: load data local infile 'stsMap.bed' into table stsMap; - At mysql> prompt type quit LOAD CHROMOSOME BANDS (done) - login to hgwdev cd /cluster/store1/gs.11/build28/bed mkdir cytoBands cp /projects/cc/hg/mapplots/data/tracks/build28/cytobands.bed cytoBands hg10 < ~/src/hg/lib/cytoBand.sql Enter database with "hg10" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit LOAD MOUSEREF TRACK (todo) First copy in data from kkstore to ~/oo/bed/mouseRef. Then substitute 'genome' for the appropriate chromosome in each of the alignment files. Finally do: hgRefAlign webb hg10 mouseRef *.alignments LOAD AVID MOUSE TRACK (todo) ssh cc98 cd ~/oo/bed mkdir avidMouse cd avidMouse wget http://pipeline.lbl.gov/tableCS-LBNL.txt hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed hgLoadBed avidRepeat avidRepeat.bed hgLoadBed avidUnique avidUnique.bed LOAD SNPS (todo) - ssh hgwdev - cd ~/oo/bed - mkdir snp - cd snp - Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.oo33.gz - Unpack. grep RANDOM gp.oo33 > snpTsc.txt grep MIXED gp.oo33 >> snpTsc.txt grep BAC_OVERLAP gp.oo33 > snpNih.txt grep OTHER gp.oo33 >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hgLoadBed hg10 snpTsc snpTsc.bed hgLoadBed hg10 snpNih snpNih.bed LOAD CPGISLANDS (done) - login to hgwdev - cd /cluster/store1/gs.11/build28/bed - mkdir cpgIsland - cd cpgIsland - hg10 < ~kent/src/hg/lib/cpgIsland.sql - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/Freezes/cpg_dec2001.tar.gz - tar -xf cpg*.tar - copy filter.awk from a previous release, e.g. ~kent/oo.33/bed/cpgIsland - awk -f filter.awk */*/NT_*/*.cpg > contig.bed - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed - Enter database with "hg10" command. - At mysql> prompt type in: load data local infile 'cpgIsland.bed' into table cpgIsland LOAD ENSEMBL GENES (done) cd ~/oo/bed mkdir ensembl cd ensembl Place the gtf dump file, xxx.gtf.gtz (for ensembl gene predictions) here. (e.g. wget http://www.sanger.ac.uk/~birney/all_april_ctg.gtf.gz) gunzip xxx.gtf.gtz Starting from Hg10, ensembl seems to have done their own lifting, so just collect all chromosome .gtf files. cat ?.gtf ??.gtf >all.gtf Add "chr" to front of each line to make it compatible with ldHgGene addchr all.gtf > ensembl.gtf rm all.gtf ldHgGene hg10 ensGene ensembl.gtf o - Load Ensembl peptides: (poke around ensembl to get their peptide files as ensembl.pep) wget ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep/*s.pep.fa.gz gunzip Homo_sapiens.pep.fa.gz Substitute ENST for ENSP in ensemble.pep with subs edit subs.in to read: ENSP|ENST subs -e Homo_sapiens.pep.fa >/dev/null hgPepPred hg10 generic ensPep ensembl.pep LOAD SANGER22 GENES cd ~/oo/bed mkdir sanger22 cd sanger22 not sure where these files were downloaded from grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg10 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0 | ldHgGene hg10 sanger22pseudo stdin Note: this creates sanger22extras, but doesn't currently create a correct sanger22 table, which are replaced in the next steps sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg10 sanger22 stdin sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg10 sanger22pseudo stdin hgPepPred hg10 generic sanger22pep *.pep LOAD SANGER 20 GENES (todo) First download files from James Gilbert's email to ~/oo/bed/sanger20 and go to that directory while logged onto hgwdev. Then: grep -v Pseudogene chr_20*.gtf | ldHgGene hg10 sanger20 stdin hgSanger20 hg10 *.gtf *.info LOAD RNAGENES (todo) - login to hgwdev - cd ~kent/src/hg/lib - hg10 < rnaGene.sql - cd /cluster/store1/gs.11/build28/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff - hgRnaGenes hg10 chrom.gff LOAD EXOFISH (todo) - login to hgwdev - cd /cluster/store1/gs.11/build28/bed - mkdir exoFish - cd exoFish - hg10 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /cluster/store1/gs.11/build28/bed/exoFish/all_maping_ecore - awk -f filter.awk all_maping_ecore > exoFish.bed - hgLoadBed hg10 exoFish exoFish.bed LOAD MOUSE SYNTENY (done) - login to hgwdev. - cd ~/kent/src/hg/lib - hg10 < mouseSyn.sql - mkdir ~/oo/bed/mouseSyn - cd ~/oo/bed/mouseSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as mouseSyn.txt - awk -f format.awk *.txt > mouseSyn.bed - delete first line of mouseSyn.bed - Enter database with "hg10" command. - At mysql> prompt type in: load data local infile 'mouseSyn.bed' into table mouseSyn LOAD GENIE (todo) - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene hg10 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred hg10 genie pred.aa - hg10 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; LOAD SOFTBERRY GENES (done) - ln -s /cluster/store1/gs.11/build28/ ~/oo - cd ~/oo/bed - mkdir softberry - cd softberry - wget ftp://www.softberry.com/pub/SC_HUM_DEC01/soft_fg_hum_dec01.tar.gz - mkdir output - Run the fixFormat.pl routine in ~/oo/bed/softberry like so: ./fixFormat.pl output - This will stick all the properly converted .gff and .pro files in the directory named "output". - cd output ldHgGene hg10 softberryGene chr*.gff hgPepPred hg10 softberry *.pro hgSoftberryHom hg10 *.pro - RT 283 fixed 7/19/02: cd ~/hg10/bed/softberry/output foreach f (chr*.gff) set f2 = $f:r-fixed.gff ../removeBogusExons.pl $f > $f2 end ldHgGene hg10 softberryGene chr*-fixed.gff LOAD GENEID GENES (done) mkdir ~/oo/bed/geneid cd ~/oo/bed/geneid mkdir download cd download Now download *.gtf and *.prot from http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/geneid_v1.1/ cd .. ldHgGene hg10 geneid download/*.gtf -exon=CDS hgPepPred hg10 generic geneidPep download/*.prot SGP GENE PREDICTIONS mkdir ~/oo/bed/sgp cd ~/oo/bed/sgp mkdir download cd download Now download *.gtf and *.prot from http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/SGP/ Get rid of the extra .N in the transcripts with subs. cd .. cp ~/oo/bed/geneid/subs.in . subs -e download/*.gtf > /dev/null ldHgGene hg10 sgpGene download/*.gtf -exon=CDS hgPepPred hg10 generic sgpPep download/*.prot LOAD ACEMBLY - Get acembly*gene.gff from Jean and Michelle Thierry-Miegs and place in ~/oo/bed/acembly - cd ~/oo/bed/acembly - Replace c_chr with chr in acembly*.gff - Use /cluster/store1/gs.11/build28/mattStuff/filterGff.pl to do the above and to concat the results into the the aceChrom.gff gile. Read the script to see what it does. It's tiny and simple. - Use /cluster/store1/gs.11/build28/mattStuff/filterProteins.pl to do the below steps. - Remove G_t*_ and likewise G_t2_, etc., from acemblyproteins.*.fasta and concatenate all the protein.fasta files into a single acembly.pep file - Load into database as so: ldHgGene hg10 acembly aceChrom.gff hgPepPred hg10 generic acemblyPep acembly.pep LOAD GENOMIC DUPES (todo) o - Load genomic dupes ssh hgwdev cd ~/oo/bed mkdir genomicDups cd genomicDups wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip unzip *.zip awk -f filter.awk oo33_dups_for_kent > genomicDups.bed mysql -u hgcat -pbigSECRET hg10 < ~/src/hg/lib/genomicDups.sql hgLoadBed hg10 -oldTable genomicDups genomicDupes.bed FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto hgwdev - mkdir ~/oo/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/oo/sts.fa > fa.lst - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g - log onto cc01 - cc ~/oo/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (markd) - loading both blastz alignments and reference (single coverage) alignments - in xAli format, which includes sequence - done in a tmp dir and intermediate files discarded - create psl files for each per-contig lav file set sc="" set tbl="blastzMm2" foreach chrdir (/cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/lav/chr*) set chr=$chrdir:t set outdir=${tbl}.psl/$chr mkdir -p $outdir foreach lav ($chrdir/*.lav${sc}) lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl end end - Make temporary per-contig mouse nib files for use by pslToXa (which doesn't currenty support FASTA) mkdir -p mm.contig.nib foreach fa (/cluster/store2/mm.2002.02/mm2/*/chr*/chr*.fa) set nib=mm.contig.nib/$fa:r:t.nib faToNib $fa $nib end - Convert to per-chromsome files, sort, and add sequence mkdir -p ${tbl}.xa foreach chrdir (${tbl}.psl/*) set chr=$chrdir:t pslCat -check -nohead -ext=.psl -dir ${tbl}.psl/$chr \ | sort -k 15n -k 16n \ | pslToXa stdin ${tbl}.xa/${chr}_${tbl}.xa mm.contig.nib /cluster/store1/gs.11/build28/nib end - repeat both loops, this time doing the single-coverage alignment: set sc=".sc" set tbl="blastzMm2Sc - Load tables - alter table chr19_blastzMm2 max_rows=10000000 avg_row_length=1200; cd blastzMm2.xa hgLoadPsl -xa hg10 *.xa cd blastzMm2.xa hgLoadPsl -xa hg10 *.xa - Load aligned ancient repeats, from /cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/aar Ryan create: /cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/aar/xali - Loaded into aarMm2, delete repeats with < 50 aligned positions - The original data contained may duplicated alignments, so did a sort -u | sort -k 15n -k 16n on each PSL before loading. - Need to filter for #aligned < 50 - Append mm2 nib info to mouseChrom table with mm2. prefix - cd /cluster/store2/mm.2002.02/mm2/ - hgNibSeq -preMadeNib -table=mouseChrom -chromPrefix=mm2. -append hg10 /cluster/store2/mm.2002.02/mm2/nib ?/chr*.fa ??/chr*.fa