This file describes how we made the browser database on the mouse genome,November 2001 build. PROCESSING MOUSE MRNA FROM GENBANK INTO DATABASE. (done) o - Create the mouseRNA and mouseEST fasta files. ssh kkstore cd /cluster/store1/genbank.127 gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/mouseRna.fil ../mrna.127/mouseRna.fa ../mrna.127/mouseRna.ra ../mrna.127/mouseRna.ta stdin gunzip -c gbest*.gz | gbToFaRa /projects/compbio/experiments/hg/h/mouseRna.fil ../mrna.127/mouseEst.fa ../mrna.127/mouseEst.ra ../mrna.127/mouseEst.ta stdin CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARY INFO (done) o - Create the database. - ssh hgwdev - Enter mysql via: mysql -u hgcat -pbigsecret - At mysql prompt type: create database mm1; quit - make a semi-permanent read-only alias: alias mm1 "mysql -u hguser -phguserstuff -A mm1" o - Use df to ake sure there is at least 5 gig free on hgwdev:/usr/local/mysql o - Store the mRNA (non-alignment) info in database. hgLoadRna new mm1 hgLoadRna add mm1 /cluster/store1/mrna.127/mouseRna.fa /cluster/store1/mrna.127/mouseRna.ra hgLoadRna add mm1 /cluster/store1/mrna.127/mouseEst.fa /cluster/store1/mrna.127/mouseEst.ra The last line will take quite some time to complete. It will count up to about 3,800,000 before it is done. NOTE: I had to patch the extFile table a little after this. Time to add a type option to hgLoadRna perhaps. -jk BREAK UP THE MOUSE SEQUENCE INTO 1 MB CHUNKS AT NON_BRIDGED CONTIGS (done) o - This version of the mouse sequence data is in /cluster/store2/mm.2001.11/mm1/assembly - The files are mouse.oo.03.agp and mouse.oo.03.agp.fasta o - cd into your CVS source tree under kent/src/hg/splitFaIntoContigs - Type make - Run splitFaIntoContigs /cluster/store2/mm.2001.11/mm1/assembly/mouse.oo.03.agp /cluster/store2/mm.2001.11/mm1/assembly/mouse.oo.03.agp.fasta /cluster/store2/mm.2001.11/mm1 1000000 - This will split the mouse sequence into approx. 1Mbase supercontigs between non-bridged clone contigs and drop the resulting dir structure in /cluster/store2/mm.2001.11/mm1. - The resulting dir structure will include 1 dir for each chromosome, each of which has a set of subdirectories, one subdir per supercontig. COPY THE MOUSE SEQUENCE DATA TO THE CLUSTER (done) o - ssh kkstore o - Copy the mouseRNA and mouseEst data to the cluster - cp /cluster/store1/mrna.127/mouseRna.fa /scratch/hg/mrna.127 - cp /cluster/store1/mrna.127/mouseEst.fa /scratch/hg/mrna.127 o - Copy the mouse sequence supercontigs to the cluster - mkdir /scratch/hg/mm1/ - mkdir /scratch/hg/mm1/contigs - cp /cluster/store2/mm.2001.11/mm1/*/chr*/chr*.fa /scratch/hg/mm1/contigs REPEAT MASKING (done) Repeat mask on sensitive settings by doing ssh kk cd ~/mm/bed mkdir rmskS cd rmskS cp ~/lastMm/bed/rmskS/rmskS.csh . edit rmskS.csh to update paths to current mouse assembly cp ~/lastMm/bed/rmskS/template ls -1 /scratch/hg/mm1/contigs/*.fa > genome.lst gensub2 genome.lst single template spec mkdir out para create spec then do para try and check things out. If it looks good then para push or para shove until finished. Next copy these into the chromosome/contig directories by updating the paths in jkStuff/cpRmskS.sh, and then sourcing it. Then lift these up to chromosome coordinates with cd ~/mm tcsh jkStuff/liftOut2.sh Then load the .out files into the database with: ssh hgwdev cd ~/mm hgLoadOut mm1 ?/*.fa.out ??/*.fa.out STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (done) Create packed chromosome sequence files ssh kkstore cd ~/oo tcsh jkStuff/makeNib.sh Load chromosome sequence info into database ssh hgwdev mysql -A -u hgcat -pbigsecret mm1 < ~/src/hg/lib/chromInfo.sql cd ~/mm hgNibSeq -preMadeNib mm1 /cluster/store2/mm.2001.11/mm1/nib ?/chr*.fa ??/chr*.fa NA*/chr*.fa Store o+o info in database. cd /cluster/store2/mm.2001.11/mm1 hgGoldGapGl mm1 /cluster/store2/mm.2001.11 mm1 -noGl Make and load GC percent table ssh hgwdev cd /cluster/store2/mm.2001.11/mm1/bed mkdir gcPercent cd gcPercent mysql -A -u hgcat -pbigsecret mm1 < ~/src/hg/lib/gcPercent.sql hgGcPercent mm1 ../../nib MAKING AND STORING mRNA AND EST ALIGNMENTS (done) o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa from /cluster/store1/mrna.127 into /var/tmp/hg/h/mrna o - Use BLAT to generate refSeq, mRNA and EST alignments as so: Make sure that /scratch/hg/gs.11/build28/contigs is loaded with NT_*.fa and pushed to the cluster nodes. The following cshell script needs updating. cd ~/mm/bed foreach i (refSeq mrna est) mkdir $i cd $i echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst ls -1 /scratch/hg/mrna.127/$i.fa > mrna.lst mkdir psl gensub2 genome.lst mrna.lst gsub spec jabba make hut spec jabba push hut end check on progress with jabba check hut in mrna, est, and refSeq directories. o - Process refSeq mRNA and EST alignments into near best in genome. cd ~/mm/bed cd refSeq pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_refSeq.psl cd .. cd mrna pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_mrna.psl cd .. cd est pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_est.psl cd .. o - Load mRNA alignments into database. ssh hgwdev cd /cluster/store2/mm.2001.11/mm1/bed/mrna/chrom foreach i (*.psl) mv $i $i:r_mrna.psl end hgLoadPsl mm1 *.psl cd .. hgLoadPsl mm1 all_mrna.psl -nobin o - Load EST alignments into database. ssh hgwdev cd /cluster/store2/mm.2001.11/mm1/bed/est/chrom foreach i (*.psl) mv $i $i:r_est.psl end hgLoadPsl mm1 *.psl cd .. hgLoadPsl mm1 all_est.psl -nobin o - Create subset of ESTs with introns and load into database. - ssh kkstore cd ~/mm tcsh jkStuff/makeIntronEst.sh - ssh hgwdev cd ~/mm/bed/est/intronEst hgLoadPsl mm1 *.psl PRODUCING KNOWN GENES (done) o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ into /cluster/store1/mrna.127/refSeq. o - Unpack this into fa files and get extra info with: cd /cluster/store1/mrna.127/refSeq gunzip mouse.gbff.gz gunzip mouse.faa.gz o - Get extra info from NCBI and produce refGene table as so: ssh hgwdev cd ~/mm/bed/refSeq wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc o - Similarly download refSeq proteins in fasta format to refSeq.pep o - Align these by processes described under mRNA/EST alignments above. o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh hgwdev cd ~/mm/bed/refSeq ln -s /cluster/store1/mrna.127 mrna hgRefSeqMrna mm1 mrna/mouseRefSeq.fa mrna/mouseRefSeq.ra all_refSeq.psl loc2ref mrna/refSeq/mouse.faa mim2loc o - Add RefSeq status info (done 6/19/02) hgRefSeqStatus mm1 loc2ref SIMPLE REPEAT TRACK (done) o - Create cluster parasol job like so: ssh kk cd ~/mm/bed mkdir simpleRepeat cd simpleRepeat cp ~/lastOo/bed/simpleRepeat/gsub mkdir trf echo single > single.lst echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst gensub2 genome.lst single.lst gsub spec para make spec para push This job had some problems this time combined with the cluster going bizarre. We ended up running it serially which only takes about 8 hours anyway. liftUp simpleRepeat.bed ~/mm/jkStuff/liftAll.lft warn trf/*.bed o - Load this into the database as so ssh hgwdev cd ~/mm/bed/simpleRepeat hgLoadBed mm1 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql PRODUCING GENSCAN PREDICTIONS (done) o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so: First make sure you have appropriate set up, permissions, etc. and you have tried using Parasol to submit and finished a set of jobs successfully. Log into kk cd ~/mm cd bed/genscan Make 3 subdirectories for genscan to put their output files in mkdir gtf pep subopt Generate a list file, genome.list, of all the contigs ls /scratch/hg/mm1/contigs/*.fa >genome.list Create a dummy file, single, containing just 1 single line of any text. Create template file, gsub, for gensub2. For example (3 lines file): #LOOP /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000 #ENDLOOP Generate job list file, jobList, for Parasol gensub2 genome.list single gsub jobList First issue the following Parasol command: para create jobList Run the following command, which will try first 10 jobs from jobList para try Check if these 10 jobs run OK by para check If they have problems, debug and fix your program, template file, commands, etc. and try again. If they are OK, then issue the following command, which will ask Parasol to start all the jobs (around 3000 jobs). para push Issue either one of the following two commands to check the status of the cluster and your jobs, until they are done. parasol status para check If any job fails to complete, study the problem and ask Jim to help if necessary. o - Convert these to chromosome level files as so: cd ~/mm cd bed/genscan liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/*.gtf liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/*.bed cat pep/*.pep > genscan.pep o - Load into the database as so: ssh hgwdev cd ~/mm/bed/genscan ldHgGene mm1 genscan genscan.gtf hgPepPred mm1 generic genscanPep genscan.pep hgLoadBed mm1 genscanSubopt genscanSubopt.bed PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (done) Make sure that the NT*.fa files are lower-case repeat masked, and that the simpleRepeat track is made. Then put doubly (simple & interspersed) repeat mask files onto /cluster local drive as so. ssh kkstore mkdir /scratch/hg/mm1/trfFa cd ~/mm foreach i (? ?? NA_*) cd $i foreach j (chr${i}_*) if (-d $j) then maskOutFa $j/$j.fa ../bed/simpleRepeat/trf/$j.bed -softAdd /scratch/hg/mm1/trfFa/$j.fa.trf echo done $i/$j endif end cd .. end Then ask admins to do a binrsync. DOING HUMAN/MOUSE ALIGMENTS (todo) o - Download the lower-case-masked assembly and put it in kkstore:/cluster/store1/a2ms. o - Mask simple repeats in addition to normal repeats with: mkdir ~/mm/jkStuff/trfCon cd ~/mm/allctgs /bin/ls -1 | grep -v '\.' > ~/mm/jkStuff/trfCon/ctg.lst cd ~/mm/jkStuff/trfCon mkdir err log out edit ~/mm/jkStuff/trf.gsub to update gs and mm version gensub ctg.lst ~/mm/jkStuff/trf.gsub mv gensub.out trf.con condor_submit trf.con wait for this to finish. Check as so cd ~/mm source jkStuff/checkTrf.sh there should be no output. o - Copy the RepeatMasked and trf masked genome to kkstore:/scratch/hg/gs.11/build28/contigTrf, and ask Jorge and all to binrsync to propagate the data across the new cluster. o - Download the assembled mouse genome in lower-case masked form to /cluster/store1/arachne.3/whole. Execute the script splitAndCopy.csh to chop it into roughly 50M pieces in arachne.3/parts o - Set up the jabba job to do the alignment as so: ssh kkstore cd /cluster/store2/mm.2001.11/mm1 mkdir blatMouse.phusion cd blatMouse.phusion ls -1S /scratch/hg/gs.3/build28/contigTrf/* > human.lst ls -1 /cluster/store1/arachne.3/parts/* > mouse.lst Make a file 'gsub' with the following three lines in it #LOOP /cluster/home/kent/bin/i386/blat -q=dnax -t=dnax {check in line+ $(path2)} {check in line+ $(path1)} {check out line+ psl/$(root2)_$(root1).psl} -minScore=20 -minIdentity=20 -tileSize=4 -minMatch=2 -oneOff=0 -ooc={check in exists /scratch/hg/h/4.pooc} -qMask=lower -mask=lower #ENDLOOP Process this into a jabba file and launch the first set of jobs (10,000 out of 70,000) as so: gensub2 mouse.lst human.lst gsub spec jabba make hut spec jabba push hut Do a 'jabba check hut' after about 20 minutes and make sure everything is right. After that make a little script that does a "jabba push hut" followed by a "sleep 30" about 50 times. Interrupt script when you see jabba push say it's not pushing anything. o - Sort alignments as so ssh kkstore cd /cluster/store2/mm.2001.11/mm1/blatMouse pslCat -dir -check psl | liftUp -type=.psl stdout ../liftAll.lft warn stdin | pslSortAcc nohead chrom /cluster/store2/temp stdin o - Get rid of big pile-ups due to contamination as so: cd chrom foreach i (*.psl) echo $i mv $i xxx pslUnpile -maxPile=600 xxx $i rm xxx end o - Remove long redundant bits from read names by making a file called subs.in with the following line: gnl|ti^ti contig_^tig_ and running the commands cd ~/mouse/vsOo33/blatMouse.phusion/chrom subs -e -c ^ *.psl > /dev/null o - Copy over to network where database is: ssh kks00 cd ~/mm/bed mkdir blatMouse mkdir blatMouse/ph.chrom600 cd !$ cp /cluster/store2/mm.2001.11/mm1/blatMouse.phusion/chrom/*.psl . o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/mm/bed/blatMouse/ph.chrom600 foreach i (*.psl) set r = $i:r mv $i ${r}_blatMouse.psl end hgLoadPsl mm1 *.psl o - load sequence into database as so: ssh kks00 faSplit about /projects/hg3/mouse/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/mouse/arachne.3/whole/unplaced ssh hgwdev hgLoadRna addSeq '-abbr=gnl|' mm1 /projects/hg3/mouse/arachne.3/whole/unpla*.fa hgLoadRna addSeq '-abbr=con' mm1 /projects/hg3/mouse/arachne.3/whole/SET*.mfa This will take quite some time. Perhaps an hour . o - Produce 'best in genome' filtered version: ssh kks00 cd ~/mouse/vsOo33 pslSort dirs blatMouseAll.psl temp blatMouse pslReps blatMouseAll.psl bestMouseAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1 pslSortAcc nohead bestMouse temp bestMouseAll.psl cd bestMouse foreach i (*.psl) set r = $i:r mv $i ${r}_bestMouse.psl end o - Load best in genome into database as so: ssh hgwdev cd ~/mouse/vsOo33/bestMouse hgLoadPsl mm1 *.psl PRODUCING CROSS_SPECIES mRNA ALIGMENTS (todo) Here you align vertebrate mRNAs against the masked genome on the cluster you set up during the previous step. o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into /cluster/store1/genbank.127 o - Process these out of genbank flat files as so: ssh kkstore cd /cluster/store1/genbank.127 mkdir ../mrna.127 gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbvrt*.gz gbinv*.gz | gbToFaRa ~/hg/h/xenoRna.fil ../mrna.127/xenoRna.fa ../mrna.127/xenoRna.ra ../mrna.127/xenoRna.ta stdin cd ../mrna.127 faSplit sequence xenoRna.fa 2 xenoRna ssh kks00 cd /projects/hg2 mkdir mrna.127 cp /cluster/store1/mrna.127/xenoRna.* mrna.127 Set up cluster run. First make sure genome is in kks00:/scratch/hg/gs.11/build28/contigTrf in RepeatMasked + trf form. (This is probably done already in mouse alignment stage). Also make sure /scratch/hg/mrna.127 is loaded with xenoRna.fa Then do: ssh kkstore cd /cluster/store2/mm.2001.11/mm1/mm/bed mkdir xenoMrna cd xenoMrna mkdir psl ls -1S /scratch/hg/gs.11/build28/trfFa > human.lst ls -1S /scratch/hg/mrna.127/xenoRna?*.fa > mrna.lst cp ~/lastOo/bed/xenoMrna/gsub . gensub2 human.lst mrna.lst gsub spec jabba make hut spec jabba push hut Do jabba check hut until the run is done, doing jabba push hut if necessary on occassion. Sort xeno mRNA alignments as so: ssh kkstore cd ~/mm/bed/xenoMrna pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.25 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoMrna.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/mm/bed/xenoMrna hgLoadPsl mm1 xenoMrna.psl -tNameIx cd /cluster/store1/mrna.127 hgLoadRna add mm1 /cluster/store1/mrna.127/xenoRna.fa xenoRna.ra Similarly do xenoEst aligments: ssh kkstore cd /cluster/store2/mm.2001.11/mm1/mm/bed mkdir xenoEst cd xenoEst mkdir psl ls -1S /scratch/hg/gs.11/build28/trfFa/*.fa > human.lst ls -1S /scratch/hg/mrna.127/xenoEst?*.fa > mrna.lst cp ~/lastOo/bed/xenoEst/gsub . gensub2 human.lst mrna.lst gsub spec jabba make hut spec jabba shove hut Sort xenoEst alignments: ssh kkstore cd ~/mm/bed/xenoEst pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.10 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoEst.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/mm/bed/xenoEst hgLoadPsl mm1 xenoEst.psl -tNameIx cd /cluster/store1/mrna.127 foreach i (xenoEst*.fa) echo processing $i hgLoadRna add mm1 /cluster/store1/mrna.127/$i $i:r.ra end PRODUCING FISH ALIGNMENTS (in progress) o - Download sequence from ... and put it on the cluster local disk at /scratch/hg/fish o - Do fish/mouse alignments. ssh kk cd ~/mm/bed mkdir blatFish cd blatFish mkdir psl ls -1S /scratch/hg/fish > fish.lst ls -1S /scratch/hg/mm1/trfFa > mouse.lst Copy over gsub from previous version and edit paths to point to current assembly. gensub2 mouse.lst fish.lst gsub spec para create spec para try Make sure jobs are going ok with para check. Then para push wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so pslCat -dir psl | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin o - Copy to hgwdev:/scratch. Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/mm/bed/blatFish/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl mm1 *.psl TIGR GENE INDEX (todo) o - Download ftp://ftp.tigr.org/private/HGI_ren/*aug.tgz into ~/oo.29/bed/tgi and then execute the following commands: cd ~/oo.29/bed/tgi mv cattleTCs_aug.tgz cowTCs_aug.tgz foreach i (mouse cow human pig rat) mkdir $i cd $i gtar -zxf ../${i}*.tgz gawk -v animal=$i -f ../filter.awk * > ../$i.gff cd .. end mv human.gff human.bak sed s/THCs/TCs/ human.bak > human.gff ldHgGene -exon=TCs hg7 tigrGeneIndex *.gff LOAD STS MAP (todo) - login to hgwdev cd ~/mm/bed mm1 < ~/src/hg/lib/stsMap.sql mkdir stsMap cd stsMap bedSort /projects/cc/hg/mapplots/data/tracks/build28/stsMap.bed stsMap.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'stsMap.bed' into table stsMap; - At mysql> prompt type quit LOAD CHROMOSOME BANDS (todo) - login to hgwdev cd /cluster/store2/mm.2001.11/mm1/bed mkdir cytoBands cp /projects/cc/hg/mapplots/data/tracks/build28/cytobands.bed cytoBands mm1 < ~/src/hg/lib/cytoBand.sql Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit LOAD MOUSEREF TRACK (todo) First copy in data from kkstore to ~/mm/bed/mouseRef. Then substitute 'genome' for the appropriate chromosome in each of the alignment files. Finally do: hgRefAlign webb mm1 mouseRef *.alignments LOAD AVID MOUSE TRACK (todo) ssh cc98 cd ~/mm/bed mkdir avidMouse cd avidMouse wget http://pipeline.lbl.gov/tableCS-LBNL.txt hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed hgLoadBed avidRepeat avidRepeat.bed hgLoadBed avidUnique avidUnique.bed LOAD SNPS (todo) - ssh hgwdev - cd ~/mm/bed - mkdir snp - cd snp - Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.oo33.gz - Unpack. grep RANDOM gp.oo33 > snpTsc.txt grep MIXED gp.oo33 >> snpTsc.txt grep BAC_OVERLAP gp.oo33 > snpNih.txt grep OTHER gp.oo33 >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hgLoadBed mm1 snpTsc snpTsc.bed hgLoadBed mm1 snpNih snpNih.bed LOAD CPGISSLANDS (todo) - login to hgwdev - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir cpgIsland - cd cpgIsland - mm1 < ~kent/src/hg/lib/cpgIsland.sql - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_aug.oo33.tar - tar -xf cpg*.tar - awk -f filter.awk */ctg*/*.cpg > contig.bed - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'cpgIsland.bed' into table cpgIsland LOAD ENSEMBLE GENES (todo) cd ~/mm/bed mkdir ensembl cd ensembl wget http://www.sanger.ac.uk/~birney/all_april_ctg.gtf.gz gunzip all*.gz liftUp ensembl110.gtf ~/mm/jkStuff/liftAll.lft warn all*.gtf ldHgGene mm1 ensGene en*.gtf o - Load Ensembl peptides: (poke around ensembl to get their peptide files as ensembl.pep) (substitute ENST for ENSP in ensemble.pep with subs) wget ftp://ftp.ensembl.org/pub/current/data/fasta/pep/ensembl.pep.gz gunzip ensembl.pep.gz edit subs.in to read: ENSP|ENST subs -e ensembl.pep hgPepPred mm1 generic ensPep ensembl.pep LOAD SANGER22 GENES (todo) - cd ~/mm/bed - mkdir sanger22 - cd sanger22 - not sure where these files were downloaded from - grep -v Pseudogene Chr22*.genes.gff | hgSanger22 mm1 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0 | ldHgGene mm1 sanger22pseudo stdin - Note: this creates sanger22extras, but doesn't currently create a correct sanger22 table, which are replaced in the next steps - sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene mm1 sanger22 stdin - sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene mm1 sanger22pseudo stdin - hgPepPred mm1 generic sanger22pep *.pep LOAD SANGER 20 GENES (todo) First download files from James Gilbert's email to ~/mm/bed/sanger20 and go to that directory while logged onto hgwdev. Then: grep -v Pseudogene chr_20*.gtf | ldHgGene mm1 sanger20 stdin hgSanger20 mm1 *.gtf *.info LOAD RNAGENES (todo) - login to hgwdev - cd ~kent/src/hg/lib - mm1 < rnaGene.sql - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff - hgRnaGenes mm1 chrom.gff LOAD EXOFISH (todo) - login to hgwdev - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir exoFish - cd exoFish - mm1 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /cluster/store2/mm.2001.11/mm1/bed/exoFish/all_maping_ecore - awk -f filter.awk all_maping_ecore > exoFish.bed - hgLoadBed mm1 exoFish exoFish.bed LOAD MOUSE SYNTENY (todo) - login to hgwdev. - cd ~/kent/src/hg/lib - mm1 < mouseSyn.sql - mkdir ~/mm/bed/mouseSyn - cd ~/oo/bed/mouseSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as mouseSyn.txt - awk -f format.awk *.txt > mouseSyn.bed - delete first line of mouseSyn.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'mouseSyn.bed' into table mouseSyn LOAD GENIE (todo) - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene mm1 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred mm1 genie pred.aa - mm1 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; LOAD SOFTBERRY GENES (done) - ln -s /cluster/store2/mm.2001.11/mm1 ~/mm - cd ~/mm/bed - mkdir softberry - cd softberry - get ftp://www.softberry.com/pub/SC_MOU_NOV01/softb_mou_genes_nov01.tar.gz - mkdir output - Run the fixFormat.pl routine in ~/mm/bed/softberry like so: ./fixFormat.pl output - This will stick all the properly converted .gff and .pro files in the directory named "output". - cd output ldHgGene mm1 softberryGene chr*.gff hgPepPred mm1 softberry *.pro hgSoftberryHom mm1 *.pro LOAD ACEMBLY (todo) - Get acembly*gene.gff from Jean and Michelle Thierry-Miegs and place in ~/mm/bed/acembly - Replace c_chr with chr in acembly*.gff - Replace G_t1_chr with chr and likewise G_t2_chr with chr, etc. - cd ~/mm/bed/acembly - # The step directly below is not necessary since the files were already lifted # liftUp ./aceChrom.gff /cluster/store2/mm.2001.11/mm1/jkStuff/liftHs.lft warn acemblygenes*.gff - Use /cluster/store2/mm.2001.11/mm1/mattStuff/filterFiles.pl to prepend "chr" to the start of every line in the gene.gff files and to concat them into the aceChrom.gff gile. Read the script to see what it does. It's tiny and simple. - Concatenate all the protein.fasta files into a single acembly.pep file - Load into database as so: ldHgGene mm1 acembly aceChrom.gff hgPepPred mm1 generic acemblyPep acembly.pep LOAD GENOMIC DUPES (todo) o - Load genomic dupes ssh hgwdev cd ~/mm/bed mkdir genomicDups cd genomicDups wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip unzip *.zip awk -f filter.awk oo33_dups_for_kent > genomicDups.bed mysql -u hgcat -pbigSECRET mm1 < ~/src/hg/lib/genomicDups.sql hgLoadBed mm1 -oldTable genomicDups genomicDupes.bed FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto hgwdev - mkdir ~/mm/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/mm/sts.fa > fa.lst - pslOoJobs ~/mm ~/mm/rescue/sts/fa.lst ~/mm/rescue/sts g2g - log onto cc01 - cc ~/mm/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl