This file describes how we made the zoo browser database for each organism, specified by $ORG DOWNLOAD THE ZOO $ORG DATA AND SET UP A LOCAL DIRECTORY STRUCTURE o - See the doc in /cluster/store2/zoo/zoo.readme for download instructions, then download the data. The following should be put in a script, but it is not yet For each organism $ORG in the organism list do: cd /cluster/store2/zoo mkdir zoo$ORG1 cd zoo$ORG1 mkdir 1 (This fakes the chromosome directory structure so we can fake a browser track with 1 chromosome) cd 1 mkdir ctgCFTR cp /cluster/store2/zoo/rawData/CFTRseqs/$ORG_CFTR.fasta.masked ctgCFTR cd .. ln -s ctgCFTR/$ORG_CFTR.fasta.masked ./chr1.fa cd /cluster/store2/zoo/zoo$ORG1 mkdir bed IMPORTANT STEP - FAKE THE CHR1 CHROMOSOME IN THE FA FILEs o NOTE: The string CFTR_$ORG needs to be changed to chr1 in all the .fa files files in the zoo/$ORG/1 directory. For example zoo/zooMouse1/1/chr1.fa has the string mouse_CFTR in it. This needs to be changed to chr1 globally in the file. I did this by hand in each .fa file (all 12) to get them to work. In vi, you can do a global search and replace by typing: ESC :%s/mouse_CFTR/chr1/g The above example is for the mouse. PROCESSING $ORG MRNA FROM GENBANK INTO DATABASE. (done) o - Create the $ORGRNA and $ORGEST fasta files. This is usually already done. ssh kkstore cd /cluster/store1/genbank.127 gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.fa mrna.ta -byOrganism=org stdin gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.127 mrna.ta -byOrganism=org stdin gunzip -c gbest*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil est.ra est.fa est.ta -byOrganism=org stdin gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.127 mrna.ta -byOrganism=org stdin CREATING DATABASE (done) o - ln -s /cluster/store2/zoo ~/zoo - make a semi-permanent read-only alias: alias zoo$ORG1 mysql -u hguser -phguserstuff -A zoo$ORG1 o - Make sure there is at least 20 gig free on hgwdev:/usr/local/mysql o The makeZoo.pl script will do the following things o - Run the zoo/scripts/makeZoo.pl script to create the databases and store the mRNA (non-alignment) info in database. The script will do the following for each organism: hgLoadRna new zoo$ORG1 hgLoadRna add -type=mRNA zoo$ORG1 /cluster/store1/mrna.128/org/$ORG-GENUS_SP/mrna.fa /cluster/store1/mrna.128/org/$ORG-GENUS_SPECIES/mrna.ra If EST data is available it will also do: hgLoadRna add -type=EST zoo$ORG1 /cluster/store1/mrna.128/org/$ORG-GENUS_SP/est.fa /cluster/store1/mrna.128/org/$ORG-GENUS_SPECIES/est.ra The last line will take quite some time to complete. o - Add the trackDB table to the database The script will also do this STORING ZOO SEQUENCE AND ASSEMBLY INFORMATION (DONE) Create packed chromosome sequence files o The makeZoo.pl script will also do the following things - Load chromosome sequence info into database - Store zoo info in database. - Make and load GC percent table REPEAT MASKING (DONE) We don't use the cluster for this because we have such a small input data set. Repeat mask on sensitive settings by doing. The makeZoo.pl script will generate the files needed to repeat mask. It will put them in zoo/work/rmskS Then just ssh to kk, then cd to the zoo/work/rmskS directory and do a 'para push'. When repeat masking is done, load the .out files by running the makeZoo.pl script again MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE) o - Use BLAT to generate refSeq, mRNA and EST alignments as so: cd to zoo/work/blat and on kk and do a para push. Check on progress using para check. o - Process refSeq mRNA and EST alignments into near best in genome. The below is done by the makeZoo.pl script Note: No lifting up is needed because the fragments are already in chromosome coordinates. cd ~/zoo/bed cd refSeq pslSort dirs raw.psl /tmp psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null mv contig.psl all_refSeq.psl pslSortAcc nohead chrom /tmp all_refSeq.psl cd .. cd mrna pslSort dirs raw.psl /tmp psl pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null mv contig.psl all_mrna.psl pslSortAcc nohead chrom /tmp all_mrna.psl cd .. cd est pslSort dirs raw.psl /tmp psl pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null mv contig.psl all_est.psl pslSortAcc nohead chrom /tmp all_est.psl cd .. o - Load mRNA alignments into database. ssh hgwdev cd /cluster/store2/zoo/zoo$ORG1/bed/mrna/chrom ../../scripts/mrnaLoadPsl.csh o - Load EST alignments into database. ssh hgwdev cd /cluster/store2/zoo/zoo$ORG1/bed/est/chrom ../../scripts/estLoadPsl.csh o - Create subset of ESTs with introns and load into database. - ssh kkstore cd ~/zoo/zoo$ORG1 cp /cluster/store1/gs.11/build28/jkStuff/makeIntronEst.sh bed/scripts tcsh bed/scripts/jkStuff/makeIntronEst.sh - ssh hgwdev cd ~/zoo/zoo$ORG1/bed/est/intronEst hgLoadPsl zoo$ORG1 *.psl The above is all done for you in the makeZoo.pl script. PRODUCING KNOWN GENES (todo) o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ into /cluster/store1/mrna.129/refSeq. o - Unpack this into fa files and get extra info with: cd /cluster/store1/mrna.129/refSeq gunzip $ORG.gbff.gz gunzip $ORG.faa.gz The above 2 files need to be symlinked from /cluster/store1/mrna.129/refSeq/$ORG/$ORG.fa /cluster/store1/mrna.128/org/$ORG/refSeq.fa and refSeq.ra respectively. o - Get extra info from NCBI and produce refGene table as so: ssh hgwdev cd ~/mm/bed/refSeq wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc o - Similarly download refSeq proteins in fasta format to refSeq.pep o - Align these by processes described under mRNA/EST alignments above. o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh hgwdev cd ~/mm/bed/refSeq ln -s /cluster/store1/mrna.129 mrna hgRefSeqMrna mm1 mrna/$ORGRefSeq.fa mrna/$ORGRefSeq.ra all_refSeq.psl loc2ref mrna/refSeq/$ORG.faa mim2loc o - Add Jackson labs info cd ~/mm/bed mkdir jaxOrtholog cd jaxOrtholog ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt awk -f filter.awk *.rpt > jaxOrtholog.tab Load this into mysql with something like: mysql -u hgcat -pBIGSECRET mm1 < ~/src/hg/lib/jaxOrtholog.sql mysql -u hgcat -pBIGSECRET -A mm1 and at the mysql> prompt load data local infile 'jaxOrtholog.tab' into table jaxOrtholog; CREATING PAIRWISE ALIGNMENTS (TODO) - This has been tested but needs to be customized to your specific needs. The makeZoo.pl script has comments on how to make pairwise alignments. The basic algorithm is as follows: Echo all the organism names to one input file. Echo all the mrna, est, and refSeq fasta file names to another input file. Then run genSub2 to generate the cross-product of all sequences aligned against all mrnas and ests. Look at scripts/blat-gsub for the gsub I used to test this. For each organism it creates a file called chr1_$X_mrna.fa and chr1_$X_est.fa, where X is the organism whose mrna and est this organism's sequence was aligned against it using BLAT. For example, for the zooMouse/1/chr1.fa sequence aligned against the Cat mrna we get zooMouse/bed/mrna/psl/chr1_Felis_catus_mrna.fa The above logic is already commented out in the makeZoo.pl script. It just has to be selectively uncommented. SIMPLE REPEAT TRACK (todo) o - Create cluster parasol job like so: ssh kk cd ~/mm/bed mkdir simpleRepeat cd simpleRepeat cp ~/lastOo/bed/simpleRepeat/gsub mkdir trf echo single > single.lst echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst gensub2 genome.lst single.lst gsub spec para make spec para push This job had some problems this time combined with the cluster going bizarre. We ended up running it serially which only takes about 8 hours anyway. liftUp simpleRepeat.bed ~/mm/jkStuff/liftAll.lft warn trf/*.bed o - Load this into the database as so ssh hgwdev cd ~/mm/bed/simpleRepeat hgLoadBed mm1 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql PRODUCING GENSCAN PREDICTIONS (todo) o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so: First make sure you have appropriate set up, permissions, etc. and you have tried using Parasol to submit and finished a set of jobs successfully. Log into kk cd ~/mm cd bed/genscan Make 3 subdirectories for genscan to put their output files in mkdir gtf pep subopt Generate a list file, genome.list, of all the contigs ls /scratch/hg/mm1/contigs/*.fa >genome.list Create a dummy file, single, containing just 1 single line of any text. Create template file, gsub, for gensub2. For example (3 lines file): #LOOP /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000 #ENDLOOP Generate job list file, jobList, for Parasol gensub2 genome.list single gsub jobList First issue the following Parasol command: para create jobList Run the following command, which will try first 10 jobs from jobList para try Check if these 10 jobs run OK by para check If they have problems, debug and fix your program, template file, commands, etc. and try again. If they are OK, then issue the following command, which will ask Parasol to start all the jobs (around 3000 jobs). para push Issue either one of the following two commands to check the status of the cluster and your jobs, until they are done. parasol status para check If any job fails to complete, study the problem and ask Jim to help if necessary. o - Convert these to chromosome level files as so: cd ~/mm cd bed/genscan liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/*.gtf liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/*.bed cat pep/*.pep > genscan.pep o - Load into the database as so: ssh hgwdev cd ~/mm/bed/genscan ldHgGene mm1 genscan genscan.gtf hgPepPred mm1 generic genscanPep genscan.pep hgLoadBed mm1 genscanSubopt genscanSubopt.bed PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS Make sure that the NT*.fa files are lower-case repeat masked, and that the simpleRepeat track is made. Then put doubly (simple & interspersed) repeat mask files onto /cluster local drive as so. ssh kkstore mkdir /scratch/hg/mm1/trfFa cd ~/mm foreach i (? ?? NA_*) cd $i foreach j (chr${i}_*) if (-d $j) then maskOutFa $j/$j.fa ../bed/simpleRepeat/trf/$j.bed -softAdd /scratch/hg/mm1/trfFa/$j.fa.trf echo done $i/$j endif end cd .. end Then ask admins to do a binrsync. DOING HUMAN/$ORG ALIGMENTS (todo) o - Download the lower-case-masked assembly and put it in kkstore:/cluster/store1/a2ms. o - Mask simple repeats in addition to normal repeats with: mkdir ~/mm/jkStuff/trfCon cd ~/mm/allctgs /bin/ls -1 | grep -v '\.' > ~/mm/jkStuff/trfCon/ctg.lst cd ~/mm/jkStuff/trfCon mkdir err log out edit ~/mm/jkStuff/trf.gsub to update gs and mm version gensub ctg.lst ~/mm/jkStuff/trf.gsub mv gensub.out trf.con condor_submit trf.con wait for this to finish. Check as so cd ~/mm source jkStuff/checkTrf.sh there should be no output. o - Copy the RepeatMasked and trf masked genome to kkstore:/scratch/hg/gs.11/build28/contigTrf, and ask Jorge and all to binrsync to propagate the data across the new cluster. o - Download the assembled $ORG genome in lower-case masked form to /cluster/store1/arachne.3/whole. Execute the script splitAndCopy.csh to chop it into roughly 50M pieces in arachne.3/parts o - Set up the jabba job to do the alignment as so: ssh kkstore cd /cluster/store2/mm.2001.11/mm1 mkdir blat$ORG.phusion cd blat$ORG.phusion ls -1S /scratch/hg/gs.3/build28/contigTrf/* > human.lst ls -1 /cluster/store1/arachne.3/parts/* > $ORG.lst Make a file 'gsub' with the following three lines in it #LOOP /cluster/home/kent/bin/i386/blat -q=dnax -t=dnax {check in line+ $(path2)} {check in line+ $(path1)} {check out line+ psl/$(root2)_$(root1).psl} -minScore=20 -minIdentity=20 -tileSize=4 -minMatch=2 -oneOff=0 -ooc={check in exists /scratch/hg/h/4.pooc} -qMask=lower -mask=lower #ENDLOOP Process this into a jabba file and launch the first set of jobs (10,000 out of 70,000) as so: gensub2 $ORG.lst human.lst gsub spec jabba make hut spec jabba push hut Do a 'jabba check hut' after about 20 minutes and make sure everything is right. After that make a little script that does a "jabba push hut" followed by a "sleep 30" about 50 times. Interrupt script when you see jabba push say it's not pushing anything. o - Sort alignments as so ssh kkstore cd /cluster/store2/mm.2001.11/mm1/blat$ORG pslCat -dir -check psl | liftUp -type=.psl stdout ../liftAll.lft warn stdin | pslSortAcc nohead chrom /cluster/store2/temp stdin o - Get rid of big pile-ups due to contamination as so: cd chrom foreach i (*.psl) echo $i mv $i xxx pslUnpile -maxPile=600 xxx $i rm xxx end o - Remove long redundant bits from read names by making a file called subs.in with the following line: gnl|ti^ti contig_^tig_ and running the commands cd ~/$ORG/vsOo33/blat$ORG.phusion/chrom subs -e -c ^ *.psl > /dev/null o - Copy over to network where database is: ssh kks00 cd ~/mm/bed mkdir blat$ORG mkdir blat$ORG/ph.chrom600 cd !$ cp /cluster/store2/mm.2001.11/mm1/blat$ORG.phusion/chrom/*.psl . o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/mm/bed/blat$ORG/ph.chrom600 foreach i (*.psl) set r = $i:r mv $i ${r}_blat$ORG.psl end hgLoadPsl mm1 *.psl o - load sequence into database as so: ssh kks00 faSplit about /projects/hg3/$ORG/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/$ORG/arachne.3/whole/unplaced ssh hgwdev hgLoadRna addSeq '-abbr=gnl|' mm1 /projects/hg3/$ORG/arachne.3/whole/unpla*.fa hgLoadRna addSeq '-abbr=con' mm1 /projects/hg3/$ORG/arachne.3/whole/SET*.mfa This will take quite some time. Perhaps an hour . o - Produce 'best in genome' filtered version: ssh kks00 cd ~/$ORG/vsOo33 pslSort dirs blat$ORGAll.psl temp blat$ORG pslReps blat$ORGAll.psl best$ORGAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1 pslSortAcc nohead best$ORG temp best$ORGAll.psl cd best$ORG foreach i (*.psl) set r = $i:r mv $i ${r}_best$ORG.psl end o - Load best in genome into database as so: ssh hgwdev cd ~/$ORG/vsOo33/best$ORG hgLoadPsl mm1 *.psl PRODUCING CROSS_SPECIES mRNA ALIGMENTS (done) This is done by the makeZoo.pl script. It sets up the BLAT job, which has to be run by hand on kk. After that the script will load the xeno mrna into the db. The cross-species alignments are done a little differently here, in that, instead of aligning each species against all other species, I have aligned each species in the group of 12 against the other 11. For example, mouse is aligned against the other 11 species in the list. See the spec file in zoo/work/blat. PRODUCING FISH ALIGNMENTS (todo) o - Download sequence from ... and put it in ssh kks00 /projects/hg3/fish/seq15jun2001/bqcnstn_0106151510.fa then ln -s /projects/hg3/fish ~/fish split this into multiple files and compress original with cd ~/fish/seq15jun2001 faSplit sequence bq* 100 fish compress bq* copy over to kkstore:/cluster/store2/fish/seq15jun2001 o - Do fish/human alignments. ssh kkstore cd /cluster/store2/fish mkdir vsOo33 cd vsOo33 mkdir psl ls -1S /var/tmp/hg/gs.11/build28/tanMaskNib/*.fa.trf > human.lst ls -1S /projects/hg3/fish/seq15jun2001/*.fa > fish.lst Copy over gsub from previous version and edit paths to point to current assembly. gensub2 fish.lst human.lst gsub spec jabba make hut spec jabba try hut Make sure jobs are going ok. Then jabba push hut wait about 2 hours and do another jabba push hut do a jabba check hut, and if necessary another push hut. o - Sort alignments as so pslCat -dir psl | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin o - Copy to hgwdev:/scratch. Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/fish/vsOo33/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl mm1 *.psl TIGR GENE INDEX (todo) o - Download ftp://ftp.tigr.org/private/HGI_ren/*aug.tgz into ~/oo.29/bed/tgi and then execute the following commands: cd ~/oo.29/bed/tgi mv cattleTCs_aug.tgz cowTCs_aug.tgz foreach i ($ORG cow human pig rat) mkdir $i cd $i gtar -zxf ../${i}*.tgz gawk -v animal=$i -f ../filter.awk * > ../$i.gff cd .. end mv human.gff human.bak sed s/THCs/TCs/ human.bak > human.gff ldHgGene -exon=TCs hg7 tigrGeneIndex *.gff LOAD STS MAP - login to hgwdev cd ~/mm/bed mm1 < ~/src/hg/lib/stsMap.sql mkdir stsMap cd stsMap bedSort /projects/cc/hg/mapplots/data/tracks/build28/stsMap.bed stsMap.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'stsMap.bed' into table stsMap; - At mysql> prompt type quit LOAD CHROMOSOME BANDS - login to hgwdev cd /cluster/store2/mm.2001.11/mm1/bed mkdir cytoBands cp /projects/cc/hg/mapplots/data/tracks/build28/cytobands.bed cytoBands mm1 < ~/src/hg/lib/cytoBand.sql Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit LOAD $ORGREF TRACK (todo) First copy in data from kkstore to ~/mm/bed/$ORGRef. Then substitute 'genome' for the appropriate chromosome in each of the alignment files. Finally do: hgRefAlign webb mm1 $ORGRef *.alignments LOAD AVID $ORG TRACK (todo) ssh cc98 cd ~/mm/bed mkdir avid$ORG cd avid$ORG wget http://pipeline.lbl.gov/tableCS-LBNL.txt hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed hgLoadBed avidRepeat avidRepeat.bed hgLoadBed avidUnique avidUnique.bed LOAD SNPS (todo) - ssh hgwdev - cd ~/mm/bed - mkdir snp - cd snp - Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.oo33.gz - Unpack. grep RANDOM gp.oo33 > snpTsc.txt grep MIXED gp.oo33 >> snpTsc.txt grep BAC_OVERLAP gp.oo33 > snpNih.txt grep OTHER gp.oo33 >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hgLoadBed mm1 snpTsc snpTsc.bed hgLoadBed mm1 snpNih snpNih.bed LOAD CPGISSLANDS (todo) - login to hgwdev - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir cpgIsland - cd cpgIsland - mm1 < ~kent/src/hg/lib/cpgIsland.sql - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_aug.oo33.tar - tar -xf cpg*.tar - awk -f filter.awk */ctg*/*.cpg > contig.bed - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile 'cpgIsland.bed' into table cpgIsland LOAD ENSEMBLE GENES (todo) cd ~/mm/bed mkdir ensembl cd ensembl wget http://www.sanger.ac.uk/~birney/all_april_ctg.gtf.gz gunzip all*.gz liftUp ensembl110.gtf ~/mm/jkStuff/liftAll.lft warn all*.gtf ldHgGene mm1 ensGene en*.gtf o - Load Ensembl peptides: (poke around ensembl to get their peptide files as ensembl.pep) (substitute ENST for ENSP in ensemble.pep with subs) wget ftp://ftp.ensembl.org/pub/current/data/fasta/pep/ensembl.pep.gz gunzip ensembl.pep.gz edit subs.in to read: ENSP|ENST subs -e ensembl.pep hgPepPred mm1 generic ensPep ensembl.pep LOAD SANGER22 GENES (todo) - cd ~/mm/bed - mkdir sanger22 - cd sanger22 - not sure where these files were downloaded from - grep -v Pseudogene Chr22*.genes.gff | hgSanger22 mm1 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0 | ldHgGene mm1 sanger22pseudo stdin - Note: this creates sanger22extras, but doesn't currently create a correct sanger22 table, which are replaced in the next steps - sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene mm1 sanger22 stdin - sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene mm1 sanger22pseudo stdin - hgPepPred mm1 generic sanger22pep *.pep LOAD SANGER 20 GENES (todo) First download files from James Gilbert's email to ~/mm/bed/sanger20 and go to that directory while logged onto hgwdev. Then: grep -v Pseudogene chr_20*.gtf | ldHgGene mm1 sanger20 stdin hgSanger20 mm1 *.gtf *.info LOAD RNAGENES (todo) - login to hgwdev - cd ~kent/src/hg/lib - mm1 < rnaGene.sql - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff - hgRnaGenes mm1 chrom.gff LOAD EXOFISH (todo) - login to hgwdev - cd /cluster/store2/mm.2001.11/mm1/bed - mkdir exoFish - cd exoFish - mm1 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /cluster/store2/mm.2001.11/mm1/bed/exoFish/all_maping_ecore - awk -f filter.awk all_maping_ecore > exoFish.bed - hgLoadBed mm1 exoFish exoFish.bed LOAD $ORG SYNTENY (todo) - login to hgwdev. - cd ~/kent/src/hg/lib - mm1 < $ORGSyn.sql - mkdir ~/mm/bed/$ORGSyn - cd ~/oo/bed/$ORGSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as $ORGSyn.txt - awk -f format.awk *.txt > $ORGSyn.bed - delete first line of $ORGSyn.bed - Enter database with "mm1" command. - At mysql> prompt type in: load data local infile '$ORGSyn.bed' into table $ORGSyn LOAD GENIE (todo) - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene mm1 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred mm1 genie pred.aa - mm1 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; LOAD SOFTBERRY GENES (todo) - ln -s /cluster/store2/mm.2001.11/mm1 ~/mm - cd ~/mm/bed - mkdir softberry - cd softberry - get ftp://www.softberry.com/pub/SC_MOU_NOV01/softb_mou_genes_nov01.tar.gz - mkdir output - Run the fixFormat.pl routine in ~/mm/bed/softberry like so: ./fixFormat.pl output - This will stick all the properly converted .gff and .pro files in the directory named "output". - cd output ldHgGene mm1 softberryGene chr*.gff hgPepPred mm1 softberry *.pro hgSoftberryHom mm1 *.pro LOAD ACEMBLY - Get acembly*gene.gff from Jean and Michelle Thierry-Miegs and place in ~/mm/bed/acembly - Replace c_chr with chr in acembly*.gff - Replace G_t1_chr with chr and likewise G_t2_chr with chr, etc. - cd ~/mm/bed/acembly - # The step directly below is not necessary since the files were already lifted # liftUp ./aceChrom.gff /cluster/store2/mm.2001.11/mm1/jkStuff/liftHs.lft warn acemblygenes*.gff - Use /cluster/store2/mm.2001.11/mm1/mattStuff/filterFiles.pl to prepend "chr" to the start of every line in the gene.gff files and to concat them into the aceChrom.gff gile. Read the script to see what it does. It's tiny and simple. - Concatenate all the protein.fasta files into a single acembly.pep file - Load into database as so: ldHgGene mm1 acembly aceChrom.gff hgPepPred mm1 generic acemblyPep acembly.pep LOAD GENOMIC DUPES (todo) o - Load genomic dupes ssh hgwdev cd ~/mm/bed mkdir genomicDups cd genomicDups wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip unzip *.zip awk -f filter.awk oo33_dups_for_kent > genomicDups.bed mysql -u hgcat -pbigSECRET mm1 < ~/src/hg/lib/genomicDups.sql hgLoadBed mm1 -oldTable genomicDups genomicDupes.bed FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto hgwdev - mkdir ~/mm/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/mm/sts.fa > fa.lst - pslOoJobs ~/mm ~/mm/rescue/sts/fa.lst ~/mm/rescue/sts g2g - log onto cc01 - cc ~/mm/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl