This file describes how we made the browser database on NCBI build 29 (April, 2002 freeze) (The numbered stuff was brought in from /cluster/store1/gs.12/build29/build.ncbi.doc) HOW TO BUILD A ASSEMBLY FROM NCBI FILES --------------------------------------- NOTE: It is best to run most of this stuff on kkstore since it is not adverse to handling files > 2Gb 1) Download seq_contig.md, ncbi_buildXX.agp, contig_overlaps.agp and contig fa file into directory. 2) Unpack contig fa file into ../ffa/ncbi_buildXX.fa #2.1) Extract Hs to NT conversion from .fa files to convert .agp file (NOT NEEDED ANYMORE) # # /cluster/bin/scripts/extractHs ncbi_buildXX.fa # #2.2) Create allcontig.agp.buildXX file (NOT NEEDED ANYMORE)# # # /cluster/bin/scripts/convertHsAgp hs.to.nt > allcontig.agp.buildXX 2.3) Sanity check things with ~kent/bin/i386/checkYbr ncbi_buildXX.agp ../ffa/ncbi_buildXX.fa seq_contig.md report any errors back to Richa and Greg at NCBI. 3) Convert fa files into UCSC style fa files and place in "contigs" directory mkdir contigs /cluster/bin/i386/faNcbiToUcsc -split -ntLast ncbi_buildXX.fa contigs 4) Create lift files (this will create chromosome directory structure) and inserts file /cluster/bin/scripts/createNcbiLifts seq_contig.md . 5) Create contig agp files (will create contig directory structure) /cluster/bin/scripts/createNcbiCtgAgp seq_contig.md ncbi_buildXX.agp . 5.1) Create contig gl files ~kent/bin/i386/agpToGl contig_overlaps.agp . -md=seq_contig.md 6) Create chromsome agp files /cluster/bin/scripts/createNcbiChrAgp . 6.1) Copy over jkStuff mkdir jkStuff cp ../../gs.11/build28/jkStuff/*.sh jkStuff cp ../../gs.11/build28/jkStuff/*.csh jkStuff cp ../../gs.11/build28/jkStuff/*.gsub jkStuff 6.2) Patch in size of chromosome Y into Y/lift/ordered.lft by grabbing it from the last line of Y/chrY.agp 6.3) Create chromosome gl files jkStuff/liftGl.sh contig.gl 7) Distribute contig .fa and .out files to appropriate directory (assumes all files are in "contigs" directory). /cluster/bin/scripts/distNcbiCtgFa contigs . 8) Reverse complement NT contig fa files that are flipped in the assembly (uses faRc program) /cluster/bin/scripts/revCompNcbiCtgFa seq_contig.md . 9) Generate RepeatMasked files for contigs (Patrick) For the NCBI assembly we repeat mask on the sensitive mode setting. cd ~/oo /cluster/bin/scripts/RMfa RMJobs */NT_*/*.fa log into kk cd ~/oo para create RMJobs para try make sure jobs don't die right away para push 10) Lift up RepeatMask .out files to chromosome coordinates via tcsh jkStuff/liftOut2.sh 11) Generate contig and chromosome level masked and unmasked files via: tcsh jkStuff/chrFa.sh tcsh jkStuff/makeFaMasked.sh 12) Copy all contig and chrom fa files to /scratch on kkstore to get ready for cluster jobs, and ask to propagate to nodes /cluster/bin/scripts/cpNcbiFaScratch . 13) Create jkStuff/ncbi.lft for lifting stuff built w/NCBI assembly. Note: this ncbi.lift will not lift floating contigs to chr_random coords, but it will show the strand orientation of the floating contigs (grep for '|'). mdToNcbiLift seq_contig_randoms.md jkStuff/ncbi.lft CREATING DATABASE (DONE) o - ln -s /cluster/store1/gs.12/build29 ~/oo o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql o - Create the database. - ssh hgwdev - Enter mysql via: hgsql - At mysql prompt type: create database hg11; quit - make a semi-permanent read-only alias: alias hg11 mysql -u hguser -phguserstuff -A hg11 o - Tell the hgCentral database about it. Log onto genome-centdb and enter mysql via mysql -u root -pbigSecret hgCentral At the mysql prompt type: insert into dbDb values("hg11", "Human April 2002", "/cluster/store1/gs.12/build29/nib", "Human", "USP18"); o - Create the trackDb table as so cd ~/src/hg/makeDb/hgTrackDb Edit makefile to add hg11 after hg10 and do make update cvs commit makefile LOAD REPEAT MASKS (DONE 7/10/02) Load the RepeatMasker .out files into the database with: hgLoadOut hg11 ?/*.fa.out ??/*.fa.out STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (DONE) Create packed chromosome sequence files ssh kkstore cd ~/oo tcsh jkStuff/makeNib.sh Load chromosome sequence info into database ssh hgwdev hgsql hg11 < ~/src/hg/lib/chromInfo.sql cd ~/oo hgNibSeq -preMadeNib hg11 /cluster/store1/gs.12/build29/nib ?/chr*.fa ??/chr*.fa Store o+o info in database. cd /cluster/store1/gs.12/build29 jkStuff/liftGl.sh contig.gl hgGoldGapGl hg11 /cluster/store1/gs.12 build29 cd /cluster/store1/gs.12 hgClonePos hg11 build29 ffa/sequence.inf /cluster/store1/gs.12 -maxErr=3 (Ignore warnings about missing clones - these are in chromosomes 21 and 22) hgCtgPos hg11 build29 Make and load GC percent table ssh hgwdev cd ~/oo mkdir -p bed/gcPercent cd bed/gcPercent hgsql hg11 < ~/src/hg/lib/gcPercent.sql hgGcPercent hg11 ../../nib SIMPLE REPEAT TRACK (DONE) o - Create cluster parasol job like so: ssh kk cd ~/oo/bed mkdir simpleRepeat cd simpleRepeat cp /cluster/store1/gs11.build28/bed/simpleRepeat/gsub ./gsub mkdir trf ls -1S /scratch/hg/gs.12/build29/contig/*.fa > genome.lst gensub2 genome.lst single gsub spec para create spec para try para check para push liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed o - Load this into the database as so ssh hgwdev cd ~/oo/bed/simpleRepeat hgLoadBed hg11 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (DONE) Make sure that the NT*.fa files are lower-case repeat masked. Do something much like the simpleRepeat track, but only masking out stuff with a period of 12 or less as so: ssh kk cd ~/oo/bed mkdir trfMask cd trfMask # I couldn't find a valid gsub according to these instructions so I used the one # from /cluster/store1/gs.12/build29.bad/bed/trfMask # instead of doing -> cp ~/lastOo/bed/trfMask/gsub ./gsub mkdir trf ls -1S /scratch/hg/gs.12/build29/contig/*.fa > genome.lst gensub2 genome.lst single gsub spec para create spec para try para check para push When that is done do: ssh kkstore mkdir /scratch/hg/gs.12/build29/trfFa cd ~/oo NOTE:Below is a tcsh script foreach i (? ??) cd $i foreach j (NT*) maskOutFa $j/$j.fa ../bed/trfMask/trf/$j.bed -softAdd /scratch/hg/gs.12/build29/trfFa/$j.fa.trf echo done $i/$j end cd .. end Then ask admins to do a binrsync. DONE GETTING FRESH mRNA AND EST SEQUENCE FROM GENBANK. (DONE) This will create a genbank.129 directory containing compressed GenBank flat files and a mrna.129 containing unpacked sequence info and auxiliary info in a relatively easy to parse (.ra) format. o - Point your browser to ftp://ncbi.nlm.nih.gov/genbank and look at the README. Figure out the current release number (which is 129). o - Consider deleting one of the older genbank releases. It's good to at least keep one previous release though. o - Where there is space make a new genbank directory. Create a symbolic link to it: mkdir /cluster/store1/genbank.129 ln -s /cluster/store1/genbank.129 ~/genbank o - cd ~/genbank o - ftp ncbi.nlm.nih.gov (do anonymous log-in). Then do the following commands inside ftp: cd genbank prompt mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv* This will take at least 2 hours. o - Log onto server and change to your genbank directory. o - cd /cluster/store1 o - mkdir mrna.129 o - cd mrna.129 o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz | gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz | gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin -byOrganism=org o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin -byOrganism=org o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin -byOrganism=org o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz /cluster/store1/genbank.129/gbmam*.gz /cluster/store1/genbank.129/gbrod*.gz /cluster/store1/genbank.129/gbvrt*.gz /cluster/store1/genbank.129/gbinv*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoRna.fa xenoRna.ra xenoRna.ta stdin -byOrganism=org o - cd /cluster/store1/genbank.129 o - gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbvrt*.gz gbinv*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.129/xenoRna.fa ../mrna.129/xenoRna.ra ../mrna.129/xenoRna.ta stdin STORING mRNA/EST SEQUENCE AND AUXILIARY INFO (DONE) o - Store the mRNA (non-alignment) info in database. hgLoadRna new hg11 hgLoadRna add hg11 /cluster/store1/mrna.129/mrna.fa /cluster/store1/mrna.129/mrna.ra hgLoadRna add hg11 /cluster/store1/mrna.129/est.fa /cluster/store1/mrna.129/est.ra The last line will take quite some time to complete. It will count up to about 3,800,000 before it is done. MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE) o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa Copy the above 3 files from /cluster/store1/mrna.129 into /scratch/hg/h/mrna Request the admins to do a binrsync to the cluster. DONE o - Use BLAT to generate refSeq, mRNA and EST alignments as so: Make sure that /scratch/hg/gs.12/build29/contigs is loaded with NT_*.fa and pushed to the cluster nodes. ssh kkstore cd ~/oo/bed foreach i (refSeq mrna est) mkdir -p $i cd $i cp ~kent/lastOo/bed/$i/gsub . echo /scratch/hg/gs.12/build29/contig/*.fa | wordLine stdin > genome.lst ls -1 /scratch/hg/h/mrna/$i.fa > mrna.lst mkdir -p psl gensub2 genome.lst mrna.lst gsub spec para create spec cd .. end DONE Now, by hand cd to the mrna, refSeq, and est directories respectively and run a para push and para check in each one. DONE o - Process refSeq mRNA and EST alignments into near best in genome. cd ~/oo/bed cd refSeq pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_refSeq.psl cd .. DONE cd mrna pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_mrna.psl cd .. DONE cd est pslSort dirs raw.psl /cluster/fast1/temp psl pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/fast1/temp all_est.psl cd .. DONE o - Load refSeq alignments into database DONE ssh hgwdev cd /cluster/store1/gs.12/build29/bed/refSeq pslCat -dir chrom > refSeqAli.psl hgLoadPsl hg11 -tNameIx refSeqAli.psl o - Load mRNA alignments into database. DONE ssh hgwdev cd /cluster/store1/gs.12/build29/bed/mrna/chrom foreach i (*.psl) mv $i $i:r_mrna.psl end hgLoadPsl hg11 *.psl cd .. hgLoadPsl hg11 all_mrna.psl -nobin o - Load EST alignments into database. DONE ssh hgwdev cd /cluster/store1/gs.12/build29/bed/est/chrom foreach i (*.psl) mv $i $i:r_est.psl end hgLoadPsl hg11 *.psl cd .. hgLoadPsl hg11 all_est.psl -nobin o - Create subset of ESTs with introns and load into database. DONE - ssh kkstore cd ~/oo tcsh jkStuff/makeIntronEst.sh - ssh hgwdev cd ~/oo/bed/est/intronEst hgLoadPsl hg11 *.psl o - Put orientation info on ESTs into database: ssh kkstore cd ~/oo/bed/est pslSortAcc nohead contig /cluster/fast1/temp contig.psl mkdir /scratch/hg/gs.12/build29/bed cp -r contig /scratch/hg/gs.12/build29/bed/est sudo /cluster/install/utilities/updateLocal cd ~/oo/bed mkdir estOrientInfo cd estOrientInfo mkdir ei ls -1 /scratch/hg/gs.12/build29/bed/est/*.psl > psl.lst Now ssh to kk and cd to ~/oo/bed/estOrientInfo. Copy in gsub from the previous version and edit it to say where things are located in scratch on this version. Then: gensub2 psl.lst single gsub spec para create spec para try para push check until done, or use 'para shove' When the cluster run is done do: liftUp estOrientInfo.bed ~/oo/jkStuff/liftAll.lft warn ei/*.tab hgLoadBed hg11 estOrientInfo estOrientInfo.bed -sqlTable=$HOME/src/hg/lib/estOrientInfo.sql DONE o - Create rnaCluster table ssh hgwdev cd ~/oo mkdir -p bed/rnaCluster/chrom foreach i (? ??) cd $i foreach j (chr*.fa) set c = $j:r echo clusterRna hg11 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c clusterRna hg11 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c end cd .. end cd bed/rnaCluster hgLoadBed hg11 rnaCluster chrom/*.bed DONE PRODUCING KNOWN GENES (DONE) o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ into /cluster/store1/mrna.129/refSeq. DONE o - Unpack this into fa files and get extra info with: cd /cluster/store1/mrna.129/refSeq gunzip hs.gbff gunzip hs.faa.gz gbToFaRa ~/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta DONE o - Align refSeq.fa to genome as described under mRNA/EST alignments above. DONE o - Get extra info from NCBI and produce refGene table as so: ssh hgwdev cd ~/oo/bed mkdir refSeq cd refSeq wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref DONE wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc DONE o - Similarly download refSeq proteins in fasta format to refSeq.pep - I believe this is hs.faa o - RefSeq should have already been aligned to the genome by processes described under mRNA/EST alignments above. o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh hgwdev cd ~/oo/bed/refSeq ln -s /cluster/store1/mrna.129 mrna hgRefSeqMrna hg11 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl loc2ref mrna/refSeq/hs.faa mim2loc DONE o - Add Jackson labs info DONE cd ~/oo/bed mkdir jaxOrtholog cd jaxOrtholog ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt awk -f filter.awk *.rpt > jaxOrtholog.tab Load this into mysql with something like: mysql -u hgcat -pBIGSECRET hg11 < ~/src/hg/lib/jaxOrtholog.sql mysql -u hgcat -pBIGSECRET -A hg11 and at the mysql> prompt load data local infile 'jaxOrtholog.tab' into table jaxOrtholog; o - Add RefSeq status info (DONE 6/19/02) hgRefSeqStatus hg11 loc2ref PRODUCING GENSCAN PREDICTIONS (done) o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so: Load up the cluster with hard-masked contigs in /cluster/store1/gs.12/build29/bed/genscan/mContigs (For hg11, the .masked files were not saved during repeat masking. So the contig (.fa) files in /cluster/store1/gs.12/build29/? and ?? were processed to convert all lower case bases into N and named as *.fa.masked and placed under genscan/mContigs). Log into kkr1u00 (not kk!). kkr1u00 is the driver node for the small cluster (kkr2u00 -kkr8u00. Genscan has problem running on the big cluster, due to limitation of memory and swap space on each processing node). cd ~/oo cd bed/genscan Make 3 subdirectories for genscan to put their output files in mkdir gtf pep subopt Generate a list file, genome.list, of all the contigs ls -1S ./mContigs/*.masked > genome.list Edit genome.list to remove jobs on files of "*.fa.masked" which have pure Ns due to heterochromatin (unsequencable stuff) and will cause genscan to run forever. Create template file, gsub, for gensub2. For example (3 lines file): #LOOP /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000 #ENDLOOP Create a file containing a single line. echo single > single Generate job list file, jobList, for Parasol gensub2 genome.list single gsub jobList First issue the following Parasol command: para create jobList Run the following command, which will try first 10 jobs from jobList para try Check if these 10 jobs run OK by para check If they have problems, debug and fix your program, template file, commands, etc. and try again. If they are OK, then issue the following command, which will ask Parasol to start all the remaining jobs. For hg11, there were 2043 jobs in total. para push Issue either one of the following two commands to check the status of the cluster and your jobs, until they are done. parasol status para check o - Convert these to chromosome level files as so: cd ~/mm cd bed/genscan liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/*.gtf liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/*.bed cat pep/*.pep > genscan.pep o - Load into the database as so: ssh hgwdev cd ~/mm/bed/genscan ldHgGene hg11 genscan genscan.gtf hgPepPred hg11 generic genscanPep genscan.pep hgLoadBed hg11 genscanSubopt genscanSubopt.bed CREATE GOLDEN TRIANGLE (todo) Make sure that rnaCluster table is in place. Then extract Affy expression info into a form suitable for Eisen's clustering program with: cd ~/oo/bed mkdir triangle cd triangle eisenInput hg11 affyHg10.txt Transfer this to Windows and do k-means clustering with k=200 with cluster. Transfer results file back to ~/oo/bed/triangle/affyCluster_K_G200.kgg. Then do promoSeqFromCluster hg11 1000 affyCluster_K_G200.kgg kg200.unmasked Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions to kg200. Then cat kg200/*.fa > all1000.fa and set up cluster Improbizer run to do 100 controls for every real run on each - putting the output in imp.200.1000.e. When improbizer run is done make a file summarizing the runs as so: cd imp.200.1000.e motifSig ../imp.200.1000.e.iri ../kg200 motif control* get rid of insignificant motifs with: cd .. awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri turn rest into just dnaMotifs with iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt Extract all promoters with featureBits hg11 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa Locate motifs on all promoters with dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2 liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed CREATE STS/FISH/BACENDS/CYTOBANDS DIRECTORY STRUCTURE AND SETUP (done) o - Create directory structure to hold information for these tracks cd /projects/hg2/booch/psl/ mkdir gs.12 mkdir gs.12/build29 mkdir gs.12/build29/sts mkdir gs.12/build29/primers mkdir gs.12/build29/bacends mkdir gs.12/build29/fish mkdir gs.12/build29/cytobands o - Copy in Makefiles from previous assembly cp gs.11/build28/Makefile gs.12/build29 cp gs.11/build28/sts/Makefile gs.12/build29/sts cp gs.11/build28/primers/Makefile gs.12/build29/primers cp gs.11/build28/bacends/Makefile gs.12/build29/bacends cp gs.11/build28/fish/Makefile gs.12/build29/fish cp gs.11/build28/cytobands/Makefile gs.12/build29/cytobands o - Update all Makefiles with latest OOVERS and GSVERS o - Create accession_info file make accession_info.rdb UPDATE STS INFORMATION (done) o - Download and unpack updated information from dbSTS: In a web browser, go to ftp://ftp.ncbi.nih.gov/repository/dbSTS/. Download dbSTS.sts, dbSTS.aliases, and dbSTS.FASTA.dailydump.Z to /projects/hg2/booch/psl/update -Unpack dbSTS.FASTA.dailydump.Z gunzip dbSTS.FASTA.dailydump.Z o - Create updated files (takes a while ~1.5 days right now) cd /projects/hg2/booch/psl/update make update o - Make new directory for this info and move files there ssh kks00 mkdir /cluster/store1/sts.# (# = next number not used) cp all.STS.fa /cluster/store1/sts.# cp all.primers /cluster/store1/sts.# cp all.primers.fa /cluster/store1/sts.# STS ALIGNMENTS (done) (alignments done without RepeatMasking, so start ASAP!) o - Create full sequence alignments ssh kk cd /cluster/home/booch/sts - update Makefile with latest OOVERS and GSVERS - update stsMarkers.lst with latest location of all.STS.fa (from above) make new.assembly make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make stsMarkers.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly o - Create primer alignments ssh kk cd /cluster/home/booch/primers - update Makefile with latest OOVERS and GSVERS - update primers.lst with latest location of all.primers.fa (from above) make new.assembly make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make primers.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly CREATE AND LOAD STS MARKERS TRACK (done) o - Create final version of sts sequence placements ssh kks00 cd /projects/hg2/booch/psl/gs.12/build29/sts make stsMarkers.final o - Create final version of primers placements cd /projects/hg2/booch/psl/gs.12/build29/primers make primers.final o - Create bed file cd /projects/hg2/booch/psl/gs.12/build29 make stsMap.bed o - Create database tables ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < all_sts_primer.sql mysql -uhgcat -pXXXXXXX < all_sts_seq.sql mysql -uhgcat -pXXXXXXX < stsAlias.sql mysql -uhgcat -pXXXXXXX < stsInfo.sql mysql -uhgcat -pXXXXXXX < stsMap.sql o - Load the tables load /projects/hg2/booch/psl/gs.12/build29/sts/stsMarkers.psl.filter.lifted into all_sts_seq load /projects/hg2/booch/psl/gs.12/build29/primers/primers.psl.filter.lifted into all_sts_primer load /projects/hg2/booch/psl/gs.12/build29/stsAlias.bed into stsAlias load /projects/hg2/booch/psl/gs.12/build29/stsInfo.bed into stsInfo load /projects/hg2/booch/psl/gs.12/build29/stsMap.bed into stsMap o - Load the sequences (change sts.# to match correct location) hgLoadRna addSeq hg11 /cluster/store1/sts.2/all.STS.fa hgLoadRna addSeq hg11 /cluster/store1/sts.2/all.primers.fa BACEND SEQUENCE ALIGNMENTS (done) (alignments done without RepeatMasking, so start ASAP!) o - Create full sequence alignments ssh kk cd /cluster/home/booch/bacends - update Makefile with latest OOVERS and GSVERS - update bacEnds.lst with latest location of BACends.fa (doesn't usually change) make new make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make stsMarkers.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly BACEND PAIRS TRACK o - Update Makefile with location of pairs files, if necessary cd /projects/hg2/booch/psl/gs.12/build29/bacends edit Makefile (PAIRS=....) o - Create bed file ssh kks00 cd /projects/hg2/booch/psl/gs.12/build29/bacends make bacEndPairs.bed o - Create database tables ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < all_bacends.sql mysql -uhgcat -pXXXXXXX < bacEndPairs.sql o - Load the tables load /projects/hg2/booch/psl/gs.12/build29/bacends/bacEnds.psl.filter.lifted into all_bacends load /projects/hg2/booch/psl/gs.12/build29/bacends/bacEndPairs.bed into bacEndPairs o - Load the sequences (change bacends.# to match correct location) hgLoadRna addSeq hg11 /cluster/store1/bacends.2/BACends.fa UPDATE FISH CLONES INFORMATION o - Download the latest info from NCBI point browser at http://www.ncbi.nlm.nih.gov/genome/cyto/cytobac.cgi?CHR=all&VERBOSE=ctg change "Show details on sequence-tag" to "yes" change "Download or Display" to "Download table for UNIX" press Submit - save as /projects/hg2/booch/psl/fish/hbrc/hbrc.YYYYMMDD.table o - Format file just downloaded cd /projects/hg2/booch/psl/fish/ make HBRC o - Copy it to the new freeze location cp /projects/hg2/booch/psl/fish/all.fish.format /projects/hg2/booch/psl/gs.12/build29/fish/ CREATE AND LOAD FISH CLONES TRACK (must be done after STS markers track and BAC end pairs track) o - Extract the file with clone positions from database ssh hgwdev mysql -uhgcat -pXXXXXXXX hg11 mysql> select * into outfile "/tmp/booch/clonePos.txt" from clonePos; mysql> quit mv /tmp/booch/clonePos.txt /projects/hg2/booch/psl/gs.12/build29/fish o - Create bed file cd /projects/hg2/booch/psl/gs.12/build29/fish make bed o - Create database table ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < fishClones.sql o - Load the table load /projects/hg2/booch/psl/gs.12/build29/fish/fishClones.bed into fishClones CREATE AND LOAD CHROMOSOME BANDS TRACK (must be done after FISH Clones track) o - Create bed file ssh hgwdev make setBands.txt make cytobands.pct.ranges make predict o - Create database table ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < cytoBand.sql o - Load the table load /projects/hg2/booch/psl/gs.12/build29/cytobands/cytobands.bed into cytoBand CREATE CHROMOSOME REPORTS CREATE STS MAP COMPARISON PLOTS DOING HUMAN/MOUSE ALIGMENTS (todo) o - Start with the mouse assembly in 1 Mb chunks lower-case repeat and tandem-repeat masked on kkstore by copying files there in the following way. Mouse contigs: mkdir /scratch/hg/mm2/rmsk cp bed/rmsk/out/* /scratch/hg/mm2/rmsk cp -R /cluster/store2/mm.2002.02/mm2/trfFa/ /scratch/hg/mm2/ Human contigs: mkdir /scratch/hg/gs.12/build29/rmsk cp /cluster/store1/gs.12/build29/?/*/*.out /scratch/hg/gs.12/build29/rmsk cp /cluster/store1/gs.12/build29/??/*/*.out /scratch/hg/gs.12/build29/rmsk cp -R /cluster/store2/gs.12/build29/bed/trfFa /scratch/hg/gs.12/build29/trfFa Then ssh kkstore cd ~/oo/bed mkdir blatMus cd blatMus ls -1 /scratch/hg/mm2/trfFa/*.fa.trf > mouseAll mkdir mm cd mm splitFile ../mouseAll 50 mmsplitFile ../smallH 4 small cd .. ls -1 mm/* > mouse.lst Then bundle up the human into pieces of less than 12 meg mostly by ls -lhS /scratch/hg/gs.12/build29/trfFa/*.fa.trf > bigHuman edit this file and move all of the lines less than 3 meg into the file smallHuman. Then do awk '{printf("%s\n", $9);}' bigHuman > bigH awk '{printf("%s\n", $9);}' smallHuman > smallH mkdir hs cd hs splitFile ../bigH 1 big rm big32 # Note, this is just an empty file that the splitFile program erroneously created splitFile ../smallH 4 small rm small504 cd .. ls -1 hs/* > human.lst (The rm commands above indicate that splitFile needs a fix - they are zero length). Copy the old gsub here cp /cluster/store1/gs.11/build28/bed/blatMouse/gsub . Finally generate the job list with gensub2 human.lst mouse.lst gsub spec o - Do the cluster run as so ssh kk cd ~/oo/bed/blatMus mkdir psl para create specE para try and then do para push/check/push/check/shove etc. o - Sort alignments as so ssh kkstore cd ~/oo/bed/blatMus pslCat -dir -check psl | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin -pslQ | pslSortAcc nohead chromPile /cluster/store2/temp stdin o - Get rid of big pile-ups due to contamination as so: mkdir chrom cd chromPile foreach i (*.psl) echo $i pslUnpile -maxPile=250 $i ../chrom/$i end o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatMus/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatMus.psl end hgLoadPsl hg11 *.psl o - load sequence into database as so: ssh kks00 faSplit about /projects/hg3/mouse/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/mouse/arachne.3/whole/unplaced ssh hgwdev hgLoadRna addSeq '-abbr=gnl|' hg11 /projects/hg3/mouse/arachne.3/whole/unpla*.fa hgLoadRna addSeq '-abbr=con' hg11 /projects/hg3/mouse/arachne.3/whole/SET*.mfa This will take quite some time. Perhaps an hour . o - Produce 'best in genome' filtered version: ssh kks00 cd ~/mouse/vsOo33 pslSort dirs blatMouseAll.psl temp blatMouse pslReps blatMouseAll.psl bestMouseAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1 pslSortAcc nohead bestMouse temp bestMouseAll.psl cd bestMouse foreach i (*.psl) set r = $i:r mv $i ${r}_bestMouse.psl end o - Load best in genome into database as so: ssh hgwdev cd ~/mouse/vsOo33/bestMouse hgLoadPsl hg11 *.psl PRODUCING CROSS_SPECIES mRNA ALIGNMENTS DONE Here you align vertebrate mRNAs against the masked genome on the cluster you set up during the previous step. o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into /cluster/store1/genbank.129 DONE o - Process these out of genbank flat files as so: ssh kkstore cd /cluster/store1/genbank.129 cd ../mrna.129 faSplit sequence xenoRna.fa 2 xenoRna ssh kks00 cd /scratch/hg mkdir mrna.129 cp /cluster/store1/mrna.129/xenoRna*.* mrna.129 Request binrysnc of /scratch/hg/mrna.129 from the admins Set up cluster run. First make sure genome is in kks00:/scratch/hg/gs.12/build29/contig/trf in RepeatMasked + trf form. (This is probably done already in mouse alignment stage). Also make sure /scratch/hg/mrna.129 is loaded with xenoRna.fa Then do: ssh kkstore cd /cluster/store1/gs.12/build29/bed mkdir xenoMrna cd xenoMrna mkdir psl ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst ls -1S /scratch/hg/mrna.129/xenoRna?*.fa > mrna.lst cp ~kent/lastOo/bed/xenoMrna/gsub . gensub2 human.lst mrna.lst gsub spec para create spec para try para check para push Do para check until the run is done, doing para push if necessary on occassion. Sort xeno mRNA alignments as so: ssh kkstore cd ~/oo/bed/xenoMrna pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.25 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoMrna.psl rm -r chrom raw.psl cooked.psl chrom.psl DONE Load into database as so: ssh hgwdev cd ~/oo/bed/xenoMrna hgLoadPsl hg11 xenoMrna.psl -tNameIx cd /cluster/store1/mrna.129 hgLoadRna add hg11 /cluster/store1/mrna.129/xenoRna.fa /cluster/store1/hgLoadRna add hg11 /cluster/store1/mrna.129/xenoRna.fa xenoRna.ra DONE Similarly do xenoEst aligments: Prepare the est data: cd /cluster/store1/mrna.129 faSplit sequence xenoEst.fa 16 xenoEst ssh kkstore cd /cluster/store1/gs.8/build29/oo/bed mkdir xenoEst cd xenoEst mkdir psl ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst cp /cluster/store1/mrna.129/xenoEst?*.fa /scratch/hg/mrna.129 ls -1S /scratch/hg/mrna.129/xenoEst?*.fa > mrna.lst cp ~kent/lastOo/bed/xenoEst/gsub . Request a binrysnc from the admin's of kkstore's /scratch/hg/mrna.129 When done, do: gensub2 human.lst mrna.lst gsub spec para create spec para push DONE Sort xenoEst alignments: ssh kkstore cd ~/oo/bed/xenoEst pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.10 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoEst.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/oo/bed/xenoEst hgLoadPsl hg11 xenoEst.psl -tNameI cd /cluster/store1/mrna.129 hgLoadRna add hg11 /cluster/store1/mrna.129/xenoEst.fa /cluster/store1/mrna.129/xenoEst.ra DONE PRODUCING FISH ALIGNMENTS (DONE) o - Do fish/human alignments. ssh kk cd ~/oo/bed mkdir blatFish cd blatFish mkdir psl ls -1S /scratch/hg/fish/*.fa > fish.lst ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst Copy over gsub from previous version and edit paths to point to current assembly. gensub2 human.lst fish.lst gsub spec para create spec DONE para try Make sure jobs are going ok with para check. Then para push wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so pslCat -dir psl | liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin o - Copy to hgwdev:/scratch. Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatFish/chrom foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl hg11 *.psl Now load the fish seqeuence data hgLoadRna addSeq hg11 /projects/hg3/fish/tet6/tet*.fa DONE TIGR GENE INDEX (done 7/1/02, re-load w/new data 7/30/02) mkdir -p ~/hg11/bed/tigr cd ~/hg11/bed/tigr # wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build29.tgz wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build29_corrected.tgz gunzip -c TGI*.tgz | tar xvf - foreach f (*cattle*) set f1 = `echo $f | sed -e 's/cattle/cow/g'` mv $f $f1 end foreach o (mouse cow human pig rat) setenv O $o foreach f (chr*_$o*s) tail +2 $f | perl -wpe 's /THC/TC/; s/(TH?C\d+)/$ENV{O}_$1/;' > $f.gff end end ldHgGene -exon=TC hg11 tigrGeneIndex *.gff LOAD STS MAP (todo) DONE BY TERRY I BELIEVE - HE WILL UPDATE THIS - login to hgwdev cd ~/oo/bed hg11 < ~/src/hg/lib/stsMap.sql mkdir stsMap cd stsMap bedSort /projects/cc/hg/mapplots/data/tracks/build29/stsMap.bed stsMap.bed - Enter database with "hg11" command. - At mysql> prompt type in: load data local infile 'stsMap.bed' into table stsMap; - At mysql> prompt type quit LOAD CHROMOSOME BANDS (todo) ALSO DONE BY TERRY I BELIEVE - login to hgwdev cd /cluster/store1/gs.12/build29/bed mkdir cytoBands cp /projects/cc/hg/mapplots/data/tracks/oo.29/cytobands.bed cytoBands cd cytoBands hg11 < ~/src/hg/lib/cytoBand.sql Enter database with "hg11" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit LOAD MOUSEREF TRACK (todo) First copy in data from kkstore to ~/oo/bed/mouseRef. Then substitute 'genome' for the appropriate chromosome in each of the alignment files. Finally do: hgRefAlign webb hg11 mouseRef *.alignments LOAD AVID MOUSE TRACK (todo) ssh cc98 cd ~/oo/bed mkdir avidMouse cd avidMouse wget http://pipeline.lbl.gov/tableCS-LBNL.txt hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed hgLoadBed avidRepeat avidRepeat.bed hgLoadBed avidUnique avidUnique.bed LOAD SNPS (Done; Daryl Thomas May 28, 2002) ssh hgwdev cd ~/oo/bed mkdir snp cd snp -Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.ncbi.b29.gz -Unpack. ln -s ../../seq_contig.md . calcFlipSnpPos seq_contig.md gp.ncbi.b29 gp.ncbi.b29.flipped mv gp.ncbi.b29 gp.ncbi.b29.original gzip gp.ncbi.b29.original grep RANDOM gp.ncbi.b29.flipped > snpTsc.txt grep MIXED gp.ncbi.b29.flipped >> snpTsc.txt grep BAC_OVERLAP gp.ncbi.b29.flipped > snpNih.txt grep OTHER gp.ncbi.b29.flipped >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hgLoadBed hg11 snpTsc snpTsc.bed hgLoadBed hg11 snpNih snpNih.bed -gzip all of the big files LOAD CPGISLANDS (done 7/18/02) - login to hgwdev mkdir -p ~/hg11/cpgIsland cd ~/hg11/cpgIsland - Asif Chinwalla emailed the data in an attachment; it was unpacked into ~/hg11/cpgIsland - copy filter.awk from a previous release, e.g. ~kent/oo.33/bed/cpgIsland to cpg_apr2002.masked awk -f filter.awk */*.cpg > cpgIsland.bed hgLoadBed hg11 cpgIsland -tab -noBin \ -sqlTable=$HOME/kent/src/hg/lib/cpgIsland.sql cpgIsland.bed LOAD ENSEMBL GENES (done 7/9/02) mkdir -p ~/hg11/bed/ensembl cd ~/hg11/bed/ensembl # wget complains about a Redirection loop, but GET handles it (?): GET http://www.ebi.ac.uk/~stabenau/human_29_gtf.gz > human_29_gtf.gtf.gz # add "chr" to the chrom ids: gunzip -c human_29_gtf.gtf.gz | \ perl -w -p -e 's/^(\w)/chr$1/' > human_29_gtf-fixed.gtf ldHgGene hg11 ensGene human_29_gtf-fixed.gtf # Load Ensembl peptides, replace ">ENSP" with ">ENST": wget ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep/Homo_sapiens.pep.all.fa.gz gunzip -c Homo_sapiens.pep.all.fa.gz | sed -e 's/^>ENSP/>ENST/' \ > ensembl.pep hgPepPred hg11 generic ensPep ensembl.pep LOAD SANGER22 GENES cd ~/oo/bed mkdir sanger22 cd sanger22 not sure where these files were downloaded from grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg11 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0 | ldHgGene hg11 sanger22pseudo stdin Note: this creates sanger22extras, but doesn't currently create a correct sanger22 table, which are replaced in the next steps sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg11 sanger22 stdin sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg11 sanger22pseudo stdin hgPepPred hg11 generic sanger22pep *.pep LOAD SANGER 20 GENES (todo) First download files from James Gilbert's email to ~/oo/bed/sanger20 and go to that directory while logged onto hgwdev. Then: grep -v Pseudogene chr_20*.gtf | ldHgGene hg11 sanger20 stdin hgSanger20 hg11 *.gtf *.info LOAD RNAGENES (todo) - login to hgwdev - cd ~kent/src/hg/lib - hg11 < rnaGene.sql - cd /cluster/store1/gs.12/build29/bed - mkdir rnaGene - cd rnaGene - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz - gunzip *.gz - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff - hgRnaGenes hg11 chrom.gff LOAD EXOFISH (todo) - login to hgwdev - cd /cluster/store1/gs.12/build29/bed - mkdir exoFish - cd exoFish - hg11 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /cluster/store1/gs.12/build29/bed/exoFish/all_maping_ecore - awk -f filter.awk all_maping_ecore > exoFish.bed - hgLoadBed hg11 exoFish exoFish.bed LOAD MOUSE SYNTENY (todo) - login to hgwdev. - cd ~/kent/src/hg/lib - hg11 < mouseSyn.sql - mkdir ~/oo/bed/mouseSyn - cd ~/oo/bed/mouseSyn - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as mouseSyn.txt - awk -f format.awk *.txt > mouseSyn.bed - delete first line of mouseSyn.bed - Enter database with "hg11" command. - At mysql> prompt type in: load data local infile 'mouseSyn.bed' into table mouseSyn LOAD GENIE (todo) - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene hg11 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred hg11 genie pred.aa - hg11 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; LOAD SOFTBERRY GENES (DONE 8/8/02) ln -s /cluster/store1/gs.12/build29/ ~/hg11 mkdir -p ~/hg11/bed/softberry cd ~/hg11/bed/softberry GET ftp://www.softberry.com/pub/sc_fgenesh_ap02/sb_fgenesh_ap02.tar.gz \ > sb_fgenesh_ap02.tar.gz gunzip -c sb_fgenesh_ap02.tar.gz | tar xvf - cd sb_fgenesh_ap02 ssh hgwdev cd ~/hg11/bed/softberry/sb_fgenesh_ap02 ldHgGene hg11 softberryGene chr*.gff hgPepPred hg11 softberry *.pro hgSoftberryHom hg11 *.pro LOAD GENEID GENES (todo) mkdir ~/oo/bed/geneid cd ~/oo/bed/geneid mkdir download cd download Now download *.gtf and *.prot from http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/geneid_v1.1/ cd .. ldHgGene hg11 geneid download/*.gtf -exon=CDS hgPepPred hg11 generic geneidPep download/*.prot LOAD ACEMBLY (DONE 05/31/02) mkdir -p ~/oo/bed/acembly cd ~/oo/bed/acembly - Get acembly*gene.gff from Jean and Danielle Thierry-Mieg wget ftp://ftp.ncbi.nih.gov/repository/acedb/ncbi_29.human.genes/acembly.ncbi_29.genes.gff.tar.gz wget ftp://ftp.ncbi.nih.gov/repository/acedb/ncbi_29.human.genes/acembly.ncbi_29.genes.proteins.fasta.tar.gz gunzip -c acembly.ncbi_29.genes.gff.tar.gz | tar xvf - gunzip -c acembly.ncbi_29.genes.proteins.fasta.tar.gz | tar xvf - cd acembly.ncbi_29.genes.gff - Strip out floating-contig features (lines with *|NT_?????? as the chr ID), and add 'chr' prefix to all chr nums: foreach f (acemblygenes.*.gff) egrep -v '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' $f | \ perl -wpe 's/^(\w)/chr$1/' > $f:r-fixed.gff end - Save just the floating-contig features to different files for lifting - and lift up the floating-contig features to chr*_random coords: foreach c ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Un) egrep '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' acemblygenes.$c.gff | \ perl -wpe 's/^(\w+)\|(\w+)/$1\/$2/' > $c-random-ctg.gff liftUp $c-random-lifted.gff ../../../$c/lift/random.lft warn $c-random-ctg.gff end cd ../acembly.ncbi_29.genes.proteins.fasta - Remove G_t*_ prefixes from acemblyproteins.*.fasta: foreach f (acemblyproteins.*.fasta) perl -wpe 's/^\>G_t[\da-zA-Z]+_/\>/' $f > $f:r-fixed.fasta end - Load into database as so: cd .. ldHgGene hg11 acembly acembly.ncbi_29.genes.gff/*-fixed.gff acembly.ncbi_29.genes.gff/*-lifted.gff hgPepPred hg11 generic acemblyPep acembly.ncbi_29.genes.proteins.fasta/*-fixed.fasta LOAD GENOMIC DUPES (todo) o - Load genomic dupes ssh hgwdev cd ~/oo/bed mkdir genomicDups cd genomicDups wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip unzip *.zip awk -f filter.awk oo33_dups_for_kent > genomicDups.bed mysql -u hgcat -pbigSECRET hg11 < ~/src/hg/lib/genomicDups.sql hgLoadBed hg11 -oldTable genomicDups genomicDupes.bed FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto hgwdev - mkdir ~/oo/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/oo/sts.fa > fa.lst - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g - log onto cc01 - cc ~/oo/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (markd) - loading both blastz alignments and reference (single coverage) alignments - in xAli format, which includes sequence - done in a tmp dir and intermediate files discarded - create psl files for each per-contig lav file set sc="" set tbl="blastzMm2" foreach chrdir (/cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/lav/chr*) set chr=$chrdir:t set outdir=lav-psl${sc}/$chr mkdir -p $outdir foreach lav ($chrdir/*.lav${sc}) lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl end end - Convert to per-chromsome files, sort, and add sequence mkdir -p lav-xa{sc} foreach chrdir (lav-psl${sc}/*) set chr=$chrdir:t pslCat -check -nohead -ext=.psl -dir lav-psl${sc}/$chr \ | liftUp -type=.psl -pslQ -nohead stdout /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin \ | sort -k 15n -k 16n \ | pslToXa stdin lav-xa${sc}/${chr}_${tbl}.xa /cluster/store2/mm.2002.02/mm2/nib /cluster/store1/gs.12/build29/nib end - repeat both loops, this time doing the single-coverage alignment: set sc=".sc" set tbl="blastzMm2Sc - Load tables cd lav-xa hgLoadPsl -xa hg11 *.xa cd lav-xa.sc hgLoadPsl -xa hg11 *.xa - Load aligned ancient repeats, from /cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/aar Ryan create: /cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/aar/xali - Loaded into aarMm2 MITOCHONDRIAL DNA PSEUDO-CHROMOSOME - DONE Download the fasta file from http://www.gen.emory.edu/MITOMAP/mitomapRCRS.fasta Put it in /cluster/store1/mrna.129 ssh hgwdev cd ~/oo mkdir M cp /cluster/store1/mrna.129/mitomapRCRS.fasta M/chrM.fa Edit jkStuff/makeNib.sh to make sure it also has the "M" directory in its file list tcsh jkStuff/makeNib.sh hgNibSeq -preMadeNib hg11 /cluster/store1/gs.12/build29/nib ?/chr*.fa ??/chr*.fa LOAD Ingo Ebersber's chimp BLAT alignments DONE cd ~/oo mkdir bed/chimpBlat cd bed/chimpBlat #!/bin/sh for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y do wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/MPI-sg_apr02/chr${i}_gp_F01Apr02.psl done Remove the first line from each psl file. It is junk. pslCat *.psl > chimpBlat.psl hgLoadPsl hg11 chimpBlat.psl MAKING THE DOWNLOADABLE DATABASE FILES - DONE mkdir /usr/local/apache/htdocs/goldenPath/05apr2002 mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/chromosomes mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/database o zip up the chromosomes individually ssh kkstore (we use kkstore because no NFS traffic via kkstore = faster data transfer) cd ~/oo In tcsh run this script foreach i (*/chr*.fa) echo zip $i:r.zip $i zip $i:r.zip $i end Then do: ssh hgwdev mv */chr*.zip /usr/local/apache/htdocs/goldenPath/05apr2002/chromosomes Request that the admins push this to hgwbeta. o Make the big zips - Make database.zip ssh hgwbeta cd /usr/local/apache/htdocs/goldenPath/05apr2002/database zip ../bigZips/database.zip * ssh hgwdev cd ~/oo - Make chromAgp.zip zip chromAgp.zip */chr*.agp mv chromAgp.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make chromFa.zip zip chromFa.zip */chr*.fa mv chromFa.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make chromOut.zip zip chromOut.zip */chr*.out mv chromOut.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make contigAgp.zip zip contigAgp.zip */*/*.agp mv contigAgp.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make contigFa.zip zip contigFa.zip */*/*.fa mv contigFa.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make contigOut.zip zip contigOut.zip */*/*.out mv contigOut.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make liftAll.zip zip liftAll.zip jkStuff/liftAll.lft mv liftAll.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips - Make mrna.zip zip mrna.zip /cluster/store1/mrna.129/mrna.fa mv mrna.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZip o Dump the database ssh hgwbeta We dump the database on hgwbeta in order to only dump the most accurate datbase state. There is one trick here: mysqldump becomes the mysql user and the directory you want to dump to must have that user the ability to write to it. Here's what to do: cd /var/tmp mkdir hg11-dump chmod 777 hg11-dump (since you aren't root this is quickest) cd hg11-dump mysqldump --user=hguser --password=hguserstuff --all --tab=. hg11 Then, that directory will quickly fill with .sql and .txt files When it is done do: cd /var/tmp/hg11-dump gzip *.txt mv * /usr/local/apache/htdocs/goldenPath/05apr2002/database