This file describes how we made the browser database on NCBI build 30 (July, 2002 freeze) [For importing GTF tracks, use /projects/compbio/bin/validate_gtf.pl] (The numbered stuff was brought in from /cluster/store3/gs.13/build30/build.ncbi.doc) HOW TO BUILD A ASSEMBLY FROM NCBI FILES --------------------------------------- NOTE: It is best to run most of this stuff on kkstore since it is not adverse to handling files > 2Gb 0) Make gs.XX directory, gs.XX/buildXX directory, and gs.XX/ffa directory. Make a symbolic link from /cluster/store1 to this location cd /cluster/store1 ln -s (actual location)/gs.13 ./gs.13 Make a symbolic link from your home directory to the build dir: ln -s /cluster/store1/gs.13/build30 ~/oo 1) Download seq_contig.md, ncbi_buildXX.agp, contig_overlaps.agp and contig fa file into gs.XX/buildXX/ directory. *** For build30, files split into reference.agp/reference.fa (main O&O), DR51.agp/DR51.fa, and DR52.agp/DR52.fa. (alternate versions of MHC region). These were concatenated to get the ncbi_build30.agp and ncbi_build30.fa 2) Move and unpack contig fa file into ../ffa/ncbi_buildXX.fa 2.3) Sanity check things with (in this directory) ~kent/bin/i386/checkYbr ncbi_buildXX.agp ../ffa/ncbi_buildXX.fa seq_contig.md report any errors back to Richa and Greg at NCBI. 3) Convert fa files into UCSC style fa files and place in "contigs" directory inside the gs.XX/buildXX directory Note: 7/23/02 ASH: edited chrM.fa header to ">chrM" not ">gi|17981852|ref|NC_001807.4|" mkdir contigs /cluster/bin/i386/faNcbiToUcsc -split -ntLast ../ffa/ncbi_buildXX.fa contigs 3.1) Make a fake chrM contig cd ~/oo mkdir M copy in chrM.fa, chrM.agp and chrM.gl from previous version. mkdir M/NT_999999 cp chrM.fa NT_999999/NT_999999.fa 4) Create lift files (this will create chromosome directory structure) and inserts file /cluster/bin/scripts/createNcbiLifts seq_contig.md . 5) Create contig agp files (will create contig directory structure) /cluster/bin/scripts/createNcbiCtgAgp seq_contig.md ncbi_buildXX.agp . 5.1) Create contig gl files ~kent/bin/i386/agpToGl contig_overlaps.agp . -md=seq_contig.md 6) Create chromsome agp files /cluster/bin/scripts/createNcbiChrAgp . 6.1) Copy over jkStuff mkdir jkStuff cp /cluster/store1/gs.12/build29/jkStuff/*.sh jkStuff cp /cluster/store1/gs.12/build29/jkStuff/*.csh jkStuff cp /cluster/store1/gs.12/build29/jkStuff/*.gsub jkStuff 6.2) Patch in size of chromosome Y into Y/lift/ordered.lft by grabbing it from the last line of Y/chrY.agp (not needed for build30) 6.3) Create chromosome gl files jkStuff/liftGl.sh contig.gl 7) Distribute contig .fa to appropriate directory (assumes all files are in "contigs" directory). /cluster/bin/scripts/distNcbiCtgFa contigs . 8) Reverse complement NT contig fa files that are flipped in the assembly (uses faRc program) /cluster/bin/scripts/revCompNcbiCtgFa seq_contig.md . (NOTE: STS placements may be done at this point before repeat masking and using the .fa's on NFS for QC analysis - all other placements should be done after repeat masking and distributing to cluster nodes) 9) Split contigs, run RepeatMasker, lift results Notes: * If there is a new version of RepeatMasker, build it and ask the admins to binrsync it (kkstore:/scratch/hg/RepeatMasker/*). * Contigs (*/NT_*/NT_*.fa) are split into 500kb chunks to make RepeatMasker runs manageable on the cluster ==> results need lifting. * For the NCBI assembly we repeat mask on the sensitive mode setting (RepeatMasker -s) * Note: for build30 / hg12, RepeatMaster was run in quick mode (/cluster/bin/scripts/RMLocalQuick) first, and the .out files were saved to .out.quick before re-running with RMLocalSens. #- Split contigs into 500kb chunks: cd ~/oo foreach d ( ?{,?}/NT_* ) cd $d set contig = $d:t faSplit size $contig.fa 500000 ${contig}_ -lift=$contig.lft \ -maxN=500000 cd ../.. end #- Make the run directory and job list: cd ~/oo mkdir RMRun rm -f RMRun/RMJobs touch RMRun/RMJobs foreach d ( ?{,?}/NT_* ) foreach f ( /cluster/store3/gs.13/build30/$d/NT_*_*.fa ) set f = $f:t echo /cluster/bin/scripts/RMLocalSens \ /cluster/store3/gs.13/build30/$d $f \ '{'check out line+ /cluster/store3/gs.13/build30/$d/$f.out'}' \ >> RMRun/RMJobs end end #- Do the run ssh kk cd ~/oo/RMRun para create RMJobs para try, para check, para check, para push, para check,... #- Lift up the split-contig .out's to contig-level .out's cd ~/oo foreach d ( ?{,?}/NT_* ) cd $d set contig = $d:t liftUp $contig.fa.out $contig.lft warn ${contig}_*.fa.out > /dev/null cd ../.. end 10) Lift up RepeatMask .out files to chromosome coordinates via cd ~/oo tcsh jkStuff/liftOut2.sh 10.1) Validate the RepeatMasking by randomly selecting a few NT_*.fa files, manually repeat masking them and matching the .out files with the related part in the chromosome-level .out files. For example: ssh kk cd ~/oo Pick several values of $chr and $nt and run these commands: set chr = ? set nt = NT_?????? mv $chr/$nt/$nt.fa.out $chr/$nt/$nt.fa.out.bak /scratch/hg/RepeatMasker/RepeatMasker -s $chr/$nt/$nt.fa rm $chr/$nt/$nt.fa.{masked,cat,cut,stderr,tbl} Compare each $chr/$nt/$nt.fa.out against the original and against the appropriate part of $chr/chr$chr.fa.out (use the coords for $nt given in seq_contig.md). mv $chr/$nt/$nt.fa.out.bak $chr/$nt/$nt.fa.out For build 30, the following were checked: 1/NT_004321, Y/NT_025975, 11/NT_033237 11) Generate contig and chromosome level masked and unmasked files via: tcsh jkStuff/chrFa.sh tcsh jkStuff/makeFaMasked.sh 12) Copy all contig and chrom fa files to /scratch on kkstore to get ready for cluster jobs, and ask to propagate to nodes ssh kkstore cd ~/oo /cluster/bin/scripts/cpNcbiFaScratch . /scratch/hg/gs.13/build30 Build 30 re-do only: cd /scratch/hg/gs.13/build30/; mv contig contig.0729 13) Create jkStuff/ncbi.lft for lifting stuff built w/NCBI assembly. Note: this ncbi.lift will not lift floating contigs to chr_random coords, but it will show the strand orientation of the floating contigs (grep for '|'). mdToNcbiLift seq_contig.md jkStuff/ncbi.lft CREATING DATABASE (DONE) o - ln -s /cluster/store1/gs.13/build30 ~/oo NOTE: /cluster/store1/gs.13/ is a symlink to /cluster/store3/gs.13 o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql o - Create the database. - ssh hgwdev - Enter mysql as the mysql root user. - At mysql prompt type: create database hg12; quit - make a semi-permanent read-only alias: alias hg12 mysql -u hguser -phguserstuff -A hg12 o - Tell the hgCentral database about it. Log onto genome-centdb and enter mysql via mysql -u root -pbigSecret hgCentral At the mysql prompt type: insert into dbDb values("hg12", "Human July 2002", "/cluster/store1/gs.13/build30/nib", "Human", "USP18", 1); o - Create the trackDb table as so cd ~/src/hg/makeDb/hgTrackDb Edit that makefile to add hg12 after hg11 and do make update cvs commit makefile LOAD REPEAT MASKS (DONE 7/29/02) Load the RepeatMasker .out files into the database with: cd ~/oo hgLoadOut hg12 ?/*.fa.out ??/*.fa.out EXTRACT LINEAGE-SPECIFIC REPEATS (ARIAN SMIT's scripts) (DONE 11/4/02) ssh kkstore mkdir -p ~/hg12/bed/linSpecRep cd ~/hg12/bed/linSpecRep foreach f (~/hg12/*/*.out) ln -sf $f . end /cluster/bin/scripts/primateSpecificRepeats.pl *.out /cluster/bin/scripts/perl-rename 's/(\.fa|\.nib)//' *.out.*spec /cluster/bin/scripts/perl-rename 's/\.(rod|prim)spec/.spec/' *.out.*spec rm *.out rm -rf /scratch/hg/gs.13/build30/linSpecRep cd .. cp -R linSpecRep /scratch/hg/gs.13/build30 # Ask cluster-admin@cse.ucsc.edu to binrsync /scratch/hg to clusters STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (DONE 7/12/02) Create packed chromosome sequence files ssh kkstore cd ~/oo tcsh jkStuff/makeNib.sh Load chromosome sequence info into database and save chrom sizes ssh hgwdev hgsql hg12 < ~/src/hg/lib/chromInfo.sql cd ~/oo hgNibSeq -preMadeNib hg12 /cluster/store1/gs.13/build30/nib ?{,?}/chr*.fa mysql -u hguser -phguserstuff -N -e "select chrom,size from chromInfo" hg12 > chrom.sizes Store o+o info in database. DONE 8/13/02 Note: for build30, Terry specially requested these files from NCBI: finished.finf draft.finf predraft.finf extras.finf finished.ffa.gz draft.ffa.gz predraft.ffa.gz extras.ffa.gz For future builds, we should try to modify hgClonePos to just use *.finf and not the *.ffa files. Patrick unpacked the *.ffa.gz into gs.13/{fin,draft,predraft,extras}/fa/* using /cluster/bin/scripts/unPackffa . cd /cluster/store1/gs.13/build30 jkStuff/liftGl.sh contig.gl hgGoldGapGl hg12 /cluster/store1/gs.13 build30 cd /cluster/store1/gs.13 hgClonePos hg12 build30 ffa/sequence.inf /cluster/store1/gs.13 -maxErr=3 #(Ignore warnings about missing clones - these are in chromosomes 21 and 22) hgCtgPos hg12 build30 Make and load GC percent table DONE 7/12/02 ssh hgwdev cd ~/oo mkdir -p bed/gcPercent cd bed/gcPercent hgsql hg12 < ~/src/hg/lib/gcPercent.sql hgGcPercent hg12 ../../nib GETTING FRESH mRNA, EST, REFSEQ SEQUENCE FROM GENBANK. (DONE 7/29/02) This will create a genbank.130 directory containing compressed GenBank flat files and a mrna.130 containing unpacked sequence info and auxiliary info in a relatively easy to parse (.ra) format. o - Point your browser to ftp://ftp.ncbi.nih.gov/genbank and look at the README.genbank. Figure out the current release number (which is 130). o - Consider deleting one of the older genbank releases. It's good to at least keep one previous release though. o - Where there is space make a new genbank directory. Create a symbolic link to it: mkdir /cluster/store1/genbank.130 ln -s /cluster/store1/genbank.130 ~/genbank cd ~/genbank o - ftp ftp.ncbi.nih.gov (do anonymous log-in). Then do the following commands inside ftp: cd genbank prompt mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv* This will take at least 2 hours. o - Make the refSeq subdir and download files: mkdir -p /cluster/store1/mrna.130/refSeq cd /cluster/store1/mrna.130/refSeq o - ftp ftp.ncbi.nih.gov (do anonymous log-in). Then do the following commands inside ftp: cd refseq/H_sapiens/mRNA_Prot prompt mget hs.*.gz o - Unpack this into fa files and get extra info with: cd /cluster/store1/mrna.130/refSeq gunzip -c hs.gbff.gz | \ gbToFaRa ~kent/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta \ stdin o - Log onto server and change to yo gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbv*.gz gbinv*.gz | \ gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.130/xenoRna.fa \ ../mrna.130/xenoRna.ra ../mrna.130/xenoRna.ta stdiur genbank directory. mkdir -p /cluster/store1/mrna.130 cd /cluster/store1/mrna.130 gunzip -c /cluster/store1/genbank.130/gbpri*.gz | \ gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin gunzip -c /cluster/store1/genbank.130/gbest*.gz | \ gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin gunzip -c /cluster/store1/genbank.130/gbest*.gz | \ gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin cd /cluster/store1/genbank.130 gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbv*.gz gbinv*.gz | \ gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.130/xenoRna.fa \ ../mrna.130/xenoRna.ra ../mrna.130/xenoRna.ta stdin STORING mRNA/EST SEQUENCE AND AUXILIARY INFO (DONE 7/29/02) o - Store the mRNA (non-alignment) info in database. hgLoadRna new hg12 hgLoadRna add hg12 /cluster/store1/mrna.130/mrna.fa /cluster/store1/mrna.130/mrna.ra hgLoadRna add hg12 /cluster/store1/mrna.130/est.fa /cluster/store1/mrna.130/est.ra hgLoadRna add -type=refSeq hg12 /cluster/store1/mrna.130/refSeq.fa /cluster/store1/mrna.130/refSeq.ra The est line will take quite some time to complete. MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE w/ mrna.130) o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa Copy the above 3 files from /cluster/store1/mrna.130 into kkstore:/scratch/hg/mrna.130 Request the admins to do a binrsync to the cluster. o - Use BLAT to generate refSeq, mRNA and EST alignments as so: Make sure that /scratch/hg/gs.13/build30/contig/ is loaded with NT_*.fa and pushed to the cluster nodes. ssh kk mkdir -p /cluster/store1/gs.13/build30/bed cd /cluster/store1/gs.13/build30/bed Using the bash shell do: for i in 'refSeq' 'mrna' 'est' do mkdir -p $i cd $i cp ~kent/lastOo/bed/$i/gsub . ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst ls -1 /scratch/hg/mrna.130/$i/$i.fa > mrna.lst mkdir -p psl # Note: build30/bed/refSeq directory not writeable, so I had to # create a bed/refSeq/psl and change the gsub mkdir -p /cluster/store1/gs.13/build30/bed/refSeq/psl gensub2 genome.lst mrna.lst gsub spec para create spec cd .. done Now, by hand cd to the mrna, refSeq, and est directories respectively and run a para push and para check in each one. o - Process refSeq mRNA and EST alignments into near best in genome. cd ~/oo/bed cd refSeq pslSort dirs raw.psl /tmp psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /tmp all_refSeq.psl cd .. cd mrna pslSort dirs raw.psl /tmp psl pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /tmp all_mrna.psl cd .. cd est pslSort dirs raw.psl /cluster/store3/tmp psl pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /cluster/store3/tmp all_est.psl cd .. o - Load refSeq alignments into database ssh hgwdev cd /cluster/store1/gs.13/build30/bed/refSeq pslCat -dir chrom > refSeqAli.psl hgLoadPsl hg12 -tNameIx refSeqAli.psl o - Load mRNA alignments into database. ssh hgwdev cd /cluster/store1/gs.13/build30/bed/mrna/chrom In tcsh: rm *_mrna.psl foreach i (*.psl) mv $i $i:r_mrna.psl end hgLoadPsl hg12 *.psl cd .. hgLoadPsl hg12 all_mrna.psl -nobin o - Load EST alignments into database. ssh hgwdev cd /cluster/store1/gs.13/build30/bed/est/chrom in tcsh do: rm *_est.psl foreach i (*.psl) mv $i $i:r_est.psl end hgLoadPsl hg12 *.psl cd .. hgLoadPsl hg12 all_est.psl -nobin o - Create subset of ESTs with introns and load into database. - ssh kkstore cd ~/oo tcsh jkStuff/makeIntronEst.sh - ssh hgwdev cd ~/oo/bed/est/intronEst hgLoadPsl hg12 *.psl o - Put orientation info on ESTs and mRNAs into database: Note: the cluster run requires /scratch/.../trfFa.0730/ to be in place, so this step should be run after "PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS" below. ssh kk cd ~/oo/bed/est pslSortAcc nohead contig /cluster/store3/tmp contig.psl cd ~/oo/bed/mrna pslSortAcc nohead contig /cluster/store3/tmp contig.psl ssh kkstore mkdir -p /scratch/hg/gs.13/build30/bed cp -r ~/oo/bed/est/contig /scratch/hg/gs.13/build30/bed/est cp -r ~/oo/bed/mrna/contig /scratch/hg/gs.13/build30/bed/mrna Ask admins to binrsync /scratch/hg/gs.13/build30/bed/* to the cluster. ssh kk foreach d (est mrna) mkdir -p ~/oo/bed/${d}OrientInfo/oi cd ~/oo/bed/${d}OrientInfo ls -1 /scratch/hg/gs.13/build30/bed/${d}/*.psl > psl.lst cp ~/hg11/bed/${d}OrientInfo/gsub . end Edit ~/oo/bed/${d}OrientInfo/gsub to point to the correct paths. For each of ~/oo/bed/{est,mrna}OrientInfo, cd there and do this: gensub2 psl.lst single gsub spec para create spec para try para check para push check until done, or use 'para shove' When the cluster run is done do: foreach d (est mrna) cd ~/oo/bed/${d}OrientInfo liftUp ${d}OrientInfo.bed ~/oo/jkStuff/liftAll.lft warn oi/*.tab hgLoadBed hg12 ${d}OrientInfo ${d}OrientInfo.bed \ -sqlTable=$HOME/src/hg/lib/${d}OrientInfo.sql > /dev/null end o - Create rnaCluster table (depends on {est,mrna}OrientInfo above) ssh hgwdev cd ~/oo mkdir -p ~/oo/bed/rnaCluster/chrom foreach i (? ??) cd $i foreach j (chr*.fa) set c = $j:r set f = ../bed/rnaCluster/chrom/$c.bed echo clusterRna hg12 /dev/null $f -chrom=$c clusterRna hg12 /dev/null $f -chrom=$c end cd .. end cd bed/rnaCluster hgLoadBed hg12 rnaCluster chrom/*.bed > /dev/null PRODUCING KNOWN GENES (DONE for 130) o - Get extra info from NCBI and produce refGene table as so: ssh hgwdev cd ~/oo/bed mkdir refSeq cd refSeq # Note: downloaded these to refSeq (refSeq dir perms) wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/mim2loc o - Similarly download refSeq proteins in fasta format to refSeq.pep - I believe this is hs.faa. I have changed the name to hs.prot.fa. o - RefSeq should have already been aligned to the genome by processes described under mRNA/EST alignments above. o - Produce refGene, refPep, refMrna, and refLink tables as so: ssh hgwdev cd ~/oo/bed/refSeq ln -s /cluster/store1/mrna.130 mrna # # NOTE: If hgRefSeqMrna is losing refLink's protAcc field, I think it's due to # a format-change issue with .-suffixes in mrna acc's # hgRefSeqMrna hg12 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl \ loc2ref mrna/refSeq/human/hs.prot.fa mim2loc o - Add RefSeq status info hgRefSeqStatus hg12 loc2ref o - Add Jackson labs info cd ~/oo/bed mkdir jaxOrtholog cd jaxOrtholog wget ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt cp /cluster/store1/gs.12/build29/bed/jaxOrtholog/filter.awk . awk -f filter.awk *.rpt > jaxOrtholog.tab Drop (just in case), create and load the table like this: echo 'drop table jaxOrtholog;' | hgsql hg12 hgsql hg12 < ~/src/hg/lib/jaxOrtholog.sql echo "load data local infile '"`pwd`"/jaxOrtholog.tab' into table jaxOrtholog;" \ | hgsql hg12 REFFLAT and GENEBANDS o - create precomputed join of refFlat and refGene: echo 'CREATE TABLE refFlat (KEY geneName (geneName), KEY name (name), KEY chrom (chrom)) SELECT refLink.name as geneName, refGene.* FROM refLink,refGene WHERE refLink.mrnaAcc = refGene.name' | hgsql hg12 o - Create precomputed geneBands table: ssh hgwdev hgGeneBands hg12 geneBands.txt hgsql hg12 >mysql load data local infile 'geneBands.txt' into table geneBands; >mysqy quit rm geneBands.txt SIMPLE REPEAT TRACK (DONE) o - Create cluster parasol job like so: ssh kk mkdir -p ~/oo/bed/simpleRepeat cd ~/oo/bed/simpleRepeat cp /cluster/store1/gs12.build29/bed/simpleRepeat/gsub ./gsub mkdir trf Ask the admins to push /scratch/hg/gs.13/build30/ to the cluster ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst gensub2 genome.lst single gsub spec para create spec TODO para try para check para push liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed o - Load this into the database as so ssh hgwdev cd ~/oo/bed/simpleRepeat hgLoadBed hg12 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql PRODUCING GENSCAN PREDICTIONS (DONE 7/31/02) mkdir -p ~/oo/bed/genscan cd ~/oo/bed/genscan o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so: Put hard-masked contigs in /cluster/store1/gs.13/build30/bed/genscan/mContigs (For hg12, the .masked files were not saved during repeat masking. So the contig (.fa) files in /cluster/store1/gs.13/build30/? and ?? were processed to convert all lower case bases into N and named as *.fa.masked and placed under genscan/mContigs). mkdir -p ~/oo/bed/genscan/mContigs cd ~/oo/bed/genscan/mContigs foreach f (/cluster/store3/gs.13/build30/?/*/NT_??????.fa \ /cluster/store3/gs.13/build30/??/*/NT_??????.fa) set m = $f:t.masked tr 'abcdghkmnrstvwy' 'NNNNNNNNNNNNNNN' < $f > $m end Log into kkr1u00 (not kk!). kkr1u00 is the driver node for the small cluster (kkr2u00 -kkr8u00. Genscan has problem running on the big cluster, due to limitation of memory and swap space on each processing node). cd ~/oo/bed/genscan Make 3 subdirectories for genscan to put their output files in mkdir -p gtf pep subopt Generate a list file, genome.list, of all the contigs *that do not have pure Ns* (due to heterochromatin, unsequencable stuff) which would cause genscan to run forever. rm -f genome.list touch genome.list foreach f ( `ls -1S ./mContigs/*.masked` ) egrep '[ACGT]' $f > /dev/null if ($status == 0) echo $f >> genome.list end Create template file, gsub, for gensub2. For example (3 lines file): #LOOP /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000 #ENDLOOP Create a file containing a single line. echo single > single Generate job list file, jobList, for Parasol gensub2 genome.list single gsub jobList First issue the following Parasol command: para create jobList Run the following command, which will try first 10 jobs from jobList para try Check if these 10 jobs run OK by para check If they have problems, debug and fix your program, template file, commands, etc. and try again. If they are OK, then issue the following command, which will ask Parasol to start all the remaining jobs. For hg12, there were 1396 jobs in total. para push Issue either one of the following two commands to check the status of the cluster and your jobs, until they are done. parasol status para check If there were out-of-memory problems (run "para problems"), then re-run those jobs by hand but change the -window arg from 2400000 to 1200000. In gs.13/build30, this was the job for mContigs/NT_011519.fa.masked . o - Convert these to chromosome level files as so: cd ~/oo/bed/genscan liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/NT*.gtf liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/NT*.bed > \ /dev/null cat pep/*.pep > genscan.pep o - Load into the database as so: ssh hgwdev cd ~/oo/bed/genscan ldHgGene hg12 genscan genscan.gtf hgPepPred hg12 generic genscanPep genscan.pep hgLoadBed hg12 genscanSubopt genscanSubopt.bed > /dev/null PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (DONE 7/30/02) Make sure that the NT*.fa files are lower-case repeat masked. Do something much like the simpleRepeat track, but only masking out stuff with a period of 12 or less as so: ssh kk mkdir -p ~/oo/bed/trfMask cd ~/oo/bed/trfMask cp ~/cluster/store1/gs.12/build29/bed/trfMask/gsub . mkdir trf ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst gensub2 genome.lst single gsub spec para create spec para try para check para push When that is done do: ssh kkstore mkdir /scratch/hg/gs.13/build30/trfFa.0730 cd ~/oo NOTE:Below is a tcsh script foreach i (? ??) cd $i foreach j (NT*) maskOutFa $j/$j.fa ../bed/trfMask/trf/$j.bed -softAdd \ /scratch/hg/gs.13/build30/trfFa.0730/$j.fa.trf echo done $i/$j end cd .. end Then ask admins to do a binrsync. PREPARING POST-TRF CHROM-LEVEL MIXED NIBs for mouse blastz (DONE 11/6/02) # lift trfMask output to chrom-level... this is a pain because all # trf output was put in the same dir. maybe next time around, we # can preserve chrom dir structure... ssh kkstore cd ~/oo foreach c (?{,?}) if (-e $c/lift/ordered.lst) then set ntlist = () foreach n (`cat $c/lift/ordered.lst`) set ntlist = ($ntlist bed/trfMask/trf/$n.bed) end liftUp $c/chr$c.trf.bed jkStuff/liftAll.lft warn $ntlist endif if (-e $c/lift/random.lst) then set ntlist = () foreach n (`cat $c/lift/random.lst`) set ntlist = ($ntlist bed/trfMask/trf/$n.bed) end liftUp $c/chr${c}_random.trf.bed jkStuff/liftAll.lft warn $ntlist endif end # make trf-masked chrom-level .fa foreach c (?{,?}) cd $c if (-e chr$c.trf.bed) then echo masking $c... cp chr$c.fa chr$c.trf.fa maskOutFa -softAdd chr$c.trf.fa chr$c.trf.bed chr$c.trf.fa endif if (-e chr${c}_random.trf.bed) then echo masking ${c}_random... cp chr${c}_random.fa chr${c}_random.trf.fa maskOutFa -softAdd chr${c}_random.trf.fa chr${c}_random.trf.bed \ chr${c}_random.trf.fa endif cd .. end # make nib mkdir trfMixedNib foreach c (?{,?}) if (-e $c/chr$c.trf.fa) then faToNib -softMask $c/chr$c.trf.fa trfMixedNib/chr$c.nib endif if (-e $c/chr${c}_random.trf.fa) then faToNib -softMask $c/chr${c}_random.trf.fa \ trfMixedNib/chr${c}_random.nib endif end rm -rf /scratch/hg/gs.13/build30/chromTrfMixedNib cp -pR trfMixedNib /scratch/hg/gs.13/build30/chromTrfMixedNib CREATE GOLDEN TRIANGLE (todo) Make sure that rnaCluster table is in place. Then extract Affy expression info into a form suitable for Eisen's clustering program with: cd ~/oo/bed mkdir triangle cd triangle eisenInput hg12 affyHg10.txt Transfer this to Windows and do k-means clustering with k=200 with cluster. Transfer results file back to ~/oo/bed/triangle/affyCluster_K_G200.kgg. Then do promoSeqFromCluster hg12 1000 affyCluster_K_G200.kgg kg200.unmasked Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions to kg200. Then cat kg200/*.fa > all1000.fa and set up cluster Improbizer run to do 100 controls for every real run on each - putting the output in imp.200.1000.e. When improbizer run is done make a file summarizing the runs as so: cd imp.200.1000.e motifSig ../imp.200.1000.e.iri ../kg200 motif control* get rid of insignificant motifs with: cd .. awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri turn rest into just dnaMotifs with iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt Extract all promoters with featureBits hg12 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa Locate motifs on all promoters with dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2 liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed CREATE STS/FISH/BACENDS/CYTOBANDS DIRECTORY STRUCTURE AND SETUP (DONE) o - Create directory structure to hold information for these tracks cd /projects/hg2/booch/psl/ change Makefile parameters for OOVERS, GSVERS, PREVGS, PREVOO make new o - Update all Makefiles with latest OOVERS and GSVERS, DATABASE, and locations of .fa files o - Create accession_info file make accession_info.rdb UPDATE STS INFORMATION (DONE) o - Download and unpack updated information from dbSTS: In a web browser, go to ftp://ftp.ncbi.nih.gov/repository/dbSTS/. Download dbSTS.sts, dbSTS.aliases, and dbSTS.FASTA.dailydump.Z to /projects/hg2/booch/psl/update -Unpack dbSTS.FASTA.dailydump.Z gunzip dbSTS.FASTA.dailydump.Z o - Create updated files cd /projects/hg2/booch/psl/update edit Makefile to latest sts.X version from PREV (currently sts.4) make update o - Make new directory for this info and move files there ssh kks00 mkdir /cluster/store1/sts.5 cp all.STS.fa /cluster/store1/sts.5 cp all.primers /cluster/store1/sts.5 cp all.primers.fa /cluster/store1/sts.5 o - Copy new files to cluster ssh kkstore cd /cluster/store1/sts.5 cp /cluster/store1/sts.5/*.* /scratch/hg/STS ask for propagation from sysadmin STS ALIGNMENTS (DONE) (alignments done without RepeatMasking, so start ASAP!) o - Create full sequence alignments ssh kk cd /cluster/home/booch/sts - update Makefile with latest OOVERS and GSVERS make new make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make stsMarkers.psl o - Copy files to final destination and remove originals ssh kks00 make copy.assembly make clean.assembly o - Create primer alignments ssh kk cd /cluster/home/booch/primers - update Makefile with latest OOVERS and GSVERS make new make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make primers.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly o - Create ePCR alignments ssh kk cd /cluster/home/booch/epcr - update Makefile with latest OOVERS and GSVERS make new make jobList.scratch (if contig files propagated to nodes) - or _ make jobList.disk (if contig files not propagated) para create jobList para push (or para try/para check if want to make sure it runs) make primers.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly CREATE AND LOAD STS MARKERS TRACK (DONE) o - Copy in current stsInfo2.bed and stsAlias.bed files cd /projects/hg2/booch/psl/gs.13/build30 cp ../update/stsInfo2.bed . cp ../update/stsAlias.bed . o - Create final version of sts sequence placements ssh kks00 cd /projects/hg2/booch/psl/gs.13/build30/sts make stsMarkers.final o - Create final version of primers placements cd /projects/hg2/booch/psl/gs.13/build30/primers cp /cluster/store1/sts.5/all.primers . make primers.final o - Create bed file cd /projects/hg2/booch/psl/gs.13/build30 make stsMap.bed o - Create database tables ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < all_sts_primer.sql mysql -uhgcat -pXXXXXXX < all_sts_seq.sql mysql -uhgcat -pXXXXXXX < stsAlias.sql mysql -uhgcat -pXXXXXXX < stsInfo2.sql mysql -uhgcat -pXXXXXXX < stsMap.sql o - Load the tables load /projects/hg2/booch/psl/gs.13/build30/sts/stsMarkers.psl.filter.lifted into all_sts_seq load /projects/hg2/booch/psl/gs.13/build30/primers/primers.psl.filter.lifted into all_sts_primer load /projects/hg2/booch/psl/gs.13/build30/stsAlias.bed into stsAlias load /projects/hg2/booch/psl/gs.13/build30/stsInfo2.bed into stsInfo2 load /projects/hg2/booch/psl/gs.13/build30/stsMap.bed into stsMap o - Load the sequences (change sts.# to match correct location) hgLoadRna addSeq hg12 /cluster/store1/sts.5/all.STS.fa hgLoadRna addSeq hg12 /cluster/store1/sts.5/all.primers.fa BACEND SEQUENCE ALIGNMENTS (DONE) (alignments done without RepeatMasking, so start ASAP!) o - Create full sequence alignments ssh kk cd /cluster/home/booch/bacends - update Makefile with latest OOVERS and GSVERS make new make jobList para create jobList para push (or para try/para check if want to make sure it runs) make bacEnds.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly BACEND PAIRS TRACK (DONE) o - Update Makefile with location of pairs files, if necessary cd /projects/hg2/booch/psl/gs.13/build30/bacends edit Makefile (PAIRS=....) o - Create bed file ssh kks00 cd /projects/hg2/booch/psl/gs.13/build30/bacends make bacEndPairs.bed o - Create database tables ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < all_bacends.sql mysql -uhgcat -pXXXXXXX < bacEndPairs.sql o - Load the tables load /projects/hg2/booch/psl/gs.13/build30/bacends/bacEnds.psl.filter.lifted into all_bacends load /projects/hg2/booch/psl/gs.13/build30/bacends/bacEndPairs.bed into bacEndPairs o - Load the sequences (change bacends.# to match correct location) hgLoadRna addSeq hg12 /cluster/store1/bacends.2/BACends.fa FOSEND SEQUENCE ALIGNMENTS (DONE) o - Create full sequence alignments ssh kk cd /cluster/home/booch/fosends - update Makefile with latest OOVERS and GSVERS make new make jobList para create jobList para push (or para try/para check if want to make sure it runs) make fosEnds.psl o - Copy files to final destination and remove ssh kks00 make copy.assembly make clean.assembly FOSEND PAIRS TRACK (DONE) o - Update Makefile with location of pairs files, if necessary cd /projects/hg2/booch/psl/gs.13/build30/fosends o - Create bed file ssh kks00 cd /projects/hg2/booch/psl/gs.13/build30/fosends make fosEndPairs.bed o - Create database tables ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < all_fosends.sql mysql -uhgcat -pXXXXXXX < fosEndPairs.sql o - Load the tables load /projects/hg2/booch/psl/gs.13/build30/fosends/fosEnds.psl.filter.lifted into all_fosends load /projects/hg2/booch/psl/gs.13/build30/fosends/fosEndPairs.bed into fosEndPairs o - Load the sequences (change bacends.# to match correct location) hgLoadRna addSeq hg12 /cluster/store1/fosends.1/fosEnds.fa UPDATE FISH CLONES INFORMATION (DONE) o - Download the latest info from NCBI point browser at http://www.ncbi.nlm.nih.gov/genome/cyto/cytobac.cgi?CHR=all&VERBOSE=ctg change "Show details on sequence-tag" to "yes" change "Download or Display" to "Download table for UNIX" press Submit - save as /projects/hg2/booch/psl/fish/hbrc/hbrc.YYYYMMDD.table o - Format file just downloaded cd /projects/hg2/booch/psl/fish/ make HBRC o - Copy it to the new freeze location cp /projects/hg2/booch/psl/fish/all.fish.format /projects/hg2/booch/psl/gs.13/build30/fish/ CREATE AND LOAD FISH CLONES TRACK (DONE) (must be done after STS markers track and BAC end pairs track) o - Extract the file with clone positions from database ssh hgwdev mysql -uhgcat -pXXXXXXXX hg12 mysql> select * into outfile "/tmp/booch/clonePos.txt" from clonePos; mysql> quit mv /tmp/booch/clonePos.txt /projects/hg2/booch/psl/gs.13/build30/fish o - Create bed file cd /projects/hg2/booch/psl/gs.13/build30/fish make bed o - Create database table ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < fishClones.sql o - Load the table load /projects/hg2/booch/psl/gs.13/build30/fish/fishClones.bed into fishClones CREATE AND LOAD CHROMOSOME BANDS TRACK (DONE) (must be done after FISH Clones track) o - Create bed file ssh hgwdev make setBands.txt make cytobands.pct.ranges make predict o - Create database table ssh hgwdev cd /projects/hg2/booch/psl/tables mysql -uhgcat -pXXXXXXX < cytoBand.sql o - Load the table load /projects/hg2/booch/psl/gs.13/build30/cytobands/cytobands.bed into cytoBand CREATE CHROMOSOME REPORTS (DONE) CREATE STS MAP COMPARISON PLOTS (DONE) HUMAN/MOUSE BLAT ALIGNMENTS (DONE 8/22/02) # Process the trfFa files (contigs, lower-case repeat and tandem-repeat # masked) into about 500 files containing records of <= 20kb each: # # First, make .unlft (unlifted) versions of all mouse contig .agp's: ssh kkstore cd ~/mm2 foreach ctgAgp (?{,?}/chr*/chr?{,?}_?{,?}.agp) ~/kent/src/hg/splitFaIntoContigs/deLiftAgp.pl jkStuff/liftAll.lft \ $ctgAgp > $ctgAgp.unlft end # Now use the unlifted contig .agp's to further split the (super-)contigs # into smaller "sub"-contigs (still at contig boundaries): foreach ctgAgp (?{,?}/chr*/chr?{,?}_?{,?}.agp.unlft) set ctg=$ctgAgp:t:r:r splitFaIntoContigs $ctgAgp trfFa/$ctg.fa.trf trfFaSplit -nSize=15000 end # Create a lift file for all sub-contigs. cat trfFaSplit/*/lift/ordered.lft > trfFaSplit/allSubContigs.lft # Since splitFaIntoContigs enforces a min/approximate size and we need # to enforce a max size, use faSplit on sub-contigs. Build up a list # file naming all the split sub-contigs. set splitSubDir=trfFaSplit/splitSubs mkdir -p $splitSubDir rm -f $splitSubDir/splitSubs.lst touch $splitSubDir/splitSubs.lst foreach ctgDir (trfFaSplit/?{,?}_?{,?}) foreach subCtgFa ($ctgDir/chr*/chr*.fa) set subCtg=$subCtgFa:t:r faSplit size $subCtgFa 20000 $splitSubDir/${subCtg}_ \ -lift=$splitSubDir/$subCtg.lft -maxN=20000 foreach ss ($splitSubDir/${subCtg}_*) echo $ss >> $splitSubDir/splitSubs.lst end end end # Create a lift file for all split sub-contigs. # or not -- too many files for cat. Create per-chunk lift files below. # Divide the list of split sub-contigs into ~500 chunks splitFile $splitSubDir/splitSubs.lst 350 $splitSubDir/splitSubs_ # cat the split-sub-contig .fa's into multi-record chunk .fa's # for para job generation. Make a lift file for each chunk. set chunkDir = trfFaSplit/chunks mkdir -p $chunkDir rm -f /tmp/makeLft.log touch /tmp/makeLft.log foreach chunkLst ($splitSubDir/splitSubs_*) set chunkNum=`echo $chunkLst | sed -e 's/.*_//g'` set chunkLft = $chunkDir/chunk_$chunkNum.lft rm -f $chunkLft touch $chunkLft set lastSubCtg = "" foreach splitSubFa (`cat $chunkLst`) set splitSub = $splitSubFa:r set subCtg = `echo $splitSub | perl -wpe 's/(chr\w+_\d+_\d+)_\d+/$1/'` if ("$subCtg" != "$lastSubCtg") then echo "subCtg changed from $lastSubCtg to $subCtg; catting $subCtg.lft onto $chunkLft" >> /tmp/makeLft.log cat $subCtg.lft >> $chunkLft set lastSubCtg = $subCtg endif end cat `cat $chunkLst` > $chunkDir/chunk_$chunkNum.fa end # Put those files on cluster nodes' /scratch: mkdir /scratch/hg/mm2/splitContigChunks cp -p $chunkDir/chunk_* /scratch/hg/mm2/splitContigChunks # Ask sysadmins for an updateLocal/binrsync # Now we're ready to set up the cluster run! mkdir -p ~/oo/bed/blatMus cd ~/oo/bed/blatMus # Use the mouse multi-record chunks created above: ls -1 /scratch/hg/mm2/splitContigChunks/chunk_*.fa > mouse.lst # Then bundle up the human into pieces of less than 12 meg mostly: rm -f bigH smallH foreach f (/scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf) set size = `ls -l $f | awk '{print $5;}'` if ($size < 13000000) then echo $f >> smallH else echo $f >> bigH endif end mkdir hs cd hs splitFile ../bigH 1 big rm bigXX # Note, this is just an empty file that the splitFile program erroneously created. Remove the last one. splitFile ../smallH 4 small rm smallXXX cd .. ls -1 hs/* > human.lst cat > gsub <>& /tmp/lft.log end # Then the kinds of lifting we can do all at once: # mouse sub-contigs to mouse contigs, # mouse contigs to mouse chrom, # human contigs to human chrom: pslCat -dir -check pslLft \ | liftUp -type=.psl -pslQ stdout ~/mm2/trfFaSplit/allSubContigs.lft \ warn stdin \ | liftUp -type=.psl -pslQ stdout ~/mm2/jkStuff/liftAll.lft warn stdin \ | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin \ | pslSortAcc nohead chromPile /cluster/store2/temp stdin # Get rid of big pile-ups due to contamination as so: mkdir chrom cd chromPile foreach i (*.psl) echo $i pslUnpile -maxPile=250 $i ../chrom/${i:r}_blatMus.psl end # Load into database: ssh hgwdev cd ~/oo/bed/blatMus/chrom hgLoadPsl hg12 *.psl PRODUCING CROSS_SPECIES mRNA ALIGNMENTS (DONE 8/7/02) Here you align vertebrate mRNAs against the masked genome on the cluster you set up during the previous step. o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into /cluster/store1/genbank.130 (in GETTING FRESH mRNA AND EST SEQUENCE FROM GENBANK step) o - Process these out of genbank flat files as so: ssh kkstore cd /cluster/store1/mrna.130 faSplit sequence xenoRna.fa 2 xenoRna mkdir -p /scratch/hg/mrna.130 cp /cluster/store1/mrna.130/xenoRna*.* /scratch/hg/mrna.130 Request binrysnc of /scratch/hg/mrna.130 from the admins Set up cluster run. First make sure genome is in kkstore:/scratch/hg/gs.13/build30/trfFa.0730 in RepeatMasked + trf form. (This is probably done already in mouse alignment stage). Also make sure /scratch/hg/mrna.130 is loaded with xenoRna.fa Then do: ssh kkstore mkdir -p ~/oo/bed/xenoMrna cd ~/oo/bed/xenoMrna mkdir psl ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst ls -1S /scratch/hg/mrna.130/xenoRna?*.fa > mrna.lst cp ~/hg11/bed/xenoMrna/gsub . gensub2 human.lst mrna.lst gsub spec para create spec ssh kk cd ~/oo/bed/xenoMrna para try para check para push Do para check until the run is done, doing para push if necessary on occassion. Sort xeno mRNA alignments as so: ssh kkstore cd ~/oo/bed/xenoMrna pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.25 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoMrna.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/oo/bed/xenoMrna hgLoadPsl hg12 xenoMrna.psl -tNameIx cd /cluster/store1/mrna.130 hgLoadRna add hg12 xenoRna.fa xenoRna.ra Similarly do xenoEst aligments: Prepare the est data: cd /cluster/store1/mrna.130 faSplit sequence xenoEst.fa 16 xenoEst ssh kkstore cd /cluster/store1/gs.13/build30/bed mkdir xenoEst cd xenoEst mkdir psl ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst cp /cluster/store1/mrna.130/xenoEst?*.fa /scratch/hg/mrna.130 ls -1S /scratch/hg/mrna.130/xenoEst?*.fa > mrna.lst cp ~/hg11/bed/xenoEst/gsub . Request a binrysnc from the admin's of kkstore's /scratch/hg/mrna.130 When done, do: gensub2 human.lst mrna.lst gsub spec para create spec para push Sort xenoEst alignments: ssh kkstore cd ~/oo/bed/xenoEst pslSort dirs raw.psl /cluster/store2/temp psl pslReps raw.psl cooked.psl /dev/null -minAli=0.10 liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl pslSortAcc nohead chrom /cluster/store2/temp chrom.psl pslCat -dir chrom > xenoEst.psl rm -r chrom raw.psl cooked.psl chrom.psl Load into database as so: ssh hgwdev cd ~/oo/bed/xenoEst hgLoadPsl hg12 xenoEst.psl -tNameIx cd /cluster/store1/mrna.130 hgLoadRna add hg12 xenoEst.fa xenoEst.ra PRODUCING FISH ALIGNMENTS (DONE 08/05/02) o - Do fish/human alignments. ssh kk cd ~/oo/bed mkdir blatFish cd blatFish mkdir psl ls -1S /scratch/hg/fish/*.fa > fish.lst ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst # Copy over gsub from previous version. gensub2 human.lst fish.lst gsub spec para create spec para try Make sure jobs are going ok with para check. Then para push wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so pslCat -dir psl | \ liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \ pslSortAcc nohead chrom temp stdin o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatFish/chrom rm -f chr*_blatFish.psl foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl hg12 *.psl Now load the fish seqeuence data hgLoadRna addSeq hg12 /cluster/store3/tetFish/tet*.fa PRODUCING FISH ALIGNMENTS (DONE 08/05/02) o - Do fish/human alignments. ssh kk cd ~/oo/bed mkdir blatFish cd blatFish mkdir psl ls -1S /scratch/hg/fish/*.fa > fish.lst ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst # Copy over gsub from previous version. gensub2 human.lst fish.lst gsub spec para create spec para try Make sure jobs are going ok with para check. Then para push wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so pslCat -dir psl | \ liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \ pslSortAcc nohead chrom temp stdin o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatFish/chrom rm -f chr*_blatFish.psl foreach i (*.psl) set r = $i:r mv $i ${r}_blatFish.psl end hgLoadPsl hg12 *.psl Now load the fish seqeuence data hgLoadRna addSeq hg12 /cluster/store3/tetFish/tet*.fa PRODUCING FUGU ALIGNMENTS (DONE 12/09/02) o - Do fugu/human alignments. ssh kk cd ~/oo/bed mkdir blatFugu cd blatFugu mkdir psl foreach f (~/hg12/?{,?}/NT_??????/NT_??????.fa) set c=$f:t mkdir -p psl/$c end ls -1S /scratch/hg/fugu/split500/*.fa > fugu.lst ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst # Copy over gsub from previous version. gensub2 human.lst fugu.lst gsub spec para create spec para try Make sure jobs are going ok with para check. Then para push -maxJobs=10000 wait about 2 hours and do another para push do para checks and if necessary para pushes until done or use para shove. o - Sort alignments as so ssh eieio cd ~/oo/bed/blatFugu pslCat -dir psl/NT_??????.fa | \ liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \ pslSortAcc nohead chrom temp stdin o - Rename to correspond with tables as so and load into database: ssh hgwdev cd ~/oo/bed/blatFugu/chrom rm -f chr*_blatFugu.psl foreach i (chr?{,?}{,_random}.psl) set r = $i:r mv $i ${r}_blatFugu.psl end hgLoadPsl hg12 *.psl Make fugu symlink cd /gbdb/hg12 mkdir fugu cd fugu ln -s /cluster/store3/fuguSeq/fugu_v3_mask.fasta /gbdb/hg12/fugu/fugu_v3_mask.fasta Now load the Fugu sequence data hgLoadRna addSeq hg12 /gbdb/hg12/fugu/fugu_v3_mask.fasta TIGR GENE INDEX (DONE 12/18/02) mkdir -p ~/hg12/bed/tigr cd ~/hg12/bed/tigr wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build30.tgz gunzip -c TGI*.tgz | tar xvf - foreach f (*cattle*) set f1 = `echo $f | sed -e 's/cattle/cow/g'` mv $f $f1 end foreach o (mouse cow human pig rat) setenv O $o foreach f (chr*_$o*s) tail +2 $f | perl -wpe 's /THC/TC/; s/(TH?C\d+)/$ENV{O}_$1/;' > $f.gff end end ldHgGene -exon=TC hg12 tigrGeneIndex *.gff LOAD STS MAP (todo) TODO BY TERRY I BELIEVE - HE WILL UPDATE THIS - login to hgwdev cd ~/oo/bed hg12 < ~/src/hg/lib/stsMap.sql mkdir stsMap cd stsMap bedSort /projects/cc/hg/mapplots/data/tracks/build30/stsMap.bed stsMap.bed - Enter database with "hg12" command. - At mysql> prompt type in: load data local infile 'stsMap.bed' into table stsMap; - At mysql> prompt type quit LOAD CHROMOSOME BANDS (todo) ALSO TODO BY TERRY I BELIEVE - login to hgwdev cd /cluster/store1/gs.13/build30/bed mkdir cytoBands cp /projects/cc/hg/mapplots/data/tracks/oo.29/cytobands.bed cytoBands cd cytoBands hg12 < ~/src/hg/lib/cytoBand.sql Enter database with "hg12" command. - At mysql> prompt type in: load data local infile 'cytobands.bed' into table cytoBand; - At mysql> prompt type quit LOAD MOUSEREF TRACK (todo) First copy in data from kkstore to ~/oo/bed/mouseRef. Then substitute 'genome' for the appropriate chromosome in each of the alignment files. Finally do: hgRefAlign webb hg12 mouseRef *.alignments LOAD AVID MOUSE TRACK (todo) ssh cc98 cd ~/oo/bed mkdir avidMouse cd avidMouse wget http://pipeline.lbl.gov/tableCS-LBNL.txt hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed hgLoadBed avidRepeat avidRepeat.bed hgLoadBed avidUnique avidUnique.bed LOAD SNPS (Done; Daryl Thomas July 29, 2002) ssh hgwdev cd ~/oo/bed mkdir snp cd snp -Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.ncbi.b29.gz -Unpack. ln -s ../../seq_contig.md . calcFlipSnpPos seq_contig.md gp.ncbi.b30 gp.ncbi.b30.flipped mv gp.ncbi.b30 gp.ncbi.b30.original gzip gp.ncbi.b30.original grep RANDOM gp.ncbi.b30.flipped > snpTsc.txt grep MIXED gp.ncbi.b30.flipped >> snpTsc.txt grep BAC_OVERLAP gp.ncbi.b30.flipped > snpNih.txt grep OTHER gp.ncbi.b30.flipped >> snpNih.txt awk -f filter.awk snpTsc.txt > snpTsc.contig.bed awk -f filter.awk snpNih.txt > snpNih.contig.bed liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed hgLoadBed hg12 snpTsc snpTsc.bed hgLoadBed hg12 snpNih snpNih.bed -gzip all of the big files LOAD CPGISLANDS (DONE 8/9/02) ssh kkstore mkdir -p ~/oo/bed/cpgIsland cd ~/oo/bed/cpgIsland - Build software emailed from Asif Chinwalla (achinwal@watson.wustl.edu) - copy the tar file to the current directory tar xvf cpg_dist.tar cd cpg_dist gcc readseq.c cpg_lh.c -o cpglh.exe cd .. - cpglh.exe requires hard-masked (N) .fa's. - execute the following loop in tcsh to hard-mask chr*.fa and run cpglh: foreach c (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y \ Un M) if (-e ../../$c/chr$c.fa) then tr '[a-z]' 'N' < ../../$c/chr$c.fa | sed -e 's/^>NNN/>chr/' > chr$c.fa echo masked chr$c. ./cpg_dist/cpglh.exe chr$c.fa > chr$c.fa.cpg echo Done with chr$c. else echo ../../$c/chr$c.fa does not exist. endif if (-e ../../$c/chr${c}_random.fa) then tr '[a-z]' 'N' < ../../$c/chr${c}_random.fa | sed -e 's/^>NNN/>chr/' \ > chr${c}_random.fa echo masked chr${c}_random. ./cpg_dist/cpglh.exe chr${c}_random.fa > chr${c}_random.fa.cpg echo Done with chr${c}_random. endif end rm chr*.fa -copy filter.awk from a previous release cp ~/hg11/bed/cpgIsland/filter.awk . awk -f filter.awk chr*.cpg > cpgIsland.bed ssh hgwdev cd ~/oo/bed/cpgIsland hgLoadBed hg12 cpgIsland -tab -noBin \ -sqlTable=$HOME/kent/src/hg/lib/cpgIsland.sql cpgIsland.bed LOAD ENSEMBL GENES (Done by Matt 9/20/02) cd ~/oo/bed mkdir ensembl cd ensembl Get the ensembl gene data as below: GET http://www.ebi.ac.uk/~stabenau/human_8_30.gtf.gz > ensGene.gz (The above may only be a temproary location) Get the ensembl protein data from http://www.ensembl.org/Homo_sapiens/martview Add "chr" to front of each line in the gene data gtf file to make it compatible with ldHgGene ~matt/bin/addchr.pl ensGene.gtf ensembl.gtf ldHgGene hg12 ensGene ensembl.gtf o - Load Ensembl peptides: Get them from ensembl as above in the gene section Substitute ENST for ENSP in ensPep with the program called subs edit subs.in to read: ENSP|ENST subs -e ensPep.fa > /dev/null Run fixPep.pl ensPep.fa ensembl.pep hgPepPred hg12 generic ensPep ensembl.pep LOAD SANGER 22 Pseudogenes (from hg10 - Done by Robert) cd ~/hg12/bed/sanger22 cp ~/hg10/bed/sanger22/cChr22.3.lx.pseudogene.gff . replace ^chr22 with hg10:chr22 in Chr22.3.lx.pseudogene.gff liftUp -type=.gff pseudo.gff hg12.lft Chr22.3.lx.pseudogene.gff ldHgGene hg12 sanger22pseudo pseudo.gff LOAD SANGER22 GENES DONE 9/27/02 by MATT cd ~/oo/bed mkdir sanger22 cd sanger22 not sure where these files were downloaded from grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg12 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0 | ldHgGene hg12 sanger22pseudo stdin Note: this creates sanger22extras, but doesn't currently create a correct sanger22 table, which are replaced in the next steps sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg12 sanger22 stdin sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \ | ldHgGene hg12 sanger22pseudo stdin hgPepPred hg12 generic sanger22pep *.pep LOAD SANGER 20 GENES (todo) # First download files from James Gilbert's email to ~/oo/bed/sanger20 and # go to that directory while logged onto hgwdev. Then: grep -v Pseudogene chr_20*.gtf | ldHgGene hg12 sanger20 stdin hgSanger20 hg12 *.gtf *.info LOAD RNAGENES (DONE 9/10/02) ssh hgwdev mkdir -p ~/hg12/bed/rnaGene cd ~/hg12/bed/rnaGene wget ftp://ftp.genetics.wustl.edu/pub/eddy/pickup/ncRNA-chr7-20020621.gff.gz gunzip -c ncRNA-chr7-20020621.gff.gz \ | grep -v "^#" \ | perl -wpe 's/chrom(\d+)\.\w+\.fsa/chr$1/' \ > chr7_ncrna.gff # NOTE: just for build30, chr7 NCBI coords differ slightly from # chr7 WUSTL coords. LaDeana Hillier's instructions for translation: # ours ncbi build30 # ---- ------------- # 1->16379450 = 1->16379450 # nothing = 16379451 to 16379650 (because ncbi bp 16379451 # to 16379650 are identical to ncbi 16379651 16379850) # 16379451->157432593 = 16379651 -> 157432793 # The following fix should not be required for future builds! perl -we 'while (<>) { \ @words = split("\t"); \ if ($words[3] > 16379450) { $words[3] += 200; } \ if ($words[4] > 16379450) { $words[4] += 200; } \ print join("\t", @words); \ } \ ' chr7_ncrna.gff \ > chr7_ncrna-fixed.gff echo 'drop table hgRnaGene;' | hgsql hg12 hgsql hg12 < ~/kent/src/hg/lib/rnaGene.sql hgRnaGenes hg12 chr7_ncrna-fixed.gff LOAD EXOFISH (todo) - login to hgwdev - cd /cluster/store1/gs.13/build30/bed - mkdir exoFish - cd exoFish - hg12 < ~kent/src/hg/lib/exoFish.sql - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr) into /cluster/store1/gs.13/build30/bed/exoFish/all_maping_ecore - awk -f filter.awk all_maping_ecore > exoFish.bed - hgLoadBed hg12 exoFish exoFish.bed LOAD MOUSE SYNTENY (DONE 8/22/02) ssh hgwdev mkdir -p ~/oo/bed/mouseSyn cd ~/oo/bed/mouseSyn # Saved Michael Kamal's email attachment: allDirectedSegmentsBySize300.txt # Process the .txt file (minus header) into a bed 6 + file: grep -v "^#" allDirectedSegmentsBySize300.txt \ | awk '($6 > $5) {printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%s\n", $4, $5-1, $6, $1, 999, $7, $2-1, $3, $8;} \ ($5 > $6) {printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%s\n", $4, $6-1, $5, $1, 999, $7, $2-1, $3, $8;}' \ > mouseSynWhd.bed hgLoadBed -noBin -sqlTable=$HOME/kent/src/hg/lib/mouseSynWhd.sql \ hg12 mouseSynWhd mouseSynWhd.bed LOAD GENIE (todo) - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf - ldHgGene hg12 genieAlt predChrom.gtf - cat */ctg*/ctg*.affymetrix.aa > pred.aa - hgPepPred hg12 genie pred.aa - hg12 mysql> delete * from genieAlt where name like 'RS.%'; mysql> delete * from genieAlt where name like 'C.%'; LOAD SOFTBERRY GENES (DONE 9/12/02) - ln -s /cluster/store1/gs.13/build30/ ~/oo - cd ~/oo/bed - mkdir softberry - cd softberry - wget ftp://www.softberry.com/pub/sc_fgenesh_jun02/sb_fgenesh_jun02.tar.gz ldHgGene hg12 softberryGene chr*.gff hgPepPred hg12 softberry *.pro hgSoftberryHom hg12 *.pro LOAD GENEID GENES (DONE 11/20/02) mkdir ~/oo/bed/geneid cd ~/oo/bed/geneid mkdir download cd download # download .gff and prot files for each chrom (and _random): foreach f (~/oo/?{,?}/*.fa.out) set c=$f:t:r:r wget http://www1.imim.es/genepredictions/H.sapiens/golden_path_20020628/geneid_v1.1/$c.gff wget http://www1.imim.es/genepredictions/H.sapiens/golden_path_20020628/geneid_v1.1/$c.prot end # This time around, their "gff" uses screwy keywords. clean up: # Also replace gi|17981852|ref|NC_001807.4| with chrM: foreach f (chr*.gff) perl -wpi.bak -e 's/(First|Terminal|Internal|Single)/CDS/' $f end foreach f (chr*.{gff,prot}) perl -wpi.bak -e 's/gi\|17981852\|ref\|NC_001807.4\|/chrM/g' $f end cd .. ldHgGene hg12 geneid download/*.gff -exon=CDS hgPepPred hg12 generic geneidPep download/*.prot LOAD ACEMBLY (DONE 9/10/02) mkdir -p ~/oo/bed/acembly cd ~/oo/bed/acembly - Get acembly*gene.gff from Jean and Danielle Thierry-Mieg wget ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ncbi_30.human.genes/acembly.ncbi_30.genes.gff.tar.gz wget ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ncbi_30.human.genes/acembly.ncbi_30.genes.proteins.fasta.tar.gz gunzip -c acembly.ncbi_30.genes.gff.tar.gz | tar xvf - gunzip -c acembly.ncbi_30.genes.proteins.fasta.tar.gz | tar xvf - cd acembly.ncbi_30.genes.gff #- Strip out floating-contig features (lines with *|NT_?????? as the chr ID), # and add 'chr' prefix to all chr nums: foreach f (acemblygenes.*.gff) egrep -v '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' $f | \ perl -wpe 's/^(\w)/chr$1/' > $f:r-fixed.gff end #- Save just the floating-contig features to different files for lifting #- and lift up the floating-contig features to chr*_random coords: foreach c ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Un M) egrep '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' acemblygenes.$c.gff | \ perl -wpe 's/^(\w+)\|(\w+)/$1\/$2/' > $c-random-ctg.gff if (-e ../../../$c/lift/random.lft) then liftUp $c-random-lifted.gff ../../../$c/lift/random.lft warn $c-random-ctg.gff endif end cd ../acembly.ncbi_30.genes.proteins.fasta #- Remove G_t*_ prefixes from acemblyproteins.*.fasta: foreach f (acemblyproteins.*.fasta) perl -wpe 's/^\>G_t[\da-zA-Z]+_/\>/' $f > $f:r-fixed.fasta end #- Load into database as so: cd .. ldHgGene hg12 acembly acembly.ncbi_30.genes.gff/*-fixed.gff acembly.ncbi_30.genes.gff/*-lifted.gff hgPepPred hg12 generic acemblyPep acembly.ncbi_30.genes.proteins.fasta/*-fixed.fasta LOAD GENOMIC DUPES (todo) o - Load genomic dupes ssh hgwdev cd ~/oo/bed mkdir genomicDups cd genomicDups wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip unzip *.zip awk -f filter.awk oo33_dups_for_kent > genomicDups.bed mysql -u hgcat -pbigSECRET hg12 < ~/src/hg/lib/genomicDups.sql hgLoadBed hg12 -oldTable genomicDups genomicDupes.bed LOAD NCI60 (Done: Chuck Sugnet 7/24/02) o - # ssh hgwdev cd /projects/cc/hg/mapplots/data/NCI60/dross_arrays_nci60/ mkdir hg12 cd hg12 findStanAlignments hg12 ../BC2.txt.ns ../../image/cumulative_plates.011204.list.human hg12.image.psl >& hg12.image.log cp ../experimentOrder.txt ./ sed -e 's/ / \.\.\//g' < experimentOrder.txt > epo.txt stanToBedAndExpRecs hg12.image.good.psl hg12.nci60.exp hg12.nci60.bed `cat epo.txt` hg12S -A < ../../scripts/nci60.sql echo "load data local infile 'hg12.nci60.bed' into table nci60" | hg12S -A mkdir /cluster/store3/gs.13/build30/bed/nci60 mv hg12.nci60.bed /cluster/store3/gs.13/build30/bed/nci60 rm *.psl LOAD AFFYRATIO [GNF] (Done: Chuck Sugnet 7/24/02) o - # ssh hgwdev cd /cluster/store1/sugnet/ mkdir gs.13 mkdir gs.13/build30 mkdi20r gs.13/build30/affyGnf cd gs.13/build30/affyGnf cp /projects/compbiodata/microarray/affyGnf/sequences/HG-U95Av2_target ./ ls -1 /cluster/store3/gs.13/build30/trfFa.0730/ > allctg.lst echo "/cluster/store1/sugnet/gs.13/build30/affyGnf/HG-U95Av2_target" > affy.lst echo '#LOOP\n/cluster/bin/i386/blat -mask=lower -minIdentity=95 -ooc=/cluster/store3/gs.13/build30/jkStuff/post.refCheck.old/11.ooc /cluster/store3/gs.13/build30/trfFa.0730/$(path1) $(path2) {check out line+ psl/$(root1)_$(root2).psl}\n#ENDLOOP' > template.sub gensub2 allctg.lst affy.lst template.sub para.spec # ssh kkr1u00 para create para.spec para try para check para push # exit kkr1u00 pslSort dirs hg12.affy.psl tmp psl >& pslSort.log liftUp hg12.affy.lifted.psl /cluster/store3/gs.13/build30/jkStuff/liftAll.lft warn hg12.affy.psl pslAffySelect seqIdent=.95 basePct=.95 in=hg12.affy.lifted.psl out=hg12.affy.pAffySelect.95.95.psl affyPslAndAtlasToBed hg12.affy.pAffySelect.95.95.psl /projects/compbiodata/microarray/affyGnf/human_atlas_U95_gnf.noquotes.txt affyRatio.bed affyRatio.exr >& affyPslAndAtlasToBed.log hg12S -A ){chomp($_);@p=split/\t/,$_; print "$p[2]\t$p[3]\t$p[0]\n"}' < SAGEmap_tag_ug-rel_Hs | sort | sed -e 's/ /_/g' > SAGEmap_ug_tag-rel_Hs cd - createSageSummary ../map/Hs/NlaIII/SAGEmap_ug_tag-rel_Hs tagExpArrays.tab sageSummary.sage # Create the uniGene alignments /cluster/store1/sugnet/gs.13/build30/uniGene/hg12.uniGene.lifted.pslReps.psl addAveMedScoreToPsls /cluster/store1/sugnet/gs.13/build30/uniGene/hg12.uniGene.lifted.pslReps.psl sageSummary.sage uniGene.pslWScores.psl /cluster/home/kent/bin/i386/hgLoadBed hg12 uniGene_2 uniGene.wscores.bed hg12S -A < ~/kk/jk/hg/lib/sage.sql echo "load data local infile 'sageSummary.sage' into table sage" | hg12S -A cd ../info ../../scripts/parseRecords.pl ../extr/expList.tab > sageExp.tab hg12S -A < ~/kk/jk/hg/lib/sageExp.sql echo "load data local infile 'sageExp.tab' into table sageExp" | hg12S -A update kent/src/hg/makeDb/hgTrackDb/hgRoot/uniGene_2.html with current uniGene date. MAKE UNIGENE ALIGNMENTS (Done: Chuck Sugnet 7/26/02) o - cd /projects/cc/hg/sugnet/uniGene ftp ftp.ncbi.nih.gov user: anonymous password: email cd repository/UniGene/ prompt mget Hs.info Hs.seq.uniq.gz Hs.data.gz exit # Cut out the unigene build number and append it to unigene mkdir uniGene.`perl -e '$t=<>;$t=~/\#(\d+)/;print "$1\n";' < Hs.info` # new uniGene directory = uniGene.153 mv Hs.* uniGene.153 cd uniGene.153 gunzip Hs.seq.uniq.gz gunzip Hs.data.gz ../countSeqsInCluster.pl Hs.data counts.tab ../parseUnigene.pl Hs.seq.uniq Hs.seq.uniq.simpleHeader.fa leftoverData.tab # ssh kkstore cp /projects/cc/hg/sugnet/uniGene/uniGene.153/Hs.seq.uniq.simpleHeader.fa /scratch/hg/gs.13/build30/uniGene # email cluster-admin to push /scratch/hg/gs.13/build30/uniGene to cluster cd /cluster/store1/sugnet/gs.13/build30 mkdir uniGene cd uniGene ls -1 /cluster/store3/gs.13/build30/trfFa.0730/ > allctg.lst echo "/scratch/hg/gs.13/build30/uniGene/Hs.seq.uniq.simpleHeader.fa" > uniGene.lst echo '#LOOP\n/cluster/bin/i386/blat -mask=lower -minIdentity=95 -ooc=/scratch/hg/h/11.ooc /cluster/store3/gs.13/build30/trfFa.0730/$(path1) $(path2) {check out line+ psl/$(root1)_$(root2).psl}\n#ENDLOOP' > template.sub gensub2 allctg.lst uniGene.lst template.sub para.spec # ssh kk para create para.spec mkdir psl para try para check para push pslSort dirs hg12.uniGene.psl tmp psl >& pslSort.log liftUp hg12.uniGene.lifted.psl /cluster/store3/gs.13/build30/jkStuff/liftAll.lft carry hg12.uniGene.psl pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 hg12.uniGene.lifted hg12.uniGene.lifted.pslReps.psl /dev/null # exit kk and use hg12.uniGene.lifted.pslReps.psl for building SAGE track. FAKING DATA FROM PREVIOUS VERSION (This is just for until proper track arrives. Rescues about 97% of data Just an experiment, not really followed through on). o - Rescuing STS track: - log onto hgwdev - mkdir ~/oo/rescue - cd !$ - mkdir sts - cd sts - bedDown hg3 mapGenethon sts.fa sts.tab - echo ~/oo/sts.fa > fa.lst - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g - log onto cc01 - cc ~/oo/rescue/sts - split all.con into 3 parts and condor_submit each part - wait for assembly to finish - cd psl - mkdir all - ln ?/*.psl ??/*.psl *.psl all - pslSort dirs raw.psl temp all - pslReps raw.psl contig.psl /dev/null - rm raw.psl - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl - rm contig.psl - mv chrom.psl ../convert.psl LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (DONE 10/30/02) (DONE 8/18/02 with 08-15-ASH run) # create psl files for each per-contig lav file ssh kkstore set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24" cd $base set tbl="blastzMm2" foreach chrdir ($base/lav/chr*) set chr=$chrdir:t set outdir=lav-psl/$chr mkdir -p $outdir foreach lav ($chrdir/*.lav) lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl end end # Substitute scratch/... path with chrom name: foreach f (lav-psl/*/*.psl) perl -wpi -e 's@/?scratch/hg/gs.13/build30/[\w\.-_]+/+(chr.+)\.nib:\d+-\d+@$1@; s@/?scratch/hg/mm2/[\w\.-_]+/+(chr.+)\.nib:\d+-\d+@$1@; s@:___@@g;' $f end # Convert to per-chromsome files, sort, and add sequence # kkstore's /tmp might not have enough space; try kk, kkr1u00 etc. mkdir -p lav-xa foreach chrdir (lav-psl/*) set chr=$chrdir:t pslCat -check -nohead -ext=.psl -dir lav-psl/$chr \ | sort -k 15n -k 16n -T /cluster/store2/temp \ | liftUp -type=.psl -pslQ -nohead stdout \ /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin \ > lav-xa/${chr}_${tbl}.psl end # Load tables ssh hgwdev set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24" cd $base/lav-xa hgLoadPsl hg12 *_${tbl}.psl MAKING THE BLASTZBESTMOUSE TRACK FROM PENN STATE MM2 AXT FILES (DONE 9/3/02) # When Scott Schwartz is done generating .axt's for the blastz mm2 # alignments (takes longer than the .lav used for blastxMm2 above): # Create tSizes (human chrom size) and qSizes (mouse) for axtToPsl. In mysql: use hg12 select chrom,size from chromInfo; use mm2 select chrom,size from chromInfo; Edit the results of the first select into a tab-separated tSizes, edit the results of the second select into a tab-separated qSizes. # Consolidate AXT files to chrom level, sort, pick best, make psl. ssh kkstore set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24" cd $base mkdir -p axtChrom axtBest pslBest foreach chrdir (lav/chr*) set chr=$chrdir:t gunzip -c $chrdir/*.axt.gz | \ liftUp -type=.axt -axtQ axtChrom/$chr.lifted.axt \ /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin axtBest axtChrom/$chr.lifted.axt $chr axtBest/$chr.axt -minScore=300 axtToPsl axtBest/$chr.axt tSizes qSizes pslBest/${chr}_blastzBestMouse.psl echo Done with $chr. end # If a chromosome has so many alignments that axtBest runs out of mem, # run axtBest in 2 passes to reduce size of the input to final axtBest: foreach chrdir (lav/chr19) set chr=$chrdir:t foreach a ($chrdir/*.axt.gz) gunzip $a axtBest $a:r $chr $a:r:r.axtBest gzip $a:r $a:r:r.axtBest end gunzip -c $chrdir/*.axtBest.gz | \ liftUp -type=.axt -axtQ axtChrom/$chr.lifted.axt \ /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin axtBest axtChrom/$chr.lifted.axt $chr axtBest/$chr.axt axtToPsl axtBest/$chr.axt tSizes qSizes pslBest/${chr}_blastzBestMouse.psl echo Done with $chr. end # Load tables ssh hgwdev set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24" cd $base/pslBest hgLoadPsl hg12 *.psl # Make /gbdb links and add them to the axtInfo table: mkdir -p /gbdb/hg12/axtBestMm2 cd /gbdb/hg12/axtBestMm2 foreach f ($base/axtBest/chr*.axt) ln -s $f . end cd $base/axtBest rm -f axtInfoInserts.sql touch axtInfoInserts.sql foreach f (/gbdb/hg12/axtBestMm2/chr*.axt) set chr=$f:t:r echo "INSERT INTO axtInfo VALUES ('mm2','Blastz Best in Genome','$chr','$f');" \ >> axtInfoInserts.sql end hgsql hg12 < ~/kent/src/hg/lib/axtInfo.sql hgsql hg12 < axtInfoInserts.sql EXTRACTING DYNAMIC MASKING LOCATIONS FROM BLASTZ LAV (TODO) # Dynamic masking = splicing out of lineage-specific repeats during # first phase of blastz alignments ssh kkstore cd ~/oo/blastz-whatever-path set tbl=blastzDynMaskMm2 # Dig ranges out of lavs; collapse overlapping ranges. rm -rf $tbl.bed touch $tbl.bed foreach c (lav/chr*) set chr=$c:t awk -v c=$chr '/ x/ {print c "\t" $2-1 "\t" $3}' $c/*.lav \ | sort -n -k 2,2 | uniq \ | perl -we 'while (<>) { \ @words = split(/\t/); \ $chrom = $words[0]; \ $start = $words[1]; \ $end = $words[2]; \ if (! defined $prevChrom) { \ $prevChrom = $chrom; \ $rStart = $start; \ $rEnd = $end; \ } \ if (($start > $rEnd) || ($chrom ne $prevChrom)) { \ print "$prevChrom\t$rStart\t$rEnd"; \ $rStart = $start; \ $rEnd = $end; \ } \ elsif ($end > $rEnd) { $rEnd = $end; } \ $prevChrom = $chrom; \ } \ print "$prevChrom\t$rStart\t$rEnd" \ if (defined $prevChrom); \ ' \ >> $tbl.bed end # load table ssh hgwdev cd ~/oo/blastz-whatever-path set tbl=blastzDynMaskMm2 hgLoadBed -nobin -noBin hg12 $tbl $tbl.bed MAKING THE AXTTIGHT FROM AXTBEST (DONE 10/30/02) # After creating axtBest alignments above, use subsetAxt to get axtTight: ssh kkstore cd ~/oo/blastz.mm2.2002-08-24/axtBest mkdir -p ../axtTight foreach i (*.axt) subsetAxt $i ../axtTight/$i \ ~kent/src/hg/mouseStuff/subsetAxt/coding.mat 3400 end # translate to psl cd ../axtTight mkdir -p ../pslTight foreach i (*.axt) set c = $i:r axtToPsl $i ../tSizes ../qSizes ../pslTight/${c}_blastzTightMouse.psl end # Load tables into database ssh hgwdev cd ~/oo/blastz.mm2.2002-08-24/pslTight hgLoadPsl hg12 chr*_blastzTightMouse.psl MITOCHONDRIAL DNA PSEUDO-CHROMOSOME - TODO Download the fasta file from http://www.gen.emory.edu/MITOMAP/mitomapRCRS.fasta Put it in /cluster/store1/mrna.130 ssh hgwdev cd ~/oo mkdir M cp /cluster/store1/mrna.130/mitomapRCRS.fasta M/chrM.fa Edit jkStuff/makeNib.sh to make sure it also has the "M" directory in its file list tcsh jkStuff/makeNib.sh hgNibSeq -preMadeNib hg12 /cluster/store1/gs.13/build30/nib ?/chr*.fa ??/chr*.fa TWINSCAN GENE PREDICTIONS (DONE 8/22/02) mkdir -p ~/oo/bed/twinscan cd ~/oo/bed/twinscan wget http://genes.cs.wustl.edu/NCBI30/gtf/gtf.tgz wget http://genes.cs.wustl.edu/NCBI30/ptx/ptx.tgz gunzip -c gtf.tgz | tar xvf - gunzip -c ptx.tgz | tar xvf - ldHgGene hg12 twinscan chr*.gtf -exon=CDS - pare down to id: foreach c (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) perl -wpe 's/^\>.*\s+source_id\s*\=\s*(\S+).*$/\>$1/;' < \ chr$c.ptx > chr$c-fixed.fa end hgPepPred hg12 generic twinscanPep chr*-fixed.fa LOAD CHIMP DATA o Load Ingo Ebersber's chimp BLAT alignments cd ~/oo mkdir bed/chimpBlat cd bed/chimpBlat #!/bin/sh for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y do wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/MPI-sg_jun02/chr${i}_gp_F25Jun02.psl done Remove the first line from each psl file to prepare them for pslCat using the fixFile.sh shell script. ./fixFile.sh pslCat *.psl > chimpBlat.psl hgLoadPsl hg12 chimpBlat.psl o Load the chimp BAC data #!/bin/sh for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y do wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/Riken-be_jun02/chr${i}_gp_F25Jun02.psl done Remove the first line from each psl file to prepare them for pslCat using the fixFile.sh shell script. ./fixFile.sh pslCat *.psl > chimpBac.psl hgLoadPsl hg12 chimpBac.psl MAKING THE DOWNLOADABLE DATABASE FILES - DONE ssh hgwdev mkdir /usr/local/apache/htdocs/goldenPath/28jun2002 mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/chromosomes mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/database o zip up the chromosomes individually ssh kkstore # (we use kkstore because no NFS traffic via kkstore = faster data transfer) cd ~/oo #In tcsh run this script tcsh foreach i (?{,?}/chr*.fa) echo zip $i:r.zip $i zip $i:r.zip $i end Then do: ssh hgwdev cd ~/oo mv ?{,?}/chr*.zip /usr/local/apache/htdocs/goldenPath/28jun2002/chromosomes Request that the admins push this to hgwbeta o Make the big zips ssh kkstore cd ~/oo # Make the big zips zip chromAgp.zip ?{,?}/chr*.agp zip chromFa.zip ?{,?}/chr*.fa zip chromOut.zip ?{,?}/chr*.out zip contigAgp.zip ?{,?}/NT_??????/NT_??????.agp zip contigFa.zip ?{,?}/NT_??????/NT_??????.fa zip contigOut.zip ?{,?}/NT_??????/NT_??????.fa.out zip liftAll.zip jkStuff/liftAll.lft zip mrna.zip /cluster/store1/mrna.130/mrna.fa ssh hgwdev cd ~/oo # Move all the zips to the web server dirs mv chromAgp.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv chromFa.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv chromOut.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv contigAgp.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv contigFa.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv contigOut.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv liftAll.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips mv mrna.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips Request that the admins push all this to hgwbeta. o Dump the database - DON'T DO THIS IT IS HERE FOR REFERENCE ONLY IT IS DONE AUTOMATICALLY BY A PAUL T. SCRIPT ON THE PRODUCTION MACHINES ssh hgwbeta We dump the database on hgwbeta in order to only dump the most accurate database state. There is one trick here: mysqldump becomes the mysql user and the directory you want to dump to must have that user the ability to write to it. Here's what to do: cd /var/tmp mkdir hg12-dump chmod 777 hg12-dump (since you aren't root this is quickest) cd hg12-dump mysqldump --user=hguser --password=hguserstuff --all --tab=. hg11 Then, that directory will quickly fill with .sql and .txt files When it is done do: cd /var/tmp/hg12-dump gzip *.txt mv * /usr/local/apache/htdocs/goldenPath/28jun2002/database - Make database.zip ssh hgwbeta cd /usr/local/apache/htdocs/goldenPath/28jun2002/database zip ../bigZips/database.zip * SGP GENE PREDICTIONS (DONE 01/31/03) mkdir -p ~/hg12/bed/sgp/download cd ~/hg12/bed/sgp/download foreach f (~/hg12/?{,?}/chr?{,?}{,_random}.fa) set chr = $f:t:r wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/$chr.gtf wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/$chr.prot end wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/chrUn.gtf -O chrUn_random.gtf wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/chrUn.prot -O chrUn_random.prot # Add missing .1 to protein id's foreach f (*.prot) perl -wpe 's/^(>chr\w+)$/$1.1/' $f > $f:r-fixed.prot end cd .. ldHgGene hg12 sgpGene download/*.gtf -exon=CDS hgPepPred hg12 generic sgpPep download/*-fixed.prot ALIGNED ANCIENT REPEATS FROM MOUSE BLASTZ cd ~/oo/bed mkdir aarMm2 cd aarMm2 set mmdir=../../blastz.mm2.2002-08-01 foreach aar ($mmdir/aar/*.aar.gz) zcat $aar | aarToAxt | axtToPsl stdin $mmdir/H.len $mmdir/M.len stdout | liftUp -type=.psl -nohead -pslQ stdout $mmdir/liftAllMm2.lft warn stdin > chr$aar:t:r:r_aarMm2.psl end hgLoadPsl hg12 *.psl ALIGNMENT COUNTS FOR WIGGLE TRACK # this needs to be updated to reflected the full process. - Generate BED table of AARs used to select regions. cat ../bed/aarMm2/*.psl | awk 'BEGIN{OFS="\t"} {print $14,$16,$17,"aar"}' >aarMm2.bed - Generate background counts with windows that have a 6kb counts, with a maximum windows size of 512kb and sliding the windows by foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt) set chr=$axt:t:r set tab=$chr.6kb-aar.cnts (??? need better name ???) hgCountAlign -selectBed=aarMm2.bed -winSize=512000 -winSlide=1000 -fixedNumCounts=6000 -countCoords $axt $tab end - Generate counts for AARs with 50b windows, slide by 5b foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt) set chr=$axt:t:r set tab=$chr.50b-aar.cnts (??? need better name ???) hgCountAlign -selectBed=aarMm2.bed -winSize=50 -winSlide=5 $axt $tab end - Generate counts for all with 50b windows, slide by 5b foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt) set chr=$axt:t:r set tab=$chr.50b.cnts (??? need better name ???) hgCountAlign -winSize=50 -winSlide=5 $axt $tab end MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE w/ mrna.130) o ssh to kkstore mkdir -p /cluster/store1/gs.13/build30/bed/refFull cd /cluster/store1/gs.13/build30/bed/refFull Download the sequence: wget ftp://blue3.ims.u-tokyo.ac.jp/pub/db/hgc/dbtss/ref-full.fa.gz mv ref-full.fa.gz dbtss.fa.gz gunzip it and split the ref-rull.fa file into about 200 pieces gunzip dbtss.fa.gz faSplit sequence dbtss.fa 50 splitDbtss mkdir /scratch/hg/refFull splitdbtss* /scratch/hg/dbtss/ ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst ls -1S /scratch/hg/dbtss/split*.fa > refFull.lst o - Request the admins to do a binrsync to the cluster of /scratch/hg/dbtss o - Use BLAT to generate dbtss alignments as so: Make sure that /scratch/hg/gs.13/build30/contig/ is loaded with NT_*.fa and pushed to the cluster nodes. ssh kk cd /cluster/store1/gs.13/build30/bed/dbtss mkdir -p psl # run mkdirs.sh script to create sudirs in the psl directory # in order to modularize the blat job. gensub2 genome.lst dbtss.lst gsub spec para create spec Now run a para try/push and para check in each one. o - Process dbtss alignments into near best in genome. cd ~/oo/bed cd dbtss pslSort dirs raw.psl /tmp psl/* pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null liftUp -nohead all_dbtss.psl ../../jkStuff/liftAll.lft carry contig.psl pslSortAcc nohead chrom /tmp all_dbtss.psl o - Load dbtss alignments into database ssh hgwdev cd /cluster/store1/gs.13/build30/bed/dbtss pslCat -dir chrom > dbtssAli.psl hgLoadPsl hg12 -tNameIx dbtssAli.psl LOAD SLAM GENES cd /cluster/store3/gs.13/build30/bed mkdir slam cd slam wget http://bio.math.berkeley.edu/slam/mouse/gff/UCSC/hsCDS.gff.gz wget http://bio.math.berkeley.edu/slam/mouse/gff/UCSC/hsCNS.gff.gz gunzip * ldHgGene -exon=CDS hg12 slam hsCDS.gff mv genePred.tab genePred.hg12 awk '{print $1,$4,$5,$10,$12}' hsCNS.gff > hsCNS.bed sed -e 's/;//g' -e 's/"//g' hsCNS.bed > hsCNS.bed.2 sort -n -k 5,5 hsCNS.bed.2 > hsCNS.bed.sort examine head and tail of sorted file for range of scores rm hsCNS.bed.sort size.pl < hsCNS.bed.2 > hsCNS.bed.2.size sort -n -k 2,2 hsCNS.bed.2.size > hsCNS.bed.2.size.sort examine head and tail of sorted file for range of sizes rm hsCNS.bed.2.size.sort expand.pl < hsCNS.bed.2 > hsCNS.bed.2.expand hgLoadBed -tab hg12 slamNonCoding hsCDS.bed.2.expand CREATING THE humMusL SAMPLE TRACK (a.k.a WIGGLE TRACK) ------------------------------------------------------ o - refer to the script at src/hg/sampleTracks/makeHg12Mm2.doc LIFTOVER CHAIN TO HG15 (2004-04-12 kate) ---------------------------------------- # blat alignments with 3K split of hg15 as query # NOTE: the split is doc'ed in makeHg15.doc ssh eieio mkdir -p /cluster/bluearc/hg12 cp -R /cluster/data/hg12/mixedNib /cluster/bluearc/hg12 ssh kk cd /cluster/data/hg12/bed mkdir -p blat.hg115.2004-04-12 ln -s blat.hg115.2004-04-12 blat.hg15 cd blat.hg15 mkdir raw psl run cd run echo '#LOOP' > gsub echo 'blat $(path1) $(path2) {check out line+ ../raw/$(root1)_$(root2).psl} -tileSize=11 -ooc=/cluster/bluearc/hg/h/11.ooc -minScore=100 -minIdentity=98 -fastMap' >> gsub echo '#ENDLOOP' >> gsub # query ls -1S /iscratch/i/gs.16/build33/liftOver/split/*.fa > new.lst # target ls -1S /cluster/bluearc/hg12/mixedNib/*.nib > old.lst gensub2 old.lst new.lst gsub spec para create spec para try para push # lift results ssh eieio cd /cluster/data/hg12/bed/blat.hg15 cd raw cat > liftup.csh << 'EOF' set liftDir = /cluster/bluearc/hg/gs.16/build33/liftOver/lift foreach i (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M) echo chr$i liftUp -pslQ ../psl/chr$i.psl $liftDir/chr$i.lft warn chr*_chr$i.psl echo done $i end 'EOF' csh liftup.csh >&! liftup.log & # create alignment chains ssh kk cd /cluster/data/hg12/bed/blat.hg15 mkdir chainRun chainRaw chain cd chainRun echo '#LOOP' > gsub echo 'axtChain -psl $(path1) /cluster/bluearc/hg12/mixedNib /scratch/hg/gs.16/build33/chromTrfMixedNib {check out line+ ../chainRaw/$(root1).chain}' >> gsub echo '#ENDLOOP' >> gsub ls -1S ../psl/*.psl > in.lst gensub2 in.lst single gsub spec para create spec para try para push ssh eieio cd /cluster/data/hg12/bed/blat.hg15 chainMergeSort chainRaw/*.chain | chainSplit chain stdin mkdir net cd chain foreach i (*.chain) chainNet $i /cluster/data/hg12/chrom.sizes \ /cluster/data/hg15/chrom.sizes ../net/$i:r.net /dev/null echo done $i end mkdir ../over cat > subset.csh << 'EOF' foreach i (*.chain) echo $i:r netChainSubset ../net/$i:r.net $i ../over/$i echo done $i end 'EOF' csh subset.csh >&! subset.log & cat ../over/*.chain > ../over.chain mkdir -p /cluster/data/hg12/bed/bedOver cp ../over.chain /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain # save to download area ssh hgwdev cd /usr/local/apache/htdocs/goldenPath/hg12 mkdir -p liftOver cp /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain liftOver gzip liftOver/hg12ToHg15.over.chain # load into database mkdir -p /gbdb/hg12/liftOver ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain \ /gbdb/hg12/liftOver hgAddLiftOverChain hg12 hg15 LIFTOVER CHAIN TO HG16 (2004-04-13 kate) ---------------------------------------- # blat alignments with 3K split of hg16 as query # NOTE: the split is doc'ed in makeHg16.doc ssh kk cd /cluster/data/hg12/bed mkdir -p blat.hg16.2004-04-13 ln -s blat.hg16.2004-04-13 blat.hg16 cd blat.hg16 mkdir raw psl run cd run echo '#LOOP' > gsub echo 'blat $(path1) $(path2) {check out line+ ../raw/$(root1)_$(root2).psl} -tileSize=11 -ooc=/cluster/bluearc/hg/h/11.ooc -minScore=100 -minIdentity=98 -fastMap' >> gsub echo '#ENDLOOP' >> gsub # query ls -1S /iscratch/i/gs.17/build34/liftOver/split/*.fa > new.lst # target ls -1S /cluster/bluearc/hg12/mixedNib/*.nib > old.lst gensub2 old.lst new.lst gsub spec para create spec para try para push # lift results ssh eieio cd /cluster/data/hg12/bed/blat.hg16 cd raw cat > liftup.csh << 'EOF' set liftDir = /cluster/bluearc/hg/gs.17/build34/liftOver/lift foreach i (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M) echo chr$i liftUp -pslQ ../psl/chr$i.psl $liftDir/chr$i.lft warn chr*_chr$i.psl echo done $i end 'EOF' csh liftup.csh >&! liftup.log & # create alignment chains ssh kk cd /cluster/data/hg12/bed/blat.hg16 mkdir chainRun chainRaw chain cd chainRun echo '#LOOP' > gsub echo 'axtChain -psl $(path1) /cluster/bluearc/hg12/mixedNib /scratch/hg/gs.17/build34/bothMaskedNibs {check out line+ ../chainRaw/$(root1).chain}' >> gsub echo '#ENDLOOP' >> gsub ls -1S ../psl/*.psl > in.lst gensub2 in.lst single gsub spec para create spec para try para push ssh eieio cd /cluster/data/hg12/bed/blat.hg16 chainMergeSort chainRaw/*.chain | chainSplit chain stdin mkdir net cd chain cat > chain.csh << 'EOF' foreach i (*.chain) chainNet $i /cluster/data/hg12/chrom.sizes \ /cluster/data/hg16/chrom.sizes ../net/$i:r.net /dev/null echo done $i end 'EOF' csh chain.csh >&! chain.log & mkdir ../over cat > subset.csh << 'EOF' foreach i (*.chain) echo $i:r netChainSubset ../net/$i:r.net $i ../over/$i echo done $i end 'EOF' csh subset.csh >&! subset.log & cat ../over/*.chain > ../over.chain mkdir -p /cluster/data/hg12/bed/bedOver cp ../over.chain /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain # save to download area ssh hgwdev cd /usr/local/apache/htdocs/goldenPath/hg12 mkdir -p liftOver cp /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain liftOver gzip liftOver/hg12ToHg16.over.chain # load into database mkdir -p /gbdb/hg12/liftOver ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain \ /gbdb/hg12/liftOver hgAddLiftOverChain hg12 hg16 LIFTOVER CHAIN TO HG13 (2003-04-14 daryl ?, 2004-04-15 kate) ------------------------------------------- cp /cluster/data/hg13/bedOver/over.chain \ /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain # save to download area ssh hgwdev cd /usr/local/apache/htdocs/goldenPath/hg12 mkdir -p liftOver cp /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain liftOver gzip liftOver/hg12ToHg13.over.chain # load into database mkdir -p /gbdb/hg12/liftOver ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain \ /gbdb/hg12/liftOver hgAddLiftOverChain hg12 hg13