This file describes how we made the browser database on 
NCBI build 30 (July, 2002 freeze)

[For importing GTF tracks, use /projects/compbio/bin/validate_gtf.pl]

(The numbered stuff was brought in from
 /cluster/store3/gs.13/build30/build.ncbi.doc)

HOW TO BUILD A ASSEMBLY FROM NCBI FILES
---------------------------------------

NOTE: It is best to run most of this stuff on kkstore since it
is not adverse to handling files > 2Gb

0) Make gs.XX directory, gs.XX/buildXX directory, and gs.XX/ffa directory.
   Make a symbolic link from /cluster/store1 to this location
	
	cd /cluster/store1
	ln -s (actual location)/gs.13 ./gs.13

   Make a symbolic link from your home directory to the build dir:

	ln -s /cluster/store1/gs.13/build30 ~/oo

1) Download seq_contig.md, ncbi_buildXX.agp, contig_overlaps.agp and contig fa file 
into gs.XX/buildXX/ directory. 

*** For build30, files split into reference.agp/reference.fa (main O&O), DR51.agp/DR51.fa,
    and DR52.agp/DR52.fa. (alternate versions of MHC region).  These were concatenated 
    to get the ncbi_build30.agp and ncbi_build30.fa

2) Move and unpack contig fa file into ../ffa/ncbi_buildXX.fa

2.3) Sanity check things with (in this directory)
        ~kent/bin/i386/checkYbr ncbi_buildXX.agp ../ffa/ncbi_buildXX.fa seq_contig.md
     report any errors back to Richa and Greg at NCBI.

3) Convert fa files into UCSC style fa files and place in "contigs" directory
   inside the gs.XX/buildXX directory 
   Note: 7/23/02 ASH: edited chrM.fa header to ">chrM" not ">gi|17981852|ref|NC_001807.4|"

	mkdir contigs
	/cluster/bin/i386/faNcbiToUcsc -split -ntLast ../ffa/ncbi_buildXX.fa contigs

3.1) Make a fake chrM contig
cd ~/oo
mkdir M
copy in chrM.fa, chrM.agp and chrM.gl from previous version.
mkdir M/NT_999999
cp chrM.fa NT_999999/NT_999999.fa

4) Create lift files (this will create chromosome directory structure) and inserts file

	/cluster/bin/scripts/createNcbiLifts seq_contig.md .

5) Create contig agp files (will create contig directory structure)
	
	/cluster/bin/scripts/createNcbiCtgAgp seq_contig.md ncbi_buildXX.agp .

5.1) Create contig gl files

        ~kent/bin/i386/agpToGl contig_overlaps.agp . -md=seq_contig.md

6) Create chromsome agp files

	/cluster/bin/scripts/createNcbiChrAgp .

6.1) Copy over jkStuff
	mkdir jkStuff
        cp /cluster/store1/gs.12/build29/jkStuff/*.sh jkStuff
        cp /cluster/store1/gs.12/build29/jkStuff/*.csh jkStuff
        cp /cluster/store1/gs.12/build29/jkStuff/*.gsub jkStuff        

6.2) Patch in size of chromosome Y into Y/lift/ordered.lft 
     by grabbing it from the last line of Y/chrY.agp (not needed for build30)

6.3) Create chromosome gl files
  
        jkStuff/liftGl.sh contig.gl

7) Distribute contig .fa to appropriate directory (assumes all files
   are in "contigs" directory).

	/cluster/bin/scripts/distNcbiCtgFa contigs .

8) Reverse complement NT contig fa files that are flipped in the assembly
   (uses faRc program)

	/cluster/bin/scripts/revCompNcbiCtgFa seq_contig.md .

(NOTE: STS placements may be done at this point before repeat masking and 
using the .fa's on NFS for QC analysis - all other placements should be 
done after repeat masking and distributing to cluster nodes)

9) Split contigs, run RepeatMasker, lift results
   Notes: 
   * If there is a new version of RepeatMasker, build it and ask the admins 
     to binrsync it (kkstore:/scratch/hg/RepeatMasker/*).
   * Contigs (*/NT_*/NT_*.fa) are split into 500kb chunks to make 
     RepeatMasker runs manageable on the cluster ==> results need lifting.
   * For the NCBI assembly we repeat mask on the sensitive mode setting
     (RepeatMasker -s)
   * Note: for build30 / hg12, RepeatMaster was run in quick mode 
     (/cluster/bin/scripts/RMLocalQuick) first, and the .out files were 
     saved to .out.quick before re-running with RMLocalSens.  

        #- Split contigs into 500kb chunks:
        cd ~/oo
        foreach d ( ?{,?}/NT_* )
          cd $d
          set contig = $d:t
          faSplit size $contig.fa 500000 ${contig}_ -lift=$contig.lft \
            -maxN=500000
          cd ../..
        end

        #- Make the run directory and job list:
        cd ~/oo
        mkdir RMRun
        rm -f RMRun/RMJobs
        touch RMRun/RMJobs
        foreach d ( ?{,?}/NT_* )
          foreach f ( /cluster/store3/gs.13/build30/$d/NT_*_*.fa )
            set f = $f:t
            echo /cluster/bin/scripts/RMLocalSens \
                 /cluster/store3/gs.13/build30/$d $f \
                '{'check out line+ /cluster/store3/gs.13/build30/$d/$f.out'}' \
              >> RMRun/RMJobs
          end
        end

        #- Do the run
        ssh kk
        cd ~/oo/RMRun
        para create RMJobs
        para try, para check, para check, para push, para check,...

        #- Lift up the split-contig .out's to contig-level .out's
        cd ~/oo
        foreach d ( ?{,?}/NT_* )
          cd $d
          set contig = $d:t
          liftUp $contig.fa.out $contig.lft warn ${contig}_*.fa.out > /dev/null
          cd ../..
        end


10) Lift up RepeatMask .out files to chromosome coordinates via
       cd ~/oo
       tcsh jkStuff/liftOut2.sh

10.1) Validate the RepeatMasking by randomly selecting a few NT_*.fa files, 
      manually repeat masking them and matching the .out files with the 
      related part in the chromosome-level .out files.  For example:

        ssh kk
        cd ~/oo
      Pick several values of $chr and $nt and run these commands: 
        set chr = ?
        set nt  = NT_??????
        mv $chr/$nt/$nt.fa.out $chr/$nt/$nt.fa.out.bak
        /scratch/hg/RepeatMasker/RepeatMasker -s $chr/$nt/$nt.fa
        rm $chr/$nt/$nt.fa.{masked,cat,cut,stderr,tbl}
      Compare each $chr/$nt/$nt.fa.out against the original and against 
      the appropriate part of $chr/chr$chr.fa.out (use the coords for 
      $nt given in seq_contig.md).  
        mv $chr/$nt/$nt.fa.out.bak $chr/$nt/$nt.fa.out

     For build 30, the following were checked:
     1/NT_004321, Y/NT_025975, 11/NT_033237

11) Generate contig and chromosome level masked and unmasked files via:
       tcsh jkStuff/chrFa.sh
       tcsh jkStuff/makeFaMasked.sh

12) Copy all contig and chrom fa files to /scratch on kkstore to get ready for
    cluster jobs, and ask to propagate to nodes

        ssh kkstore
	cd ~/oo
        /cluster/bin/scripts/cpNcbiFaScratch . /scratch/hg/gs.13/build30
    Build 30 re-do only:
        cd /scratch/hg/gs.13/build30/; mv contig contig.0729

13) Create jkStuff/ncbi.lft for lifting stuff built w/NCBI assembly.
    Note: this ncbi.lift will not lift floating contigs to chr_random coords,
    but it will show the strand orientation of the floating contigs 
    (grep for '|').
        mdToNcbiLift seq_contig.md jkStuff/ncbi.lft 


CREATING DATABASE  (DONE)

o - ln -s /cluster/store1/gs.13/build30 ~/oo
NOTE: /cluster/store1/gs.13/ is a symlink to /cluster/store3/gs.13
o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql 
o - Create the database.
     - ssh hgwdev
     - Enter mysql as the mysql root user.
     - At mysql prompt type:
        create database hg12;
        quit
     - make a semi-permanent read-only alias:
        alias hg12 mysql -u hguser -phguserstuff -A hg12
o - Tell the hgCentral database about it.  Log onto genome-centdb
    and enter mysql via
        mysql -u root -pbigSecret hgCentral
    At the mysql prompt type:
       insert into dbDb values("hg12", "Human July 2002", "/cluster/store1/gs.13/build30/nib", "Human", "USP18", 1);
o - Create the trackDb table as so
       cd ~/src/hg/makeDb/hgTrackDb
    Edit that makefile to add hg12 after hg11 and do
       make update
       cvs commit makefile


LOAD REPEAT MASKS (DONE 7/29/02)
    Load the RepeatMasker .out files into the database with:
       cd ~/oo
       hgLoadOut hg12 ?/*.fa.out ??/*.fa.out


EXTRACT LINEAGE-SPECIFIC REPEATS (ARIAN SMIT's scripts) (DONE 11/4/02)

    ssh kkstore
    mkdir -p ~/hg12/bed/linSpecRep
    cd ~/hg12/bed/linSpecRep
    foreach f (~/hg12/*/*.out)
        ln -sf $f .
    end
    /cluster/bin/scripts/primateSpecificRepeats.pl *.out
    /cluster/bin/scripts/perl-rename 's/(\.fa|\.nib)//' *.out.*spec
    /cluster/bin/scripts/perl-rename 's/\.(rod|prim)spec/.spec/' *.out.*spec
    rm *.out
    rm -rf /scratch/hg/gs.13/build30/linSpecRep
    cd ..
    cp -R linSpecRep /scratch/hg/gs.13/build30
    # Ask cluster-admin@cse.ucsc.edu to binrsync /scratch/hg to clusters


STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (DONE 7/12/02)

Create packed chromosome sequence files 
     ssh kkstore
     cd ~/oo
     tcsh jkStuff/makeNib.sh 

Load chromosome sequence info into database and save chrom sizes
     ssh hgwdev
     hgsql hg12 < ~/src/hg/lib/chromInfo.sql
     cd ~/oo
     hgNibSeq -preMadeNib hg12 /cluster/store1/gs.13/build30/nib ?{,?}/chr*.fa
     mysql -u hguser -phguserstuff -N -e "select chrom,size from chromInfo" hg12 > chrom.sizes

Store o+o info in database. DONE 8/13/02
Note: for build30, Terry specially requested these files from NCBI:
  finished.finf
  draft.finf
  predraft.finf
  extras.finf
  finished.ffa.gz
  draft.ffa.gz
  predraft.ffa.gz
  extras.ffa.gz
For future builds, we should try to modify hgClonePos to just use *.finf 
and not the *.ffa files.
Patrick unpacked the *.ffa.gz into gs.13/{fin,draft,predraft,extras}/fa/*
using /cluster/bin/scripts/unPackffa .
     cd /cluster/store1/gs.13/build30
     jkStuff/liftGl.sh contig.gl
     hgGoldGapGl hg12 /cluster/store1/gs.13 build30 
     cd /cluster/store1/gs.13
     hgClonePos hg12 build30 ffa/sequence.inf /cluster/store1/gs.13 -maxErr=3
#(Ignore warnings about missing clones - these are in chromosomes 21 and 22)
     hgCtgPos hg12 build30 

Make and load GC percent table  DONE 7/12/02
     ssh hgwdev
     cd ~/oo
     mkdir -p bed/gcPercent
     cd bed/gcPercent
     hgsql hg12  < ~/src/hg/lib/gcPercent.sql
     hgGcPercent hg12 ../../nib

GETTING FRESH mRNA, EST, REFSEQ SEQUENCE FROM GENBANK. (DONE 7/29/02)

This will create a genbank.130 directory containing compressed
GenBank flat files and a mrna.130 containing unpacked sequence
info and auxiliary info in a relatively easy to parse (.ra) 
format.

  o - Point your browser to ftp://ftp.ncbi.nih.gov/genbank and
      look at the README.genbank.  Figure out the current release number
      (which is 130).
  o - Consider deleting one of the older genbank releases.  It's
      good to at least keep one previous release though.
  o - Where there is space make a new genbank directory.  Create a
      symbolic link to it:
          mkdir /cluster/store1/genbank.130
          ln -s /cluster/store1/genbank.130 ~/genbank
      cd ~/genbank
  o - ftp ftp.ncbi.nih.gov  (do anonymous log-in).  Then do the
      following commands inside ftp:
           cd genbank
           prompt
           mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv*
      This will take at least 2 hours.
  o - Make the refSeq subdir and download files:
       mkdir -p /cluster/store1/mrna.130/refSeq
       cd /cluster/store1/mrna.130/refSeq
  o - ftp ftp.ncbi.nih.gov  (do anonymous log-in).  Then do the
      following commands inside ftp:
           cd refseq/H_sapiens/mRNA_Prot
           prompt
           mget hs.*.gz
  o - Unpack this into fa files and get extra info with:
      cd /cluster/store1/mrna.130/refSeq
      gunzip -c hs.gbff.gz  | \
        gbToFaRa ~kent/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta \
        stdin
  o - Log onto server and change to yo 
gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbv*.gz gbinv*.gz | \
        gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.130/xenoRna.fa \
        ../mrna.130/xenoRna.ra ../mrna.130/xenoRna.ta stdiur genbank directory.
      mkdir -p /cluster/store1/mrna.130
      cd /cluster/store1/mrna.130
      gunzip -c /cluster/store1/genbank.130/gbpri*.gz | \
        gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin
      gunzip -c /cluster/store1/genbank.130/gbest*.gz | \
        gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin
      gunzip -c /cluster/store1/genbank.130/gbest*.gz | \
        gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin
      cd /cluster/store1/genbank.130
      gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbv*.gz gbinv*.gz | \
        gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.130/xenoRna.fa \
        ../mrna.130/xenoRna.ra ../mrna.130/xenoRna.ta stdin


STORING mRNA/EST SEQUENCE AND AUXILIARY INFO (DONE 7/29/02)

o - Store the mRNA (non-alignment) info in database.
     hgLoadRna new hg12
     hgLoadRna add hg12 /cluster/store1/mrna.130/mrna.fa /cluster/store1/mrna.130/mrna.ra
     hgLoadRna add hg12 /cluster/store1/mrna.130/est.fa /cluster/store1/mrna.130/est.ra
     hgLoadRna add -type=refSeq hg12 /cluster/store1/mrna.130/refSeq.fa /cluster/store1/mrna.130/refSeq.ra

    The est line will take quite some time to complete.


MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE w/ mrna.130)

o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa
    Copy the above 3 files from /cluster/store1/mrna.130 into 
    kkstore:/scratch/hg/mrna.130
    Request the admins to do a binrsync to the cluster.

o - Use BLAT to generate refSeq, mRNA and EST alignments as so:
      Make sure that /scratch/hg/gs.13/build30/contig/ is loaded
      with NT_*.fa and pushed to the cluster nodes.

          ssh kk

          mkdir -p /cluster/store1/gs.13/build30/bed
          cd /cluster/store1/gs.13/build30/bed

Using the bash shell do:
          for i in 'refSeq' 'mrna' 'est'
          do
              mkdir -p $i
              cd $i
              cp ~kent/lastOo/bed/$i/gsub .
              ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst
              ls -1 /scratch/hg/mrna.130/$i/$i.fa > mrna.lst
              mkdir -p psl
              # Note: build30/bed/refSeq directory not writeable, so I had to 
              # create a bed/refSeq/psl and change the gsub
              mkdir -p /cluster/store1/gs.13/build30/bed/refSeq/psl
              gensub2 genome.lst mrna.lst gsub spec
              para create spec
              cd ..
          done 

    Now, by hand cd to the mrna, refSeq, and est directories respectively
    and run a para push and para check in each one.

o - Process refSeq mRNA and EST alignments into near best in genome.
      cd ~/oo/bed
      cd refSeq
      pslSort dirs raw.psl /tmp psl
      pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null
      liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /tmp all_refSeq.psl
      cd .. 

      cd mrna
      pslSort dirs raw.psl /tmp psl
      pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /tmp all_mrna.psl
      cd ..

      cd est
      pslSort dirs raw.psl /cluster/store3/tmp psl
      pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/store3/tmp all_est.psl
      cd ..

o - Load refSeq alignments into database
      ssh hgwdev
      cd /cluster/store1/gs.13/build30/bed/refSeq
      pslCat -dir chrom > refSeqAli.psl
      hgLoadPsl hg12 -tNameIx refSeqAli.psl

o - Load mRNA alignments into database. 
      ssh hgwdev
      cd /cluster/store1/gs.13/build30/bed/mrna/chrom
In tcsh:
      rm *_mrna.psl
      foreach i (*.psl)
          mv $i $i:r_mrna.psl
      end
      hgLoadPsl hg12 *.psl
      cd ..
      hgLoadPsl hg12 all_mrna.psl -nobin

o - Load EST alignments into database.
      ssh hgwdev
      cd /cluster/store1/gs.13/build30/bed/est/chrom
in tcsh do:
      rm *_est.psl
      foreach i (*.psl)
            mv $i $i:r_est.psl
      end
      hgLoadPsl hg12 *.psl
      cd ..
      hgLoadPsl hg12 all_est.psl -nobin

o - Create subset of ESTs with introns and load into database.
      - ssh kkstore
      cd ~/oo
      tcsh jkStuff/makeIntronEst.sh
      - ssh hgwdev
      cd ~/oo/bed/est/intronEst
      hgLoadPsl hg12 *.psl

o - Put orientation info on ESTs and mRNAs into database:
    Note: the cluster run requires /scratch/.../trfFa.0730/ to be in place,
    so this step should be run after "PREPARING SEQUENCE FOR CROSS SPECIES 
    ALIGNMENTS" below.
     ssh kk
     cd ~/oo/bed/est
     pslSortAcc nohead contig /cluster/store3/tmp contig.psl
     cd ~/oo/bed/mrna
     pslSortAcc nohead contig /cluster/store3/tmp contig.psl
     ssh kkstore
     mkdir -p /scratch/hg/gs.13/build30/bed
     cp -r ~/oo/bed/est/contig /scratch/hg/gs.13/build30/bed/est
     cp -r ~/oo/bed/mrna/contig /scratch/hg/gs.13/build30/bed/mrna
    Ask admins to binrsync /scratch/hg/gs.13/build30/bed/* to the cluster.
     ssh kk
     foreach d (est mrna)
       mkdir -p ~/oo/bed/${d}OrientInfo/oi
       cd ~/oo/bed/${d}OrientInfo
       ls -1 /scratch/hg/gs.13/build30/bed/${d}/*.psl > psl.lst
       cp ~/hg11/bed/${d}OrientInfo/gsub .
     end
    Edit ~/oo/bed/${d}OrientInfo/gsub to point to the correct paths.
    For each of ~/oo/bed/{est,mrna}OrientInfo, cd there and do this:
     gensub2 psl.lst single gsub spec
     para create spec
     para try 
     para check      
     para push
   check until done, or use 'para shove'

   When the cluster run is done do:
     foreach d (est mrna)
       cd ~/oo/bed/${d}OrientInfo
       liftUp ${d}OrientInfo.bed ~/oo/jkStuff/liftAll.lft warn oi/*.tab
       hgLoadBed hg12 ${d}OrientInfo ${d}OrientInfo.bed \
         -sqlTable=$HOME/src/hg/lib/${d}OrientInfo.sql > /dev/null
     end

o - Create rnaCluster table (depends on {est,mrna}OrientInfo above)
   ssh hgwdev
   cd ~/oo
   mkdir -p ~/oo/bed/rnaCluster/chrom
   foreach i (? ??)
       cd $i
       foreach j (chr*.fa)
           set c = $j:r
           set f = ../bed/rnaCluster/chrom/$c.bed
           echo clusterRna hg12 /dev/null $f -chrom=$c
           clusterRna hg12 /dev/null $f -chrom=$c
       end
       cd ..
   end
   cd bed/rnaCluster
   hgLoadBed hg12 rnaCluster chrom/*.bed > /dev/null


PRODUCING KNOWN GENES (DONE for 130)

o - Get extra info from NCBI and produce refGene table as so:
       ssh hgwdev
       cd ~/oo/bed
       mkdir refSeq
       cd refSeq
       # Note: downloaded these to refSeq (refSeq dir perms)
       wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref 
       wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/mim2loc 
o - Similarly download refSeq proteins in fasta format to refSeq.pep - I believe this is hs.faa.
        I have changed the name to hs.prot.fa.
o - RefSeq should have already been aligned to the genome by processes 
        described under mRNA/EST alignments above.
o - Produce refGene, refPep, refMrna, and refLink tables as so:
       ssh hgwdev
       cd ~/oo/bed/refSeq
       ln -s /cluster/store1/mrna.130 mrna
#
# NOTE: If hgRefSeqMrna is losing refLink's protAcc field, I think it's due to 
# a format-change issue with .-suffixes in mrna acc's
#
       hgRefSeqMrna hg12 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl \
         loc2ref mrna/refSeq/human/hs.prot.fa mim2loc

o - Add RefSeq status info 
    hgRefSeqStatus hg12 loc2ref

o - Add Jackson labs info
     cd ~/oo/bed
     mkdir jaxOrtholog
     cd jaxOrtholog
     wget ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt
     cp /cluster/store1/gs.12/build29/bed/jaxOrtholog/filter.awk .
     awk -f filter.awk *.rpt > jaxOrtholog.tab
    Drop (just in case), create and load the table like this:
     echo 'drop table jaxOrtholog;' | hgsql hg12
     hgsql hg12 < ~/src/hg/lib/jaxOrtholog.sql
     echo "load data local infile '"`pwd`"/jaxOrtholog.tab' into table jaxOrtholog;" \
       | hgsql hg12

REFFLAT and GENEBANDS

o - create precomputed join of refFlat and refGene:
      echo 'CREATE TABLE refFlat (KEY geneName (geneName), KEY name (name), KEY chrom (chrom)) SELECT refLink.name as geneName, refGene.* FROM refLink,refGene WHERE refLink.mrnaAcc = refGene.name' | hgsql hg12

o - Create precomputed geneBands table:
      ssh hgwdev
      hgGeneBands hg12 geneBands.txt
      hgsql hg12
          >mysql load data local infile 'geneBands.txt' into table geneBands;
	  >mysqy quit
      rm geneBands.txt

SIMPLE REPEAT TRACK (DONE)

o - Create cluster parasol job like so:
        ssh kk
	mkdir -p ~/oo/bed/simpleRepeat
	cd ~/oo/bed/simpleRepeat
	cp /cluster/store1/gs12.build29/bed/simpleRepeat/gsub ./gsub
	mkdir trf

Ask the admins to push /scratch/hg/gs.13/build30/ to the cluster
	ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst
	gensub2 genome.lst single gsub spec
	para create spec
TODO	para try
	para check
	para push
        liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed

o - Load this into the database as so
        ssh hgwdev
	cd ~/oo/bed/simpleRepeat
	hgLoadBed hg12 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql


PRODUCING GENSCAN PREDICTIONS (DONE 7/31/02)
	mkdir -p ~/oo/bed/genscan
	cd ~/oo/bed/genscan
o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so:

        Put hard-masked contigs in
		/cluster/store1/gs.13/build30/bed/genscan/mContigs
	(For hg12, the .masked files were not saved during repeat masking.  
	So the contig (.fa) files in /cluster/store1/gs.13/build30/? and ?? 
	were processed to convert all lower case bases into N and named 
	as *.fa.masked and placed under genscan/mContigs).

                mkdir -p ~/oo/bed/genscan/mContigs
                cd ~/oo/bed/genscan/mContigs
                foreach f (/cluster/store3/gs.13/build30/?/*/NT_??????.fa \
                           /cluster/store3/gs.13/build30/??/*/NT_??????.fa)
                  set m = $f:t.masked
                  tr 'abcdghkmnrstvwy' 'NNNNNNNNNNNNNNN' < $f > $m
                end

        Log into kkr1u00 (not kk!).  kkr1u00 is the driver node for the small
        cluster (kkr2u00 -kkr8u00. Genscan has problem running on the
        big cluster, due to limitation of memory and swap space on each
        processing node).
                cd ~/oo/bed/genscan
        Make 3 subdirectories for genscan to put their output files in
                mkdir -p gtf pep subopt
        Generate a list file, genome.list, of all the contigs
	*that do not have pure Ns* (due to heterochromatin, unsequencable 
        stuff) which would cause genscan to run forever.
                rm -f genome.list
                touch genome.list
                foreach f ( `ls -1S ./mContigs/*.masked` )
                  egrep '[ACGT]' $f > /dev/null
                  if ($status == 0) echo $f >> genome.list
                end
        
	Create template file, gsub, for gensub2.  For example (3 lines file):
                #LOOP
                /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000
                #ENDLOOP
        Create a file containing a single line.
                echo single > single
        Generate job list file, jobList, for Parasol
                gensub2 genome.list single gsub jobList

        First issue the following Parasol command:
                para create jobList
        Run the following command, which will try first 10 jobs from jobList
                para try
        Check if these 10 jobs run OK by
                para check
        If they have problems, debug and fix your program, template file,
        commands, etc. and try again.  If they are OK, then issue the following
        command, which will ask Parasol to start all the remaining jobs.  For
	hg12, there were 1396 jobs in total.
                para push
        Issue either one of the following two commands to check the
        status of the cluster and your jobs, until they are done.
                parasol status
                para check
        If there were out-of-memory problems (run "para problems"), then 
        re-run those jobs by hand but change the -window arg from 2400000
        to 1200000.  In gs.13/build30, this was the job for 
        mContigs/NT_011519.fa.masked .

o - Convert these to chromosome level files as so:     
     cd ~/oo/bed/genscan
     liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/NT*.gtf
     liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/NT*.bed > \
       /dev/null
     cat pep/*.pep > genscan.pep

o - Load into the database as so:
     ssh hgwdev
     cd ~/oo/bed/genscan
     ldHgGene hg12 genscan genscan.gtf
     hgPepPred hg12 generic genscanPep genscan.pep
     hgLoadBed hg12 genscanSubopt genscanSubopt.bed > /dev/null


PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (DONE 7/30/02)

Make sure that the NT*.fa files are lower-case repeat masked.
Do something much like the simpleRepeat track, but only
masking out stuff with a period of 12 or less as so:
    ssh kk
    mkdir -p ~/oo/bed/trfMask
    cd ~/oo/bed/trfMask
    cp ~/cluster/store1/gs.12/build29/bed/trfMask/gsub .
    mkdir trf
    ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst
    gensub2 genome.lst single gsub spec
    para create spec
    para try
    para check
    para push
When that is done do:
    ssh kkstore
    mkdir /scratch/hg/gs.13/build30/trfFa.0730
    cd ~/oo
NOTE:Below is a tcsh script
    foreach i (? ??)
        cd $i
        foreach j (NT*)
            maskOutFa $j/$j.fa ../bed/trfMask/trf/$j.bed -softAdd \
              /scratch/hg/gs.13/build30/trfFa.0730/$j.fa.trf
            echo done $i/$j
        end
        cd ..
    end

Then ask admins to do a binrsync.

PREPARING POST-TRF CHROM-LEVEL MIXED NIBs for mouse blastz (DONE 11/6/02)

    # lift trfMask output to chrom-level... this is a pain because all 
    # trf output was put in the same dir.  maybe next time around, we 
    # can preserve chrom dir structure...
    ssh kkstore
    cd ~/oo
    foreach c (?{,?})
     if (-e $c/lift/ordered.lst) then
       set ntlist = ()
       foreach n (`cat $c/lift/ordered.lst`)
         set ntlist = ($ntlist bed/trfMask/trf/$n.bed)
       end
       liftUp $c/chr$c.trf.bed jkStuff/liftAll.lft warn $ntlist
     endif
     if (-e $c/lift/random.lst) then
       set ntlist = ()
       foreach n (`cat $c/lift/random.lst`)
         set ntlist = ($ntlist bed/trfMask/trf/$n.bed)
       end
       liftUp $c/chr${c}_random.trf.bed jkStuff/liftAll.lft warn $ntlist
     endif
    end
    # make trf-masked chrom-level .fa
    foreach c (?{,?})
      cd $c
      if (-e chr$c.trf.bed) then
        echo masking $c...
        cp chr$c.fa chr$c.trf.fa
        maskOutFa -softAdd chr$c.trf.fa chr$c.trf.bed chr$c.trf.fa
      endif
      if (-e chr${c}_random.trf.bed) then
        echo masking ${c}_random...
        cp chr${c}_random.fa chr${c}_random.trf.fa
        maskOutFa -softAdd chr${c}_random.trf.fa chr${c}_random.trf.bed \
          chr${c}_random.trf.fa
      endif
      cd ..
    end
    # make nib
    mkdir trfMixedNib
    foreach c (?{,?})
      if (-e $c/chr$c.trf.fa) then
        faToNib -softMask $c/chr$c.trf.fa trfMixedNib/chr$c.nib
      endif
      if (-e $c/chr${c}_random.trf.fa) then
        faToNib -softMask $c/chr${c}_random.trf.fa \
          trfMixedNib/chr${c}_random.nib
      endif
    end
    rm -rf /scratch/hg/gs.13/build30/chromTrfMixedNib
    cp -pR trfMixedNib /scratch/hg/gs.13/build30/chromTrfMixedNib


CREATE GOLDEN TRIANGLE (todo)

Make sure that rnaCluster table is in place.   Then
extract Affy expression info into a form suitable
for Eisen's clustering program with:
      cd ~/oo/bed
      mkdir triangle
      cd triangle
      eisenInput hg12 affyHg10.txt
Transfer this to Windows and do k-means clustering
with k=200 with cluster.  Transfer results file back
to ~/oo/bed/triangle/affyCluster_K_G200.kgg.  Then
do
      promoSeqFromCluster hg12 1000 affyCluster_K_G200.kgg kg200.unmasked
Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions
to kg200.   Then
      cat kg200/*.fa > all1000.fa
and set up cluster Improbizer run to do 100 controls for every real
run on each - putting the output in imp.200.1000.e.  When improbizer
run is done make a file summarizing the runs as so:
      cd imp.200.1000.e
      motifSig ../imp.200.1000.e.iri ../kg200 motif control*
get rid of insignificant motifs with:
      cd ..
      awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri
turn rest into just dnaMotifs with
      iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt
Extract all promoters with
      featureBits hg12 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa
Locate motifs on all promoters with
      dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2
      liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed

CREATE STS/FISH/BACENDS/CYTOBANDS DIRECTORY STRUCTURE AND SETUP (DONE)

o - Create directory structure to hold information for these tracks
	cd /projects/hg2/booch/psl/
	change Makefile parameters for OOVERS, GSVERS, PREVGS, PREVOO
	make new

o - Update all Makefiles with latest OOVERS and GSVERS, DATABASE, and locations of .fa files

o - Create accession_info file
	make accession_info.rdb

UPDATE STS INFORMATION (DONE)

o - Download and unpack updated information from dbSTS:

	In a web browser, go to ftp://ftp.ncbi.nih.gov/repository/dbSTS/.  Download 
    	dbSTS.sts, dbSTS.aliases, and dbSTS.FASTA.dailydump.Z to 
    	/projects/hg2/booch/psl/update

	-Unpack dbSTS.FASTA.dailydump.Z
	gunzip dbSTS.FASTA.dailydump.Z

o - Create updated files
	cd /projects/hg2/booch/psl/update
	edit Makefile to latest sts.X version from PREV (currently sts.4)
	make update

o - Make new directory for this info and move files there
	ssh kks00
	mkdir /cluster/store1/sts.5
	cp all.STS.fa /cluster/store1/sts.5
	cp all.primers /cluster/store1/sts.5
	cp all.primers.fa /cluster/store1/sts.5

o - Copy new files to cluster
	ssh kkstore
	cd /cluster/store1/sts.5
	cp /cluster/store1/sts.5/*.* /scratch/hg/STS
	ask for propagation from sysadmin

STS ALIGNMENTS (DONE)
(alignments done without RepeatMasking, so start ASAP!)

o - Create full sequence alignments
	ssh kk

	cd /cluster/home/booch/sts
	- update Makefile with latest OOVERS and GSVERS
	make new
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make stsMarkers.psl

o - Copy files to final destination and remove originals
	ssh kks00
	make copy.assembly
	make clean.assembly

o - Create primer alignments
	ssh kk
	cd /cluster/home/booch/primers
	- update Makefile with latest OOVERS and GSVERS
	make new
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make primers.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly
	
o - Create ePCR alignments
	ssh kk
	cd /cluster/home/booch/epcr
	- update Makefile with latest OOVERS and GSVERS
	make new
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make primers.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly
	
CREATE AND LOAD STS MARKERS TRACK (DONE)

o - Copy in current stsInfo2.bed and stsAlias.bed files
	cd /projects/hg2/booch/psl/gs.13/build30
	cp ../update/stsInfo2.bed .
	cp ../update/stsAlias.bed .

o - Create final version of sts sequence placements
	ssh kks00
	cd /projects/hg2/booch/psl/gs.13/build30/sts
	make stsMarkers.final

o - Create final version of primers placements
	cd /projects/hg2/booch/psl/gs.13/build30/primers
	cp /cluster/store1/sts.5/all.primers .
	make primers.final

o - Create bed file
	cd /projects/hg2/booch/psl/gs.13/build30
	make stsMap.bed

o - Create database tables
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < all_sts_primer.sql
	mysql -uhgcat -pXXXXXXX < all_sts_seq.sql
	mysql -uhgcat -pXXXXXXX < stsAlias.sql
	mysql -uhgcat -pXXXXXXX < stsInfo2.sql
	mysql -uhgcat -pXXXXXXX < stsMap.sql

o - Load the tables
	load /projects/hg2/booch/psl/gs.13/build30/sts/stsMarkers.psl.filter.lifted into all_sts_seq	
	load /projects/hg2/booch/psl/gs.13/build30/primers/primers.psl.filter.lifted into all_sts_primer	
	load /projects/hg2/booch/psl/gs.13/build30/stsAlias.bed into stsAlias
	load /projects/hg2/booch/psl/gs.13/build30/stsInfo2.bed into stsInfo2
	load /projects/hg2/booch/psl/gs.13/build30/stsMap.bed into stsMap

o - Load the sequences (change sts.# to match correct location)
	hgLoadRna addSeq hg12 /cluster/store1/sts.5/all.STS.fa
	hgLoadRna addSeq hg12 /cluster/store1/sts.5/all.primers.fa


BACEND SEQUENCE ALIGNMENTS (DONE)
(alignments done without RepeatMasking, so start ASAP!)

o - Create full sequence alignments
	ssh kk
	cd /cluster/home/booch/bacends
	- update Makefile with latest OOVERS and GSVERS
	make new
	make jobList
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make bacEnds.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly

BACEND PAIRS TRACK (DONE)

o - Update Makefile with location of pairs files, if necessary
	cd /projects/hg2/booch/psl/gs.13/build30/bacends
	edit Makefile (PAIRS=....)

o - Create bed file
	ssh kks00
	cd /projects/hg2/booch/psl/gs.13/build30/bacends
	make bacEndPairs.bed

o - Create database tables
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < all_bacends.sql
	mysql -uhgcat -pXXXXXXX < bacEndPairs.sql

o - Load the tables
	load /projects/hg2/booch/psl/gs.13/build30/bacends/bacEnds.psl.filter.lifted into all_bacends	
	load /projects/hg2/booch/psl/gs.13/build30/bacends/bacEndPairs.bed into bacEndPairs

o - Load the sequences (change bacends.# to match correct location)
	hgLoadRna addSeq hg12 /cluster/store1/bacends.2/BACends.fa
		
FOSEND SEQUENCE ALIGNMENTS (DONE)

o - Create full sequence alignments
	ssh kk
	cd /cluster/home/booch/fosends
	- update Makefile with latest OOVERS and GSVERS
	make new
	make jobList
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make fosEnds.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly

FOSEND PAIRS TRACK (DONE)

o - Update Makefile with location of pairs files, if necessary
	cd /projects/hg2/booch/psl/gs.13/build30/fosends

o - Create bed file
	ssh kks00
	cd /projects/hg2/booch/psl/gs.13/build30/fosends
	make fosEndPairs.bed

o - Create database tables
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < all_fosends.sql
	mysql -uhgcat -pXXXXXXX < fosEndPairs.sql

o - Load the tables
	load /projects/hg2/booch/psl/gs.13/build30/fosends/fosEnds.psl.filter.lifted into all_fosends	
	load /projects/hg2/booch/psl/gs.13/build30/fosends/fosEndPairs.bed into fosEndPairs

o - Load the sequences (change bacends.# to match correct location)
	hgLoadRna addSeq hg12 /cluster/store1/fosends.1/fosEnds.fa
		
UPDATE FISH CLONES INFORMATION (DONE)

o - Download the latest info from NCBI
	point browser at http://www.ncbi.nlm.nih.gov/genome/cyto/cytobac.cgi?CHR=all&VERBOSE=ctg
	change "Show details on sequence-tag" to "yes"
	change "Download or Display" to "Download table for UNIX"
	press Submit - save as /projects/hg2/booch/psl/fish/hbrc/hbrc.YYYYMMDD.table

o - Format file just downloaded
	cd /projects/hg2/booch/psl/fish/
	make HBRC

o - Copy it to the new freeze location
	cp /projects/hg2/booch/psl/fish/all.fish.format /projects/hg2/booch/psl/gs.13/build30/fish/


CREATE AND LOAD FISH CLONES TRACK (DONE)
(must be done after STS markers track and BAC end pairs track)

o - Extract the file with clone positions from database
	ssh hgwdev
	mysql -uhgcat -pXXXXXXXX hg12
	mysql>  select * into outfile "/tmp/booch/clonePos.txt" from clonePos;
	mysql> quit
	mv /tmp/booch/clonePos.txt /projects/hg2/booch/psl/gs.13/build30/fish

o - Create bed file
	cd /projects/hg2/booch/psl/gs.13/build30/fish
	make bed

o - Create database table
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < fishClones.sql

o - Load the table
	load /projects/hg2/booch/psl/gs.13/build30/fish/fishClones.bed into fishClones
	

CREATE AND LOAD CHROMOSOME BANDS TRACK (DONE)
(must be done after FISH Clones track) 

o - Create bed file
	ssh hgwdev
	make setBands.txt
	make cytobands.pct.ranges
	make predict

o - Create database table
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < cytoBand.sql
	

o - Load the table
	load /projects/hg2/booch/psl/gs.13/build30/cytobands/cytobands.bed into cytoBand


CREATE CHROMOSOME REPORTS (DONE)


CREATE STS MAP COMPARISON PLOTS (DONE)


HUMAN/MOUSE BLAT ALIGNMENTS (DONE 8/22/02)

    # Process the trfFa files (contigs, lower-case repeat and tandem-repeat 
    # masked) into about 500 files containing records of <= 20kb each:
    # 
    # First, make .unlft (unlifted) versions of all mouse contig .agp's:
    ssh kkstore
    cd ~/mm2
    foreach ctgAgp (?{,?}/chr*/chr?{,?}_?{,?}.agp)
      ~/kent/src/hg/splitFaIntoContigs/deLiftAgp.pl jkStuff/liftAll.lft \
        $ctgAgp > $ctgAgp.unlft
    end

    # Now use the unlifted contig .agp's to further split the (super-)contigs 
    # into smaller "sub"-contigs (still at contig boundaries):
    foreach ctgAgp (?{,?}/chr*/chr?{,?}_?{,?}.agp.unlft)
      set ctg=$ctgAgp:t:r:r
      splitFaIntoContigs $ctgAgp trfFa/$ctg.fa.trf trfFaSplit -nSize=15000
    end
    # Create a lift file for all sub-contigs.
    cat trfFaSplit/*/lift/ordered.lft > trfFaSplit/allSubContigs.lft

    # Since splitFaIntoContigs enforces a min/approximate size and we need 
    # to enforce a max size, use faSplit on sub-contigs.  Build up a list 
    # file naming all the split sub-contigs.  
    set splitSubDir=trfFaSplit/splitSubs
    mkdir -p $splitSubDir
    rm -f $splitSubDir/splitSubs.lst
    touch $splitSubDir/splitSubs.lst
    foreach ctgDir (trfFaSplit/?{,?}_?{,?})
      foreach subCtgFa ($ctgDir/chr*/chr*.fa)
        set subCtg=$subCtgFa:t:r
        faSplit size $subCtgFa 20000 $splitSubDir/${subCtg}_ \
          -lift=$splitSubDir/$subCtg.lft -maxN=20000
        foreach ss ($splitSubDir/${subCtg}_*)
          echo $ss >> $splitSubDir/splitSubs.lst
        end
      end
    end
    # Create a lift file for all split sub-contigs.
    # or not -- too many files for cat.  Create per-chunk lift files below.

    # Divide the list of split sub-contigs into ~500 chunks
    splitFile  $splitSubDir/splitSubs.lst 350 $splitSubDir/splitSubs_

    # cat the split-sub-contig .fa's into multi-record chunk .fa's 
    # for para job generation.  Make a lift file for each chunk.  
    set chunkDir = trfFaSplit/chunks
    mkdir -p $chunkDir
    rm -f /tmp/makeLft.log
    touch /tmp/makeLft.log
    foreach chunkLst ($splitSubDir/splitSubs_*)
      set chunkNum=`echo $chunkLst | sed -e 's/.*_//g'`
      set chunkLft = $chunkDir/chunk_$chunkNum.lft
      rm -f $chunkLft
      touch $chunkLft
      set lastSubCtg = ""
      foreach splitSubFa (`cat $chunkLst`)
        set splitSub = $splitSubFa:r
        set subCtg = `echo $splitSub | perl -wpe 's/(chr\w+_\d+_\d+)_\d+/$1/'`
        if ("$subCtg" != "$lastSubCtg") then
          echo "subCtg changed from $lastSubCtg to $subCtg; catting $subCtg.lft onto $chunkLft" >> /tmp/makeLft.log
          cat $subCtg.lft >> $chunkLft
          set lastSubCtg = $subCtg
        endif
      end
      cat `cat $chunkLst` > $chunkDir/chunk_$chunkNum.fa
    end
    # Put those files on cluster nodes' /scratch:
    mkdir /scratch/hg/mm2/splitContigChunks
    cp -p $chunkDir/chunk_* /scratch/hg/mm2/splitContigChunks
    # Ask sysadmins for an updateLocal/binrsync

    # Now we're ready to set up the cluster run!
    mkdir -p ~/oo/bed/blatMus
    cd ~/oo/bed/blatMus
    # Use the mouse multi-record chunks created above:
    ls -1 /scratch/hg/mm2/splitContigChunks/chunk_*.fa > mouse.lst
    # Then bundle up the human into pieces of less than 12 meg mostly:
    rm -f bigH smallH
    foreach f (/scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf)
      set size = `ls -l $f | awk '{print $5;}'`
      if ($size < 13000000) then
        echo $f >> smallH
      else
        echo $f >> bigH
      endif
    end
    mkdir hs
    cd hs
    splitFile ../bigH 1 big
    rm bigXX # Note, this is just an empty file that the splitFile program erroneously created. Remove the last one.
    splitFile ../smallH 4 small
    rm smallXXX
    cd ..
    ls -1 hs/* > human.lst
    cat > gsub <<EOF
#LOOP
/cluster/home/kent/bin/i386/blat -qMask=lower -mask=lower -q=dnax -t=dnax {check in line+ \$(path1)} {check in line+ \$(path2)} {check out line+ psl/\$(root1)_\$(root2).psl}
#ENDLOOP
EOF
    gensub2 human.lst mouse.lst gsub spec

    # Do the cluster run
    ssh kk
    cd ~/oo/bed/blatMus
    mkdir psl
    para create spec
    para try, para check
    # Note: multiple "para push"es may be required, or 
    # "para push -maxQueue 200000 -maxPush 200000" or so.
    para push, para check...

    # Aggregate results and perform several kinds of coord-lifting:
    ssh kkstore
    cd ~/oo/bed/blatMus
    # First from mouse split-sub-contigs to sub-contigs (individual .lft 
    # file for each chunk of split-sub-contigs):
    set chunkDir = ~/mm2/trfFaSplit/chunks
    mkdir pslLft
    rm -f /tmp/lft.log
    touch /tmp/lft.log
    foreach f (psl/*.psl)
      set chunk = `echo $f:t:r | perl -wpe 's/.*(chunk_\d+)/$1/'`
      liftUp -type=.psl -pslQ pslLft/$f:t $chunkDir/$chunk.lft warn $f \
        >>& /tmp/lft.log
    end
    # Then the kinds of lifting we can do all at once: 
    # mouse sub-contigs to mouse contigs, 
    # mouse contigs to mouse chrom, 
    # human contigs to human chrom:
    pslCat -dir -check pslLft \
      | liftUp -type=.psl -pslQ stdout ~/mm2/trfFaSplit/allSubContigs.lft \
        warn stdin \
      | liftUp -type=.psl -pslQ stdout ~/mm2/jkStuff/liftAll.lft warn stdin \
      | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin \
      | pslSortAcc nohead chromPile /cluster/store2/temp stdin

    # Get rid of big pile-ups due to contamination as so:
    mkdir chrom
    cd chromPile
    foreach i (*.psl)
      echo $i
      pslUnpile -maxPile=250 $i ../chrom/${i:r}_blatMus.psl
    end
    # Load into database:
    ssh hgwdev
    cd ~/oo/bed/blatMus/chrom
    hgLoadPsl hg12 *.psl


PRODUCING CROSS_SPECIES mRNA ALIGNMENTS (DONE 8/7/02)

Here you align vertebrate mRNAs against the masked genome on the
cluster you set up during the previous step.

o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from 
    Genbank into /cluster/store1/genbank.130 (in GETTING FRESH mRNA AND 
    EST SEQUENCE FROM GENBANK step)

o - Process these out of genbank flat files as so:
       ssh kkstore
       cd /cluster/store1/mrna.130
       faSplit sequence xenoRna.fa 2 xenoRna
       mkdir -p /scratch/hg/mrna.130
       cp /cluster/store1/mrna.130/xenoRna*.* /scratch/hg/mrna.130
    Request binrysnc of /scratch/hg/mrna.130 from the admins

Set up cluster run.  First make sure genome is in
kkstore:/scratch/hg/gs.13/build30/trfFa.0730 in RepeatMasked + trf form.
(This is probably done already in mouse alignment stage).  Also make
sure /scratch/hg/mrna.130 is loaded with xenoRna.fa Then do:

       ssh kkstore
       mkdir -p ~/oo/bed/xenoMrna
       cd ~/oo/bed/xenoMrna
       mkdir psl
       ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst
       ls -1S /scratch/hg/mrna.130/xenoRna?*.fa > mrna.lst
       cp ~/hg11/bed/xenoMrna/gsub .
       gensub2 human.lst mrna.lst gsub spec
       para create spec
       ssh kk
       cd ~/oo/bed/xenoMrna
       para try
       para check
       para push 

Do para check until the run is done, doing para push if
necessary on occassion.

Sort xeno mRNA alignments as so:
       ssh kkstore
       cd ~/oo/bed/xenoMrna
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.25
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoMrna.psl
       rm -r chrom raw.psl cooked.psl chrom.psl
Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoMrna
       hgLoadPsl hg12 xenoMrna.psl -tNameIx
       cd /cluster/store1/mrna.130
       hgLoadRna add hg12 xenoRna.fa xenoRna.ra

Similarly do xenoEst aligments:
   Prepare the est data:
        cd /cluster/store1/mrna.130
        faSplit sequence xenoEst.fa 16 xenoEst    

       ssh kkstore
       cd /cluster/store1/gs.13/build30/bed
       mkdir xenoEst
       cd xenoEst
       mkdir psl
       ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst
       cp /cluster/store1/mrna.130/xenoEst?*.fa /scratch/hg/mrna.130
       ls -1S /scratch/hg/mrna.130/xenoEst?*.fa > mrna.lst
       cp ~/hg11/bed/xenoEst/gsub .

Request a binrysnc from the admin's of kkstore's /scratch/hg/mrna.130
When done, do:
       gensub2 human.lst mrna.lst gsub spec
       para create spec
       para push

Sort xenoEst alignments:
       ssh kkstore
       cd ~/oo/bed/xenoEst
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.10
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoEst.psl
       rm -r chrom raw.psl cooked.psl chrom.psl

Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoEst
       hgLoadPsl hg12 xenoEst.psl -tNameIx
       cd /cluster/store1/mrna.130
       hgLoadRna add hg12 xenoEst.fa xenoEst.ra
    
PRODUCING FISH ALIGNMENTS (DONE 08/05/02)
o - Do fish/human alignments.
       ssh kk
       cd ~/oo/bed
       mkdir blatFish
       cd blatFish
       mkdir psl
       ls -1S /scratch/hg/fish/*.fa > fish.lst
       ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst
       # Copy over gsub from previous version.
       gensub2 human.lst fish.lst gsub spec
       para create spec 
       para try
     Make sure jobs are going ok with para check.  Then
       para push
     wait about 2 hours and do another
       para push
     do para checks and if necessary para pushes until done
     or use para shove.
o - Sort alignments as so 
       pslCat -dir psl | \
         liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \
         pslSortAcc nohead chrom temp stdin
o - Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/oo/bed/blatFish/chrom
       rm -f chr*_blatFish.psl
       foreach i (*.psl)
           set r = $i:r
           mv $i ${r}_blatFish.psl
       end
       hgLoadPsl hg12 *.psl

Now load the fish seqeuence data
       hgLoadRna addSeq hg12 /cluster/store3/tetFish/tet*.fa

PRODUCING FISH ALIGNMENTS (DONE 08/05/02)
o - Do fish/human alignments.
       ssh kk
       cd ~/oo/bed
       mkdir blatFish
       cd blatFish
       mkdir psl
       ls -1S /scratch/hg/fish/*.fa > fish.lst
       ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst
       # Copy over gsub from previous version.
       gensub2 human.lst fish.lst gsub spec
       para create spec 
       para try
     Make sure jobs are going ok with para check.  Then
       para push
     wait about 2 hours and do another
       para push
     do para checks and if necessary para pushes until done
     or use para shove.
o - Sort alignments as so 
       pslCat -dir psl | \
         liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \
         pslSortAcc nohead chrom temp stdin
o - Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/oo/bed/blatFish/chrom
       rm -f chr*_blatFish.psl
       foreach i (*.psl)
           set r = $i:r
           mv $i ${r}_blatFish.psl
       end
       hgLoadPsl hg12 *.psl

Now load the fish seqeuence data
       hgLoadRna addSeq hg12 /cluster/store3/tetFish/tet*.fa


PRODUCING FUGU ALIGNMENTS (DONE 12/09/02)
o - Do fugu/human alignments.
       ssh kk
       cd ~/oo/bed
       mkdir blatFugu
       cd blatFugu
       mkdir psl
       foreach f (~/hg12/?{,?}/NT_??????/NT_??????.fa)
         set c=$f:t
         mkdir -p psl/$c
       end
       ls -1S /scratch/hg/fugu/split500/*.fa > fugu.lst
       ls -1S /scratch/hg/gs.13/build30/trfFa.0730/*.fa.trf > human.lst
       # Copy over gsub from previous version.
       gensub2 human.lst fugu.lst gsub spec

       para create spec 
       para try
     Make sure jobs are going ok with para check.  Then
       para push -maxJobs=10000
     wait about 2 hours and do another
       para push
     do para checks and if necessary para pushes until done
     or use para shove.

o - Sort alignments as so 
    ssh eieio
    cd ~/oo/bed/blatFugu
       pslCat -dir psl/NT_??????.fa | \
         liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | \
         pslSortAcc nohead chrom temp stdin

o - Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/oo/bed/blatFugu/chrom
       rm -f chr*_blatFugu.psl
       foreach i (chr?{,?}{,_random}.psl)
           set r = $i:r
           mv $i ${r}_blatFugu.psl
       end
       hgLoadPsl hg12 *.psl

Make fugu symlink
cd /gbdb/hg12
mkdir fugu
cd fugu
ln -s /cluster/store3/fuguSeq/fugu_v3_mask.fasta /gbdb/hg12/fugu/fugu_v3_mask.fasta
Now load the Fugu sequence data
       hgLoadRna addSeq hg12 /gbdb/hg12/fugu/fugu_v3_mask.fasta

TIGR GENE INDEX (DONE 12/18/02)
    mkdir -p ~/hg12/bed/tigr
    cd ~/hg12/bed/tigr
    wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build30.tgz
    gunzip -c TGI*.tgz | tar xvf -
    foreach f (*cattle*)
      set f1 = `echo $f | sed -e 's/cattle/cow/g'`
      mv $f $f1
    end
    foreach o (mouse cow human pig rat)
      setenv O $o
      foreach f (chr*_$o*s)
        tail +2 $f | perl -wpe 's /THC/TC/; s/(TH?C\d+)/$ENV{O}_$1/;' > $f.gff
      end
    end
    ldHgGene -exon=TC hg12 tigrGeneIndex *.gff


LOAD STS MAP (todo) TODO BY TERRY I BELIEVE - HE WILL UPDATE THIS
     - login to hgwdev
      cd ~/oo/bed
      hg12 < ~/src/hg/lib/stsMap.sql
      mkdir stsMap
      cd stsMap
      bedSort /projects/cc/hg/mapplots/data/tracks/build30/stsMap.bed stsMap.bed
      - Enter database with "hg12" command.
      - At mysql> prompt type in:
          load data local infile 'stsMap.bed' into table stsMap;
      - At mysql> prompt type
          quit


LOAD CHROMOSOME BANDS (todo) ALSO TODO BY TERRY I BELIEVE
      - login to hgwdev
      cd /cluster/store1/gs.13/build30/bed
      mkdir cytoBands
      cp /projects/cc/hg/mapplots/data/tracks/oo.29/cytobands.bed cytoBands
      cd cytoBands
      hg12 < ~/src/hg/lib/cytoBand.sql
      Enter database with "hg12" command.
      - At mysql> prompt type in:
          load data local infile 'cytobands.bed' into table cytoBand;
      - At mysql> prompt type
          quit

LOAD MOUSEREF TRACK (todo)
    First copy in data from kkstore to ~/oo/bed/mouseRef.  
    Then substitute 'genome' for the appropriate chromosome 
    in each of the alignment files.  Finally do:
       hgRefAlign webb hg12 mouseRef *.alignments

LOAD AVID MOUSE TRACK (todo)
      ssh cc98
      cd ~/oo/bed
      mkdir avidMouse
      cd avidMouse
      wget http://pipeline.lbl.gov/tableCS-LBNL.txt
      hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed
      hgLoadBed avidRepeat avidRepeat.bed
      hgLoadBed avidUnique avidUnique.bed

LOAD SNPS (Done; Daryl Thomas July 29, 2002)
      ssh hgwdev
      cd ~/oo/bed
      mkdir snp
      cd snp
     -Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.ncbi.b29.gz
     -Unpack.
      ln -s ../../seq_contig.md .
      calcFlipSnpPos seq_contig.md gp.ncbi.b30 gp.ncbi.b30.flipped
      mv gp.ncbi.b30 gp.ncbi.b30.original
      gzip gp.ncbi.b30.original
      grep RANDOM       gp.ncbi.b30.flipped >  snpTsc.txt
      grep MIXED        gp.ncbi.b30.flipped >> snpTsc.txt
      grep BAC_OVERLAP  gp.ncbi.b30.flipped >  snpNih.txt
      grep OTHER        gp.ncbi.b30.flipped >> snpNih.txt
      awk -f filter.awk snpTsc.txt > snpTsc.contig.bed
      awk -f filter.awk snpNih.txt > snpNih.contig.bed
      liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed
      liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed
      hgLoadBed hg12 snpTsc snpTsc.bed
      hgLoadBed hg12 snpNih snpNih.bed
     -gzip all of the big files

LOAD CPGISLANDS (DONE 8/9/02)
     ssh kkstore
     mkdir -p ~/oo/bed/cpgIsland
     cd ~/oo/bed/cpgIsland
     - Build software emailed from Asif Chinwalla (achinwal@watson.wustl.edu)
     - copy the tar file to the current directory
     tar xvf cpg_dist.tar 
     cd cpg_dist
     gcc readseq.c cpg_lh.c -o cpglh.exe
     cd ..
     - cpglh.exe requires hard-masked (N) .fa's.  
     - execute the following loop in tcsh to hard-mask chr*.fa and run cpglh:
     foreach c (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y \
                Un M)
       if (-e ../../$c/chr$c.fa) then
         tr '[a-z]' 'N' < ../../$c/chr$c.fa | sed -e 's/^>NNN/>chr/' > chr$c.fa
         echo masked chr$c.
         ./cpg_dist/cpglh.exe chr$c.fa > chr$c.fa.cpg
         echo Done with chr$c.
       else
         echo ../../$c/chr$c.fa does not exist.
       endif
       if (-e ../../$c/chr${c}_random.fa) then
         tr '[a-z]' 'N' < ../../$c/chr${c}_random.fa | sed -e 's/^>NNN/>chr/' \
           > chr${c}_random.fa
         echo masked chr${c}_random.
         ./cpg_dist/cpglh.exe chr${c}_random.fa > chr${c}_random.fa.cpg
         echo Done with chr${c}_random.
       endif
     end
     rm chr*.fa
     -copy filter.awk from a previous release
     cp ~/hg11/bed/cpgIsland/filter.awk .
     awk -f filter.awk chr*.cpg > cpgIsland.bed
     ssh hgwdev
     cd ~/oo/bed/cpgIsland
     hgLoadBed hg12 cpgIsland -tab -noBin \
       -sqlTable=$HOME/kent/src/hg/lib/cpgIsland.sql cpgIsland.bed

LOAD ENSEMBL GENES (Done by Matt 9/20/02)
     cd ~/oo/bed
     mkdir ensembl
     cd ensembl

        Get the ensembl gene data as below:
        GET http://www.ebi.ac.uk/~stabenau/human_8_30.gtf.gz > ensGene.gz
        (The above may only be a temproary location)

        Get the ensembl protein data from http://www.ensembl.org/Homo_sapiens/martview

     Add "chr" to front of each line in the gene data gtf file to make it compatible with ldHgGene 

     ~matt/bin/addchr.pl ensGene.gtf ensembl.gtf
     ldHgGene hg12 ensGene ensembl.gtf

o - Load Ensembl peptides:
     Get them from ensembl as above in the gene section

     Substitute ENST for ENSP in ensPep with the program called subs
     edit subs.in to read: ENSP|ENST
     subs -e ensPep.fa > /dev/null

     Run fixPep.pl ensPep.fa ensembl.pep
     hgPepPred hg12 generic ensPep ensembl.pep


    LOAD SANGER 22 Pseudogenes (from hg10 - Done by Robert)
    cd ~/hg12/bed/sanger22
    cp ~/hg10/bed/sanger22/cChr22.3.lx.pseudogene.gff .
    replace ^chr22 with hg10:chr22 in Chr22.3.lx.pseudogene.gff
    liftUp -type=.gff pseudo.gff hg12.lft Chr22.3.lx.pseudogene.gff
    ldHgGene hg12 sanger22pseudo pseudo.gff

LOAD SANGER22 GENES  DONE 9/27/02 by MATT
      cd ~/oo/bed
      mkdir sanger22
      cd sanger22
      not sure where these files were downloaded from
      grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg12 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0
          | ldHgGene hg12 sanger22pseudo stdin
  Note: this creates sanger22extras, but doesn't currently create
  a correct sanger22 table, which are replaced in the next steps
      sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg12 sanger22 stdin
      sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg12 sanger22pseudo stdin

      hgPepPred hg12 generic sanger22pep *.pep

              
              LOAD SANGER 20 GENES (todo)
     # First download files from James Gilbert's email to ~/oo/bed/sanger20 and
     # go to that directory while logged onto hgwdev.  Then:
        grep -v Pseudogene chr_20*.gtf | ldHgGene hg12 sanger20 stdin
	hgSanger20 hg12 *.gtf *.info


LOAD RNAGENES (DONE 9/10/02)
      ssh hgwdev
      mkdir -p ~/hg12/bed/rnaGene
      cd ~/hg12/bed/rnaGene
      wget ftp://ftp.genetics.wustl.edu/pub/eddy/pickup/ncRNA-chr7-20020621.gff.gz
      gunzip -c ncRNA-chr7-20020621.gff.gz \
        | grep -v "^#" \
        | perl -wpe 's/chrom(\d+)\.\w+\.fsa/chr$1/' \
        > chr7_ncrna.gff

      # NOTE: just for build30, chr7 NCBI coords differ slightly from 
      # chr7 WUSTL coords.  LaDeana Hillier's instructions for translation:
# ours                            ncbi  build30
# ----                            -------------
# 1->16379450             =       1->16379450
# nothing                 =   16379451 to 16379650  (because ncbi bp 16379451 
#                        to 16379650 are identical to ncbi 16379651  16379850)
# 16379451->157432593     = 16379651 -> 157432793
      # The following fix should not be required for future builds!
      perl -we 'while (<>) { \
                  @words = split("\t"); \
                  if ($words[3] > 16379450) { $words[3] += 200; } \
                  if ($words[4] > 16379450) { $words[4] += 200; } \
                  print join("\t", @words); \
                } \
               ' chr7_ncrna.gff \
        > chr7_ncrna-fixed.gff

      echo 'drop table hgRnaGene;' | hgsql hg12
      hgsql hg12 < ~/kent/src/hg/lib/rnaGene.sql
      hgRnaGenes hg12 chr7_ncrna-fixed.gff

LOAD EXOFISH (todo)
     - login to hgwdev
     - cd /cluster/store1/gs.13/build30/bed
     - mkdir exoFish
     - cd exoFish
     - hg12 < ~kent/src/hg/lib/exoFish.sql
     - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr)
       into /cluster/store1/gs.13/build30/bed/exoFish/all_maping_ecore
     - awk -f filter.awk all_maping_ecore > exoFish.bed
     - hgLoadBed hg12 exoFish exoFish.bed

LOAD MOUSE SYNTENY (DONE 8/22/02)
     ssh hgwdev
     mkdir -p ~/oo/bed/mouseSyn
     cd ~/oo/bed/mouseSyn
     # Saved Michael Kamal's email attachment: allDirectedSegmentsBySize300.txt
     # Process the .txt file (minus header) into a bed 6 + file:
     grep -v "^#" allDirectedSegmentsBySize300.txt \
       | awk '($6 > $5) {printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%s\n", $4, $5-1, $6, $1, 999, $7, $2-1, $3, $8;} \
              ($5 > $6) {printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%s\n", $4, $6-1, $5, $1, 999, $7, $2-1, $3, $8;}' \
       > mouseSynWhd.bed
     hgLoadBed -noBin -sqlTable=$HOME/kent/src/hg/lib/mouseSynWhd.sql \
       hg12 mouseSynWhd mouseSynWhd.bed

LOAD GENIE (todo)
     - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf
     - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf
     - ldHgGene hg12 genieAlt predChrom.gtf

     - cat */ctg*/ctg*.affymetrix.aa > pred.aa
     - hgPepPred hg12 genie pred.aa 

     - hg12
         mysql> delete * from genieAlt where name like 'RS.%';
         mysql> delete * from genieAlt where name like 'C.%';

LOAD SOFTBERRY GENES (DONE 9/12/02)
     - ln -s /cluster/store1/gs.13/build30/ ~/oo
     - cd ~/oo/bed
     - mkdir softberry
     - cd softberry
     - wget ftp://www.softberry.com/pub/sc_fgenesh_jun02/sb_fgenesh_jun02.tar.gz
     ldHgGene hg12 softberryGene chr*.gff
     hgPepPred hg12 softberry *.pro
     hgSoftberryHom hg12 *.pro

LOAD GENEID GENES (DONE 11/20/02)
     mkdir ~/oo/bed/geneid
     cd ~/oo/bed/geneid
     mkdir download
     cd download
     # download .gff and prot files for each chrom (and _random):
     foreach f (~/oo/?{,?}/*.fa.out)
       set c=$f:t:r:r
       wget http://www1.imim.es/genepredictions/H.sapiens/golden_path_20020628/geneid_v1.1/$c.gff
       wget http://www1.imim.es/genepredictions/H.sapiens/golden_path_20020628/geneid_v1.1/$c.prot
     end
     # This time around, their "gff" uses screwy keywords.  clean up:
     # Also replace gi|17981852|ref|NC_001807.4| with chrM:
     foreach f (chr*.gff)
       perl -wpi.bak -e 's/(First|Terminal|Internal|Single)/CDS/' $f
     end
     foreach f (chr*.{gff,prot})
       perl -wpi.bak -e 's/gi\|17981852\|ref\|NC_001807.4\|/chrM/g' $f
     end
     cd ..
     ldHgGene hg12 geneid download/*.gff -exon=CDS
     hgPepPred hg12 generic geneidPep download/*.prot

LOAD ACEMBLY (DONE 9/10/02)
    mkdir -p ~/oo/bed/acembly
    cd ~/oo/bed/acembly
    
    - Get acembly*gene.gff from Jean and Danielle Thierry-Mieg
    wget ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ncbi_30.human.genes/acembly.ncbi_30.genes.gff.tar.gz
    wget ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ncbi_30.human.genes/acembly.ncbi_30.genes.proteins.fasta.tar.gz 
    gunzip -c acembly.ncbi_30.genes.gff.tar.gz | tar xvf -
    gunzip -c acembly.ncbi_30.genes.proteins.fasta.tar.gz | tar xvf -
    cd acembly.ncbi_30.genes.gff
    #- Strip out floating-contig features (lines with *|NT_?????? as the chr ID),
    #  and add 'chr' prefix to all chr nums:
    foreach f (acemblygenes.*.gff)
      egrep -v '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' $f | \
        perl -wpe 's/^(\w)/chr$1/' > $f:r-fixed.gff
    end
    #- Save just the floating-contig features to different files for lifting 
    #- and lift up the floating-contig features to chr*_random coords:
    foreach c ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Un M)
      egrep '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' acemblygenes.$c.gff | \
        perl -wpe 's/^(\w+)\|(\w+)/$1\/$2/' > $c-random-ctg.gff
    if (-e ../../../$c/lift/random.lft) then
         liftUp $c-random-lifted.gff ../../../$c/lift/random.lft warn $c-random-ctg.gff
    endif
    end

    cd ../acembly.ncbi_30.genes.proteins.fasta
    #- Remove G_t*_ prefixes from acemblyproteins.*.fasta:
    foreach f (acemblyproteins.*.fasta)
      perl -wpe 's/^\>G_t[\da-zA-Z]+_/\>/' $f > $f:r-fixed.fasta
    end

    #- Load into database as so:
    cd ..
    ldHgGene hg12 acembly acembly.ncbi_30.genes.gff/*-fixed.gff acembly.ncbi_30.genes.gff/*-lifted.gff
    hgPepPred hg12 generic acemblyPep acembly.ncbi_30.genes.proteins.fasta/*-fixed.fasta

LOAD GENOMIC DUPES (todo)
o - Load genomic dupes
    ssh hgwdev
    cd ~/oo/bed
    mkdir genomicDups
    cd genomicDups
    wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip
    unzip *.zip
    awk -f filter.awk oo33_dups_for_kent > genomicDups.bed
    mysql -u hgcat -pbigSECRET hg12 < ~/src/hg/lib/genomicDups.sql
    hgLoadBed hg12 -oldTable genomicDups genomicDupes.bed

LOAD NCI60 (Done: Chuck Sugnet 7/24/02)
o - # ssh hgwdev
    cd /projects/cc/hg/mapplots/data/NCI60/dross_arrays_nci60/
    mkdir hg12
    cd hg12
    findStanAlignments hg12 ../BC2.txt.ns ../../image/cumulative_plates.011204.list.human hg12.image.psl >& hg12.image.log 
    cp ../experimentOrder.txt ./
    sed -e 's/ / \.\.\//g' < experimentOrder.txt > epo.txt
    stanToBedAndExpRecs  hg12.image.good.psl hg12.nci60.exp hg12.nci60.bed `cat epo.txt`
    hg12S -A < ../../scripts/nci60.sql
    echo "load data local infile 'hg12.nci60.bed' into table nci60" | hg12S -A
    mkdir /cluster/store3/gs.13/build30/bed/nci60
    mv hg12.nci60.bed /cluster/store3/gs.13/build30/bed/nci60
    rm *.psl 

LOAD AFFYRATIO [GNF] (Done: Chuck Sugnet 7/24/02)
o - # ssh hgwdev
    cd /cluster/store1/sugnet/
    mkdir gs.13
    mkdir gs.13/build30
    mkdi20r gs.13/build30/affyGnf
    cd gs.13/build30/affyGnf
    cp /projects/compbiodata/microarray/affyGnf/sequences/HG-U95Av2_target ./
    ls -1 /cluster/store3/gs.13/build30/trfFa.0730/ > allctg.lst
    echo "/cluster/store1/sugnet/gs.13/build30/affyGnf/HG-U95Av2_target" > affy.lst
    echo '#LOOP\n/cluster/bin/i386/blat -mask=lower -minIdentity=95 -ooc=/cluster/store3/gs.13/build30/jkStuff/post.refCheck.old/11.ooc /cluster/store3/gs.13/build30/trfFa.0730/$(path1) $(path2) {check out line+ psl/$(root1)_$(root2).psl}\n#ENDLOOP' > template.sub
    gensub2 allctg.lst affy.lst template.sub para.spec
    # ssh kkr1u00
    para create para.spec
    para try
    para check
    para push
    # exit kkr1u00
    pslSort dirs hg12.affy.psl tmp psl >& pslSort.log
    liftUp hg12.affy.lifted.psl /cluster/store3/gs.13/build30/jkStuff/liftAll.lft warn hg12.affy.psl
    pslAffySelect seqIdent=.95 basePct=.95 in=hg12.affy.lifted.psl out=hg12.affy.pAffySelect.95.95.psl
    affyPslAndAtlasToBed hg12.affy.pAffySelect.95.95.psl  /projects/compbiodata/microarray/affyGnf/human_atlas_U95_gnf.noquotes.txt affyRatio.bed affyRatio.exr >& affyPslAndAtlasToBed.log 
    hg12S -A </projects/compbiodata/microarray/affyGnf/browserFiles/affyRatio.sql 
    echo "load data local infile 'affyRatio.bed' into table affyRatio" | hg12S -A
    mkdir /cluster/store3/gs.13/build30/bed/affyGnf
    rm -rf psl tmp err *.psl *.bed HG-U95Av2_target 

LOAD SAGE DATA (Done: Chuck Sugnet 7/26/02)
o - cd /projects/cc/hg/sugnet/sage
    mkdir sage.XXX # XXX = current uniGene build
    cd sage.XXX
    ncftp ftp://ftp.ncbi.nih.gov/pub/sage
    mget -R map/readme.txt map/info.txt extr info map/Hs
    mkdir map
    mv Hs map
    cd map/Hs/NlaIII
    gunzip SAGEmap_tag_ug-rel_Hs.Z
    cd ../../../extr/
    ../../scripts/summarizeCounts.pl expCounts.tab ./SAGE_*
    ../../scripts/countGenesPerTag.pl expCounts.tab allTags.count.tab
    ../../scripts/createArraysForTags.pl allTags.count.tab tagExpArrays.tab ./SAGE_*
    ../../scripts/countsPerExp.pl expCounts.tab expList.tab
    cd ../map/Hs/NlaIII/ 
    perl -e 'while(<>){chomp($_);@p=split/\t/,$_; print "$p[2]\t$p[3]\t$p[0]\n"}' < SAGEmap_tag_ug-rel_Hs | sort | sed -e 's/ /_/g' > SAGEmap_ug_tag-rel_Hs
    cd -
    createSageSummary ../map/Hs/NlaIII/SAGEmap_ug_tag-rel_Hs tagExpArrays.tab sageSummary.sage
    # Create the uniGene alignments /cluster/store1/sugnet/gs.13/build30/uniGene/hg12.uniGene.lifted.pslReps.psl
    addAveMedScoreToPsls /cluster/store1/sugnet/gs.13/build30/uniGene/hg12.uniGene.lifted.pslReps.psl sageSummary.sage  uniGene.pslWScores.psl
    /cluster/home/kent/bin/i386/hgLoadBed hg12 uniGene_2 uniGene.wscores.bed 
    hg12S -A < ~/kk/jk/hg/lib/sage.sql 
    echo "load data local infile 'sageSummary.sage' into table sage" | hg12S -A
    cd ../info
    ../../scripts/parseRecords.pl ../extr/expList.tab  > sageExp.tab
    hg12S -A < ~/kk/jk/hg/lib/sageExp.sql 
    echo "load data local infile 'sageExp.tab' into table sageExp" | hg12S -A
    update kent/src/hg/makeDb/hgTrackDb/hgRoot/uniGene_2.html with current uniGene date.	

    
MAKE UNIGENE ALIGNMENTS (Done: Chuck Sugnet 7/26/02)
o - cd /projects/cc/hg/sugnet/uniGene
    ftp ftp.ncbi.nih.gov
    user: anonymous
    password: email
    cd repository/UniGene/
    prompt
    mget Hs.info Hs.seq.uniq.gz Hs.data.gz
    exit 
    # Cut out the unigene build number and append it to unigene
    mkdir uniGene.`perl -e '$t=<>;$t=~/\#(\d+)/;print "$1\n";' < Hs.info`
    # new uniGene directory = uniGene.153
    mv Hs.* uniGene.153
    cd uniGene.153
    gunzip Hs.seq.uniq.gz 
    gunzip Hs.data.gz 
    ../countSeqsInCluster.pl Hs.data counts.tab
    ../parseUnigene.pl Hs.seq.uniq Hs.seq.uniq.simpleHeader.fa leftoverData.tab
    # ssh kkstore
    cp /projects/cc/hg/sugnet/uniGene/uniGene.153/Hs.seq.uniq.simpleHeader.fa /scratch/hg/gs.13/build30/uniGene
    # email cluster-admin to push /scratch/hg/gs.13/build30/uniGene to cluster
    cd  /cluster/store1/sugnet/gs.13/build30
    mkdir uniGene
    cd uniGene
    ls -1 /cluster/store3/gs.13/build30/trfFa.0730/ > allctg.lst
    echo "/scratch/hg/gs.13/build30/uniGene/Hs.seq.uniq.simpleHeader.fa" > uniGene.lst
    echo '#LOOP\n/cluster/bin/i386/blat -mask=lower -minIdentity=95 -ooc=/scratch/hg/h/11.ooc /cluster/store3/gs.13/build30/trfFa.0730/$(path1) $(path2) {check out line+ psl/$(root1)_$(root2).psl}\n#ENDLOOP' > template.sub
    gensub2 allctg.lst uniGene.lst template.sub para.spec
    # ssh kk
    para create para.spec
    mkdir psl
    para try
    para check
    para push
    pslSort dirs hg12.uniGene.psl tmp psl >& pslSort.log
    liftUp hg12.uniGene.lifted.psl /cluster/store3/gs.13/build30/jkStuff/liftAll.lft carry hg12.uniGene.psl
    pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 hg12.uniGene.lifted hg12.uniGene.lifted.pslReps.psl /dev/null
    # exit kk and use hg12.uniGene.lifted.pslReps.psl for building SAGE track.

FAKING DATA FROM PREVIOUS VERSION
(This is just for until proper track arrives.  Rescues about
97% of data  Just an experiment, not really followed through on).

o - Rescuing STS track:
     - log onto hgwdev
     - mkdir ~/oo/rescue
     - cd !$
     - mkdir sts
     - cd sts
     - bedDown hg3 mapGenethon sts.fa sts.tab
     - echo ~/oo/sts.fa > fa.lst
     - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g
     - log onto cc01
     - cc ~/oo/rescue/sts
     - split all.con into 3 parts and condor_submit each part
     - wait for assembly to finish
     - cd psl
     - mkdir all
     - ln ?/*.psl ??/*.psl *.psl all
     - pslSort dirs raw.psl temp all
     - pslReps raw.psl contig.psl /dev/null
     - rm raw.psl
     - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl
     - rm contig.psl
     - mv chrom.psl ../convert.psl


LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (DONE 10/30/02)
(DONE 8/18/02 with 08-15-ASH run)

    # create psl files for each per-contig lav file
       ssh kkstore
       set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24"
       cd $base
       set tbl="blastzMm2"
       foreach chrdir ($base/lav/chr*)
         set chr=$chrdir:t
         set outdir=lav-psl/$chr
         mkdir -p $outdir
         foreach lav ($chrdir/*.lav)
           lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl
         end
       end

    # Substitute scratch/... path with chrom name:
       foreach f (lav-psl/*/*.psl)
         perl -wpi -e 's@/?scratch/hg/gs.13/build30/[\w\.-_]+/+(chr.+)\.nib:\d+-\d+@$1@; s@/?scratch/hg/mm2/[\w\.-_]+/+(chr.+)\.nib:\d+-\d+@$1@; s@:___@@g;' $f
       end

    # Convert to per-chromsome files, sort, and add sequence
    # kkstore's /tmp might not have enough space; try kk, kkr1u00 etc.
       mkdir -p lav-xa
       foreach chrdir (lav-psl/*)
         set chr=$chrdir:t
         pslCat -check -nohead -ext=.psl -dir lav-psl/$chr \
          | sort -k 15n -k 16n -T /cluster/store2/temp \
          | liftUp -type=.psl -pslQ -nohead stdout \
            /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin \
          > lav-xa/${chr}_${tbl}.psl
       end

    # Load tables
        ssh hgwdev
        set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24"
        cd $base/lav-xa
        hgLoadPsl hg12 *_${tbl}.psl

MAKING THE BLASTZBESTMOUSE TRACK FROM PENN STATE MM2 AXT FILES (DONE 9/3/02)
    # When Scott Schwartz is done generating .axt's for the blastz mm2 
    # alignments (takes longer than the .lav used for blastxMm2 above):

    # Create tSizes (human chrom size) and qSizes (mouse) for axtToPsl.
      In mysql:
         use hg12
         select chrom,size from chromInfo;
         use mm2
         select chrom,size from chromInfo;
      Edit the results of the first select into a tab-separated tSizes,
      edit the results of the second select into a tab-separated qSizes.

    # Consolidate AXT files to chrom level, sort, pick best, make psl.
    ssh kkstore
    set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24"
    cd $base
    mkdir -p axtChrom axtBest pslBest
    foreach chrdir (lav/chr*)
      set chr=$chrdir:t
      gunzip -c $chrdir/*.axt.gz | \
        liftUp -type=.axt -axtQ axtChrom/$chr.lifted.axt \
        /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin
      axtBest axtChrom/$chr.lifted.axt $chr axtBest/$chr.axt -minScore=300
      axtToPsl axtBest/$chr.axt tSizes qSizes pslBest/${chr}_blastzBestMouse.psl
      echo Done with $chr.
    end
    # If a chromosome has so many alignments that axtBest runs out of mem,
    # run axtBest in 2 passes to reduce size of the input to final axtBest:
    foreach chrdir (lav/chr19)
      set chr=$chrdir:t
      foreach a ($chrdir/*.axt.gz)
        gunzip $a
        axtBest $a:r $chr $a:r:r.axtBest
        gzip $a:r $a:r:r.axtBest
      end
      gunzip -c $chrdir/*.axtBest.gz | \
        liftUp -type=.axt -axtQ axtChrom/$chr.lifted.axt \
        /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin
      axtBest axtChrom/$chr.lifted.axt $chr axtBest/$chr.axt
      axtToPsl axtBest/$chr.axt tSizes qSizes pslBest/${chr}_blastzBestMouse.psl
      echo Done with $chr.
    end

    # Load tables
     ssh hgwdev
     set base="/cluster/store3/gs.13/build30/blastz.mm2.2002-08-24"
     cd $base/pslBest
     hgLoadPsl hg12 *.psl
    # Make /gbdb links and add them to the axtInfo table:
     mkdir -p /gbdb/hg12/axtBestMm2
     cd /gbdb/hg12/axtBestMm2
     foreach f ($base/axtBest/chr*.axt)
       ln -s $f .
     end
     cd $base/axtBest
     rm -f axtInfoInserts.sql
     touch axtInfoInserts.sql
     foreach f (/gbdb/hg12/axtBestMm2/chr*.axt)
       set chr=$f:t:r
       echo "INSERT INTO axtInfo VALUES ('mm2','Blastz Best in Genome','$chr','$f');" \
         >> axtInfoInserts.sql
     end
     hgsql hg12 < ~/kent/src/hg/lib/axtInfo.sql
     hgsql hg12 < axtInfoInserts.sql

EXTRACTING DYNAMIC MASKING LOCATIONS FROM BLASTZ LAV (TODO)
    # Dynamic masking = splicing out of lineage-specific repeats during 
    # first phase of blastz alignments
    ssh kkstore
    cd ~/oo/blastz-whatever-path
    set tbl=blastzDynMaskMm2
    # Dig ranges out of lavs; collapse overlapping ranges.
    rm -rf $tbl.bed
    touch $tbl.bed
    foreach c (lav/chr*)
      set chr=$c:t
      awk -v c=$chr '/  x/ {print c "\t" $2-1 "\t" $3}' $c/*.lav \
        | sort -n -k 2,2 | uniq \
        | perl -we 'while (<>) { \
                      @words = split(/\t/); \
                      $chrom = $words[0]; \
                      $start = $words[1]; \
                      $end   = $words[2]; \
                      if (! defined $prevChrom) { \
                        $prevChrom = $chrom; \
                        $rStart    = $start; \
                        $rEnd      = $end; \
                      } \
                      if (($start > $rEnd) || ($chrom ne $prevChrom)) { \
                        print "$prevChrom\t$rStart\t$rEnd"; \
                        $rStart = $start; \
                        $rEnd   = $end; \
                      } \
                      elsif ($end > $rEnd) { $rEnd = $end; } \
                      $prevChrom = $chrom; \
                    } \
                    print "$prevChrom\t$rStart\t$rEnd" \
                      if (defined $prevChrom); \
                    ' \
        >> $tbl.bed
    end
    # load table
    ssh hgwdev
    cd ~/oo/blastz-whatever-path
    set tbl=blastzDynMaskMm2
    hgLoadBed -nobin -noBin hg12 $tbl $tbl.bed


MAKING THE AXTTIGHT FROM AXTBEST (DONE 10/30/02)
    # After creating axtBest alignments above, use subsetAxt to get axtTight:
    ssh kkstore
    cd ~/oo/blastz.mm2.2002-08-24/axtBest
    mkdir -p ../axtTight
    foreach i (*.axt)
      subsetAxt  $i ../axtTight/$i \
        ~kent/src/hg/mouseStuff/subsetAxt/coding.mat 3400
    end
    # translate to psl
    cd ../axtTight
    mkdir -p ../pslTight
    foreach i (*.axt)
      set c = $i:r
      axtToPsl $i ../tSizes ../qSizes ../pslTight/${c}_blastzTightMouse.psl
    end
    # Load tables into database
    ssh hgwdev
    cd ~/oo/blastz.mm2.2002-08-24/pslTight
    hgLoadPsl hg12 chr*_blastzTightMouse.psl


MITOCHONDRIAL DNA PSEUDO-CHROMOSOME - TODO

Download the fasta file from http://www.gen.emory.edu/MITOMAP/mitomapRCRS.fasta
Put it in /cluster/store1/mrna.130
ssh hgwdev
cd ~/oo
mkdir M
cp /cluster/store1/mrna.130/mitomapRCRS.fasta M/chrM.fa
Edit jkStuff/makeNib.sh to make sure it also has the "M" directory in its file list
tcsh jkStuff/makeNib.sh
hgNibSeq -preMadeNib hg12 /cluster/store1/gs.13/build30/nib ?/chr*.fa ??/chr*.fa


TWINSCAN GENE PREDICTIONS (DONE 8/22/02)
    mkdir -p ~/oo/bed/twinscan
    cd ~/oo/bed/twinscan
    wget http://genes.cs.wustl.edu/NCBI30/gtf/gtf.tgz
    wget http://genes.cs.wustl.edu/NCBI30/ptx/ptx.tgz
    gunzip -c gtf.tgz | tar xvf -
    gunzip -c ptx.tgz | tar xvf -
    ldHgGene hg12 twinscan chr*.gtf -exon=CDS
    - pare down to id:
    foreach c (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y)
      perl -wpe 's/^\>.*\s+source_id\s*\=\s*(\S+).*$/\>$1/;' < \
        chr$c.ptx > chr$c-fixed.fa
    end
    hgPepPred hg12 generic twinscanPep chr*-fixed.fa

LOAD CHIMP DATA

o Load Ingo Ebersber's chimp BLAT alignments
cd ~/oo
mkdir bed/chimpBlat
cd bed/chimpBlat

#!/bin/sh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
do
  wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/MPI-sg_jun02/chr${i}_gp_F25Jun02.psl
done

Remove the first line from each psl file to prepare them for pslCat using the fixFile.sh shell script.
./fixFile.sh
pslCat *.psl > chimpBlat.psl
hgLoadPsl hg12 chimpBlat.psl

o Load the chimp BAC data

#!/bin/sh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
do
        wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/Riken-be_jun02/chr${i}_gp_F25Jun02.psl
done

Remove the first line from each psl file to prepare them for pslCat using the fixFile.sh shell script.
./fixFile.sh
pslCat *.psl > chimpBac.psl
hgLoadPsl hg12 chimpBac.psl


MAKING THE DOWNLOADABLE DATABASE FILES - DONE

ssh hgwdev

mkdir /usr/local/apache/htdocs/goldenPath/28jun2002
mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/chromosomes
mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mkdir /usr/local/apache/htdocs/goldenPath/28jun2002/database

o zip up the chromosomes individually

ssh kkstore # (we use kkstore because no NFS traffic via kkstore = faster data transfer)
cd ~/oo

#In tcsh run this script
  tcsh
  foreach i (?{,?}/chr*.fa)
      echo zip $i:r.zip $i
      zip $i:r.zip $i
  end

Then do:  
ssh hgwdev
cd ~/oo
mv ?{,?}/chr*.zip /usr/local/apache/htdocs/goldenPath/28jun2002/chromosomes

Request that the admins push this to hgwbeta

o Make the big zips

ssh kkstore
cd ~/oo

# Make the big zips
zip chromAgp.zip ?{,?}/chr*.agp
zip chromFa.zip ?{,?}/chr*.fa
zip chromOut.zip ?{,?}/chr*.out
zip contigAgp.zip ?{,?}/NT_??????/NT_??????.agp
zip contigFa.zip ?{,?}/NT_??????/NT_??????.fa
zip contigOut.zip ?{,?}/NT_??????/NT_??????.fa.out
zip liftAll.zip jkStuff/liftAll.lft
zip mrna.zip /cluster/store1/mrna.130/mrna.fa

ssh hgwdev
cd ~/oo
# Move all the zips to the web server dirs
mv chromAgp.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv chromFa.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv chromOut.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv contigAgp.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv contigFa.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv contigOut.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv liftAll.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips
mv mrna.zip /usr/local/apache/htdocs/goldenPath/28jun2002/bigZips

Request that the admins push all this to hgwbeta.


o Dump the database - DON'T DO THIS
IT IS HERE FOR REFERENCE ONLY
IT IS DONE AUTOMATICALLY BY A PAUL T. SCRIPT ON THE PRODUCTION MACHINES
ssh hgwbeta
We dump the database on hgwbeta in order to only dump the most accurate database state.

There is one trick here: mysqldump becomes the mysql user
and the directory you want to dump to must have that
user the ability to write to it.

Here's what to do:

cd /var/tmp
mkdir hg12-dump
chmod 777 hg12-dump      (since you aren't root this is quickest)
cd hg12-dump
mysqldump --user=hguser --password=hguserstuff --all --tab=. hg11

Then, that directory will quickly fill with .sql and .txt files
When it is done do:

cd /var/tmp/hg12-dump
gzip *.txt
mv * /usr/local/apache/htdocs/goldenPath/28jun2002/database

 - Make database.zip
ssh hgwbeta
cd /usr/local/apache/htdocs/goldenPath/28jun2002/database
zip ../bigZips/database.zip *

SGP GENE PREDICTIONS (DONE 01/31/03)
    mkdir -p ~/hg12/bed/sgp/download
    cd ~/hg12/bed/sgp/download
    foreach f (~/hg12/?{,?}/chr?{,?}{,_random}.fa)
      set chr = $f:t:r
      wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/$chr.gtf
      wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/$chr.prot
    end
    wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/chrUn.gtf -O chrUn_random.gtf
    wget http://genome.imim.es/genepredictions/H.sapiens/golden_path_20020628/SGP/chrUn.prot -O chrUn_random.prot
    # Add missing .1 to protein id's
    foreach f (*.prot)
      perl -wpe 's/^(>chr\w+)$/$1.1/' $f > $f:r-fixed.prot
    end
    cd ..
    ldHgGene hg12 sgpGene download/*.gtf -exon=CDS
    hgPepPred hg12 generic sgpPep download/*-fixed.prot


ALIGNED ANCIENT REPEATS FROM MOUSE BLASTZ
    cd ~/oo/bed
    mkdir aarMm2
    cd aarMm2
    set mmdir=../../blastz.mm2.2002-08-01
    foreach aar ($mmdir/aar/*.aar.gz)
         zcat $aar | aarToAxt | axtToPsl stdin $mmdir/H.len $mmdir/M.len stdout | liftUp -type=.psl -nohead -pslQ stdout $mmdir/liftAllMm2.lft warn stdin > chr$aar:t:r:r_aarMm2.psl
    end   
    hgLoadPsl hg12 *.psl

ALIGNMENT COUNTS FOR WIGGLE TRACK
    # this needs to be updated to reflected the full process.
 
    - Generate BED table of AARs used to select regions.
        cat ../bed/aarMm2/*.psl | awk 'BEGIN{OFS="\t"} {print $14,$16,$17,"aar"}' >aarMm2.bed
    
    - Generate background counts with windows that have a 6kb counts,
      with a maximum windows size of 512kb and sliding the windows by
        foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt)
           set chr=$axt:t:r
           set tab=$chr.6kb-aar.cnts  (??? need better name ???)
           hgCountAlign -selectBed=aarMm2.bed -winSize=512000 -winSlide=1000 -fixedNumCounts=6000 -countCoords $axt $tab
        end

    - Generate counts for AARs with 50b windows, slide by 5b
        foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt)
           set chr=$axt:t:r
           set tab=$chr.50b-aar.cnts  (??? need better name ???)
           hgCountAlign -selectBed=aarMm2.bed -winSize=50 -winSlide=5 $axt $tab
        end

    - Generate counts for all with 50b windows, slide by 5b
        foreach axt (../../blastz.mm2.2002-08-01/axtBest/chr*.axt)
           set chr=$axt:t:r
           set tab=$chr.50b.cnts  (??? need better name ???)
           hgCountAlign -winSize=50 -winSlide=5 $axt $tab
        end


MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE w/ mrna.130)

o ssh to kkstore
 mkdir -p /cluster/store1/gs.13/build30/bed/refFull
 cd /cluster/store1/gs.13/build30/bed/refFull

 Download the sequence: wget ftp://blue3.ims.u-tokyo.ac.jp/pub/db/hgc/dbtss/ref-full.fa.gz

 mv ref-full.fa.gz dbtss.fa.gz
 gunzip it and split the ref-rull.fa file into about 200 pieces
 gunzip dbtss.fa.gz
 faSplit sequence dbtss.fa 50 splitDbtss
 mkdir /scratch/hg/refFull
 splitdbtss* /scratch/hg/dbtss/
 ls -1S /scratch/hg/gs.13/build30/contig.0729/*.fa > genome.lst
 ls -1S /scratch/hg/dbtss/split*.fa > refFull.lst

o - Request the admins to do a binrsync to the cluster of /scratch/hg/dbtss

o - Use BLAT to generate dbtss alignments as so:
      Make sure that /scratch/hg/gs.13/build30/contig/ is loaded
      with NT_*.fa and pushed to the cluster nodes.

          ssh kk
          cd /cluster/store1/gs.13/build30/bed/dbtss
          mkdir -p psl
#          run mkdirs.sh script to create sudirs in the psl directory
#               in order to modularize the blat job.
          gensub2 genome.lst dbtss.lst gsub spec
          para create spec
          
    Now run a para try/push and para check in each one.

o - Process dbtss alignments into near best in genome.
      cd ~/oo/bed
      cd dbtss
      pslSort dirs raw.psl /tmp psl/*
      pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null
      liftUp -nohead all_dbtss.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /tmp all_dbtss.psl

o - Load dbtss alignments into database
      ssh hgwdev
      cd /cluster/store1/gs.13/build30/bed/dbtss
      pslCat -dir chrom > dbtssAli.psl
      hgLoadPsl hg12 -tNameIx dbtssAli.psl

LOAD SLAM GENES 
     cd /cluster/store3/gs.13/build30/bed
     mkdir slam
     cd slam
     wget http://bio.math.berkeley.edu/slam/mouse/gff/UCSC/hsCDS.gff.gz
     wget http://bio.math.berkeley.edu/slam/mouse/gff/UCSC/hsCNS.gff.gz
     gunzip *
     ldHgGene -exon=CDS hg12 slam hsCDS.gff
     mv genePred.tab genePred.hg12

     awk '{print $1,$4,$5,$10,$12}' hsCNS.gff > hsCNS.bed
     sed -e 's/;//g' -e 's/"//g' hsCNS.bed > hsCNS.bed.2
     sort -n -k 5,5 hsCNS.bed.2 > hsCNS.bed.sort
     examine head and tail of sorted file for range of scores
     rm hsCNS.bed.sort
     size.pl < hsCNS.bed.2 > hsCNS.bed.2.size
     sort -n -k 2,2 hsCNS.bed.2.size > hsCNS.bed.2.size.sort
     examine head and tail of sorted file for range of sizes
     rm hsCNS.bed.2.size.sort
     expand.pl < hsCNS.bed.2 > hsCNS.bed.2.expand
     hgLoadBed -tab hg12 slamNonCoding hsCDS.bed.2.expand


CREATING THE humMusL SAMPLE TRACK (a.k.a WIGGLE TRACK)
------------------------------------------------------
o - refer to the script at src/hg/sampleTracks/makeHg12Mm2.doc

LIFTOVER CHAIN TO HG15 (2004-04-12 kate)
----------------------------------------
# blat alignments with 3K split of hg15 as query
# NOTE: the split is doc'ed in makeHg15.doc
    ssh eieio
    mkdir -p /cluster/bluearc/hg12
    cp -R /cluster/data/hg12/mixedNib /cluster/bluearc/hg12

    ssh kk
    cd /cluster/data/hg12/bed
    mkdir -p blat.hg115.2004-04-12
    ln -s blat.hg115.2004-04-12 blat.hg15
    cd blat.hg15
    mkdir raw psl run
    cd run
echo '#LOOP' > gsub
echo 'blat $(path1) $(path2) {check out line+ ../raw/$(root1)_$(root2).psl} -tileSize=11 -ooc=/cluster/bluearc/hg/h/11.ooc -minScore=100 -minIdentity=98 -fastMap' >> gsub
echo '#ENDLOOP' >> gsub
    # query
    ls -1S /iscratch/i/gs.16/build33/liftOver/split/*.fa > new.lst
    # target
    ls -1S /cluster/bluearc/hg12/mixedNib/*.nib > old.lst
    gensub2 old.lst new.lst gsub spec
    para create spec
    para try 
    para push

    # lift results
    ssh eieio
    cd /cluster/data/hg12/bed/blat.hg15
    cd raw
cat > liftup.csh << 'EOF'
    set liftDir = /cluster/bluearc/hg/gs.16/build33/liftOver/lift
    foreach i (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M)
        echo chr$i
        liftUp -pslQ ../psl/chr$i.psl $liftDir/chr$i.lft warn chr*_chr$i.psl
        echo done $i
    end
'EOF'
    csh liftup.csh >&! liftup.log &

    # create alignment chains
    ssh kk
    cd /cluster/data/hg12/bed/blat.hg15
    mkdir chainRun chainRaw chain
    cd chainRun
    echo '#LOOP' > gsub
    echo 'axtChain -psl $(path1) /cluster/bluearc/hg12/mixedNib /scratch/hg/gs.16/build33/chromTrfMixedNib {check out line+ ../chainRaw/$(root1).chain}' >> gsub
    echo '#ENDLOOP' >> gsub
    ls -1S ../psl/*.psl > in.lst
    gensub2 in.lst single gsub spec
    para create spec
    para try
    para push

    ssh eieio
    cd /cluster/data/hg12/bed/blat.hg15
    chainMergeSort chainRaw/*.chain | chainSplit chain stdin
    mkdir net
    cd chain
    foreach i (*.chain)
        chainNet $i /cluster/data/hg12/chrom.sizes \
                /cluster/data/hg15/chrom.sizes ../net/$i:r.net /dev/null
        echo done $i
    end
    mkdir ../over
cat > subset.csh << 'EOF'
    foreach i (*.chain)
        echo $i:r
        netChainSubset ../net/$i:r.net $i ../over/$i
        echo done $i
    end
'EOF'
    csh subset.csh >&! subset.log &
    cat ../over/*.chain > ../over.chain
    mkdir -p /cluster/data/hg12/bed/bedOver
    cp ../over.chain /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain

    # save to download area
    ssh hgwdev
    cd /usr/local/apache/htdocs/goldenPath/hg12
    mkdir -p liftOver
    cp /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain liftOver
    gzip liftOver/hg12ToHg15.over.chain

    # load into database
    mkdir -p /gbdb/hg12/liftOver
    ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg15.over.chain \
        /gbdb/hg12/liftOver
    hgAddLiftOverChain hg12 hg15


LIFTOVER CHAIN TO HG16 (2004-04-13 kate)
----------------------------------------
# blat alignments with 3K split of hg16 as query
# NOTE: the split is doc'ed in makeHg16.doc

    ssh kk
    cd /cluster/data/hg12/bed
    mkdir -p blat.hg16.2004-04-13
    ln -s blat.hg16.2004-04-13 blat.hg16
    cd blat.hg16
    mkdir raw psl run
    cd run
echo '#LOOP' > gsub
echo 'blat $(path1) $(path2) {check out line+ ../raw/$(root1)_$(root2).psl} -tileSize=11 -ooc=/cluster/bluearc/hg/h/11.ooc -minScore=100 -minIdentity=98 -fastMap' >> gsub
echo '#ENDLOOP' >> gsub
    # query
    ls -1S /iscratch/i/gs.17/build34/liftOver/split/*.fa > new.lst
    # target
    ls -1S /cluster/bluearc/hg12/mixedNib/*.nib > old.lst
    gensub2 old.lst new.lst gsub spec
    para create spec
    para try 
    para push

    # lift results
    ssh eieio
    cd /cluster/data/hg12/bed/blat.hg16
    cd raw
cat > liftup.csh << 'EOF'
    set liftDir = /cluster/bluearc/hg/gs.17/build34/liftOver/lift
    foreach i (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M)
        echo chr$i
        liftUp -pslQ ../psl/chr$i.psl $liftDir/chr$i.lft warn chr*_chr$i.psl
        echo done $i
    end
'EOF'
    csh liftup.csh >&! liftup.log &

    # create alignment chains
    ssh kk
    cd /cluster/data/hg12/bed/blat.hg16
    mkdir chainRun chainRaw chain
    cd chainRun
    echo '#LOOP' > gsub
    echo 'axtChain -psl $(path1) /cluster/bluearc/hg12/mixedNib /scratch/hg/gs.17/build34/bothMaskedNibs {check out line+ ../chainRaw/$(root1).chain}' >> gsub
    echo '#ENDLOOP' >> gsub
    ls -1S ../psl/*.psl > in.lst
    gensub2 in.lst single gsub spec
    para create spec
    para try
    para push

    ssh eieio
    cd /cluster/data/hg12/bed/blat.hg16
    chainMergeSort chainRaw/*.chain | chainSplit chain stdin
    mkdir net
    cd chain
cat > chain.csh << 'EOF'
    foreach i (*.chain)
        chainNet $i /cluster/data/hg12/chrom.sizes \
                /cluster/data/hg16/chrom.sizes ../net/$i:r.net /dev/null
        echo done $i
    end
'EOF'
    csh chain.csh >&! chain.log &
    mkdir ../over
cat > subset.csh << 'EOF'
    foreach i (*.chain)
        echo $i:r
        netChainSubset ../net/$i:r.net $i ../over/$i
        echo done $i
    end
'EOF'
    csh subset.csh >&! subset.log &
    cat ../over/*.chain > ../over.chain
    mkdir -p /cluster/data/hg12/bed/bedOver
    cp ../over.chain /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain

    # save to download area
    ssh hgwdev
    cd /usr/local/apache/htdocs/goldenPath/hg12
    mkdir -p liftOver
    cp /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain liftOver
    gzip liftOver/hg12ToHg16.over.chain

    # load into database
    mkdir -p /gbdb/hg12/liftOver
    ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg16.over.chain \
        /gbdb/hg12/liftOver
    hgAddLiftOverChain hg12 hg16


LIFTOVER CHAIN TO HG13 (2003-04-14 daryl ?, 2004-04-15 kate)
-------------------------------------------
    cp /cluster/data/hg13/bedOver/over.chain \
        /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain

    # save to download area
    ssh hgwdev
    cd /usr/local/apache/htdocs/goldenPath/hg12
    mkdir -p liftOver
    cp /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain liftOver
    gzip liftOver/hg12ToHg13.over.chain

    # load into database
    mkdir -p /gbdb/hg12/liftOver
    ln -s /cluster/data/hg12/bed/bedOver/hg12ToHg13.over.chain \
        /gbdb/hg12/liftOver
    hgAddLiftOverChain hg12 hg13