The database of sp092903 and proteins092903 need to be built from SWISS-PROT, TrEMBL, and TrEMBL-NEW first, using spToDb and other programs. (see /cluster/store4/fan/pb/buildProteins092903.doc for details). o Create a working subdirectory mm3, make symbolic link, and go there mkdir /cluster/store4/fan/pb/mm3 cd /cluster/store4/fan/pb/mm3 ln -s /cluster/store4/fan/pb/mm3 ~/mm3 o Build mm3Temp database by: create database mm3Temp; Get mm3Temp.sql for table definitions dumpdbdef hg16Temp >mm3Temp.sql Create tables in mm3Temp: mysql -u hgcat -p$HGPSWD -A mm3Temp " mrna.fa > mrna.lis o Process LocusLink data to generate mrnaRefseq table - create a subdirectory 100603 under ~fan/data/ll and cd to there - get the latest LocusLink data from wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/mim2loc - and copy over to ~mm3 cp -p *loc* ~/mm3 - load LocusLink data to this 2 tables using mysql LOAD DATA local INFILE 'loc2acc' into table mm3Temp.locus2Acc0; LOAD DATA local INFILE 'loc2ref' into table mm3Temp.locus2Ref0; - run hgMrnaRefseq to generate mrnaRefseq.tab hgMrnaRefseq mm3 - create table mm3.mrnaRefseq: CREATE TABLE mrnaRefseq ( mrna varchar(40) NOT NULL default '', refseq varchar(40) NOT NULL default '', KEY mrna (mrna), KEY refseq (refseq) ) TYPE=MyISAM; - load data into all appropriate genome databases LOAD DATA local INFILE 'mrnaRefseq.tab' into table mm3.mrnaRefseq; o generate FASTA format protein seuqnce file kgGetPep 092903 > mrnaPep.fa o run pslReps to get tighter mRNAs pslReps -minCover=0.40 -sizeMatters -minAli=0.97 -nearTop=0.002 all_mrna.psl tight_mrna.psl /dev/null o Run hgKgMrna to build "refGene" tables in mm3Temp database hgKgMrna mm3Temp mrna.fa mrna.ra tight_mrna.psl loc2ref mrnaPep.fa mim2loc proteins092903 >hgKgMrna.out 2>hgKgMrna.err o create the mrnaGene table in mm3Temp DB, by running mrnaGene.sql at mySql prompt Load mrnaGene data into the table LOAD DATA local INFILE 'refGene.tab' into table mm3Temp.mrnaGene; mm3Temp.mrnaGene is needed by spm6 create KG related tables in mm3 mysql -u hgcat -p$HGPSWD -A mm3 < kgRelated.sql LOAD DATA local INFILE 'refMrna.tab' into table mm3Temp.refMrna; Load pep and mrna data into the knownGenePep and knownGeneMrna tables LOAD DATA local INFILE 'refPep.tab' into table mm3.knownGenePep; LOAD DATA local INFILE 'refMrna.tab' into table mm3.knownGeneMrna; o run spm3 to generate the proteinMrna.tab and protein.lis file spm3 092903 mm3 create table spMrna in mm3Temp and load proteinMrna.tab into mm3Temp.spMrna. load data local infile "proteinMrna.tab" into table mm3Temp.spMrna; o run kgBestMrna create a subdirectory kgBestMrna cd kgBestMrna cp -p ../protein.lis . kgBestMrna 092903 mm3 2>kgBestMrna.err >kgBestMrna.out2 The log file of best picks will be generated by kgBestMrna and stored at kgBestMrna.out. This may take a day and half to finish! The output file is best.lis. cp -p best.lis .. This step could be broken into 2 or 3 pieces and run in parallel to leverage hgwdev's 4 CPUs. o Create spMrna table in mm3, by copy and paste spMrna.sql at mysql prompt. Load the data by: LOAD DATA local INFILE 'best.lis' into table mm3.spMrna; o Run spm6 to generate sorted.lis and knownGene0.tab for further duplicates processing spm6 092903 mm3 create table knownGene0 in mm3Temp load the knownGene0.tab into the knownGene0 table in mm3Temp LOAD DATA local INFILE 'knownGene0.tab' into table mm3Temp.knownGene0; o Run spm7 to perform duplicates processing spm7 092903 mm3 > spm7.out o create knownGene and dupSpMrna tables in mm3 by using knownGene.sql and dupSpMrna.sql LOAD DATA local INFILE 'knownGene.tab' into table mm3.knownGene; LOAD DATA local INFILE 'duplicate.tab' into table mm3.dupSpMrna; o collect DNA based RefSeq data to create dnaGene.tab and dnaLink.tab dnaGene mm3 proteins092903 o create table knownGeneLink in mm3 LOAD DATA local INFILE 'dnaLink.tab' into table mm3.knownGeneLink; o load the data into tables: LOAD DATA local INFILE 'dnaGene.tab' into table mm3.knownGene; o Remove invalid KG entries in knownGenePep and knownGeneMrna tables: rmKGPepMrna mm3 092903 First, use mysql to delete old knownGenePep and knownGeneMrna table entries: use mm3 delete from mm3.knownGenePep; delete from mm3.knownGeneMrna; Then load in new filtered data: LOAD DATA local INFILE 'knownGenePep.tab' into table mm3.knownGenePep; LOAD DATA local INFILE 'knownGeneMrna.tab' into table mm3.knownGeneMrna; o Use the Genome Browser to check if the "Known Gene" track is functioning correctly. o Now create alias tables to facilitate hgFind. First create tables of kgXref, kgAlias and kgProtAlias in mm3, using kgXref.sql kgAlias.sql kgProtAlias.sql o Build kgXref table Generate xref .tab file for KG kgXref mm3 proteins092903 Load it into mySQL load data local infile "kgXref.tab" into table mm3.kgXref; o Build gene aliases Generate aliases from hugo, etc kgAliasM mm3 proteins092903 Generate gene aliases from SWISS-PROT data kgAliasP mm3 /cluster/store5/swissprot/092903/build/sprot.dat sp.lis kgAliasP mm3 /cluster/store5/swissprot/092903/build/trembl.dat tr.lis kgAliasP mm3 /cluster/store5/swissprot/092903/build/trembl_new.dat new.lis cat sp.lis tr.lis new.lis |sort|uniq >kgAliasP.tab rm sp.lis tr.lis new.lis Generate gene aliases from RefSeq data kgAliasRefseq mm3 Concatenate all 3 files cat kgAliasM.tab kgAliasRefseq.tab kgAliasP.tab|sort|uniq > kgAlias.tab Load it into mySQL table load data local infile "kgAlias.tab" into table mm3.kgAlias; o Build protein aliases Generate protein aliases kgProtAlias mm3 proteins092903 Generate protein aliases from NCBI data kgProtAliasNCBI mm3 Concatenate both files cat kgProtAliasNCBI.tab kgProtAlias.tab|sort|uniq > kgProtAliasBoth.tab Load it into mySQL tables load data local infile "kgProtAliasBoth.tab" into table mm3.kgProtAlias; o Create KEGG pathway related tables Go to KEGG web site at: http://www.genome.ad.jp/dbget-bin/www_bfind?pathway Search "mmu". Cut and paste the resulting list, e.g.: 1. path:mmu00010 Glycolysis / Gluconeogenesis - Mus musculus 2. path:mmu00020 Citrate cycle (TCA cycle) - Mus musculus 3. path:mmu00030 Pentose phosphate pathway - Mus musculus 4. path:mmu00040 Pentose and glucuronate interconversions - Mus musculus ... Save it as mmu.lis Create the keggList database table in mm3Temp; CREATE TABLE keggList ( locusID varchar(40) NOT NULL default '', mapID varchar(40) NOT NULL default '', description varchar(255) NOT NULL default '', KEY (locusID), KEY (mapID) ) TYPE=MyISAM; Run the Perl program getKeggList.pl under hg/hgKegg getKeggList.pl mmu > keggList.tab Load into the table keggList; load data local infile "keggList.tab" into table mm3Temp.keggList; Run hgKegg to generate the .tab files: hgKegg mm3 which will create two files, keggPathway.tab and keggMapDesc.tab. Create the following two tables in mm3 by: CREATE TABLE keggMapDesc ( mapID varchar(40) NOT NULL default '', description varchar(255) NOT NULL default '', KEY (mapID) ) TYPE=MyISAM; CREATE TABLE keggPathway ( kgID varchar(40) NOT NULL default '', locusID varchar(40) NOT NULL default '', mapID varchar(40) NOT NULL default '', KEY (kgID), KEY (locusID), KEY (mapID) ) TYPE=MyISAM; Load the two tables: load data local infile "keggPathway.tab" into table mm3.keggPathway; load data local infile "keggMapDesc.tab" into table mm3.keggMapDesc; o Create CGAP related tables Ftp from ftp://ftp1.nci.nih.gov/pub/CGAP Get Mm_GeneData.dat. Run hgCGAP to generate parsed .tab files. hgCGAP Mm_GeneData.dat cat *SEQ*.tab *SYM*.tab *ALI*.tab |sort|uniq >cgapAlias.tab Load data into tables: load data local infile "cgapBIOCARTA.tab" into table mm3.cgapBiocPathway; load data local infile "cgapBIOCARTAdesc.tab" into table mm3.cgapBiocDesc; load data local infile "cgapAlias.tab" into table mm3.cgapAlias;