Hi Galt - this directory /hive/groups/encode/encode3/testSets/manifest1 has a test set for the manifest file. The input is manifest.txt and the output should be called "validated.txt" and include the following extra columns: 1) md5 - the usual md5sum of the file 2) size - in bytes 3) modified - last modified time. I would not be adverse to a simple seconds since 1970 format, but if you have other simple good ideas please let me know. 4) validation key/error. If the file validates this should be V plus a number calculated as so: (sum-of-ascii-values-in-MD5 + file-size)%10000 so it should be something like V18209. We'd just use this as a lightweight security mechanism to make sure they don't fake the validation. If it's an error, instead please put just ERROR there. Hopefully the validate program will tell user about the error in std err in more detail. Could you design the program so that it could read in a previous version of validated.txt, and just check the ones where the file date or size has changed, and for the unchanged ones just use what is in validated.txt rather than rerunning? ------------------------- [hgwdev:manifest1> ll total 112 -rw-rw-r-- 1 kent genecats 3505 Feb 4 17:56 files.txt -rw-rw-r-- 1 kent genecats 74 Feb 4 17:32 genomes.txt -rw-rw-r-- 1 kent genecats 88 Feb 4 18:32 header drwxrwsr-x 2 kent genecats 8192 Feb 4 17:42 hg19 -rw-rw-r-- 1 kent genecats 136 Feb 4 17:33 hub.txt -rw-rw-r-- 1 kent genecats 6136 Feb 4 18:34 manifest.txt drwxrwsr-x 2 kent genecats 8192 Feb 4 17:45 mm9 ------------------------- [hgwdev:manifest1> find . ./hg19 ./hg19/wgEncodeRikenCageH1hescCellLongnonpolyaAln.bam ./hg19/wgEncodeRikenCageH1hescCellLongnonpolyaAln.bam.bai ./hg19/wgEncodeRikenCageH1hescCellLongnonpolyaMinusClusters.bigWig ./hg19/wgEncodeRikenCageH1hescCellLongnonpolyaPlusClusters.bigWig ./hg19/wgEncodeRikenCageH1hescCellLongnonpolyaRawData.fastq.gz ./hg19/wgEncodeRikenCageH1hescCellPamAln.bam ./hg19/wgEncodeRikenCageH1hescCellPamAln.bam.bai ./hg19/wgEncodeRikenCageH1hescCellPamMinusSignal.bigWig ./hg19/wgEncodeRikenCageH1hescCellPamPlusSignal.bigWig ./hg19/wgEncodeRikenCageH1hescCellPamRawData.fastq.gz ./hg19/wgEncodeRikenCageH1hescCellPamTssGencV7.gtf.gz ./hg19/wgEncodeRikenCageH1hescCellPamTssHmm.bedRnaElements.gz ./hg19/wgEncodeRikenCageH1hescCellPamTssHmmV2.bedRnaElements.gz ./hg19/wgEncodeRikenCageH1hescCellPapAlnRep1.bam ./hg19/wgEncodeRikenCageH1hescCellPapAlnRep1.bam.bai ./hg19/wgEncodeRikenCageH1hescCellPapAlnRep2.bam ./hg19/wgEncodeRikenCageH1hescCellPapAlnRep2.bam.bai ./hg19/wgEncodeRikenCageH1hescCellPapMinusClustersRep1.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapMinusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapMinusSignalRep1.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapMinusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapPlusClustersRep1.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapPlusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapPlusSignalRep1.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapPlusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCellPapRawDataRep1.fastq.gz ./hg19/wgEncodeRikenCageH1hescCellPapRawDataRep2.fastq.gz ./hg19/wgEncodeRikenCageH1hescCellPapTssGencV7.gtf.gz ./hg19/wgEncodeRikenCageH1hescCellPapTssHmm.bedRnaElements.gz ./hg19/wgEncodeRikenCageH1hescCytosolPapAlnRep2.bam ./hg19/wgEncodeRikenCageH1hescCytosolPapAlnRep2.bam.bai ./hg19/wgEncodeRikenCageH1hescCytosolPapMinusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCytosolPapMinusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCytosolPapPlusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCytosolPapPlusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescCytosolPapRawDataRep2.fastq.gz ./hg19/wgEncodeRikenCageH1hescCytosolPapTssHmm.bedRnaElements.gz ./hg19/wgEncodeRikenCageH1hescNucleusPapAlnRep2.bam ./hg19/wgEncodeRikenCageH1hescNucleusPapAlnRep2.bam.bai ./hg19/wgEncodeRikenCageH1hescNucleusPapMinusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescNucleusPapMinusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescNucleusPapPlusClustersRep2.bigWig ./hg19/wgEncodeRikenCageH1hescNucleusPapPlusSignalRep2.bigWig ./hg19/wgEncodeRikenCageH1hescNucleusPapRawDataRep2.fastq.gz ./hg19/wgEncodeRikenCageH1hescNucleusPapTssHmm.bedRnaElements.gz ./hg19/wgEncodeUncBsuProtGencH1hescCellFaspmPepMapGcFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellFaspmPepMapGcUnFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellFasppepMapGcFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellFasppepMapGcUnFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellMudpitmPepMapGcFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellMudpitmPepMapGcUnFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellMudpitpepMapGcFt.bigBed ./hg19/wgEncodeUncBsuProtGencH1hescCellMudpitpepMapGcUnFt.bigBed ./mm9 ./mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xAlnRep1.bam ./mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xAlnRep1.bam.bai ./mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xPkRep1.narrowPeak.gz ./mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xRawDataRep1.fastq.tgz ./mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xSigRep1.bigWig ./genomes.txt ./hub.txt ./files.txt ./manifest.txt ./header Looks like most of these are symlinks. -------------------- manifest.txt looks like this: #file_name format experiment replicate output_type biosample target localization update hg19/wgEncodeRikenCageH1hescCellLongnonpolyaAln.bam bam exp1 pooled alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellLongnonpolyaAln.bam.bai bam.bai exp1 pooled alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellLongnonpolyaMinusClusters.bigWig bigWig exp1 pooled clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellLongnonpolyaPlusClusters.bigWig bigWig exp1 pooled clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellLongnonpolyaRawData.fastq.gz fastq.gz exp1 pooled reads H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamAln.bam bam exp1 pooled alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamAln.bam.bai bam.bai exp1 pooled alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamMinusSignal.bigWig bigWig exp1 pooled signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamPlusSignal.bigWig bigWig exp1 pooled signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamRawData.fastq.gz fastq.gz exp1 pooled reads H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamTssGencV7.gtf.gz gtf.gz exp1 pooled gencodeModels H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamTssHmm.bedRnaElements.gz bedRnaElements.gz exp1 pooled hmmModels H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPamTssHmmV2.bedRnaElements.gz bedRnaElements.gz exp1 pooled hmmModels H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapAlnRep1.bam bam exp1 1 alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapAlnRep1.bam.bai bam.bai exp1 1 alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapAlnRep2.bam bam exp1 2 alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapAlnRep2.bam.bai bam.bai exp1 2 alignment H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapMinusClustersRep1.bigWig bigWig exp1 1 clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapMinusClustersRep2.bigWig bigWig exp1 2 clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapMinusSignalRep1.bigWig bigWig exp1 1 signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapMinusSignalRep2.bigWig bigWig exp1 2 signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapPlusClustersRep1.bigWig bigWig exp1 1 clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapPlusClustersRep2.bigWig bigWig exp1 2 clusters H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapPlusSignalRep1.bigWig bigWig exp1 1 signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapPlusSignalRep2.bigWig bigWig exp1 2 signal H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapRawDataRep1.fastq.gz fastq.gz exp1 1 reads H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapRawDataRep2.fastq.gz fastq.gz exp1 2 reads H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapTssGencV7.gtf.gz gtf.gz exp1 pooled gencodeModels H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCellPapTssHmm.bedRnaElements.gz bedRnaElements.gz exp1 pooled hmmModels H1hesc n/a cell no hg19/wgEncodeRikenCageH1hescCytosolPapAlnRep2.bam bam exp2 2 alignment H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapAlnRep2.bam.bai bam.bai exp2 2 alignment H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapMinusClustersRep2.bigWig bigWig exp2 2 clusters H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapMinusSignalRep2.bigWig bigWig exp2 2 signal H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapPlusClustersRep2.bigWig bigWig exp2 2 clusters H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapPlusSignalRep2.bigWig bigWig exp2 2 signal H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapRawDataRep2.fastq.gz fastq.gz exp2 2 reads H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescCytosolPapTssHmm.bedRnaElements.gz bedRnaElements.gz exp2 pooled hmmModels H1hesc n/a cytosol no hg19/wgEncodeRikenCageH1hescNucleusPapAlnRep2.bam bam exp3 2 alignment H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapAlnRep2.bam.bai bam.bai exp3 2 alignment H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapMinusClustersRep2.bigWig bigWig exp3 2 clusters H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapMinusSignalRep2.bigWig bigWig exp3 2 signal H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapPlusClustersRep2.bigWig bigWig exp3 2 clusters H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapPlusSignalRep2.bigWig bigWig exp3 2 signal H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapRawDataRep2.fastq.gz fastq.gz exp3 2 reads H1hesc n/a nucleus no hg19/wgEncodeRikenCageH1hescNucleusPapTssHmm.bedRnaElements.gz bedRnaElements.gz exp3 pooled hmmModels H1hesc n/a nucleus no hg19/wgEncodeUncBsuProtGencH1hescCellFaspmPepMapGcFt.bigBed bigBed exp4 pooled pmPepMapFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellFaspmPepMapGcUnFt.bigBed bigBed exp4 pooled pmPepMapUnFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellFasppepMapGcFt.bigBed bigBed exp4 pooled PepMapFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellFasppepMapGcUnFt.bigBed bigBed exp4 pooled PepMapUnFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellMudpitmPepMapGcFt.bigBed bigBed exp4 pooled mMudMapFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellMudpitmPepMapGcUnFt.bigBed bigBed exp4 pooled mMudMapUnFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellMudpitpepMapGcFt.bigBed bigBed exp4 pooled MudMapFt H1hesc n/a n/a no hg19/wgEncodeUncBsuProtGencH1hescCellMudpitpepMapGcUnFt.bigBed bigBed exp4 pooled MudMapUnFt H1hesc n/a n/a no mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xAlnRep1.bam bam exp5 1 alignment H1hesc Nrsf n/a no mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xAlnRep1.bam.bai bam.bai exp5 1 alignment C2c12 Nrsf n/a no mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xPkRep1.narrowPeak.gz narrowPeak.gz exp5 1 peaks C2c12 Nrsf n/a no mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xRawDataRep1.fastq.tgz fastq.tgz exp5 1 reads C2c12 Nrsf n/a no mm9/wgEncodeCaltechTfbsC2c12NrsfFCntrl32bE2p60hPcr2xSigRep1.bigWig bigWig exp5 1 signal C2c12 Nrsf n/a no ----------------- I assume that simply hashing the filename should be enough to make it unique. I need to first read in the validated.txt file if it is there, and then use that with the manifest.txt. -- if the file_name stored in the validated hash matches and it has a valid code rather than error, and the file size and date are the same too, then accept it and do not re-run it. How important is it to either sort it or to output the new validated.txt? No, just use the natural order that is there. These are tab-separated values. In theory, the order of the columns in validated.txt could be different than in manifest.txt. Should that be detected, or does it just not matter? Where and when would be decide the allowed types? Should we begin an encode3 directory under hg? I guess so. hash of fieldname to column#. The manifestValidator should probably check if the value is an integer where appropriate? Make that be a hard-wired table at first? Do I want to make it a hash of hashes? #file_name format experiment replicate output_type biosample target localization update ---- md5_sum size modified valid_key Some of the code could be like the bed-parser which does tab input? Probably want to validate the manifest columns. create another hash for the inputs that simply maps record# to values. one way wants to preserve the original record order. one way wants to match on the file_name. do we parse all the fields, or simply keep the entire row together as a string? It can easily be made to check that the other columns exist and have the same values in validate.txt, otherwise, it can re-run the validator and the md5sum which are time-intensive operations. store the field-names list as an slList? also hash field-names with position index? do I need this for sure? store the records as hashes, keep an slList of the hashes to preserve the original record order. as well as keeping a hash-of-file_name to them also. always still have to check the fileExists, fileSize and modified. ------------ It would probably be better NOT to clobber the validate.txt until the whole program finishes running? So use something like newValidate.txt and then rename it at the end? -- Do we need to detect duplicate fieldnames? ------------------- There is 11 GB of stuff there. [hgwdev:manifest1> du -Lh --apparent-size . 7.4G ./hg19 2.7G ./mm9 11G .