ENCODE Project at NHGRI ENCODE Resources and Frequently Asked Questions

Helpful ENCODE Resources:

Search the entire ENCODE website:  

Search the entire UCSC Genome Browser website:  

Search the UCSC Genome mailing list archives:  

ENCODE Frequently Asked Questions:

  • How do I display ENCODE data from GEO in the genome browser?
  • Which cell types are used by ENCODE?
  • Where can I find the ENCODE growth protocol for a specific cell type?
  • Has transcription factor xxx been mapped by ENCODE?
  • How do I find overlaps between my own ChIP-seq regions and available ENCODE transcription factors?
  • What is the difference between a file xxx and the related file xxxV2?
  • How do I extract information about an ENCODE experiment from the filename?
  • How do I learn more about different ENCODE file formats?
  • What is the definition of "score" in ENCODE tables?
  • How do I download ENCODE histone data in BED format?
  • What does the Name column represent for DNase clustered BED files?
  • How do I find the meaning of a column of a BED file?
  • Is there a service providing ENCODE data on a hard drive?
  • Where can I find ENCODE papers?
  • Can I convert WIG files into a variableStep format to use with SitePro?
  • What does xxx mean in a file in hgdownload/encodeDCC/hg19/wgEncode*?
  • Which cell protocols were used in my track of interest?

    Questions and feedback welcome.

  • GEO DATA

    Question:
    How do I display ENCODE data from GEO in the genome browser?

    Response:
    Please avoid loading GEO data as a custom track! Rather since nearly all ENCODE data at GEO are already hosted as tracks on the UCSC browser load the existing corresponding track.

    Take note of the GEO sample accession (GSM) number and enter it into the Track Search tool accessible from the left side of the ENCODE portal page by clicking Search, for example GSM999240. Or use the Advanced Track Search page and select "GEO sample accession" from the pull down menu displaying "Cell, tissue or DNA sample". Click the box next to your track resulting from the search and the "View in Browser" button.

    If you have data that is not already in the browser we recommend converting your BED files to bigBed format. You could download our source tools for converting from BED to bigBed (as described in the previous link) or use the tools at the Galaxy website. For questions regarding Galaxy you will have to contact them directly.

    ENCODE CELL TYPES

    Question:
    Which cell types are used by ENCODE?

    Response:
    On the left side of the ENCODE portal page under Human and Mouse are links to the cell types used in all ENCODE experiments.

    ENCODE PROTOCOLS

    Question:
    Where can I find the ENCODE growth protocol for a specific cell type? For example RCC 7860?

    Response:
    To find a specific protocol, for example for human RCC 7860 cells, from the ENCODE portal navigate to the Human Cell Types page. Under the "Documents" column for RCC 7860, click the link to connect to see the growth protocol named after the lab that provided the document, in this case "Crawford".

    Another path to ENCODE protocols is from the link http://genome.ucsc.edu/ENCODE/protocols/. Navigate to the cell protocols and then human directories to find the link to the same RCC 7860 protocol file as linked on the above Human Cell Types page.

    If you have further questions about a protocol contact the lab that registered the protocol.

    TRANSCRIPTION FACTORS

    Question:
    Has transcription factor xxx been mapped by ENCODE?

    Response:
    A quick way to view the list of transcription factors mapped by ENCODE is to view the ChIP-seq matrix for either human or mouse. Targets are listed horizontally across the top, indicating available mapped transcription factor data. Clicking on the green highlighted boxes will bring you to experiment data specific to the corresponding cell type and target.

    Another option is to use the Track Search or FileSearch tools and to search the "Antibody or target protein" field to see if the desired transcription factor is listed.

    MAPPING A CUSTOM TRACK TO TRANSCRIPTION FACTORS

    Question:
    How do I find overlaps between my own ChIP-seq regions and available ENCODE transcription factors?

    Response:
    By using the Table Browser tool you can add your ChIP-seq information as a custom track and then use the "intersection" feature to intersect the Txn Factor ChIP track table listed under the Regulation group with your custom track. Note, your custom track should contain ChIP-seq regions in BED format, for more information visit our custom tracks page.

    If you are unfamiliar with the Table Browser, please refer to our help page and the section on intersecting data.

    FILES NAMED xxxV2

    Question:
    What is the difference between a file xxx and the related file xxxV2? Why is the xxx file not displayed in the browser?

    Response:
    For files named similar to xxxV2, often the "V2" refers to a second version that revokes earlier versions that are therefore not displayed in the browser. Revoked files are still available for download, but they will be indicated as "replaced " or "revoked" in the related metadata file named "files.txt" present in the corresponding download directory.

    ENCODE METADATA AND FILENAMES

    Question:
    How do I extract information about an ENCODE experiment from the filename?

    Response:
    Do not try it! While ENCODE filenames have some metadata embedded, they can not be relied upon to be unique. Rather use the file's metadata, for example in "files.txt", or access metadata in the following places:

      By opening the mentioned related "files.txt" metadata file located in each track's corresponding download directory.
      By clicking the blue down-arrow next to each subtrack listed on a track's Track Settings page.
      By using Track Search or File Search to filter files by metadata.
      By using the Table Browser tool and setting "Group" to "All Tables" and selecting the "metaDb" table. By clicking the "describe table schema" button you can learn more about the metaDb table.
      By using the public MySQl database to query the metaDb table for each database.

    ENCODE FILE FORMATS

    Question:
    How do I learn more about different ENCODE file formats? For example what is the difference between a file.bed and a file.bed9 in the ENCODE methylation data?

    Response:
    By clicking the File Formats link from the ENCODE portal page you can reach a list of various file formats used in ENCODE. Every ENCODE file has metadata included under a "files.txt" file in the related downloads page. From the ENCODE methylation's download page, in the files.txt file, a line after the specific bed9 file in question, wgEncodeHaibMethylRrbsAg04449UwstamgrowprotSitesRep1.bed9, reads 'objstatus=replaced'. This metadata indicates this bed9 file was preliminary data that has since been replaced. A similar note in the automatically displayed README file states: "WARNING - Revoked and replaced data files may be present in this directory."

    ENCODE SCORE DEFINITION

    Question:
    What is the definition of "score" in ENCODE tables?

    Response:
    The score (between 0-1000) is what determines how darkly an item is displayed in the browser (with 1000 being black). The darkness of an item's box is proportional to the maximum signal strength observed in any cell line.

    To find out exactly how score has been calculated, contact the lab that created the data determining signal strength. There are often several links to authors' labs in the credits section for each track at the bottom of a track's description page.

    ENCODE DATA IN BED FORMAT

    Question:
    How do I download ENCODE histone data in BED format? From the Table Browser I can select to download the file in BED format, but I am limited to just a few thousand lines. When I looked in the ENCODE Downloads directory I could only find the path to a bigWig file, for example wgEncodeBroadHistoneGm12878H3k27acStdSig for human build hg19.

    Response:
    The ENCODE BED files you are looking to download are the 'peak calls', which are in the extended broadPeak or narrowPeak formats, described here. For example, within the database mentioned (H3K27ac histone mark in GM12878 cells) there is a BED representation in the file: "wgEncodeBroadHistoneGm12878H3k27acStdPk.broadPeak.gz". Using the File Search tool you can use the setting "Data Format: Peaks Broad" to narrow your results to only these types of files.

    ENCODE BED FILE FORMAT

    Question:
    What does the name column represent for DNase clustered BED files? I downloaded the ENCODE BED file wgEncodeRegDnaseClustered.bed from the DNase footprinting assay. However, I am having trouble understanding the 4th column in this file. Usually this column, as I understand from the file format FAQ page, is assigned to name.

    Response:
    For the DNase cluster BED files, the name field represents the number of items in the cluster. To find out more information about each cluster, you can click on the item in the browser image and it will take you to a details page that will list all of the items in the cluster and the cell lines. Here is an example details page for a DNase item on chromosome 21. There are 58 items in this cluster and you can see the name value is 58.

    ENCODE ChIP-seq BED FILES

    Question:
    How do I find the meaning of a column of a BED file? I have downloaded ENCODE Chip-Seq BED files that have the following format:

    chr21 9825311 9827738 . 1000 . 4.51792 256.60845 261.34671 1809

    What is the meaning of the information from the forth field forward?

    Response:
    ENCODE has a number of ENCODE-specific formats. ENCODE ChIP-seq files are typically stored in the ENCODE narrowPeak format. This format extends BED6 to include fields for signalValue, two measurements of statistical significance (pValue and qValue), and the offset of a single base 'point source' peak within the region. The dots are used for name and strand which are not applicable.

    DOWNLOAD ALL ENCODE DATA

    Question:
    Is there a service providing ENCODE data on a hard drive? What is the total data volume? We have been trying FTP, but it takes too much bandwidth and time.

    Response:
    The total volume of ENCODE data is greater than 31 TB. Unfortunately, it is not possible for you to obtain a disk copy, however we are working on a protocol which should help our users get much faster download rates. It is on our list of things to do, so we will not be able to give you a time estimate as to when this tool will be available.

    ENCODE PAPERS

    Question:
    Where can I find ENCODE papers? I would like a list of the principal ENCODE Papers, can you send a link to a list of a core 30 papers detailing ENCODE's results?

    Response:
    On the left side of the ENCODE portal page you will find a link to titled Publications that will provide access to lists of ENCODE-funded Publications and Publications from non-ENCODE Authors.

    CONVERTING WIG FILES TO VARIABLESTEP

    Question:
    Can I convert WIG files into a variableStep format to use with SitePro? I am trying to use a tool called SitePro within Cistrome. This tool uses WIG and BED files to compute score profiles on the BED regions. I have downloaded, through Cistrome/Galaxy, the ENCODE WIG files which have BED-like structure:

    chr1 3002700 3002800 0.17

    However, this WIG file's BED-like structure is not accepted by SitePro. Is there a way to format the WIG files as variablestep and not BED-like?

    Response:
    There is not a way to convert formats using the Genome Browser directly, but you could convert formats using a script. There is an example script in our genomewiki, here.

    UNIQUE ENCODE DATA DETAILS

    Question:
    What does xxx mean in a file in hgdownload/encodeDCC/hg19/wgEncode*? For example downloadable files in the wgEncodeCaltechRnaSeq/ directory have a gene_id format like gene_id "GM12878-rep1.1045777" where the first part is the cell type. Would you know what does the last number 1045777 means?

    Response:
    On the top of the page for each of the download directories you are visiting there is a README.txt file that is automatically displaying. A link is provided that will bring you to a user interface enabling filtering files by cell type and other parameters, as well as including additional information such as release status, restriction dates, track description, methods, and metadata that can answer such questions.

    For example in the displaying README.txt file at the top of the page in the Caltech RNA-seq directory you can find the following written link: "http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeCaltechRnaSeq"

    By navigating to the above Caltech RNA-seq Downloadable Files page, you can scroll to the bottom (or click the "Description" button on the top right corner) and read the Track Description's Methods section. In the "Data Processing and Analysis" section there is information explaining how the numbers in gene_id, "GM12878-rep1.####" represent de novo identifiers output by Cufflinks software. At the very bottom of the page is a "Credits" section where contacts are listed. You should send remaining process specific questions about the data you are investigating to the appropriate contact listed.

    ENCODE PROTOCOLS

    Question:
    Which cell protocols were used in my track of interest? Did the Open Chromatin ENCODE tracks use standard ENCODE cell protocols?

    Response:
    Standard growth protocols were used for all ENCODE experiments, including the Open Chromatin ENCODE tracks. A directory of all ENCODE protocols is available here: http://genome.ucsc.edu/ENCODE/protocols/.