Description

These tracks display evidence of open chromatin in multiple cell types from the Duke/UNC/UT-Austin/EBI ENCODE group. Open chromatin was identified using two independent and complementary methods: DNaseI hypersensitivity (HS) and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE), combined with chromatin immunoprecipitation (ChIP) for select regulatory factors. Each method was verified by two detection platforms: Illumina (formerly Solexa) sequencing by synthesis, and high-resolution 1% ENCODE tiled microarrays supplied by NimbleGen.

DNaseI HS data: DNaseI is an enzyme that has long been used to map general chromatin accessibility, and DNaseI "hyperaccessibility" or "hypersensitivity" is a feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, silencers, insulators, promotors, locus control regions and novel elements. DNaseI hypersensitivity signifies chromatin accessibility following binding of trans-acting factors in place of a canonical nucleosome.

FAIRE data: FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements) is a method to isolate and identify nucleosome-depleted regions of the genome. FAIRE was initially discovered in yeast and subsequently shown to identify active regulatory elements in human cells (Giresi et al., 2007). Although less well-characterized than DNase, FAIRE also appears to identify functional regulatory elements that include enhancers, silencers, insulators, promotors, locus control regions and novel elements. DNA fragments isolated by FAIRE are 100-200 bp in length, with the average length being 140 bp.

ChIP data: ChIP (Chromatin Immunoprecipitation) is a method to identify the specific location of proteins that are directly or indirectly bound to genomic DNA. By identifying the binding location of sequence-specific transcription factors, general transcription machinery components, and chromatin factors, ChIP can help in the functional annotation of the open chromatin regions identified by DNaseI HS mapping and FAIRE.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. Chromatin data displayed here represents a continuum of signal intensities. The Crawford lab recommends setting the "Data view scaling: auto-scale" option when viewing signal data in full mode. In general, for each experiment in each of the cell types, the Open Chromatin tracks contain the following views:

Peaks: Regions of enriched signal in either DNaseI HS, FAIRE, or ChIP experiments. Peaks were called based on signals created using F-Seq, a software program developed at Duke (Boyle et al., 2008b). Significant regions were determined by performing ROC analysis of sequence data using data from the 1% ENCODE arrays, and determining a cut-off value at approximately the 95% sensitivity level. The solid vertical line in the peak represents the point with highest signal. ENCODE Peaks tables contain a p-value for statistical significance. For these data, this was determined by fitting the data to a gamma distribution.
Signal (F-Seq Density): Density graph (wiggle) of signal enrichment calculated using F-Seq for the combined set of sequences from all replicates. F-Seq employs Parzen kernel density estimation to create base pair scores (Boyle et al., 2008b). This method does not look at fixed-length windows but rather weights contributions of nearby sequences in proportion to their distance from that base. It only considers sequences aligned 4 or less times in the genome, and uses an alignability background model to try to correct for regions where sequences cannot be aligned. For the K562, HepG2 and HelaS3 cell types, where there is an abnormal karyotype, a model to try to correct for amplifications and deletions was also used. No control data were used in the creation of these annotations.
Signal (Base Overlap): An alternative version of the Signal (F-Seq Density) track annotation that provides a higher resolution view of the raw sequence data. This track also includes the combined set of sequences from all replicates. For each sequence, the aligned read is extended in the following way: for DNase, the read is extended 5 bp in both directions from its 5' aligned end where DNase cut the DNA; for FAIRE and ChIP, the sequence is extend to a fragment length of 134 bp from the 5' aligned end representing the approximate average fragment length. The score at each base pair represents the number of extended fragments that overlap the base pair.
Alignments: Mappings of short reads to the genome (currently only available for download).

Additional data that were used to generate these tracks are located in the ENCODE Mapability track:

Uniqueness: The Duke uniqueness tracks were used in identify regions of unique sequence for different tag lengths. The tracks also identify regions where high-throughput sequence tags cannot be mapped.
Excluded Regions: The Duke excluded regions track was used to identify problematic regions for short sequence tag signal detection (such as satellites and rRNA genes). These regions of the genome were excluded from the Open Chromatin tracks.

Methods

Cells were grown according to the approved ENCODE cell culture protocols.

DNaseI hypersensitive sites were isolated using methods called DNase-seq or DNase-chip (Boyle et al., 2008a, Crawford et al., 2006). Briefly, cells were lysed with NP40, and intact nuclei were digested with optimal levels of DNaseI enzyme. DNaseI digested ends were captured from three different DNase concentrations, and material was sequenced using Illumina (Solexa) sequencing. DNase-seq data were verified using material that was hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome). Multiple independent growths (replicates) were compared to verify the reproducibility of the data. A more detailed protocol is available here.

FAIRE was performed (Giresi et al., 2007) by cross-linking proteins to DNA using 1% formaldehyde solution, and the complex was sheared using sonication. Phenol/chloroform extractions were performed to remove DNA fragments cross-linked to protein. The DNA recovered in the aqueous phase was hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome) and sequenced using a Solexa sequencing system. The ENCODE array data were used to verify the accuracy of the sequencing data, and multiple independent growths (replicates) were compared to assess the reproducibility of the data. A more detailed protocol is available here. Also see Giresi et al., 2009.

To perform ChIP, proteins were cross-linked to DNA in vivo using 1% formaldehyde solution (Bhinge et al., 2007, ENCODE Project Consortium., 2007). Cross-linked chromatin was sheared by sonication and immunoprecipitated using a specific antibody against the protein of interest. After reversal of the cross-links, the immunoprecipitated DNA was used to identify the genomic location of transcription factor binding. This was accomplished by Solexa sequencing of the ends of the immunoprecipitated DNA (ChIP-seq), as well as labeling and hybridization of the immunoprecipitated DNA to NimbleGen Human ENCODE tiling arrays (1% of the genome) along with the input DNA as reference (ChIP-chip). The ENCODE array data were used to verify the accuracy of the sequencing data, and multiple independent growths (replicates) were compared to assess the reproducibility of the data. A more detailed protocol is available here.

ENCODE Array data were normalized using the Tukey biweight normalization, and peaks were called using ChIPOTle (Buck, et al., 2005) at multiple levels of significance. Regions matched on size to these peaks that were devoid of any significant signal were also created to allow for ROC analysis.

Sequences from each experiment were aligned to the genome using Maq (Li et al., 2008) and those that aligned to 4 or fewer locations were retained. Other sequences were also filtered based on their alignment to problematic regions (such as satellites and rRNA genes). The resulting digital signal was converted to a continuous wiggle track using F-Seq that employs Parzen kernel density estimation to create base pair scores (Boyle et al., 2008b). Discrete DNase HS, FAIRE, and ChIP sites (peaks) were identified from DNase/FAIRE/ChIP-seq using F-Seq by setting a Parzen cutoff based on ROC curve analysis using peaks and non-peaks identified from DNase/FAIRE/ChIP-chip using NimbleGen Human ENCODE tiling arrays (1% of the genome).

Input data was generated for GM12878, K562, HeLa-S3, HepG2, and HUVEC. These were used directly to create a control/background model used for F-Seq when generating signal annotations and subsequenntly peaks for these cell lines. These models are meant to correct for sequencing biases, alignment artifacts, and copy number changes in these cell lines. Input data is not being generated directly for other cell lines. Instead, a general background model was derived from the five Input data sets. This should provide corrections for sequencing biases and alignment artifacts, but obviously not for cell type specific copy number changes.

Release Notes

This is Release 2 (Oct 2009) of this track, which includes new experimental data as well as the changes described below. The affected database tables and files include 'V2' in the name, and metadata is marked with "submittedDataVersion=V2", followed by the reason for replacement. Specific changes are:

Signals were replaced for all previously released experiments, due to improvements in the background model.

New Peaks were called for all previously released experiments. Not only had the background model improved, but the method for determining peaks was changed from relying on ROC AUC analysis to set peak thresholds to fitting all peaks from each dataset to a gamma distribution, and setting a threshold based on a p-value. For DNaseHS and CTCF, this p-value threshold is 0.05. For c-Myc, this is 0.01 with the exception of GM12878 where it is 0.001. In addition, the background models used when generating the signal tracks for each were modified slightly.

For HepG2, the pseudoautosomal regions (PARs) on ChrX and ChrY were being incorrectly annotated for this cell line derived from a male individual. These were corrected. Only the PARs on chrX are currently annotated.

The following additional changes were made in Release 2:

DNaseHS GM12878 New sequences were generated for replicates 1 and 3.

K562 Replicate 3 was discarded due to poor quality data.

FAIRE GM12878 New sequences were generated for replicates 1 and 2.

K562 New sequences were generated for replicates 2 and 3.
Replicate 1 was discarded due to poor quality data.

HeLaS3 New sequences were generated for replicates 1 and 3.
Replicate 2 was discarded due to poor quality data.

HepG2 New sequences were generated for replicates 1, 2 and 3.

Previous versions of these files are available for download from the FTP site

Credits

These data and annotations were created by a collaboration of multiple institutions (contact: tsfurey@duke.edu):

Duke University's Institute for Genome Sciences & Policy (IGSP): Alan Boyle, Lingyun Song, Terry Furey, and Greg Crawford
University of North Carolina at Chapel Hill: Paul Giresi and Jason Lieb
Universty of Texas at Austin: Zheng Liu, Ryan McDaniell, Bum-Kyu Lee, and Vishy Iyer
European Bioinformatics Insitute: Paul Flicek, Damian Keefe, and Ewan Birney
University of Cambridge, Department of Oncology and CR-UK Cambridge Research Institute (CRI) : Stefan Graf

We thank NHGRI for ENCODE funding support.

References

Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer, VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Res. 2007 Jun;17(6):910-6.

Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008 Jan 25;132(2):311-22.

Boyle AP, Guinney J, Crawford GE, and Furey TS. F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics. 2008 Nov 1;24(21):2537-8.

Buck MJ, Nobel AB, Lieb JD. ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol. 2005;6(11):R97.

Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006 Jul;3(7):503-9.

Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31.

The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816.

Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolated active regulatory elements in human chromatin. Genome Res. 2007 Jun;17(6):877-85.

Giresi PG, Lieb JD. Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements). Methods. 2009 Jul;48(3):233-9.

Li H, Ruan J, and Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.

DNaseHS	GM12878	New sequences were generated for replicates 1 and 3.
	K562	Replicate 3 was discarded due to poor quality data.
FAIRE	GM12878	New sequences were generated for replicates 1 and 2.
	K562	New sequences were generated for replicates 2 and 3. Replicate 1 was discarded due to poor quality data.
	HeLaS3	New sequences were generated for replicates 1 and 3. Replicate 2 was discarded due to poor quality data.
	HepG2	New sequences were generated for replicates 1, 2 and 3.