Description

This track contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. A binding site is considered to be conserved across the alignment if its score meets the threshold score for its binding matrix in all 3 species. The score and threshold are computed with the Transfac Matrix Database (v7.0) created by Biobase. The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.

In the graphical display, each box represents one conserved putative tfbs. Clicking on a box brings up detailed information on the binding site, namely its Transfac I.D., a link to its Transfac Matrix (free registration with Transfac required), its location in the human genome (chromosome, start, end, and strand), its length in bases, its raw score, and its Z score.

All binding factors that are known to bind to the particular binding matrix of the binding site are listed along with their species, SwissProt ID, and a link to that factor's page on the UCSC Protein Browser if such an entry exists.

Methods

The Transfac Matrix Database (v.7.0) contains position-weight matrices for 398 transcription factor binding sites, as characterized through experimental results in the scientific literature. Only binding matrices for known transcription factors in human, mouse, or rat were used for this track (258 of the 398). A typical (in this case ficticious) matrix (call it mat) will look something like:

        A      C      G      T
01     15     15     15     15      N
02     20     10     15     15      N
03      0      0     60      0      G
04     60      0      0      0      A
05      0      0      0     60      T

The above matrix specifies the results of 60 (the sum of each row) experiments. In the experiments, the first position of the binding site was A 15 times, C 15 times, G 15 times, and T 15 times (and so on for each position.) The consensus sequence of the above binding site as characterized by the matrix is NNGAT. The format of the consensus sequence is the deduced consensus in the IUPAC 15-letter code.

In the general case, the goal is to find all matches to a matrix of length n that are conserved across n_s sequences. For this example, n=5 and n_s=3 (human, mouse, and rat.) Denote the multispecies alignment s, such that s_ji is the nucleotide at position j of species i. Also, define an n_s x 4 background matrix (call it back) giving the background frequencies of each nucleotide in each species. A sliding window (of length n) calculates the "species score" for each species at each position:

From this, a log-odds score is calculated for each species (normalizing by the length of the matrix and the number of species in the alignment):

These scores are then summed for all species, yielding a final log-odds score for the current position:

Note that the log-odds score of each species must exceed the threshold for that species. The threshold is calculated for each species such that the only hits that will be reported will have a Z score (to be discussed later) of 1.64 or higher in each species (corresponding to a p-value of 0.05). Next, the maximum and minimum possible log-odds scores are computed and summed across all species for the given binding matrix:

These are then used to normalize the final, raw log-odds score so that its range is between 0 and 1:

Next, the best raw score for each binding matrix is calculated for the 5,000 base upstream region of each human RefSeq gene (taken from the RefGene table for hg19.) The mean and standard deviation for each binding matrix are then calculated across all RefSeq genes. These are then used to create the threshold for each binding matrix, namely, 1.64 standard deviations above the mean. Tfloc is then run with this threshold on each chromosome for the 3-way multiz alignments. Finally, a Z score is calculated for each binding site hit h to matrix m according to the following formula:

This final Z score can be interpreted as the number of standard deviations above the mean raw score for that binding matrix across the upstream regions of all RefSeq genes. The default Z score cutoff for display in the browser is 2.33 (corresponding to a p-value of 0.01.) This cutoff can be adjusted at the top of this page.

After all hits have been recorded genome-wide, one final filtering step is performed. Due to the inherant redundancy of the Transfac database, several binding sites that all bind the same factor often appear together. For example, consider the following binding sites:

585     chr1    4021    4042    V$$MEF2_02       875     -       2.83
585     chr1    4021    4042    V$$MEF2_03       917     -       3.38
585     chr1    4021    4042    V$$MEF2_04       844     -       3.45
585     chr1    4022    4037    V$$HMEF2_Q6      810     -       2.34
585     chr1    4022    4037    V$$MEF2_01       802     -       2.47
585     chr1    4022    4038    V$$RSRFC4_Q2     875     -       2.65
585     chr1    4022    4039    V$$AMEF2_Q6      823     -       2.44
585     chr1    4023    4038    V$$RSRFC4_01     878     +       2.53
585     chr1    4024    4035    V$$MEF2_Q6_01    913     +       2.41
585     chr1    4024    4039    V$$MMEF2_Q6      861     -       2.39

These 10 overlapping binding sites bind a total of 19 factors. However, of these 19 factors, only 7 of them are unique. Many of the above binding sites are redundant (they add no additional factors). In fact, the first 3 binding sites all bind the same two factors (namely, aMEF-2 and MEF-2A). These ten binding sites can therefore be filtered down to the following four binding sites, without any loss of information (in terms of transcription factors). The final table entry then has the following four lines, since these four binding sites account for all 7 of the unique factors:

585     chr1    4021    4042    V$$MEF2_04       844     -       3.45
585     chr1    4022    4038    V$$RSRFC4_Q2     875     -       2.65
585     chr1    4024    4035    V$$MEF2_Q6_01    913     +       2.41
585     chr1    4024    4039    V$$MMEF2_Q6      861     -       2.39

In the event that multiple binding sites bind the same factors, the site with the highest Z score is chosen. Only binding sites which overlap each other and whose start positions are within 5 bases of each other are considered for merging.

It should be noted that the positions of many of these conserved binding sites coincide with known exons and other highly conserved regions. Regions such as these are more likely to contain false positive matches, as the high sequence identity across the alignment increases the likelihood of a short motif that looks like a binding site to be conserved. Conversely, matches found in introns and intergenic regions are more likely to be real binding sites, since these regions are mostly poorly conserved.

These data were obtained by running the program tfloc (Transcription Factor binding site LOCater) on multiz46way alignments, restricting only to the July 2007 (mm9) mouse genome assembly, the November 2004 rat assembly (rn4), and the February 2009 human genome assembly (hg19). Transcription factor information was culled from the Transfac Factor database, version 7.0.

Table Format

The format of the tfbsConsSites sql table is shown above. The columns are (from left to right): bin, chromosome, from, to, binding matrix ID, raw score, strand, and Z score.

To get the corresponding transcription factor information for a given binding matrix, use the table tfbsConsFactors. The format of the tfbsConsFactors sql table is:

V$$MYOD_01       M00001  mouse   MyoD    P10085
V$$E47_01        M00002  human   E47     N
V$$CMYB_01       M00004  mouse   c-Myb   P06876
V$$AP4_01        M00005  human   AP-4    Q01664
V$$MEF2_01       M00006  mouse   aMEF-2  Q60929
V$$MEF2_01       M00006  rat     MEF-2   N
V$$MEF2_01       M00006  human   MEF-2A  Q02078
V$$ELK1_01       M00007  human   Elk-1   P19419
V$$SP1_01        M00008  human   Sp1     P08047
V$$EVI1_06       M00011  mouse   Evi-1   P14404

The columns are (from left to right): transfac binding matrix id, transfac binding matrix accession number, transcription factor species, transcription factor name, SwissProt accesssion number. When no factor species, name, or id information exists in the transfac factor database for a binding matrix, an 'N' appears in the corresponding column(s). Notice also that if more than one transcription factor is known for one binding matrix, each occurs on its own line, so multiple lines can exist for one binding matrix.

Credits

These data were generated using the Transfac Matrix and Factor databases created by Biobase.

The tfloc program was developed at The Pennsylvania State University (with numerous updates done at UCSC) by Matt Weirauch.

This track was created by Matt Weirauch and Brian Raney at The University of California at Santa Cruz.