Description

This track contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site in all 3 species. The score and threshold are computed with the Transfac Matrix Database (v4.0) created by Biobase. The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.

In the graphical display, each box represents one conserved tfbs. The darker the box, the better the match of the binding site. Clicking on a box brings up detailed information on the binding site, namely its Transfac I.D., a link to its Transfac Matrix (free registration with Transfac required), its location in the human genome (chromosome, start, end, and strand), its length in bases, and its score.

All binding factors that are known to bind to the particular binding site are listed along with their species, SwissProt ID, and a link to that factor's page on the UCSC Protein Browser if such an entry exists.

Methods

A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site at exactly the same position in the alignment in all 3 species. If there is no orthologous sequence in the mouse or the rat, no prediction is made. The following is a brief discussion of the scoring and threshold system used for these data.

The Transfac Matrix Database contains position-weight matrices for 336 transcription factor binding sites, as characterized through experimental results in the scientific literature. A typical (in this case ficticious) matrix will look something like:

        A      C      G      T
01     15     15     15     15      N
02     20     10     15     15      N
03      0      0     60      0      G
04     60      0      0      0      A
05      0      0      0     60      T

The above matrix specifies the results of 60 (the sum of each row) experiments. In the experiments, the first position of the binding site was A 15 times, C 15 times, G 15 times, and T 15 times (and so on for each position.) The consensus sequence of the above binding site as characterized by the matrix is NNGAT. The format of the consensus sequence is the deduced consensus in the IUPAC 15-letter code.

The score of a segment of DNA is computed in relation to a matrix as follows:

score = SUM over each position in the matrix of
matrix[position][nucleotide_in_segment_at_this_position].

For example, the sequence "CCGAT" would have a score of: 15 + 10 + 60 + 60 + 60 = 205 for the above matrix. A score in relation to a matrix of length n can be computed for every DNA segment of length n.

The threshold for a binding site is computed from its Transfac Matrix Database entry as follows:

          St = Smin + ((Smax - Smin) * C)
                                                                               
where     St is the target threshold score
          Smin is the minimum possible score
          Smax is the maximum possible score
          C is the cutoff value used by the scoring function

For example, the above matrix has a minimum score of 15 + 10 + 0 + 0 + 0 = 25 and a maximum score of 15 + 20+ 60 + 60 + 60 = 215. Using a cutoff value of 0.85 (the value used for this track), the threshold for the above matrix is:

25 + ((215 - 25) * 0.85) = 186.5

As such the sequence "CCGAT" from above would be recorded as a hit with a cutoff value of 0.85, since its score (215) exceeds the threshold for this particular binding site (186.5.)

The final score reported is the minimum cutoff value that the position would have been recorded as a hit (multiplied by 1000.) The final score of the above example is therefore:

((Score - Smin) / (Smax - Smin)) * 1000 = (205 - 25) / (215 - 25)) = 0.947 * 1000 = 947.

Therefore, the final score for the sequence "CCGAT" would be 947. Although the scores of all three species in the alignment must exceed the threshold, the only final score that is reported for this track is the final score of the binding site in the human sequence.

It should be noted that the positions of many of these conserved binding sites coincide with known exons and other highly conserved regions. Regions such as these are more likely to contain false positive matches, as the high sequence identity across the alignment increases the likelihood of a short motif that looks like a binding site to be conserved. Conversely, matches found in introns and intergenic regions are more likely to be real binding sites, since these regions are mostly poorly conserved.

These data were obtained by running the program tfloc (Transcription Factor binding site LOCater) on multiz humor25 alignments of the Feb. 2003 mouse draft assembly (mm3) and the Jan. 2003 rat assembly (rn2) to the Apr. 2003 human genome assembly (hg15.) Tfloc was run on the subset of the Transfac Matrix Database containing human, mouse, and rat related binding sites (164 total.) Transcription factor information was culled from the Transfac Factor database.

Credits

These data were generated using the Transfac Matrix and Factor databases created by Biobase.

The tfloc program was developed at The Pennsylvania State University by Matt Weirauch.

This track was created by Matt Weirauch and Brian Raney at The University of California at Santa Cruz.