> next up previous
Next: A Paradox: How can Up: Consensus Sequence Zen Previous: Abstract.

How to be sure to make a mistake.

Genes are controlled by proteins that bind to specific spots on the DNA sequence. Molecular biologists often represent the patterns at these spots by using a consensus sequence. For example, after aligning some binding sites so that they match each other, one position might contain 70% adenine, 10% cytosine, 10% guanine, and 10% thymine. The consensus is the most frequent base, `A'. This is the simplest (and possibly the most commonly applied) approach, but there are alternatives [Day & McMorris, 1992]. Various kinds of consensus sequence commonly found in the molecular biology literature will be considered here, while the controversy over the use of consensus trees used in phylogenetic inference [Barrett et al., 1991,Nelson, 1993,Barrett et al., 1993,de Queiroz, 1993] will not be covered.

The main difficulty with using consensus sequences is that they present distorted pictures of binding sites. In order to locate new binding sites, consensus sequences are compared to various locations in a sequence and the number of matches is tallied. A difficulty arises because a position that is always an `A' in the original set is treated the same as a position that is just 70% A. If we think that a position has A, then when we use this observation to look for additional binding sites, we will find mismatches for 30% of the acceptable sequences. This problem is compounded across the entire binding site, which may be 20 or even 40 bases long [Schneider, 1996,Zheng et al., 1999]. For example, a commonly cited consensus sequence is TAT[Lewin, 1997], which represents the region of bacterial promoters originally discovered by David Pribnow (1975). The most prominent bases for the boxed positions are only 49%, 58%, and 54% respectively [Lisser & Margalit, 1993]. If one demands that a site have all of the consensus bases, one finds only 14 TATAAT sequences out of 291 sequences in the database. To deal with this, people often count mismatches, but it is not obvious from the simple consensus which bases are allowed to be more variable. Sometimes variations such as allowing C or G are indicated but, again, the degree of allowed variation is lost. It is not surprising then, that consensus sequences frequently fail to identify binding sites or that they predict sites where there are none.

Consensus sequences have other serious problems, many of which are revealed by using information theory to measure the amount of conservation in bits. In a set of aligned binding sites, a DNA position that is always an A stays that way during evolution because the molecule that binds to it always selects A from the four possible bases [Schneider, 2000]. Such a selection can be made with a minimum of two yes-no questions: `Is it in the set A or T?' and `Is it in the set A or C?', so the selection takes two bits of information, one to answer each question. Likewise, a position that is either A or T only requires one yes-no question -- the other one being ignored -- so has one bit of sequence conservation. The late Claude Shannon figured out how to consistently measure the average information when the frequencies are not so simple [Shannon, 1948,Schneider et al., 1986,Schneider, 1995]. One can plot the sequence conservation across all positions in the set of aligned binding sites. This continuous quantitative measure often follows a sine wave, reflecting the binding of a protein to one face of helically twisting B-form DNA [Papp et al., 1993,Schneider, 1996,Schneider, 2001]. This subtle effect cannot be seen by using consensus sequences.


next up previous
Next: A Paradox: How can Up: Consensus Sequence Zen Previous: Abstract.
Tom Schneider 2002-12-05