Modeling splice site and transcription factor binding site variation by information theory. P.K. Rogan1, S.R. Svojanovsky1, I. Hurwitz1, T.D. Schneider2, J.S. Leeder1. 1) Children's Mercy Hosp, Kansas City, MO; 2) National Cancer Institute, Frederick, MD.
We have validated information theory-based models for human acceptor and donor splice sites and NF-kB
heterodimer binding sites. The average information describes the range
of variation in sites having a common function, whereas the information
content of a single site (Ri) measures its conservation
within a family of binding sites. The strengths of different sites can
be directly compared based on their respective Ri values, since Ri
is related to the free energy of binding. The splice site models
comprise a set of automatically curated donor (n=111,772) and acceptor
(n=108,079) sites from all known genes in the human genome draft
sequence. These comprehensive models accurately predict the effects of
mutations, polymorphisms and cryptic splicing, including variants which
partially abolish splicing and often produce milder clinical
phenotypes. The NF-kB model was derived
initially from previously known strong sites and then iteratively
refined by incorporating binding sites predicted from the initial model
and validated by EMSA studies. The NF-kB
model accurately rank orders the strengths of known binding sites in
competitor EMSA assays, and distinguishes promoters of genes regulated
by NF-kB from those in which transcription
is not known to be induced. The model was validated by detecting known
(and previously unrecognized) sites in promoters of each of 13 genes
regulated by NF-kB that were excluded from
the initial model. The most sensitive and specific information
theory-based models are based on sites spanning a wide range of binding
affinities. A CCAAT-box protein binding site model (n=175) based on the
TRANSFAC database accurately predicted a ³1.4 fold increase in binding strength due to a G>A substitution in the promoter of the Ag-globin
gene that results in HPFH. Many other transcription factor binding
sites collated in TRANSFAC are biased towards strong binding sites.
More representative models will be required to detect weaker binding
sites and to reliably assess the effects of mutations. Supported by PHS
R01 ES10885-02 and the Merck Genome Research Foundation.