REVEALING THE CONFORMATIONAL AND PHYSICO-CHEMICAL DNA FEATURES APPLICABLE FOR PREDICTING THE ACTIVITY OF THE FUNCTIONAL DNA SITES

Mikhail Ponomarenko, Nikolay Kolchanov, Julia Ponomarenko, Anatoly Frolov,
Olga Podkolodnaya, Denis Vorobiev, Nikolay Podkolodny^#, G. Christian Overton^&

Institute of Cytology and Genetics, Novosibirsk, Russia;

^{#Institute of Computational Mathematics and Mathematical
Geophysics, Novosibirsk, Russia
^{&Center for Bioinformatics, University of Pennsylvania,
Philadelphia, USA

ABSTRACT

We suggest a new approach to predict the activity of DNA functional
sites that is focused on the perspicuity of the prediction in terms of “a probable
molecular mechanism of the site functioning”. The biological novelty of the method is in
the involvement of physico-chemical and conformational DNA properties to provide clear
interpretation of the obtained activity predictions in terms of a probable molecular
mechanism of the site functioning. For the DNA feature analyzed, the mean value of a given
DNA property averaged for a given region containing the site is calculated and studied.
This approach has allowed to create a distributed and intelligent database ACTIVITY for
the functional site activity prediction. Currently, this database contains the description
of over 240 experiments, over 30 conformational and physico-chemical properties, the DNA
features identified as applicable for predicting the site activities, and the C-code
programs predicting the activity of these sites from their sequences. ACTIVITY is URL /mgs/systems/activity/.

INTRODUCTION

Mulligan (1984) was first to predict the activity K_bk₂
of E. coli promoters through homology score. Using multiple regressing the weight
matrixes, Stormo (1986) predicted the DNA site activities for su2 suppression and
2-aminopurine-induced mutations, and the operator-binding activity of the Mnt repressor in
Salmonella phage P22 (Fields, 1997). Berg and von Hippel (1988) established the
statistical-mechanical theory to describe the sequence-dependences of the DNA/protein
interactions and applied it to predict the activities of CRP- and Cro-binding sites and E.
coli promoters. Jonsson (1993) introduced neural networks to predict the E. coli
promoter strength. Neural networks were also applied for predicting the INR and TATA box
activities(Kraus, 1996).

In our previous works, we have introduced the system ACTIVITY
(Ponomarenko, 1997a) generating programs to predict site activities using weighted
oligonucleotide concentrations, and, then, applied this system to predict the consensus
site maximizing the affinity between DNA and TBP protein (Ponomarenko, 1997b). This
consensus was found to be similar to the well-known Bucher’s consensus (1990) of the
TATA box. Nevertheless, a probable molecular mechanism of the DNA functional sites
functioning remains obscure in terms of either oligonucleotide concentrations
(Ponomarenko, 1997a,b), the weighted matrix (Stormo, 1986), or neural network (Jonsson,
1993).

With this background, we developed the distributed and intelligent
database ACTIVITY on the activities of the DNA functional sites. The database ACTIVITY
comprises: (1) the database of the experimental data on site sequences with known
activities; (2) the database of the conformational and physico-chemical DNA properties;
(3) the knowledge base of the DNA features which is the mean values of the DNA properties
significant for predicting the site activity; (4) the library of the programs predicting
the site activities. ACTIVITY is /mgs/systems/activity/.

METHODS

We are suggesting a linear regression for predicting activities of the
DNA functional sites. The biological novelty of the method is in the involvement of
physico-chemical and conformational DNA properties to provide clear interpretation of the
obtained activity predictions in terms of a probable molecular mechanism of the site
functioning. Mean values of conformational or physico-chemical DNA properties are named
“DNA features” in this work. They are easiest to interpret relative to the molecular
mechanism defining the activity value of a given site, whereas it is hard to guess in
advance what DNA feature is responsible for the site activity. That is why we suggest to
generate and test as many DNA features as the computer can afford, as it was introduced by
Hajek and Havranek (1978). In our previous papers (Ponomarenko, 1997a, b), this
“generating and testing” approach (Hajek, 1978) has been successfully applied for
revealing the oligonucleotide concentration applicable to predict the activity of a given
functional site from its sequence. However, the concentrations proved to be uncertain in
terms of a probable molecular mechanism of the site functioning. For this reason, this
paper is focused on the conformational and physico-chemical DNA properties.

The core idea of the linear regression implies that the site activity F
is determined by simultaneous action of two types of the site features X: obligatory and
facultative. The obligatory features of a given site are invariant for all sequences of
this site and determine its basal activity. Consensus is a typical obligatory feature. The
facultative features of a given site are individual in terms of their “number, size, and
location” for each sequence of the site and modulate the site activity with respect to
the basal level. Hence, within the framework of the linear regression, the activity of the
site with sequence S is described by the following equation:

, (1)

where, F₀ is the basal activity level of the sites studied;
X_k(S) is the value of the k-th facultative feature of the sequence S; and F_k
is the contribution of the feature X_k to the site activity F.

Local conformational DNA heterogeneities dependent on the nucleotide
context play an important role in DNA-protein interactions, which essentially determine
the site activity. That is why the prediction of site activity takes into account the DNA
conformational properties describing the mutual orientation and locations of base pairs.
Also, we used the earlier published values of physico-chemical properties averaged for the
known X-ray structures including melting temperature, persistent length, entropy, etc.
These properties determine the molecular dynamics of DNA sites during their functioning.
Currently, 38 conformational and physico-chemical properties are utilized in prediction of
the site activities. Thus, the sequence of the site S can be characterized by the mean
value of the q-th property R_q averaged over the region between positions a and
b:

, (2)

It should be emphasized that before starting the analysis, we knew very
little about the B-DNA physico-chemical and conformational features X_q,a,b that
would be most important for the activity of a given site under study. The only available
data were certain sequences with the known activities. With this in mind, the artificial
intelligence principle of impartiality is applicable: when the information is
insufficient, the more hypotheses have been generated and tested, the more correct is the
result, and no preference, therefore, might be given to any hypothesis before its testing
(Hajek, 1978). In this paper, each hypothesis is the assumption that a conformational or
physico-chemical features calculated by equation (2) is significant for the activity of
the site examined. Thus, we test, one by one, all the possible variants of conformational
or physico-chemical properties R_q, exhausting all the possible regions (a, b)
within the site examined. In this way, for a fixed “q, a, b”, the conformational or
physico-chemical feature X_q,a,b(S_n) is calculated by equation (2)
for each sequence S_n with the known activity F_n. The total number of
the DNA features X_q,a,b is about 10⁵. Essentially, when such a large
number of hypotheses is generated and tested, the problem to exclude any insignificant
hypothesis chosen by chance becomes crucial. In this paper, we are suggesting to cross
this problem within the framework of utility theory for decision making (Fishburn, 1970)
and Zadeh’s fuzzy logic (Zadeh, 1965) as follows.

Let’s calculate a fixed feature X_q,a,b(S_n) for
each sequence S_n with the known activity F_n by equation (2)_.If
the resulting pairs {X_q,a,b(S_n), F_n} meet all the
necessary conditions of the linear regression (equation 1) applicability, then the
activity F is predictable from an arbitrary sequence S via the feature X_q,a,b(S).
To test these conditions of linear regression applicability, a simple regression is
optimized for the pairs {X_q,a,b(S_n), F_n}:

F_q,a,b(S_n)= f₀ + f₁ x X_q,a,b(S_n);
(3)

where f₀ and f₁ are the regression coefficients
optimized for the pairs {X_q,a,b(S_n), F_n}.

To ensure the reliability of the regression between X_q,a,b(S_n)
and F_n values, 22 conditions of regression analysis are tested: the presence of
linear, sign, and rank correlations between the predicted F_q,a,b(S_n)
and the experimental activities F_n; the equality of distributions of these
values; the Gaussian distribution of their deviation (F_q,a,b(S_n)-F_n),
etc. When testing each of the 22 conditions, the significance level p_r, at
which the rth condition is met, is estimated. In Zadeh’s fuzzy logic (Zadeh, 1965), each
estimation p_r is transformed into a uniform scale, that is, the so-called
“partial utility of the usage of the feature X_q,a,b to predict the activity
F”, as follows:

(4)

The highest partial utility u_r=1 is assigned to the feature
X_q,a,b, if the rth condition is met at significance p_r <0.01. The
utility is the lowest, u_r= - 1, if the rth condition is not met (p_r>
0.1). The intermediate partial utility, - 1<u_r<1, is assigned to the
feature X_q,a,b that meets the rth condition with an intermediate significance,
0.01<p_r<0.1 (u_r<0 if p_r>0.05, u_r=0
if p_r=0.05, and u_r>0 if p_r<0.05).

In the utility theory for decision making (Fishburn, 1970), the
averaging of all the 22 partial utilities gives the integral utility of the usage of the
feature X_q,a,b to predict the activity F:

. (5)

Only the linearly independent features X_q,a,b with the
highest positive utilities are selected:

U(X_q,a,b,F) > 0. (6)

The utility U(X_q,a,b,F) is positive, if the feature X_q,a,b
meets more than a half of the 22 conditions of the linear regression applicability. The
probability to select by chance a feature X with a positive utility U(X, F)>0 from 10⁵
features was approximately estimated by the binomial criterion:

. (7)

Equation (7) shows that each conformational or physico-chemical B-DNA
feature X_q,a,b selected by equation (6) meets significantly the linear
regression applicability for predicting the site activity.

ALGORITHM

We used a simple combinatorial algorithm, schematically shown in Fig.
1. The essence of this algorithm is the following. The notion of a combinatorial algorithm
implies that all the 10⁵ possible features X_q,a,b(S_n) for
all the available site sequences S_n with the known activities F_n are
calculated by equation (2), and hence, all the 10⁵ necessary utilities U(X, F)
are estimated by equations (3), (4), and (5). When all U(X_q,a,b, P)<0, the
algorithm terminates and no features are selected. If U(X_q,a,b, P)>0, all
the possible linear-independent features {X_k} with highest positive {U(X_k,
F)>0} are selected; the linear regression (1) for predicting the site activity is
derived; the C-code program for this prediction is generated (Ponomarenko, 1997a) and
stored in the database ACTIVITY.

This algorithm has been implemented with Borland C compiler on IBM PC
platform to develop the distributed and intelligent database ACTIVITY shown schematically
in Fig. 2. It contains three databases of the SRS query language format (Etzold, 1993),
the computer system generating programs for predicting site activities (Ponomarenko,
1997a), and the library of the executable code of these programs predicting activities of
DNA functional sites from their sequences. The database ACTIVITY is WWW-available at URL/mgs/system/activity/.

RESULTS AND DISCUSSION

The most important novelty of the ACTIVITY is the database of DNA site
activities. Currently, it describes 248 samples exemplified in Table 1. Among them are
promoters and binding sites for different E. coli regulatory proteins, TATA boxes
and binding sites for various eukaryotic transcription factors, mutation hotspots, and
many others. The quantitative values characterizing specific site activities include the
association and dissociation rates, affinity, lifetime of the DNA/protein complexes,
transcription activity, mutation and cutting frequencies, etc. The database format is
exemplified in Fig. 3 by the transcription activity of the mouse alphaA-crystalline gene
promoter with the PE1B/TATA box region (-33, +3) relative to the transcription start (Sax,
1995).

The ACTIVITY contains also the database of conformational and
physico-chemical properties of the B-DNA. The current version of the database comprises
over 30 properties; some of them are listed in Table 2. As an example, the SRS-based
format (Etzold, 1993) of the physico-chemical property “Probability to be contacting
nucleosome core” in the database is shown in Fig. 4.

The ACTIVITY is also citing all the compiled experimental data on the
functional DNA site activities and the conformational and physico-chemical DNA properties
in the special database, containing currently over 140 references.

These data on site activities and DNA properties are starting data for
the computer system to generate programs predicting the site activity, developed earlier
for weighted oligonucleotide concentrations (Ponomarenko, 1997a) and modified for the
conformational and physico-chemical DNA features X_q,a,b which are the meav
values of the respective DNA properties R_q averaged on a given site region
(a,b) by formula (2) herein. The system output for initial data on the transcription
activity of the mouse alphaA-crystalline gene promoter containing the PE1B region near the
TATA box (see the experimental data shown in Fig. 3) is demonstrated in Fig. 5. This
output is stored into the knowledge base containing the significant DNA features for
predicting activity (see the scheme in Fig. 2). “Probability to be contacting with
nucleosome core”, Pnucl, appeared to be the most significant physico-chemical feature of
alphaA-crystalline gene promoter PE1B/TATA box region; the values for each of the 16
possible dinucleotide steps are shown in Fig. 4. The mean value of this property averaged
over the region, the significant B-DNA feature, correlates negatively with transcription
activity (Fig. 6a). This negative correlation is pinpointing that the tighter is the
interaction of the promoter with nucleosomes, the lower is the transcription activity.
This result is consistent with both the experimental data showing that nucleosome
displacement from a promoter precedes the TBP/TATA binding (Godde, 1995; Edmondson, 1996)
and our previous results (Ponomarenko, 1997c) that the nucleosome binding site and the
basal promoters differ essentially in their B-DNA helical conformations by their mean
Twist angles (maximal and minimal, respectively). The analysis performed has also
demonstrated that such conformational properties as major groove width distance, dist
(Fig. 6b), and angle Tilt (Fig. 6c) are of importance for the transcription activity.
Using the mean values of these DNA properties, the linear regression (1) predicting the
transcription activity of alphaA-crystalline gene promoter was derived (Table 3):

F= - 39 - 0.1x Pnucl + 12x DIST - Tilt, (8)

In Fig. 6d, the linear correlation coefficient r=0.90 shows the
significant agreement between the experimental transcription activity and the activity
predicted by equation 8.

Several dozens of the DNA functional sites analyzed by the Activity are
listed in Table 3 and Fig. 7 to demonstrate the universality of the linear regression (1).
For all these exemplifying sites, the significant physico-chemical and conformational
features have been identified and the linear regressions predicting the site activities
have been derived. Let’s consider these examples in detail.

Analysis of the sequences with known DNA bending in the TBP/TATA
complex (Starr, 1995) has shown that the bending increases with the inclination (Fig. 7a).
Similar results were obtained by the X-ray analysis of the TBP/TATA complexes (Juo, 1996).
DNA bending in these complexes results from intercalation of four phenylalanines of the
TBP between adjacent base pairs on the side of the minor groove (Juo, 1996). The
Inclination describes the rotation angle of a pair of bases along the short axis of this
pair; the increase in the angle widens the minor groove (Dickerson, 1989), thereby
facilitating the intercalation of phenylalanines in the minor groove and, hence, DNA
bending.

Fig. 7b illustrates the negative correlation of B-helical twist angle
and the promoter affinity for the upstream stimulating transcription factor USF (Bendall,
1994) (Table 3: r=-0.896, p<10^-5). The twist also correlates negatively
(r=-0.766, p<10^-3) with the activity of another transcription factor (YY1)
binding site (Fig. 7d). These two negative correlations are pinpointing independently that
the lowest twist may be an important characteristic of a possible molecular mechanism of
transcription initiation on eukaryotic promoters. Indeed, this is consistent with our
earlier result (Ponomarenko, 1997c) that the lowest twist is the significant DNA feature
of the all known eukaryotic promoters.

Finally, Fig. 7d demonstrates that even an exotic DNA functional site
activity, such as the mutability induced by the 2-aminopurine (Coullondre, 1978) increases
with the DNA melting temperature in the vicinity of the hotspots (r=0.90, a <10^-5).
This physico-chemical correlation is in agreement with the commonly accepted fact
(Mhaskar, 1984) that the 2-aminopurine-induced mutability results from repair errors that
are more frequent to the left of the G:C base pairs exhibiting the highest DNA melting
temperature than the A:T base pairs having the lowest melting temperature. Very close
estimates (r=0.865 and r=0.860, respectively) were obtained earlier using weight matrices
(Stormo, 1986) and the method for oligonucleotide concentrations (Ponomarenko, 1997a), and
these contextual correlations observed earlier have not unambiguously indicated the repair
errors dependent on DNA melting temperature as a possible molecular mechanisms of the DNA
mutability.

CONCLUTION

Summing up, we would like to underline that the linear regression
(equation 1) derived for predicting site activities can be informative in a wide range of
in molecular biological studies. Substantially, the ACTIVITY does not require a huge body
of initial experimental data. Further development of this our approach will be focused on
the accumulating of experimental data on the functional DNA site activity because there is
no another database for this field of intense research, as yet. Also, we are going to
extend the database of conformational and physico-chemical properties of the B-helical DNA
and to complement the linear regression model of the molecular mechanisms responsible for
the functional DNA site activity by more complex and informative non-linear ones
accounting interrelation of the significant DNA features of a given site during this site
functioning. In this way, our final goal is to amplify our earlier approach for the
simulation of the DNA sequences of a given functional site maximizing this site activity
which is now based on the heuristic molecular mechanisms of this site functioning
(Ponomarenko, 1997b) by using the significant conformational and physico-chemical features
of the site to describe this site functioning much more reasonably.

ACTIVITY is Web-available on URL http://wwwmgs.bionet.nsc.ru/mgs/systems/activity/.

Acknowledgments

We are grateful to Ms. Galina Chirikova for help in translation. This
work was supported by NIH Grant 2-R01-RR04026-08A2, Russian National Human Genome Project,
Russian Ministry of Science and Technical Politics, Siberian Branch of Russian Academy of
Sciences IGSBRAS-97N13, and Russian Found for Basic Research 96-04-50006, 97-07-90309,
97-04-49740, 98-07-90126.

REFERENCES
Bendall, A.J., and Molloy, P.L., (1994) Base preferences for DNA
binding by the bHLH-Zip protein USF: effects of MgCl2 on specificity and comparison with
binding of Myc family members. Nucleic Acids Res., 22, 2801-2810.
Berg, O.G., and von Hippel, P.H., (1988) Selection of DNA binding sites
by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to
recognition sites. J. Mol. Biol., 200, 709-723.
Boyd, D.C., et al., (1995) Functional redundancy of promoter elements
ensures efficient transcription of the human 7SK gene in vivo. J. Mol. Biol., 253,
677-690.
Burset, M., and Guigo, R., (1996) Evaluation of gene structure
prediction programs. Genomics, 34, 353-367.
Chiang, L.W., and Howe, M.M., (1993) Mutational analysis of a
C-dependent late promoter of bacteriophage Mu. Genetics, 135, 619-629.
Coulondre, C., et al., (1978) Molecular basis of base substitution
hotspots in Escherichia coli. Nature., 274, 775-780.
Dickerson, R.E., et al., (1989) EMBO Workshop, EMBO J., 8,
1-4
Edmondson, D.G., and Roth, S.Y., (1996) Chromatin and transcription. FASEB
J., 10, 1173-1182.
Etzold, T., and Argos, P., (1993) SRS - an indexing and retrieval tool
for flat file data libraries. Comput. Appl. Biosci., 9, 49-57.
Fickett, J.W., and Hatzigeorgiou, A.G., (1997) Eukaryotic promoter
recognition. Genome Res., 7, 861-878.
Fields, D.S., He, Y., Al-Uzri, A.Y., and Stormo, G.D. (1997)
Quantitative specificity of the Mnt repressor. J. Mol. Biol., 271, 178-194.
Fishburn, P.C., (1970) Utility Theory for Decision Making, New York:
Jonh Wiley & Sons.
Gartenberg, M.R., and Crothers, D.M., (1988) DNA sequence determinants
of CAP-induced bending and protein binding affinity. Nature, 333, 824-829.
Godde, J.S., Nakatani, Y., and Wolffe, A.P., (1995) The amino-terminal
tails of the core histones and the translational position of the TATA box determine
TBP/TFIIA association with nucleosomal DNA. Nucleic Acids Res., 23,
4557-4564.
Gorin, A.A., Zhurkin, V.B., and Olson, W.K., (1995) B-DNA twisting
correlates with base-pair morphology. J. Mol. Biol., 247, 34-48.
Hajek, P., and Havranek, T., (1978). Mechanizing hypothesis formation -
Mathematical foundations for a general theory. Heidelberg, Springer Verlag.
Hogan, M.E., and Austin, R.H., (1987) Importance of DNA stiffness in
protein-DNA binding specificity. Nature, 329, 263-266.
Hyde-DeRuyscher, R., Jennings, E., Shenk, T., (1995) DNA binding sites
for the transcriptional activator/repressor. Nucleic Acids Res., 23,
4457-4465
Jonsson, J., et al. (1993) Quantitative sequence-activity models
(QSAM)-tools for sequence design. Nucleic Acids Res., 21, 733-739.
Juo, Z.S., Chiu, T.K., et al. (1996) How proteins recognize the TATA
box. J. Mol. Biol., 261, 239-254.
Karas, H., Knuppel, R., Schulz, W., Sklenar, H., Wingender, E., (1996)
Combining structural analysis of DNA with search routines for the detection of
transcription regulatory elements. Comput Appl Biosci., 12, 441-446.
Kim, J.G., Takeda, Y., Matthews, B.W., Anderson, W.F., (1987) Kinetic
studies on Cro repressor-operator DNA interaction. J. Mol. Biol., 196,
149-158.
Kraus, R.J., et al. (1996) Experimentally determined weight matrix
definitions of the initiator and TBP binding site elements of promoters. Nucleic Acids
Res., 24, 1531-1539.
Kretsovali, A., and Papamatheakis, J., (1995) A novel IL-4 responsive
element of the E alpha MHC class II promoter that binds to an inducible factor. Nucleic
Acids Res., 23, 2919-2928.
Mhaskar, D.N., and Goodman, M.F., (1984) On the molecular basis of
transition mutations. Frequency of forming 2-aminopurine-cytosine base mispairs in the G X
C----A X T mutational pathway by T4 DNA polymerase in vitro. J. Biol. Chem., 259,
11713-11717.
Mulligan, M.E., et al. (1984) Escherichia coli promoter sequences
predict in vitro RNA polymerase selectivity. Nucleic Acids Res. 12, 789-800.
Ponomarenko, M.P., Kolchanova, A.N., and Kolchanov, N.A.,
(1997a).Generating programs for predicting the activity of functional sites. J. Comput.
Biol., 4, 83-90
Ponomarenko, M.P., Savinkova, L.K., et al. (1997b) Modeling TATA-box
sequences in eukaryotic genes. Mol Biol (Mosk)., 31, 726-732.
Ponomarenko, M.P., Ponomarenko, J.V., et al. (1997c) Computer analysis
of conformational features of the eukaryotic TATA-box DNA promotors. Mol Biol (Mosk).,
31, 733-740.
Shpigelman, E.S., et al. (1993) CURVATURE: software for the analysis of
curved DNA. Comput. Appl. Biosci., 9, 435-140.
Sax C.M., Cvelk A., et al., (1995) Lens-specific activity of the mouse
alphaA-crystallin promoter in the absence of a TATA box: functional and protein binding
analysis of the mouse alpha A-crystallin PE1 region. Nucleic Acids Res., 23,
442-451.
Starr, D.B., Hoopes, B.C., and Hawley, D.K., (1995) DNA bending is an
important component of site-specific recognition by the TATA binding protein. J. Mol.
Biol., 250, 434-446.
Stormo, G.D., Schneider, T.D., and Gold, L., (1986) Quantitative
analysis of the relationship between nucleotide sequence and functional activity. Nucleic
Acids Res., 14, 6661-6679.
Sugimoto, N., Nakano, S., Yoneyama, M., and Honda, K., (1996) Improved
thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes.
Nucleic Acids Res., 24, 4501-4505.
Suzuki, M., Yagi, N., and Finch, J.T., (1996) Role of base-backbone and
base-base interactions in alternating DNA conformations. FEBS Lett., 397,
148-152.
Zadeh, L.A., (1965) Fuzzy sets. Information and Control., 8,
338-353.

Table 1. Examples of the sites with known activities available in the database
ACTIVITY

Site

Activity

Reference

Name
DNA
Quantitative character
Sc
Min
Max

Cro-binding site
Nat
Association rate const
ln
19.1
19.9
Kim, 1987

CRP-binding site
Nat
CRP/DNA affinity
ln
-3.2
3.2
Gartenberg, 1988

E. coli promoter
Mut
Promoter strength
-log
0.26
2.1
Jonsson, 1993

C-protein-binding site
Mut
Transcription activity
ln
-6.2
1.8
Chiang, 1993

TATA box
Mut
TBP/DNA lifetime
m
1
185
Starr, 1995

TATA box
Mut
Bend, DNA/TBP comp
(^O)
33
106
Starr, 1995

Transcription signal INR
Mut
INR/DNA affinity
ln
-4.6
1.3
Kraus, 1996

Transcription signal OCT-1
Mut
Transcription activity
ln
-2.3
0.63
Boyd, 1995

Transcription signal YY1
Syn
Repressing activity
ln
2.2
0.00
Hyde-DeRuyscher, 1995

Transcription signal USF
Syn
USF/DNA affinity
ln
3.8
100
Bendall, 1994

PE1B/TATA box
Mut
Transcription activity
ln
-1.4
1.4
Sax, 1995

Transcription signal IL-1
Mut
Transcription activity
ln
-1.9
4.1
Kretsovali, 1995

2AP-induced mutation
Nat
Mutation frequency
ln
0.0
5.6
Coullondre, 1978

Nat, natural; Mut, mutant, Syn, synthetic; m,
minute; Sc, scale; 2AP, 2-aminopurine.

Table 2. Examples of the DNA properties available in the database ACTIVITY

Property name
Unit
Min
Max
Reference

Twist
(^O)
31.1
41.4
Karas, 1996

Propeller
(^O)
-17.3
-6.7
Gorin, 1995

Tip
(^O)
-1.64
6.7
Karas, 1996

Inclination
(^O)
-1.43
1.43
Karas, 1996

Tilt
(^O)
-2.6
0.6
Gorin, 1995

Bend
(^O)
2.16
6.74
Karas, 1996

Wedge
(^O)
1.1
8.4
Shpigelman, 1993

Direction
(^O)
-154
180
Shpigelman, 1993

Roll
(^O)
-6.2
6.2
Suzuki, 1996

Rise
Angstrom
3.16
4.08
Karas, 1996

Slide
Angstrom
-0.4
1.6
Suzuki, 1996

Minor groove width (width)
Angstrom
4.62
6.40
Karas, 1996

Minor groove depth (depth)
Angstrom
8.79
9.11
Karas, 1996

Minor groove width size (size)
Angstrom
2.7
4.7
Gorin, 1995

Minor groove width distance (dist)
Angstrom
2.79
4.24
Gorin, 1995

Major groove width (WIDTH)
Angstrom
12.1
15.5
Karas, 1996

Major groove depth (DEPTH)
Angstrom
8.45
9.60
Karas, 1996

Major groove size (SIZE)
Angstrom
3.26
4.70
Gorin, 1995

Major groove distance (DIST)
Angstrom
3.02
3.81
Gorin, 1995

Clash strength
r.u.
0.00
2.53
Gorin, 1995

Bending mobility to minor groove
r.u.
1.02
1.27
Gartenberg, 1988

Bending mobility to major groove
r.u.
0.99
1.18
Gartenberg, 1988

Persistent length
bp
20
130
Hogan, 1987

Melting temperature
^oC
36.7
136.1
Hogan, 1987

Probability to be contacting nucleosome
core
%
1
18
Hogan, 1987

Enthalpy change
kcal/mol
-11.8
-5.6
Sugimoto, 1996

Entropy change
cal/mol/K
-28.4
-15.2
Sugimoto, 1996

Free energy change
kcal/mol
-2.8
-0.9
Sugimoto, 1996

r.u., relative unit

Table 3. Examples of the functional DNA sites analyzed by the system ACTIVITY

Site

DNA feature found
Significance

Name
Position #1
n
Activity, F
X_k
Region
Property
U
r
p

PE1B TATA box
Transc-
11
Transcription
X₁
-32; -25
Pnucl
0.36
-0.77
10^-2

(Sax, 1995)
ription

activity
X₂
-29; -19
DIST
0.41
0.86
10^-3

start

of alphaA-
X₃
-31; -25
Tilt
0.38
-0.78
10^-2

crystalline
F=-39-0.1*X₁+12*X₂-X₃
0.90
10^-4

TATA box (mutant)
TATA
9
DNA bending
X₁
0, 9
Inclination
0.19
0.76
0.05

(Starr, 1995)
box start

in TBP/TATA
F=120.15+70.32*X₁
0.76
0.05

USF-binding site
Synthetic
14
USF/DNA
X₁
11, 15
Depth
0.22
-0.78
10^-3

(Bendall, 1994)
DNA

affinity
X₂
11; 20
Twist
0.23
-0.86
10^-4

start

F=170-16.3*X₁-0.7*X₂
0.91
10^-5

YY1-binding site
site start
21
Transcription
X₁
1, 12
Twist
0.27
-0.76
10^-2

(Hyde-DeRuyscher, 1995)

repression
F= 47.97 -1.37*X₁
0.76
10^-2

2AP-induced mutation
Mutation
26
Mutation
X₁
-1, 2
Tmelt
0,20
0.90
10^-5

(Coullondre, 1978)
point

frequency
F=-8.5568+0.1585*X₁
0.90
10^-5

Notes: n, total number of the site variants; X_k, feature selected; U,
utility; r, linear correlation coefficient; p, significance of the linear correlation
coefficient; Pnucl, probability to be contacting nucleosome core; Tmelt, melting
temperature; depth, minor groove depth; width, minor groove width; WIDTH, major groove
width; DIST, major groove width distance; and F=F₀+S
_iF_iX_i, the linear regression (1) derived for predicting
the site activity.

Fig. 1. Algorithm for generating the C-code program predicting the
activity of a given site.

Fig. 2. Scheme of the distributed and intelligent database ACTIVITY.}}

Site		Activity				Reference
Name	DNA	Quantitative character	Sc	Min	Max
Cro-binding site	Nat	Association rate const	ln	19.1	19.9	Kim, 1987
CRP-binding site	Nat	CRP/DNA affinity	ln	-3.2	3.2	Gartenberg, 1988
E. coli promoter	Mut	Promoter strength	-log	0.26	2.1	Jonsson, 1993
C-protein-binding site	Mut	Transcription activity	ln	-6.2	1.8	Chiang, 1993
TATA box	Mut	TBP/DNA lifetime	m	1	185	Starr, 1995
TATA box	Mut	Bend, DNA/TBP comp	(^O)	33	106	Starr, 1995
Transcription signal INR	Mut	INR/DNA affinity	ln	-4.6	1.3	Kraus, 1996
Transcription signal OCT-1	Mut	Transcription activity	ln	-2.3	0.63	Boyd, 1995
Transcription signal YY1	Syn	Repressing activity	ln	2.2	0.00	Hyde-DeRuyscher, 1995
Transcription signal USF	Syn	USF/DNA affinity	ln	3.8	100	Bendall, 1994
PE1B/TATA box	Mut	Transcription activity	ln	-1.4	1.4	Sax, 1995
Transcription signal IL-1	Mut	Transcription activity	ln	-1.9	4.1	Kretsovali, 1995
2AP-induced mutation	Nat	Mutation frequency	ln	0.0	5.6	Coullondre, 1978

Property name	Unit	Min	Max	Reference
Twist	(^O)	31.1	41.4	Karas, 1996
Propeller	(^O)	-17.3	-6.7	Gorin, 1995
Tip	(^O)	-1.64	6.7	Karas, 1996
Inclination	(^O)	-1.43	1.43	Karas, 1996
Tilt	(^O)	-2.6	0.6	Gorin, 1995
Bend	(^O)	2.16	6.74	Karas, 1996
Wedge	(^O)	1.1	8.4	Shpigelman, 1993
Direction	(^O)	-154	180	Shpigelman, 1993
Roll	(^O)	-6.2	6.2	Suzuki, 1996
Rise	Angstrom	3.16	4.08	Karas, 1996
Slide	Angstrom	-0.4	1.6	Suzuki, 1996
Minor groove width (width)	Angstrom	4.62	6.40	Karas, 1996
Minor groove depth (depth)	Angstrom	8.79	9.11	Karas, 1996
Minor groove width size (size)	Angstrom	2.7	4.7	Gorin, 1995
Minor groove width distance (dist)	Angstrom	2.79	4.24	Gorin, 1995
Major groove width (WIDTH)	Angstrom	12.1	15.5	Karas, 1996
Major groove depth (DEPTH)	Angstrom	8.45	9.60	Karas, 1996
Major groove size (SIZE)	Angstrom	3.26	4.70	Gorin, 1995
Major groove distance (DIST)	Angstrom	3.02	3.81	Gorin, 1995
Clash strength	r.u.	0.00	2.53	Gorin, 1995
Bending mobility to minor groove	r.u.	1.02	1.27	Gartenberg, 1988
Bending mobility to major groove	r.u.	0.99	1.18	Gartenberg, 1988
Persistent length	bp	20	130	Hogan, 1987
Melting temperature	^oC	36.7	136.1	Hogan, 1987
Probability to be contacting nucleosome core	%	1	18	Hogan, 1987
Enthalpy change	kcal/mol	-11.8	-5.6	Sugimoto, 1996
Entropy change	cal/mol/K	-28.4	-15.2	Sugimoto, 1996
Free energy change	kcal/mol	-2.8	-0.9	Sugimoto, 1996

	Site			DNA feature found			Significance
Name	Position #1	n	Activity, F	X_k	Region	Property	U	r	p
PE1B TATA box	Transc-	11	Transcription	X₁	-32; -25	Pnucl	0.36	-0.77	10^-2
(Sax, 1995)	ription		activity	X₂	-29; -19	DIST	0.41	0.86	10^-3
	start		of alphaA-	X₃	-31; -25	Tilt	0.38	-0.78	10^-2
			crystalline	F=-39-0.1X₁+12X₂-X₃				0.90	10^-4
TATA box (mutant)	TATA	9	DNA bending	X₁	0, 9	Inclination	0.19	0.76	0.05
(Starr, 1995)	box start		in TBP/TATA	F=120.15+70.32*X₁				0.76	0.05
USF-binding site	Synthetic	14	USF/DNA	X₁	11, 15	Depth	0.22	-0.78	10^-3
(Bendall, 1994)	DNA		affinity	X₂	11; 20	Twist	0.23	-0.86	10^-4
	start			F=170-16.3X₁-0.7X₂				0.91	10^-5
YY1-binding site	site start	21	Transcription	X₁	1, 12	Twist	0.27	-0.76	10^-2
(Hyde-DeRuyscher, 1995)			repression	F= 47.97 -1.37*X₁				0.76	10^-2
2AP-induced mutation	Mutation	26	Mutation	X₁	-1, 2	Tmelt	0,20	0.90	10^-5
(Coullondre, 1978)	point		frequency	F=-8.5568+0.1585*X₁				0.90	10^-5