LINEAR-ADDITIVE APPROXIMATION FOR PREDICTING ACTIVITY OF THE FUNCTIONAL DNA AND RNA SITES

N.A. Kolchanov, M.P. Ponomarenko, J.V. Ponomarenko, A.S. Frolov, N.L. Podkolodny#

Institute of Cytology and Genetics, Novosibirsk, Russia, 630090; kol@bionet.nsc.ru; #)Institute of Computational Mathematics and Mathematical Geophysics, Novosibirsk, Russia, 630090

ABSTRACT

We suggest a linear-additive approximation to predict activity of the functional DNA and RNA sites. Its novelties are (i) taking into account physico-chemical and conformational properties of DNA and RNA and (ii) one by one “generating and testing” a huge body of hypothesis such as “Is a given DNA/RNA feature responsible for a definite site activity?”. The considered features are calculated from conformational and physico-chemical properties, and also from oligonucleotide content of a given site. The such features are easily interpreted, but, of course, hardly guessed. That is why we suggest to generate and test in significance as many these features as computer can afford. This linear-additive approximation has been implemented for creating the distributed and intelligent database ACTIVITY for the functional site activities. Currently, about 60 experiments are described with this database. Besides, the database compiles 40 conformational and physico-chemical properties. There is also the knowledge base for the features found significant for predicting the site activities. It links within a library of the computer programs implementing these predictions. ACTIVITY is WWW-available in real-time mode “http://www.bionet.nsc.ru/SRCG/Activity/”.

INTRODUCTION

As is known, every molecular process in a cell, such as replication, transcription, splicing, and translation, is controlled by a definite set of functional sites. Thousands of such sites are presently known. Each site has a certain location and activity value. The methods recognizing sites rely on the data on site location stored in the EMBL Data Library, the GenBank Database and other compilations [1-4]. There are hundreds of methods in this area of intense research (for review, see [5]). From them, the consensus [6], neural networks [7] and weight matrix [8] are widely used. The methods recognizing sites locations are applied to annotate genomic DNA [9-12]. Functional sites of the same type can differ in activity by several orders of magnitude. Table 1 exemplifies that synthetic ssDNA’s differ 350-fold in the affinity for RecA-filament [13], mutant E. coli operators vary in the range of two orders affinity for Cro-repressor [14], and so on [15-17].

Mulligan et al. [18] have first predicted the activity Kbk2 of E. coli promoters through homology score. Using multiple regression, Stormo et al. [19] have optimized weight matrixes to predict the site activities for su2-suppression and 2-aminopurine induced mutations. Berg and von Hippel [20, 21] have generalized all the data within the framework of a statistical-mechanical theory and they have applied it to predict the activities of the CRP- and Cro-binding sites, and E. coli promoters. As for RNA, the weight matrix predicting the activities of the E. coli ribosome-binding site has been also optimized [22]. Jonsson et al. [23] have introduced neural networks to predict the E. coli promoter strength. Neural networks have made it possible to predict the activities of the INR-binding site and TATA box [24]. A system generating programs to predict site activities have been created [20] and applied to predict the consensus site maximizing affinity between DNA and TBP protein [25]. This consensus was found to be similar to Bucher’s consensus [1] of the TATA box.

All the large body of data on site activity appeared to be as informative as the widely used data on site locations for development of prediction methods. Therefore, the prediction of the site activity remains a challenging problem of computational biology. However, there are no databases for the site activities and, thus, no information sources to predict them. Nevertheless, there are hundreds of “sequence® activity” experimental data sets. When SRS query language [27] was introduced, the compiling data on site activities became feasible.

With this background, we developed the distributed and intelligent database ACTIVITY for the activities of the functional sites in DNA and RNA. It has the following: (i) the database for the experimental data on site sequences with known activities, (ii) the database for the conformational and physico-chemical properties of DNA and RNA, (iii) the knowledge base for the features significant for predicting the site activities, and (iv) the library for the programs predicting the site activities. ACTIVITY is WWW-available in “real-time” mode (http://www.bionet.nsc.ru/SRCG/Activity/).

METHOD

In this paper, we suggest linear-additive approximation for predicting activities of the functional DNA and RNA sites. Its biological novelty is taking into account physico-chemical and conformational DNA properties in addition to widely used contextual. Its computational novelty is using decision making theory [28] and Zadeh’s fuzzy logic [29] to generate and test a huge body of hypothesis such as “Is this DNA property responsible for a given site activity?”. Fig.1 demonstrates that (a) the concentration of the tetranucleotide VUKK within the SV40 pre-RNA cleavage point is responsible for the 3’processing efficiency, and (b) the major groove width of DNA is responsible for the Cro/DNA affinity. These type of DNA features is easily interpreted, but hardly guessed.

That is why we suggest to generate and test as many DNA features as computer can afford.

The core idea of linear-additive approximation implies that the site activity, F, is determined by simultaneous action of the site features, X’s, of two types: obligatory and facultative. The obligatory feature of a given site are invariant for all sequences of this site and determine its basal activity. Consensus is a typical obligatory feature. The facultative features of a given site are individual in terms of their “number, size and location” for each sequence of the site. They modulate the site activity with respect to basal level. Hence, within the framework of the linear-additive approximation, the activity of the site with sequence S can be calculated by the following equation:

, (1)

where, F0(S) equals the basal activity of this type sites when the sequence S has their obligatory features and equals “0” otherwise; Xk(S)’s are the values of the site facultative features of the sequence S; Fk’s are the contributions of the facultative features Xk to the site activity F.

Three types of facultative features are considered, namely, statistical, conformational, and physico-chemical. The weighted concentration of oligonucleotides Z={z1, ..., zj, ..., zm} of length m is used as a statistical feature of the nucleotide context of the sequence S={s1, ..., si, ..., sL} of length L:

, (2)

here, the so-called “d -function” is used that assumes the value "1" or "0" at each position i of the sequence S depending on the presence or absence of the oligonucleotide Z at this position:

where: siÎ {A, T, G, C}; zjÎ {A, T, G, C, W=A/T, R=A/G, M=A/C, K=T/G, Y=T/C, S=G/C, B=T/G/C, V=A/G/C, H=A/T/C, D=A/T/G, N=A/T/G/C}; m<<L.

The basic element of the facultative feature description is the function of position effect w(i). The function allows to take into account the fact that the same oligonucleotide contributes differently to the site activity depending on its location The function w(i) is determined by a simple rule: the more important is the position for the site function, the higher is its assigned weight w(i). The total number of the weighted functions w(i) used in the activity prediction is 180. The weight functions given in Fig.2 demonstrate the highest effect on the site activity of (a) the narrow region within the right half of the site, (b) its central part, (c) its terminal regions, and (d) the left terminus of the site.

As is known, local conformational DNA heterogeneities dependent on the nucleotide context play an important role in DNA-protein interactions, which essentially determine the site activity. That is why the prediction of site activity takes into account the DNA conformational properties describing the mutual orientation and locations of base pairs. The values of these parameters averaged for the known X-ray structures are used. Also the following physico-chemical properties are used: the melting temperature, persistent length, entropy, and others. These properties determine the conformational dynamics of DNA sites during their functioning. About 40 conformational and physico-chemical properties are utilized in prediction of the site activities.

Thus, the sequence of the site S can be characterized by the mean value of the q-th conformational or physico-chemical property Pq averaged for the region between positions a and b:

. (3)

It should be emphasized that before starting the analysis, we knew very little about the statistical, physico-chemical, or conformational features that were most important for the activity of the examined site. The only available data were the sequences with the known activities. With this in mind, the artificial intelligence principle of impartiality is applicable: “when the information is insufficient, the more hypotheses have been generated and tested, the more correct is the result, and no preference, therefore, might be given to any hypothesis before its testing”. In this paper, each hypothesis is the assumption that a statistical, conformational, or physico-chemical feature calculated by formulae (2) or (3) is significant for the activity of the examined site.

For this reason, in the analysis of statistical features, we test one by one all the possible variants of oligonucleotide Z varying (a) its length m from 1 to M; (b) its nucleotide composition in 15 single-letter based codes; and (c) all available functions of position effect w(i). Thus, the weighted concentration XZ,m,w(Sn) is calculated by formulae (2) for fixed combinations “Z, m, w” for each sequence Sn with known activity Fn. Hence, the total number of these statistical features generated and tested is about 107. Similarly, all the available conformational or physico-chemical property Pq and all the possible regions (a, b) within the examined site are considered one by one. In this way, for a fixed “q, a, b”, the conformational or physico-chemical feature Xq,a,b(Sn) is calculated by formula (3) for each sequence Sn with known activity Fn The total number of the features is about 105.

When so large number of statistical, conformational or physico-chemical features is generated and tested in significance for a given site activity, the problem of an insignificant feature chosen by chance becomes crucial. In this paper, we suggest to cross this problem within the framework of decision making theory [28] and Zadeh’s fuzzy logic [29] by the following way.

Lets calculate by formula (2) the fixed statistical feature Xzmw(Sn) for each sequence Sn with the known activity Fn. If the resulting pairs {Xzmw(Sn), Fn} meet all the possible conditions of the linear regression (formula 1) applicability, then activity F is predictable from an arbitrary sequence S via the feature Xzmw(S). To test the conditions, a simple regression is optimized for the pairs {Xzmw(Sn), Fn}:

; (4)

where: f0 and f1 are the regression coefficients optimized for the pairs {Xzmw(Sn), Fn} [30].

To ensure the reliability of the regression between the Xzmw(Sn) and Fn values, 22 conditions of regression analysis are tested, namely, the presence of linear, sign, and rank correlations between the predicted Fzmw(Sn) and experimental Fn activities; the equality of distributions of these values, the Gaussian distribution of their deviation (Fzmw(Sn)-Fn) and so on. When testing each of the 22 conditions, the significance level a r at which the r-th condition is met is estimated (where: 1£ r £ 22). Within the framework Zadeh’s fuzzy logic [29], each estimation a r is transformed into uniform scale that is so-called “partial utility of the usage of the feature XZmw to predict the activity F”, as follows:

(5)

The highest partial utility ur=1 is assigned to the feature Xzmw, if the r-th condition is met at significance a r <0.01. The utility is lowest, ur=-1, if the r-th condition is not met (a r > 0.1). The intermediate partial utility urÎ [-1, 1] is assigned to the feature Xzmw met the r-th condition with the intermediate a rÎ [0.01, 0.1]. Within the framework of decision making theory [28], the averaging all the 22 partial utilities gives the integral utility of the usage of the feature XZmw to predict the activity F:

. (6)

Only the linearly independent features XZ,m,w with the highest positive utility are selected:

. (7)

To have the positive utility U(XZ,m,w,F), the statistical feature XZ,m,w needs to met at least a half of the 22 conditions of the linear regression applicability. Thus, the probability of a feature X with positive utility U(X, F) selected by chance from 107 features can be estimated with the binomial distribution, such as:

. (8)

Formula (8) shows that each statistical feature XZ,m,w selected by formula (7) met significantly the linear regression applicability for predicting the site activity. The same is for the conformational and physico-chemical feature Xq,a,b. That is why this selection can be one by one generating and testing.

ALGORITHM

The simple combinatorial algorithm used is schematically shown in Fig.3. This algorithm means the following: all the possible features X(Sn)’s for all the available site sequences Sn’s with known activities Fn’s are calculated by formulae (2) and (3), and also all the possible utilities U(X,F) are estimated by formulae (4), (5) and (6). When all U(X, P)£ 0, the algorithm terminates without features selected, but, in contrast, when U(X, P)>0, all the possible linear-independent features {Xk}1£ k£ K with highest positive {U(Xk, F)>0}1£ k£ K are selected. Basing on these features {Xk}1£ k£ K selected, the linear-additive approximation (formula 1) to predict the site activity is derived, and, finally, the C-code implementing this prediction is generated and stored [25].

This algorithm has been implemented with Borland C compiler on IBM PC platform to develop the distributed and intelligent database ACTIVITY presented schematically in Fig.4. It contains the following: three databases, computer system generating programs to predict the site activity and the library for the computer programs predicting activities of the sites. ACTIVITY is WWW-available in “real-time” mode through URL “http://www.bionet.nsc.ru/SRCG/Activity/”.

RESULT AND DISCUSSION

The most important unit of the ACTIVITY is the database for DNA and RNA site activity. Currently, it describes more than 70 samples exemplified in Table 2. Among them are promoters and binding sites for different E. coli regulatory proteins, TATA-boxes and binding sites for various eukaryotic transcription factors, translation starts, splicing and 3’processing sites, mutation hotspots and many others. The parameters characterizing specific site activities include the association and dissociation rate, affinity, lifetime of the complexes, product concentrations controlled by these sites, transcription and translation efficiencies, mutation and cutting frequencies, etc. Fig.5 gives the database format by using as example the E. coli promoter strength in terms of “-log[Pbla]” units [23].

ACTIVITY has also the database for conformational and physico-chemical properties of DNA. The current version of the database contains about 40 properties, some of them are listed in Table 3. As an example, Fig.6 gives the presentation of the B-helical angle “Direction” in the database.

These two database are initial for the computer system to generate programs predicting the site activity. For initial data on the E. coli promoter strength (Fig.5), the ACTIVITY output is demonstrated in Fig.7. This output is stored into the knowledge base for the significant features for predicting activity of the site (see scheme in Fig.4). Fig.8 illustrates what does this knowledge mean. The concentration of the trinucleotide ASM weighted by the function w(i) given in Fig.2a correlates significantly with the promoter strength (Fig.8a). The function w(i) assigns the highest weighs to the region (-1; 11). It means that the trinucleotide ASM near the transcription start gives the highest contribution to the promoter strength. The Direction averaged for the region (-5; 15) also correlates with the promoter strength (Fig 8b). Basing on these two features, the linear-additive approximation (1) for predicting the strength is derived (Table 4). Fig.8c compares the experimental and predicted E. coli promoter strength. The linear correlation coefficient r=0.91 means the significant agreement between the experiment and prediction.

Also, Table 4 presents some dozens of eukaryotic promoter sites analyzed by Activity to demonstrate the universality of the linear-additive approximation (1) in this field of intense research. For all these sites, the significant statistical, physico-chemical, or conformational features have been identified and the linear-additive approximations predicting the site activities have been derived.

Analysis of the mouse a A-crystalline gene promoter (Fig.9a) showed that the best physico-chemical feature of its PE1B region near TATA box is the probability to be contacting with nucleosome core. This feature negatively correlates with transcription activity that means the tighter is the interaction of the promoter with nucleosomes, the lower is the transcription activity. This result is consistent with the experimental data showing that nucleosome displacement from a promoter precedes the TBP/TATA-binding [46, 47]. The performed analysis has demonstrated also that such conformational features as the major groove dist (Fig.9b) and the Tilt (Fig.9c) are of great importance for the transcription activity. Using these three features, the linear-additive approximation predicting the transcription activity was derived (Table 4) and tested (Fig.9d).

Analysis of the sequences with known DNA bending in the TBP/TATA complex demonstrated that the bending increases with the inclination (Fig.10a). Similar results were obtained by the X-ray analysis of the TBP/TATA complexes. DNA bending in these complexes was shown to result from intercalation of four phenylalanine residues of the TBP between a pair of adjacent bases on the side of the minor groove [48]. Inclination describes the rotation angle of a pair of bases along the short axis of the pair, and the increase in the angle widens the minor groove [49], thereby facilitating the intercalation of phenylalanines on the minor groove and DNA bending.

The linear-additive approximation (1) can be also applied to synthetic analogues of sites and their mutational variants. Study of the synthetic analogues of the TATA-boxes with known TBP affinity revealed two significant features: (i) the weighted concentration of the dinucleotide TV, contributing primarily to the TBP affinity in the center of the site (Fig.2b); and (ii) the weighted concentration of the dinucleotide WR chiefly contributing to the affinity at the site termini (Fig.2c). The linear-additive approximation predicting the TBP/DNA affinity were derived using these two statistical features (Table 4). An agreement between the experimental and predicted affinities is shown in Fig.10b.

Fig.11 demonstrates that, for an arbitrary site, any conformational (a, b), physico-chemical (d), and also statistical (d) DNA and RNA features can appear to be significant for predict the site activity.

Summing up, we would like to underline that the linear-additive approximation (formula 1) derived for predicting site activities can be helpful in a wide range of investigations in molecular biology. Substantially, ACTIVITY does not require huge body of initial experimental data. It is completely automated and WWW-available (http://www.bionet.nsc.ru/SRCG/Activity/).

This work was supported by grants from the Russian Foundation for Basic Research.

REFERENCES

1. P. Bucher, J. Mol. Biol., 212, 563 (1990)

2. I. Ioshikhes and E.N. Trifonov, Nucleic Acids Res. 21, 4857 (1993)

3. J.D. Helmann, Nucleic Acids Res. 23, 2351 (1995)

4. E. Wingender, A.E. Kel, et al., Nucleic Acids Res., 25, 265 (1997)

5. M.S. Gelfand, J. Comput. Biol., 2, 87 (1995)

6. S. Karlin and V. Brendel, Science, 257, 39 (1992)

7. E.C. Uberbacher, Y. Xu, and R.J. Mural, Methods Enzymol., 266, 259 (1996)

8. Q.K. Chen, G.Z. Hertz, and G.D. Stormo, CABIOS, 13, 29 (1997)

9. J.W. Fickett, Trends Genet., 12, 316 (1996)

10. R. Guigo and J.W. Fickett, J. Mol. Biol., 253, 51 (1995).

11. V.V. Solovyev, A.A. Salamov, and C.B. Lawrence, Nucleic Acids Res., 22, 5156 (1994).

12. E.E. Snyder and G.D. Stormo, J. Mol. Biol., 248, 1 (1995).

13. A.V. Mazin and S.C. Kowalczykowski, Proc. Natl. Acad. Sci. U.S.A., 93, 10673 (1996)

14. J.G. Kim, Y. Takeda, B.W. Matthews, and W.F. Anderson, J. Mol. Biol., 196, 149 (1987)

15. C. Coulondre, J.H. Miller, P.J. Farabaugh, and W. Gilbert, Nature, 274, 775 (1978)

16. A. Gil and N.J. Proudfoot, Cell, 49, 399 (1987)

17. A.A. Sokolenko, I.I. Sadomirsky, and L.K. Savinkova, Mol. Biol. (Msk), 30, 279 (1996).

18. M.E. Mulligan, D.K. Hawley, et al., Nucleic Acids Res., 12, 789 (1984)

19. G.D. Stormo, T.D. Schneider, and Gold, L. (1986) Nucleic Acids Res., 14, 6661 (1986).

20. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 193, 723 (1987)

21. O.G. Berg and P.H. von Hippel, J. Mol. Biol., 200, 709 (1988)

22. D. Barrick, K. Villanueba, et al., Nucleic Acids Res., 22, 1287 (1994)

23. J. Jonsson, T. Norberg, et al., Nucleic Acids Res., 21, 733 (1993)

24. R.J. Kraus, E.E. Murray, et al., Nucleic Acids Res., 24, 1531 (1996)

25. M.P. Ponomarenko, A.N. Kolchanova, and N.A. Kolchanov, J. Comput. Biol., 4, 83 (1997)

26. M.P. Ponomarenko, L.K. Savinkova, et al., Mol. Biol (Msk), 31, 726 (1997)

27. T. Etzold and P. Argos, CABIOS, 9, 49 (1993)

28. P.C. Fishburn, Utility theory for decision making, New York, Jonh Wiley & Sons (1970).

29. L.A. Zadeh, Information and Control, 8, 338 (1965)

30. E. Forster and B. Ronr, Methoden der korrelations- und regressions analyse, Berlin, Verlag Die Wirtschaft (1979)

        31. M.R. Gartenberg and D.M. Crothers, Nature, 333, 824 (1988)

L.W. Chiang and M.M. Howe, Genetics, 135, 619 (1993)

33. D.B. Starr, B.C. Hoopes, and D.K. Hawley, J. Mol. Biol., 250, 434 (1995)

34. D. Boyd et al., J. Mol. Biol., 253, 677 (1995)

35. A.J. Bendall and P.L. Molloy, Nucleic Acids Res., 22, 2801 (1994)

36. C.M. Sax, A. Cvelk, et al., Nucleic Acids Res., 23, 442 (1995)

37. A. Kretsovali, and J. Papamatheakis, Nucleic Acids Res., 23, 2919 (1995)

38. M. McDevitt et al., EMBO J., 5, 2907 (1986)

39. C.F. Lesser and C. Guthrie, Genetics, 131, 851 (1993)

40. H. Karas, R. Knuppel, W. Schulz, H. Sklenar, and E. Wingender, CABIOS, 12, 441 (1996)

41. A.A. Gorin, V.B. Zhurkin, and W.K. Olson, J. Mol. Biol., 247, 34 (1995)

42. E.S. Shpigelman, E.N. Trifonov, and A. Bolshoy, CABIOS, 9, 435 (1993)

43. M. Suzuki, N. Yagi, and J.T. Finch, FEBS L., 397, 148 (1996)

44. M.E. Hogan and R.H. Austin, Nature, 329, 263 (1987)

45. N. Sugimoto, S. Nakano, M. Yoneyama, and K. Honda, Nucleic Acids Res., 24, 4501 (1996)

46. D.G. Edmondson and S.Y. Roth, FASEB J., 10, 1173 (1996)

        47. J.S. Godde, Y. Nakatani, and A.P. Wolffe, Nucleic Acids Res., 23, 4557 (1995)

48. Z.S. Juo, T.K. Chiu, et al., J. Mol. Biol. 261, 239 (1996).

49. EMBO Workshop, EMBO J., 8, 1 (1989)

Table 1. The sites of the same type can differ in activity by several orders of magnitude

Site name

Site sequence

Activity

DNA/RecA-filament affinity

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

350

in E. coli [13]

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

1

Cro/DNA-affinity in

TAATGTGAGTTAGCTCACTCAT

91

E. coli [14]

TAATGTAAGTTAGCTCACTCAT

1

2-aminopurine inducted

CGCGTGGTGAACCAGGCCAGCCACG

51

mutations C->T [15]

ACCACCATCAAACAGGATTTTCGCC

1

3’processing pre-mRNA

UUCUACCGGAUCGUUGUGUUCGAGG

13

in SV40 [16]

UUCUACCGGAUCGUUUUGGUCGAGG

1

TBP/TATA-affinity in

GGGGCTATAAAAGGGGGTGG

7

yeast [17]

GTACCTATGGGTCTGCTGGT

1

 

Table 2. Examples of the sites with known activities available in the database ACTIVITY

Site

 

Activity

     

Ref.

Name

Sequences

Parameter

Scale

Min

Max

 
Cro-binding site

Natural

Association rate constant

ln

19.1

19.9

[14]

CRP-binding site

Natural

Affinity CRP/DNA

ln

-3.2

3.2

[31]

E. coli promoter

Mutant

Promoter strength

-log

0.26

2.1

[23]

C-protein binding site

Mutant

Transcription activity

ln

-6.2

1.8

[32]

TATA box

Mutant

TBP/DNA lifetime

minute

1

185

[33]

TATA box

Mutant

Bend, DNA/TBP complex

degree

33

106

[33]

Transcription signal INR

Mutant

Affinity INR/DNA

ln

-4.6

1.3

[24]

Transcription signal OCT-1

Mutant

Transcription activity

ln

-2.3

0.63

[34]

Transcription signal USF

Synthetic

Affinity USF/DNA

ln

3.8

100

[35]

PE1B box (adjacent TATA box)

Mutant

Transcription activity

ln

-1.4

1.4

[36]

Transcription signal IL-1

Mutant

Transcription activity

ln

-1.9

4.1

[37]

Pre-mRNA 3’cleavage site

Mutant

Cleavage efficiency

%

3

289

[38]

Pre-mRNA donor splice site

Mutant

Cleavage efficiency

%

18

100

[39]

E. coli ribosome-binding site

Synthetic

Translation activity

ln

0.0

8.06

[22]

2-aminopurine induced mutation

Natural

Mutation frequency

ln

0.0

5.6

[15]

 

 

Table 3. Examples of the DNA properties available in the database ACTIVITY

Property name

Unit

Min

Max

Ref.

Conformational:

       

Twist

Degree

31.1

41.4

[40]

Propeller

Degree

-17.3

-6.7

[41]

Tip

Degree

-1.64

6.7

[40]

Inclination

Degree

-1.43

1.43

[40]

Tilt

Degree

-2.6

0.6

[41]

Bend

Degree

2.16

6.74

[40]

Wedge

Degree

1.1

8.4

[42]

Direction

Degree

-154

180

[42]

Roll

Degree

-6.2

6.2

[43]

Rise

Angstrom

3.16

4.08

[40]

Slide

Angstrom

-0.4

1.6

[43]

Minor groove width (width)

Angstrom

4.62

6.40

[40]

Minor groove depth (depth)

Angstrom

8.79

9.11

[40]

Minor groove width size (size)

Angstrom

2.7

4.7

[41]

Minor groove width distance (dist)

Angstrom

2.79

4.24

[41]

Major groove width (WIDTH)

Angstrom

12.1

15.5

[40]

Major groove depth (DEPTH)

Angstrom

8.45

9.60

[40]

Major groove size (SIZE)

Angstrom

3.26

4.70

[41]

Major groove distance (DIST)

Angstrom

3.02

3.81

[41]

Physico-chemical:

       

Clash strength

f

0.00

2.53

[41]

Bending mobility to minor groove

m

1.02

1.27

[31]

Bending mobility to major groove

m

0.99

1.18

[31]

Persistent length

nl

20

130

[44]

Melting temperature

oC

36.7

136.1

[44]

Probability to be contacting nucleosome core

%

1

18

[44]

Enthalpy change

kcal/mol

-11.8

-5.6

[45]

Entropy change

cal/mol/K

-28.4

-15.2

[45]

Free energy change

kcal/mol

-2.8

-0.9

[45]

 

Table 4. Examples of the functional DNA and RNA sites analyzed by the system ACTIVITY

 

Site

   

Feature selected

Significance

Name

Position “1”

n

Activity, F

Xk

Region

Property

U

r

a

E. coli promoters

Transcription

27

Strength

X1

Fig.2a

[ASM]

0.59

0.86

10-2

[23]

start

    X2

-5; 15

Direction

0.50

0.71

10-2

        F=0.3+0.6´ X1+0.0008´ X2

0.91

10-4

PE1B region adjacent to

Transcription

10

Transcription

X1

-32; -25

Pnucl

0.36

-0.77

10-2

the TATA box of

start

 

activity

X2

-29; -19

DIST

0.41

0.86

10-3

the a A-crystalline

      X3

-31; -25

Tilt

0.38

-0.78

10-2

promoter [36]

      F=-39-0.1´ X1+12´ X2-X3

0.90

10-4

TATA boxes

Synthetic

19

Affinity to

X1

Fig.2b

[TV]

0.35

0,73

10-2

(synthetic)

DNA

 

yeast TBP

X2

Fig.2c

[WR]

0.41

0,76

10-2

[17]

start

    F=14.5+2.5´ X1+0.9´ X2

0,77

10-2

TATA boxes

TATA box

9

DNA bending

X1

0, 9

Inclination

0.19

0.76

0.05

(mutant) [33]

start

 

TBP/TATA

F=120.15+70.32´ X1

0.76

0.05

USF binding site

Synthetic

14

Affinity

X1

11, 15

depth

0.22

-0.78

10-3

(synthetic) [35]

DNA start

 

USF/DNA

X2

11; 20

Twist

0.23

-0.86

10-4

        F=170-16.3´ X1-0.7´ X2

0.91

10-5

CRP-binding site [31]

Center of the

10

Affinity

X1

-15; 14

Rise

0,15

-0,86

10-2

 

consensus

 

CRP/DNA

X2

-17; 12

width

0.06

0,78

10-2

 

repeat

    F=190-66.8´ X1+7.5´ X2

0.87

10-2

2-aminopurine induced

Mutation

26

Mutation

X1

-1, 2

Tmelt

0,20

0,90

10-5

mutations C->T [15]

point

 

frequency

F=-8.5568+0.1585´ X1

0,90

10-5

ssDNA/RecA-filament

Synthetic

15

DNA/RecA

X1

Fig.2d

[DRV]

0,27

-0,89

10-5

(synthetic) [13]

DNA start

 

affinity

F=0.54 - 1.03 ´ X1

0,89

10-5

the SV40 pre-mRNA

RNA cutting

16

Cutting

X1

Fig.2a

[VUKK]

0.24

0,76

10-4

3’processing site [16]

point

 

frequency

F=-301.72+216.16´ X1

0,76

10-4

Cro-binding site [14]

Consensus

7

Affinity

X1

1; 16

width

0,55

0.97

10-3

 

start

 

Cro/DNA

X2

6; 19

Roll

0,44

0.90

10-3

        X3

6, 19

Rise

0,41

0.92

10-2

        F=-72+4´ X1+X2+13´ X3

0.99

10-5

Notes: n, the total number of the site variants; Xk, the selected context-dependent feature; U, utility value; r, linear correlation coefficient; a , significance of the linear correlation coefficient value; [Z], the concentration of the oligonucleotide Z; weighted with the weighted function w(i) given in Fig.2 (formula 2); Pnucl, probability to be contacting with nucleosome core; Tmelt, melting temperature; depth, minor groove depth; width, minor groove width; WIDTH, major groove width; DIST, major groove dist; F=F0+S k=1,K Fk´ Xk, linear-additive approximation (formula 1) predicting the site activity.

 

FIGURE LEGENDS

Fig. 1. The tetranucleotide VUKK concentration is responsible for the SV40 pre-mRNA 3’processing efficiency (a); and the major groove width is responsible for the Cro/DNA affinity (b).

Fig. 2. Examples of the weight functions w(i) modeling the highest effect of oligonucleotides located within the site 3’-half (a), central part (b), termini (c) and near 5’-terminus (d) on site activity.

Fig. 3. Algorithm for generating the C-code programs to predict site activities (where: the indexes “f g l ” are either the indexes “Z,m,w” or the indexes “q,a,b” respectively in formulae (2) or (3)).

Fig. 4. A scheme of the distributed and intelligent database ACTIVITY.

Fig. 5. The description of experimental data on the E.coli promoter strength [23] within ACTIVITY: MI, entity identifier; MN, sample name; OG, genome region; OS, species; FF, site; AN, activity name; AU, activity unit; SC, variant; SQ, sequence; SA, activity value.

Fig. 6. The description of the conformational property “Direction” [42] within ACTIVITY: MI, entity identifier; MN, property type; MD, molecule; ML, step; PN, property name; PM, identifying method; PU, property unit; DINUCLEOTIDE, property values.

Fig. 7. The ACTIVITY result of the E.coli promoter strength. Fields: MI, entity identifier; MN, sample name; CF, feature type; PV, property/oligonucleotide; AB, region; UT, utility; LC, linear correlation coefficient; C-CODE of the computer program calculating the feature.

Fig. 8. An interpretation of the ACTIVITY result of the E.coli promoter strength: (a) the trinucleotide ASM concentration correlates with the promoter strength; (b) the Direction correlates with the strength (r=0.71); (c) the agreement between the experiment and prediction.

Fig. 9. The ACTIVITY result of the transcription activity of the mouse a A-crystalline gene promoter containing the PE1B region near TATA-box [36]: (a) the probability to be contacting with nucleosome core correlates negatively with the transcription activity; (b) the major groove dist correlates positively with the transcription activity; (c) the tilt correlates negatively with the transcription activity; (d) the agreement between the experimental and prediction data (Table 4).

Fig. 10. The ACTIVITY result of the TATA boxes: (a) the agreement between the experimental [17] and predicted TBP/DNA affinity; (b) the DNA bend within the TBP/TATA complex [33] correlates with the inclination.

Fig. 11. Examples of ACTIVITY-results: (a) the USF/DNA affinity [35] correlates with the twist; (b) the CRP/DNA affinity [31] correlates with the rise; (c) the frequency of the mutation induced by 2-aminopurine [15] correlates with the DNA melting temperature; (d) the ssDNA/RecA-filament affinity [13] correlates with the trinucleotide DRV concentration weighted by the weight function w(i) given in Fig.2d.

a) b)

 

 

 

a) b)

c) d)

 

 

 

 

 

 

MI K0000001

MN E. coli promoter strength in terms of -log[Pbla] units

CF Statistical FEATURE

PV ASM

AB -49 18

UT 0.589

LC 0.860

C-CODE

double WeightASM_for_EcPbla (char *s){

double X; char *seq; int i,k, SiteLength=68;

double Weigth5P0 [66]={

/* -49 -48 -47 -46 -45 -44 -43 */

0.100, 0.100, 0.100, 0.100, 0.100, 0.100, 0.100,

......................................................

/* 11 12 13 14 15 16 */

0.525, 0.356, 0.207, 0.143, 0.103, 0.100 };

seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);

for (i=0, X=0.;i<SiteLength-1;i++) {

if(seq[i ]=='A')

if(seq[i ]=='G' || seq[i ]=='C')

if(seq[i+1]=='A' || seq[i+1]=='C') X+=Weight5P0[i]; }

return(X);};

//

CF Conformational FEATURE

PV Direction

AB -5 15

UT 0.502

LC 0.710

C-CODE

double Direction_for_EcPbla (char *s){

double X; char *seq; int i,k, SiteLength=21;

double DinucPar[16]={

/* AA AT AG AC TA TT TG TC */

-154., 0., 2., 143., 0., 154., 64.,-120.,

/* GA GT GG GC CA CT CG CC */

120.,-143., 57., 180., -64., -2., 0., -57. };

seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);

for (i=0, X=0.;i<SiteLength-1;i++) {

switch (seq[i ]) { case 'A': k= 0; break;

....................................................

default : return(-1002.); }

switch (seq[i+1]) { case 'A': k+=0; break;

....................................................

default : return(-1003.); }

if (k > 15) return(-1004.); X+=DinucPar[k]; }

return (X/(double)(SiteLength-1));};

//

CF PREDICTION ACTIVITY

LC 0.910

C-CODE

double EcPbla_by_WeightASM_Direction (char *s){

extern double WeightASM_for_EcPbla (char *);

extern double Direction_for_EcPbla (char *);

double x1,x2; char *seq; int s1=0, s2=45, SiteLength=68;

seq=&s[0]; if(strlen(seq)<SiteLength+1)return(-1001.);

seq=&s[s1]; x1=WeightASM_for_EcPbla (seq); if(x1<-999.)return(x1);

seq=&s[s2]; x2=Direction_for_EcPbla (seq); if(x2<-999.)return(x2);

return (0.307547 + 0.576596*x1 + 0.000799*x2);}

//

a)b)c)

 

a) b)

c) d)

a) b)

a) b)

c) d)