Mean Recognition: a Systemic Approach Increasing the Accuracy of the Functional Site Recognition for the Genomic DNA Annotation

Mikhail P. Ponomarenko, Anatoly S. Frolov, Julia V. Ponomarenko, Olga A. Podkolodnaya, Denis V. Vorobyev, Nikolay A. Kolchanov

Laboratory of Theoretical Genetics, Institute of Cytology and Genetics, Novosibirsk, 630090, Russia; pon@bionet.nsc.ru

and

G. Christian Overton, Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6145, USA; coverton@cbil.humgen.upenn.edu

ABSTRACT

The Central Limit Theorem-based approach for increasing the accuracy of the recognition of a given functional site in an arbitrary DNA sequence has been suggested. It implies the averaging of a huge number of partial recognitions of the site analyzed into the “mean recognition” of this site. To design this huge number of the partial recognition procedures within the framework of the consensus and frequency matrix formalisms, a great number of novel computable oligonucleotide alphabets have been used. On this basis, the activated and intelligent database SAMPLES has been created; it contains the sets of DNA functional sites identified experimentally, multiply aligned, and processed to be recognized in an arbitrary DNA sequence by their consensuses and frequency matrixes. The novel opprotunities of SAMPLES was demonstrated using GATA-1, C/EBP and NF-1 transcription factor binding sites. All the control results are shown and discussed. SAMPLES is available at http://wwwmgs.bionet.nsc.ru/

Keywords: pattern recognition, activated database, Central Limit Theorem, human genome annotation, program generation, I and II type error rates.

1. Introduction

Recognition of functional sites in nucleotide sequences is one of the key episodes in genomic DNA annotation [1]. A huge number of methods have been so far developed to address the problem (for review, see [2]). The most widely used are the consensus and matrix methods [3-7] based on the evolutionary conservative nucleotides of functional sites. Recent evaluations of the accuracies of these methods for annotation of long genomic DNA fragments [8, 9] have, on the one hand, demonstrated a drastic progress in recognition of unknown genes and regulatory regions encoded in the genomic DNA and, on the other hand, the demand for considerable increase in the accuracy of the recognizing procedures for the functional sites in the actual application of genomic DNA annotation [8, 9].

Basing on the above bulk of intelligence, we are suggesting a systemic approach aiding the increase in the accuracy of a given functional site recognition in the course of genomic DNA annotation. It implies the averaging of a huge number of partial recognitions of

the site analyzed into the “mean recognition” of this site. The consensuses and frequency matrixes in 20 novel computable

alphabets have been used, and the activated database SAMPLES on the DNA site sequences identified, multiply aligned, and transformed into the C-code recognition programs has been created. The procedure was tested using the GATA-1, C/EBP and NF-1 transcription factor binding sites. SAMPLES is available at http://wwwmgs.bionet.nsc.ru/.

Figure 1. The scheme of the database SAMPLES.

2. SYSTEM AND METHODS

The scheme of the system suggested is in the Fig.1. Its key module is the C-code generator for the computer programs recognizing a given functional site; initializing data for this module is experimentally identified and multiply aligned DNA sequences.

DNA is a linear molecule containing monomers A, T, G, and C. The site consensus is the set of the letters positioned with frequencies higher than the threshold f₀.

Fig.2 exemplifies the simplest C-code of the programs recognizing the GATA-1 site by (a) its consensus, “GATAAAGG”, and (b) its frequency matrix.

Fig. 2b, d demonstrate how the simplest C-code programs can be easily generalized for generating a huge number of (c) consensuses and (d) frequency matrices using the 20 alphabets listed in Table 1. Using the database GibbsAlign of the multiply aligned DNA sequences of a given functional site (Fig. 1), the system SAMPLES can generate all the possible consensuses and frequency matrices {f_k} applicable to recognition of this site in an arbitrary sequence (1? k? K).

The simplest simultaneous usage of all these partial recognitions is averaging their values {f_k(S)} over the region S of an arbitrary DNA sequence:

F_K(S) = S _k=1,K f_k(S)/K. (1)

where the recognition values f_k(S) are normalized as:

S _n=1,N f_k(Site_n)/N = 1; (1a)

S _n=1,N f_k(Rand_n)/N =-1; (1b)

and the below decision rule is implemented:

IF {F_K(S)>0}, THEN {S is “this site”}. (1c).

By Central Limit Theorem, the mean recognition F_K Eq. (1) with high K is Gaussian-distributed and its variation K^-1/2-fold reduced. In contrast, the partial recognitions are often (a) discrete for consensuses and (b) multiple modal for frequency matrices (Fig. 3).

3. RESULTS AND DISCUSSIONS

The mean recognition suggested has been used to process the experimentally identified and multiply aligned DNA sequences of GATA-1, C/EBP, and NF-1 transcription factor binding sites (the total number of sequences analyzed was 102, 62 and 99, respectively). All the data are available in the databases SAMPLES and GibbsAlign, http://wwwmgs.bionet.nsc.ru/ (Fig. 1).

Each of these three sequence sets was randomly divided into two non-overlapping 50%-subsets, training and control. In the training sets, all the possible consensuses and frequency matrices were identified to generate the programs recognizing these sites. Their C-codes are stored in the database ConsFreq and executable in MeanRec, http://wwwmgs.bionet.nsc.ru/, (see in Fig.1).

Using the control subsets and 1000 random DNA sequences, each of the partial recognition procedures was tested. The complete test results are shown in Table 2 (lines 1-38). Similarly, their mean recognitions have been tested (Table 2). For more detail on these control test results, see the database ConsFreq at http://wwwmgs.bionet.nsc.ru/.

Note that the partial recognitions differ from one another in their means and standard deviations, I and II type error rates of both the site and random DNA sequences. Essentially, it is impossible to predict what partial recognition would be the best for an arbitrary site; in contrast, the mean-recognition appears the best for each of the three site tested. Fig. 4 illustrates that the mean recognition can decrease both the I and II type errors, a₁ and a₂, with respect to the frequency matrix. This is demonstrated for (a) GATA-1 and (b) C/EBP transcription factor binding sites.

To investigate this phenomenon, we studied the alteration of the statistical properties of the standard deviation of the GATA-1 mean recognition F_K, with the growth of the total number K of the GATA-1 averaged partial recognition procedures. In this test, for each value K, ten different combinations of the partial recognition procedures {f_k} was randomly chosen and their standard deviations for 51 control sequences of the GATA-1 site and 1000 random DNA sequences were calculated and averaged. Two variants of this analysis were carried out: with and without the GATA-1 consensus and frequency matrix used traditionally for the GATA-1 recognition.

The results obtained are presented in Fig. 5 for (a) the GATA-1 control subset and (b) for the random DNA (bold line, when with traditional recognitions; broken line, without). In case of the GATA-1 sites (Fig. 5a), the standard deviation value is approximately constant at any K values when the traditional recognitions are used (Fig.5a, bold line). This means that the GATA-1 sequences analyzed have been optimized by their preliminary alignment to create these traditional recognition procedures [11]. When the traditional recognition procedures were not involved (Fig.5a, broken line), the standard deviation value is approximately K^-1/2-fold decreasing with the K value (as is established by the Central Limit Theorem) until its alignment-dependent level reached. Essentially, the Central Limit Theorem-established decreases are for the random DNA sequences in both variants, with and without the traditional recognition procedures (Fig.5b). These results pinpoint that the mean recognition F_K Eq. (1) suggested in this work is increasing the accuracy of a given functional site recognition through the K^-1/2-fold decrease of the standard deviation of the non-site sequences which is responsible for the II type error, a₂.

4. CONCLUSION

In this work, we introduce the idea of simultaneous involvement of as many procedures recognizing a functional site in an arbitrary DNA sequence as we can design. The simplest implementation of this idea is averaging the all partial recognizing procedures available. This has been named the “mean recognition”. Unexpectedly, the analysis of the mean recognition shows that its statistical properties are asymptotically described by the Central Limit Theorem. Essentially, this theorem establishes that the mean recognition F_K should became Gaussian-distributed and its standard deviation K^-1/2-fold decreased with the total number K of the partial recognitions averaged. We have observed this essential decrease (Fig. 5). This means that the mean recognition functioning should be predictable by the Central Limit Theorem even when all its partial recognitions will be heuristic with their unpredictable functioning. Thus, the mean recognition is the systemic approach increasing the accuracy of the functional site recognition for the genomic DNA annotation.

Further development of the mean recognition approach will focus on the increase in the total number K>>100 of the averaged partial recognitions {f_k} through involvement of additional methods, such as Information Content, Perceptron, Neural Network, etc. Various weighting of the partial recognitions within the mean recognition will be also studied. Finally, it is interesting to implement the Central Limit Theorem to design the mean recognitions for increasing the accuracy of the coding potentials of exons, the non-coding potentials of introns, and the regulatory potentials of promoters [12-14]. All these will be performed within the framework of our Web-integrated system GeneExpress [15] available at the URL: http://wwwmgs.bionet.nsc.ru/.

ACKNOWLEDGMENT

We are grateful to Ms. Galina Chirikova for the help in translation. This work was granted by NIH 2-R01-RR04026-08A2; Russian Human Genome Project; Russian Foundation for Basic Research 97-04-49740, 96-04-50006, 97-07-90309, 98-07-90126; SB RAS IG-97N13 and Program for the Young Scientists Support.

REFERENCES

[1] Fickett, J.W. “Finding genes by computer: the state of the art.” Trends Genet., V. 12, 1996, pp. 316-320.

[2] Gelfand, M.S. “Prediction of function in DNA sequence analysis.” J. Comput. Biol., V. 2, 1995, pp. 87-115.

[3] Bucher, P. “Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences.” J. Mol. Biol., V. 212, 1990, pp. 563-578.

[4] Karlin, S. and Brendel, V. “Chance and statistical significance in protein and DNA sequence analysis.” Science, V. 257, 1992, pp. 39-49.

[5] Quandt, K., Frech, K., et al. “MatInd and MatInspector - New fast and sensitive tools for detection of consensus matches in nucleotide sequence data.” Nucl. Acids Res., V. 23, 1995, pp 4878-4884.

[6] Uberbacher, E.C., Xu, Y., and Mural, R.J. “Discovering and understanding genes in human DNA sequence using GRAIL.” Methods Enzymol., V. 266, 1996, pp. 259-281.

[7] Chen, Q.K., Hertz, G.Z., and Stormo, G.D. “PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices.” Comput. Appl. Biosci., V. 13, 1997, pp. 29-35.

[8] Fickett, J.W. and Hatzigeorgiou, A.G. “Eukaryotic promoter recognition.” Genome Res., V. 7, 1997, pp. 861-878.

[9] Burset, M. and Guigo, R. “Evaluation of gene structure prediction programs.” Genomics, V. 34, 1996, pp. 353-367.

[10] Kondrakhin, Yu.V., Shamin, V.V., and Kolchanov, N.A. “Construction of a generalized consensus matrix for recognition of vertebrate pre-mRNA 3'-terminal processing sites”. Comput. Appl. Biosci., V. 10, 1994, pp. 597-603.

[11] Lawrence, C. “Toward the unification of sequence and structural data for identification of structural and functional constraints.” Comput. Chem., V. 18, 1994, pp. 255-258.

[12] Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. “Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames.” Nucl. Acids Res., V. 22, 1994, pp. 5156-5163.

[13] Guigo, R. and Fickett, J.W. “Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.” J. Mol. Biol., V. 253, 1995, pp. 51-60.

[14] Kondrakhin, Y.V., Kel, A.E., Kolchanov, N.A., Romashchenko, A.G., and Milanesi, L. “Eukaryotic promoter recognition by binding sites for transcription factors.” Comput. Appl. Biosci., V. 11, 1995, pp. 477-488.

[15] Kolchanov N. A., Ponomarenko M. P., et al., “GeneExpress: a computer system for description, analysis, and recognition of regulatory sequences of the eukaryotic genome”. ISMB, V. 6, 1998, (accepted).