Hostname: page-component-78c5997874-lj6df Total loading time: 0 Render date: 2024-11-13T04:46:39.146Z Has data issue: false hasContentIssue false

Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences

Published online by Cambridge University Press:  14 July 2016

M. R. Kantorovitz*
Affiliation:
Australian National University and University of Illinois
H. S. Booth*
Affiliation:
Australian National University
C. J. Burden*
Affiliation:
Australian National University
S. R. Wilson*
Affiliation:
Australian National University
*
Postal address: Department of Mathematics, University of Illinois, Urbana, IL 61801, USA. Email address: ruth@math.uiuc.edu
∗∗H. S. Booth died 26 May 2005.
∗∗∗Postal address: Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.
∗∗∗Postal address: Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Given two sequences of length n over a finite alphabet A of size |A| = d, the D2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D2 statistic is approximately normal as n gets large. We also give a formula for the variance of D2 in the uniform case.

Type
Research Article
Copyright
Copyright © Applied Probability Trust 2007 

References

[1] Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: a user's guide. Ann. Appl. Prob. 11, 9641002.Google Scholar
[2] Billingsley, P. (1995). Probability and Measure, 3rd edn. John Wiley, New York.Google Scholar
[3] Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 11351142.CrossRefGoogle Scholar
[4] Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J. Comput. Chem. 23, 755757.Google Scholar
[5] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Prob. 3, 534545.Google Scholar
[6] Christoffels, A. et al. (2001). STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29, 234238.Google Scholar
[7] Dembo, A. and Rinott, Y. (1996). Some examples of normal approximations by Stein's method. In Random Discrete Structures (IMA Vol. Math. Appl. 76), Springer, New York, pp. 2544.CrossRefGoogle Scholar
[8] Johnson, N. L. and Kotz, S. (1970). Distributions in Statistics. Continuous Univariate Distributions. 1. Houghton Mifflin Co., Boston, MA.Google Scholar
[9] Lippert, R. A., Huang, H and Waterman, M. S. (2002). Distributional regimes for the number of k-word matches between two random sequences. Proc. Nat. Acad. Sci. USA 99, 1398013989.Google Scholar
[10] Miller, R. T. et al. (1999). A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 9, 11431155.CrossRefGoogle ScholarPubMed
[11] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Molec. Biol. 147, 195197.Google Scholar
[12] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. Sixth Berkeley Symp. Math. Statist. Prob., Vol. II, University of California Press, Berkeley, pp. 583602.Google Scholar
[13] Stein, C. (1986). Approximate Computation of Expectations. Institute of Mathematical Statistics, Hayward, CA.Google Scholar
[14] Vinga, S. and Almeida, J. S. (2003). Alignment-free sequence comparison – a review. Bioinformatics 19, 513523.CrossRefGoogle ScholarPubMed
[15] Waterman, M. S. (1995). Introduction to Computational Biology. Chapman & Hall, New York.CrossRefGoogle Scholar
[16] Zhang, Y. X. et al. (2002). Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415, 644646.Google Scholar