Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-29T03:15:45.687Z Has data issue: false hasContentIssue false

A Probabilistic Analysis of a String Editing Problem and its Variations

Published online by Cambridge University Press:  12 September 2008

Guy Louchard
Affiliation:
Laboratoire d'Informatique Théorique, Université Libre de Bruxelles, B-1050 Brussels, Belgium
Wojciech Szpankowski
Affiliation:
Department of Computer Science, Purdue University, W. Lafayette, IN 47907, USA

Abstract

We consider a string editing problem in a probabilistic framework. This problem is of considerable interest to many facets of science, most notably molecular biology and computer science. A string editing transforms one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost. The problem is equivalent to finding an optimal path in a weighted grid graph. In this paper we provide several results regarding a typical behaviour of such a path. In particular, we observe that the optimal path (i.e. edit distance) is almost surely (a.s.) equal to αn for large n where α is a constant and n is the sum of lengths of both strings. More importantly, we show that the edit distance is well concentrated around its average value. In the so called independent model in which all weights (in the associated grid graph) are statistically independent, we derive some bounds for the constant α. As a by-product of our results, we also present a precise estimate of the number of alignments between two strings. To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions, and the method of bounded difference.

Type
Research Article
Copyright
Copyright © Cambridge University Press 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1]Apostolico, A. and Guerra, C. (1987) The longest common subsequence problem revisited. Algorithmica 2 315336.CrossRefGoogle Scholar
[2]Apostolico, A., Atallah, M., Larmore, L. and McFaddin, S. (1990) Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19 968988.Google Scholar
[3]Arratia, R., Gordon, L. and Waterman, M. (1986) An extreme value theory for sequence matching. Annals of Statistics 14 971993.Google Scholar
[4]Arratia, R., Gordon, L. and Waterman, M. (1990) The Erdős-Rényi Law in distribution, for coin tossing and sequence matching. Annals of Statistics 18 539570.Google Scholar
[5]Arratia, R. and Waterman, M. (1985) Critical phenomena in sequence matching. Annals of Probability 13 12361249.CrossRefGoogle Scholar
[6]Arratia, R. and Waterman, M. (1989) The Erdős-Rényi strong law for pattern matching with a given proportion of mismatches. Annals of Probability 17 11521169.Google Scholar
[7]Arratia, R. and Waterman, M. (1994) A phase transition for the score in matching random sequences allowing deletions. Annals of Applied Probability 4 200225.Google Scholar
[8]Atallah, M., Jacquet, P. and Szpankowski, W. (1993) A probabilistic analysis of a pattern matching problem. Random Structures & Algorithms 4 191213.Google Scholar
[9]Billingsley, P. (1968) Convergence of Probability Measures, John Wiley.Google Scholar
[10]Bucklew, J. (1990) Large Deviation Techniques in Decision, Simulation, and Estimation, John Wiley.Google Scholar
[11]Dembo, A. and Karlin, S. (1992) Poisson approximations for r-scan processes. Annals of Applied Probability 2 329357.CrossRefGoogle Scholar
[12]Galil, Z. and Park, K. (1990) An improved algorithm for approximate string matching. SIAM J. Computing 19 989999.Google Scholar
[13]Chang, W. and Lampe, J. (1992) Theoretical and empirical comparisons of approximate string matching algorithms. Proc. Combinatorial Pattern Matching Tucson, AZ, 172181.Google Scholar
[14]Chvátal, V. and Sankoff, D. (1975) Longest common subsequence of two random sequences. J. Appl. Prob. 12 306315.Google Scholar
[15]Griggs, J., Halton, P. and Waterman, M. (1986) Sequence alignments with matched sections. SIAM J. Alg. Disc. Meth. 7 604608.CrossRefGoogle Scholar
[16]Griggs, J., Halton, P., Odlyzko, A. and Waterman, M. (1990) On the number of alignments of k sequences. Graphs and Combinatorics 6 133146.CrossRefGoogle Scholar
[17]Feller, W. (1970) An Introduction to Probability Theory and its Applications, Vol. 1, John Wiley.Google Scholar
[18]Feller, W. (1971) An Introduction to Probability Theory and its Applications, Vol. 2, John Wiley.Google Scholar
[19]Greene, D. H. and Knuth, D. E. (1981) Mathematics for the Analysis of Algorithms, Birkhauser.Google Scholar
[20]Iglehart, D. L. (1974) Weak convergence in applied probability. Stock Proc. Appl. 2 211241.Google Scholar
[21]Karlin, S. and Dembo, A. (1992) Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24 113140.CrossRefGoogle Scholar
[22]Karlin, S. and Ost, F. (1987) Counts of long aligned word matches among random letter sequences. Adv. Appl. Prob. 19 293351.CrossRefGoogle Scholar
[23]Kingman, J. F. C. (1976) Subadditive processes, in Ecole d'Eté de Probabilities de Saint-Flour V-1975, Lecture Notes in Mathematics, 539, Springer-Verlag.Google Scholar
[24]Laquer, H. T. (1981) Asymptotic limits for a two-dimensional recursion. Stud. Appl. Math. 64 271277.Google Scholar
[25]Liggett, T. (1985) Interacting Particle Systems, Springer-Verlag.CrossRefGoogle Scholar
[26]Louchard, G. (1987) Random walks, Gaussian processes and list structures. Theor. Comp. Sci. 53 99124.Google Scholar
[27]Louchard, G., Schott, R. and Randrianarimanana, B. (1992) Dynamic algorithms in D.E. Knuth's model: a probabilistic analysis. Theor. Comp. Sci. 93 201225.Google Scholar
[28]Louchard, G. and Szpankowski, W. (1993) A Probabilistic Analysis of a String Edit Problem, INRIA TR 1814, December 1992; revised Purdue University, CSD-TR-93–078, 1993.Google Scholar
[29]McDiarmid, C. (1989) On the method of bounded differences, in Surveys in Combinatorics, Siemons, J. (Ed.), vol 141, pp. 148188, London Mathematical Society Lecture Notes Series, Cambridge University Press.Google Scholar
[30]Myeres, E. (1986) An O(ND) difference algorithm and its variations. Algorithmica 1 251266.Google Scholar
[31]Newman, C. (1992) Chain Lengths in Certain Random Directed Graphs. Random Structures & Algorithms 3 243254.Google Scholar
[32]Pevzner, P. and Waterman, M. (1992) Matrix longest common subsequence problem, duality and Hilbert bases. Proc. Combinatorial Pattern Matching Tucson, AZ, 7787.Google Scholar
[33]Sankoff, D. and Kruskal, J. (Eds.) (1983) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley.Google Scholar
[34]Ukkonen, E. (1980) Finding approximate patterns in strings. J. Algorithms 1 359373.Google Scholar
[35]Waterman, M., Gordon, L. and Arratia, R. (1987) Phase transitions in sequence matches and nucleic acid structures. Proc. Natl. Acad. Sci. USA 84 12391242.Google Scholar
[36]Waterman, M. (Ed.) (1991) Mathematical Methods for DNA Sequences, CRC Press.Google Scholar