Hostname: page-component-78c5997874-fbnjt Total loading time: 0 Render date: 2024-11-10T16:13:39.894Z Has data issue: false hasContentIssue false

Distribution of Clump Statistics for a Collection of Words

Published online by Cambridge University Press:  14 July 2016

Donald E. K. Martin*
Affiliation:
North Carolina State University
Deidra A. Coleman*
Affiliation:
North Carolina State University
*
Postal address: Department of Statistics, North Carolina State University, 4272 SAS Hall, Raleigh, NC 27695-8203, USA.
Postal address: Department of Statistics, North Carolina State University, 4272 SAS Hall, Raleigh, NC 27695-8203, USA.
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We give an efficient method based on minimal deterministic finite automata for computing the exact distribution of the number of occurrences and coverage of clumps (maximal sets of overlapping words) of a collection of words. In addition, we compute probabilities for the number of h-clumps, word groupings where gaps of a maximal length h between occurrences of words are allowed. The method facilitates the computation of p-values for testing procedures. A word is allowed to contain other words of the collection, making the computation more general, but also more difficult. The underlying sequence is assumed to be Markovian of an arbitrary order.

Type
Research Papers
Copyright
Copyright © Applied Probability Trust 2011 

References

[1] Aho, A. V. and Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333340.Google Scholar
[2] Aston, J. A. D. and Martin, D. E. K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Prob. 42, 977988.Google Scholar
[3] Balakrishnan, N. and Koutras, M. V. (2002). Runs and Scans with Applications. John Wiley, New York.Google Scholar
[4] Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573580.Google Scholar
[5] Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19, 521545.Google Scholar
[6] Hopcroft, J. (1971). An n log n algorithm for minimizing states in a finite automaton. In Theory of Machines and Computations, eds Kohavi, Z. and Paz, A., Academic Press, New York, pp. 189196.Google Scholar
[7] Kosoresow, A. P. and Hofmeyr, S. A. (1997). Intrusion detection via system call traces. IEEE Software 14, 3542.Google Scholar
[8] Ledent, S. and Robin, S. (2005). Checking homogeneity of motifs' distribution in heterogenous sequences. J. Comput. Biol. 12, 672685.Google Scholar
[9] Lladser, M. E., Betterton, M. D. and Knight, R. (2008). Multiple pattern matching: a Markov chain approach. J. Math. Biol. 56, 5192.Google Scholar
[10] Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. In Combinatorial Pattern Matching (Lecture Notes Comput. Sci. 5029), Springer, Berlin, pp. 95106.Google Scholar
[11] Martin, D. E. K. and Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns. Comput. Statist. Data Anal. 52, 48794890.Google Scholar
[12] Nuel, G. (2007). Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. J. Appl. Prob. 45, 226243.Google Scholar
[13] Reinert, G., Schbath, S. and Waterman, M. S. (2005). Statistics on words with applications to biological sequences. In Applied Combinatorics on Words, eds Berstel, J. and Perrin, D., Cambridge University Press, pp. 268352.Google Scholar
[14] Ribeca, P. and Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach. Bioinformatics 24, 28392848.Google Scholar
[15] Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models. Cambridge University Press.Google Scholar
[16] Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 116.Google Scholar
[17] Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Appl. Math. 155, 868880.Google Scholar
[18] Tewari, A., Srivastava, U. and Gupta, P. (2002). A parallel DFA minimization algorithm. In High Performance Computing (Lecture Notes Comput. Sci. 2552), Springer, Berlin, pp. 3440.Google Scholar