Hostname: page-component-78c5997874-dh8gc Total loading time: 0 Render date: 2024-11-10T16:46:53.891Z Has data issue: false hasContentIssue false

Compound Poisson approximation of word counts in DNA sequences

Published online by Cambridge University Press:  15 August 2002

Sophie Schbath*
Affiliation:
Institut National de la Recherche Agronomique, France
Get access

Abstract

Identifying words with unexpected frequencies is an important problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number ofoccurrences N(W) of a word W. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approximations for two different counts. We approximate the “declumped” count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 1997

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)