1 Introduction
A sequence of outcomes $X_1,X_2,\ldots $ coming from a finite alphabet is drawn in a sequential manner from an unknown stochastic source P. At each moment a finite prefix $X_1^n=(X_1,X_2,\ldots ,X_n)$ is available. The forecaster has to predict the next outcome using this information. The task may take one of the two following forms. In the first scenario, the forecaster simply makes a guess about the next outcome. The forecaster’s performance is then assessed by comparing the guess with the outcome. This scenario satisfies the weak prequential principle of Dawid [Reference Dawid and Vovk12]. In the second case, we allow the forecaster to be uncertain, namely, we ask them to assign a probability value for each of the outcomes. These values may be interpreted as estimates of the conditional probabilities $P(X_{n+1}|X_1^n)$ . Various criteria of success may be chosen here such as the quadratic difference of distributions or the Kullback–Leibler divergence. The key aspect of both problems is that we assume limited knowledge about the true probabilities governing the process that we want to forecast. Thus, an admissible solution should achieve the optimal results for an arbitrary process from some general class. For clarity, term “universal predictor” will be used to denote the solution of guessing the outcome, while the solution of estimation of probabilities will be referred to as “universal estimator,” “universal measure,” or “universal code” depending on the exact meaning.
The accumulated literature on universal coding and universal prediction is vast, even when we restrict ourselves to interactions between coding and prediction (see, e.g., [Reference Algoet1, Reference Fortnow and Lutz20, Reference Kalnishkan, Vyugin and Vovk30, Reference Ryabko44, Reference Ryabko45, Reference Solomonoff50, Reference Suzuki53]). To begin, it is known that for a fixed stochastic source P, the optimal prediction is given by the predictor induced by P, i.e., the informed scheme which predicts the outcome with the largest conditional probability $P(X_{n+1}|X_1^n)$ [Reference Algoet1]. In particular, we may expect that a good universal estimator should induce a good universal predictor. That being said, the devil is hidden in the details such as what is meant by a “good” universal code, measure, estimator, or predictor.
In this paper, we will assume that the unknown stochastic source P lies in the class of stationary ergodic measures. Moreover, we are concerned with measures which are universal in the information-theoretic sense of universal coding, i.e., the rate of Kullback–Leibler divergence of the estimate and the true measure P vanishes for any stationary ergodic measure. As for universal predictors, we assume that the rate of correct guesses is equal to the respective rate for the predictor induced by measure P. In this setting, a universal measure need not belong to the class of stationary ergodic measures and can be computable, which makes the problem utterly practical. Our framework should be contrasted with universal prediction à la Solomonoff for left-c.e. semimeasures where the universal semimeasure belongs to the class and is not computable [Reference Solomonoff50]. In general, existence of a universal measure for an arbitrary class of probability measures can be linked to separability of the considered class [Reference Ryabko48].
Now, we can ask the question whether a universal measure in the above sense of universal coding induces a universal predictor. Curiously, this simple question has not been unambiguously answered in the literature (see [Reference Morvai and Weiss37] for a recent survey), although a host of related propositions were compiled by Suzuki [Reference Suzuki53] and Ryabko [Reference Ryabko44, Reference Ryabko45] (see also [Reference Ryabko, Astola and Malyutov46]). It was shown by Ryabko [Reference Ryabko45] (see also [Reference Ryabko, Astola and Malyutov46]) that the expected value of the average absolute difference between the conditional probability for a universal measure and the true value $P(X_{n+1}|X_1^n)$ converges to zero for any stationary ergodic measure P. Ryabko [Reference Ryabko45] showed also that there exists a universal measure that induces a universal predictor. As we argue in this paper, this result does not solve the general problem.
Completing works [Reference Ryabko44, Reference Ryabko45, Reference Suzuki53], in this paper we will show that any universal measure R in the sense of universal coding that additionally satisfies a uniform bound
induces a universal predictor, indeed. On our way, we will use the Breiman ergodic theorem [Reference Breiman7] and the Azuma inequality for martingales with bounded increments [Reference Azuma3], which is the source of condition (1). It is left open whether this condition is necessary. Fortunately, condition (1) is satisfied by reasonable universal measures such as the Prediction by Partial Matching (PPM) measure [Reference Cleary and Witten10, Reference Ryabko44, Reference Ryabko47], which we also show in this paper. It may be interesting to exhibit universal measures for which this condition fails. There is a large gap between bound (1) and the respective bound for the PPM measure, which begs for further research.
To add more weight and to make the problem interesting from a computational perspective, we consider this topic in the context of algorithmic randomness and we seek for effective versions of probabilistic statements. Effectivization is meant as the research program of reformulating almost sure statements into respective statements about algorithmically random points, i.e., algorithmically random infinite sequences. Any plausible class of random points is of measure one (see [Reference Downey and Hirschfeldt18]), and the effective versions of theorems substitute phrase “almost surely” with “on all algorithmically random points.” Usually, randomness in the Martin-Löf sense is the desired goal [Reference Martin-Löf35]. In many cases, the standard proofs are already constructive, and effectivization of some theorems asks for developing new proofs, but sometimes the effective versions are false.
In this paper, we will successfully show that the algorithmic randomness theory is mature enough to make the theory of universal coding and prediction for stationary ergodic sources effective in the Martin-Löf sense. The main keys to this success are: the framework for randomness with respect to uncomputable measures by Reimann and Slaman [Reference Reimann42, Reference Reimann and Slaman43], the effective Birkhoff ergodic theorem [Reference Bienvenu, Day, Hoyrup, Mezhirov and Shen6, Reference Franklin, Greenberg, Miller and Ng21, Reference V’yugin55], an effective version of Breiman’s ergodic theorem [Reference Breiman7], and an effective Azuma theorem, which follows from the Azuma inequality [Reference Azuma3] and the result of Solovay (unpublished; see [Reference Downey and Hirschfeldt18])—which we call here the effective Borel–Cantelli lemma. As a little surprise, there is also a negative result concerning universal forward estimators—Theorem 3.13. Not everything can be made effective.
The organization of the paper is as follows. In Section 2, we discuss preliminaries: notation (Section 2.1), stationary and ergodic measures (Section 2.2), algorithmic randomness (Section 2.3), and some known effectivizations (Section 2.4). Section 3 contains main results concerning: universal coding (Section 3.1), universal prediction (Section 3.2), universal predictors induced by universal backward estimators (Section 3.3) and by universal codes (Section 3.4), as well as the PPM measure (Section 3.5), which constitutes a simple example of a universal code and a universal predictor.
2 Preliminaries
In this section we familiarize the readers with our notation, we recall the concepts of stationary and ergodic measures, we discuss various sorts of algorithmic randomness, and we recall known facts from the effectivization program.
2.1 Notation
Throughout this paper, we consider the standard measurable space $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ of two-sided infinite sequences over a finite alphabet $\mathbb {X}=\left \{ a_1,\dots ,a_D \right \}$ , where $D\ge 2$ . (Occasionally, we also apply the space of one-sided infinite sequences $(\mathbb {X}^{\mathbb {N}},\mathcal {X}^{\mathbb {N}})$ .) The points of the space are (infinite) sequences $x=(x_i)_{i\in \mathbb {Z}}\in \mathbb {X}^{\mathbb {Z}}$ . We also denote (finite) strings $x_j^k=(x_i)_{j\le i\le k}$ , where $x_j^{j-1}=\lambda $ equals the empty string. By $\mathbb {X}^*=\bigcup _{n\ge 0}\mathbb {X}^n$ we denote the set of strings of an arbitrary length including the singleton $\mathbb {X}^0=\left \{ \lambda \right \}$ . We use random variables $X_k((x_i)_{i\in \mathbb {Z}}):=x_k$ . Having these, the $\sigma $ -field $\mathcal {X}^{\mathbb {Z}}$ is generated by cylinder sets $(X_{-|\sigma |+1}^{|\tau |}=\sigma \tau )$ for all $\sigma ,\tau \in \mathbb {X}^*$ . We tacitly assume that P and R denote probability measures on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ . For any probability measure P, we use the shorthand notations $P(x_1^n):=P(X_1^n=x_1^n)$ and $P(x_j^n|x_1^{j-1}):=P(X_j^n=x_j^n|X_1^{j-1}=x_1^{j-1})$ . Notation $\log x$ denotes the binary logarithm, whereas $\ln x$ is the natural logarithm.
2.2 Stationary and ergodic measures
Let us denote the measurable shift operation $T((x_i)_{i\in \mathbb {Z}}):=(x_{i+1})_{i\in \mathbb {Z}}$ for two-sided infinite sequences ${(x_i)_{i\in \mathbb {Z}}\in \mathbb {X}^{\mathbb {Z}}}$ .
Definition 2.1 Stationary measures
A probability measure P on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ is called stationary if $P(T^{-1}(A))=P(A)$ for all events $A\in \mathcal {X}^{\mathbb {Z}}$ .
Definition 2.2 Ergodic measures
A probability measure P on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ is called ergodic if for each event $A\in \mathcal {X}^{\mathbb {Z}}$ such that $T^{-1}(A)=A$ we have either $P(A)=1$ or $P(A)=0$ .
The class of stationary ergodic probability measures has various nice properties guaranteed by the collection of fundamental results called ergodic theorems. Typically stationary ergodic measures are not computable (e.g., consider independent biased coin tosses with a common uncomputable bias) but they allow for computable universal coding and computable universal prediction schemes can achieve optimal error rates, as it will be explained in Section 3.
2.3 Sorts of randomness
Now let us discuss some computability notions. In the following, computably enumerable is abbreviated as c.e. Given a real r, the set $\left \{ q\in \mathbb {Q}:q<r \right \}$ is called the left cut of r. A real function f with arguments in a countable set is called computable or left-c.e. respectively if the left cuts of $f(\sigma )$ are uniformly computable or c.e. given an enumeration of $\sigma $ . For an infinite sequence $s\in \mathbb {X}^{\mathbb {Z}}$ , we say that real functions f are s-computable or s-left-c.e. if they are computable or left-c.e. with oracle s. Similarly, for a real function f taking arguments in $\mathbb {X}^{\mathbb {Z}}$ , we will say that f is s-computable or s-left-c.e. if left cuts of $f(x)$ are uniformly computable or c.e. with oracles $x\oplus s:=(\ldots ,x_{-1},s_{-1},x_0,s_0,x_1,s_1,\ldots )$ . This induces in effect s-computable and s-left-c.e. random variables and stochastic processes on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ , where the values of an s-computable (s-left-c.e.) variable on a point x are $(x\oplus s)$ -computable (s-left-c.e.) uniformly in x and the values of s-computable (s-left-c.e.) process $X_{i}$ (with natural or integer i) are s-computable (s-left-c.e.).
For stationary ergodic measures, we need a definition of algorithmically random points with respect to an arbitrary, i.e., not necessarily computable probability measure on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ . A simple definition thereof was proposed by Reimann [Reference Reimann42] and Reimann and Slaman [Reference Reimann and Slaman43]. This definition is equivalent to earlier approaches by Levin [Reference Levin32–Reference Levin34] and Gács [Reference Gács23] as shown by Day and Miller [Reference Day and Miller13] and we will use it since it leads to straightforward generalizations of the results in Section 2.4. The definition is based on measure representations. Let $\mathcal {P}(\mathbb {X}^{\mathbb {Z}})$ be the space of probability measures on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ . A measure $P\in \mathcal {P}(\mathbb {X}^{\mathbb {Z}})$ is called s-computable if real function $(\sigma ,\tau )\mapsto P(X_{-|\sigma |+1}^{|\tau |}=\sigma \tau )$ is s-computable. Similarly, a representation function is a function $\rho :\mathbb {X}^{\mathbb {Z}}\rightarrow \mathcal {P}(\mathbb {X}^{\mathbb {Z}})$ such that real function $(\sigma ,\tau ,s)\mapsto \rho (s)(X_{-|\sigma |+1}^{|\tau |}\,{=}\,\sigma \tau )$ is computable. Subsequently, we say that an infinite sequence $s\in \mathbb {X}^{\mathbb {Z}}$ is a representation of measure P if there exists a representation function $\rho $ such that $\rho (s)=P$ . We note that any measure P is s-computable for any representation s of P.
We will consider two important sorts of algorithmically random points: Martin-Löf or 1-random points and weakly 2-random points with respect to an arbitrary stationary ergodic measure P on $(\mathbb {X}^{\mathbb {Z}},\mathcal {X}^{\mathbb {Z}})$ . Note that the following notions are typically defined for one-sided infinite sequences over the binary alphabet and computable measures P. In the following parts of this paper, let an infinite sequence $s\in \mathbb {X}^{\mathbb {Z}}$ be a representation of measure P.
Definition 2.3. A collection of events $U_1,U_2,\ldots \in \mathcal {X}^{\mathbb {Z}}$ is called uniformly s-c.e. if and only if there is a collection of sets $V_1,V_2,\ldots \subset \mathbb {X}^*\times \mathbb {X}^*$ such that
and sets $V_1,V_2,\ldots $ are uniformly s-c.e.
Definition 2.4 Martin-Löf test
A uniformly s-c.e. collection of events $U_1,U_2,\ldots \in \mathcal {X}^{\mathbb {Z}}$ is called a Martin-Löf $(s,P)$ -test if $P(U_n)\leq 2^{-n}$ for every $n\in \mathbb {N}$ .
Definition 2.5 Martin-Löf or 1-randomness
A point $x\in \mathbb {X}^{\mathbb {Z}}$ is called Martin-Löf $(s,P)$ -random or $1$ - $(s,P)$ -random if for each Martin-Löf $(s,P)$ -test $U_1,U_2,\ldots $ we have $x\not \in \bigcap _{i\ge 1} U_i$ . A point is called Martin-Löf P-random or $1$ -P-random if it is $1$ - $(s,P)$ -random for some representation s of P.
Subsequently, an event $C\in \mathcal {X}^{\mathbb {Z}}$ is called a $\Sigma ^0_2(s)$ event if there exists a uniformly s-c.e. sequence of events $U_1,U_2,\ldots $ such that $\mathbb {X}^{\mathbb {Z}}\setminus C=\bigcap _{i\ge 1} U_i$ .
Definition 2.6 Weak 2-randomness
A point $x\in \mathbb {X}^{\mathbb {Z}}$ is called weakly $2$ - $(s,P)$ -random if x is contained in every $\Sigma ^0_2(s)$ event C such that $P(C)=1$ . A point is called weakly $2$ -P-random if it is weakly $2$ - $(s,P)$ -random for some representation s of P.
The sets of weakly $2$ -random points are strictly smaller than the respective sets of $1$ -random points (see [Reference Downey and Hirschfeldt18]).
In general, there is a whole hierarchy of algorithmically random points, such as (weakly) n-random points, where n runs over natural numbers. For our purposes, however, only $1$ -random points and weakly $2$ -random points matter since the following proposition sets the baseline for effectivization:
Proposition 2.7 Folklore
Let $Y_1,Y_2,\ldots $ be a sequence of uniformly s-computable random variables. If limit $\lim _{n\to \infty }Y_n$ exists P-almost surely, then it exists on all weakly $2$ - $(s,P)$ -random points.
The above proposition is obvious since the set of points on which limit $\lim _{n\to \infty }Y_n$ exists is a $\Sigma ^0_2(s)$ event. The effectivization program aims to strengthen the above claim to $1$ -P-random points (or even weaker notions such as Schnorr randomness) but this need not always be feasible. In particular, one can observe that:
Proposition 2.8 Folklore
Let P be a non-atomic computable measure on $\mathbb {X}^{\mathbb {N}}$ . Then there exists a computable function $f:\mathbb {X}^*\rightarrow \{0,1\}$ such that the limit $\lim _{n\to \infty }f(X_1^n)$ exists and is equal zero P-almost surely but it is not defined on exactly one point, which is $1$ -P-random.
This fact is a simple consequence of the existence of $\Delta ^0_2$ $1$ -P-random sequences (for a computable P) and may be also interpreted in terms of learning theory (cf. [Reference Osherson and Weinstein41] and the upcoming paper [Reference Steifer52]).
2.4 Known effectivizations
Many probabilistic theorems have been effectivized so far. Usually they were stated for computable measures but their generalizations for uncomputable measures follow easily by relativization, i.e., putting a representation s of measure P into the oracle. In this section, we list several known effectivizations of almost sure theorems which we will use further.
As shown by Solovay (unpublished; see [Reference Downey and Hirschfeldt18]), we have this effective version of the Borel–Cantelli lemma:
Proposition 2.9 Effective Borel–Cantelli lemma
Let P be a probability measure. If a uniformly s-c.e. sequence of events $U_0,U_1,\ldots \in \mathcal {X}^{\mathbb {Z}}$ satisfies $\sum _{i=1}^\infty P(U_n)<\infty $ then $\sum _{i=1}^\infty \mathbf {1}{\left \{ x\in U_n \right \}}<\infty $ on each $1$ - $(s,P)$ -random point x.
By the effective Borel–Cantelli lemma, Proposition 2.9 follows the effective version of the Barron lemma [Reference Barron5, Theorem 3.1]:
Proposition 2.10 Effective Barron lemma
For any probability measure P and any s-computable probability measure R, on $1$ - $(s,P)$ -random points we have
In the following, we make an easy but important observation—probabilities conditioned on an infinite past are defined on random points. First, we need to recall the notion of a martingale process and prove an effective version of Doob’s martingale convergence.
Definition 2.11 Martingale process
A process $(X_i)_{i\in \mathbb {N}}$ is called a martingale process relative to the sequence of $\sigma $ -algebras $\mathcal {F}_1\subset \mathcal {F}_2\subset \cdots $ (called a filtration) if the following conditions hold:
-
1. $X_n$ are $\mathcal {F}_n$ -measurable for all n;
-
2. $\operatorname {\mathrm {\textbf {E}}}(|X_n|)<\infty $ for all n;
-
3. $\operatorname {\mathrm {\textbf {E}}}(X_{n+1}|\mathcal {F}_n)=X_n$ for all n almost surely.
The proof of Doob’s martingale convergence can be easily made effective. This was already observed by Takahashi [Reference Takahashi54], who stated the effective martingale convergence for a specific filtration generated by cylinders $X_1^n$ . The following upcrossing inequality can be used to define a test which enforces convergence.
Proposition 2.12 Doob upcrossing inequality
Let $(X_i)_{i\in \mathbb {N}}$ be a martingale process and let $C_n$ will be the random variable denoting the number of upcrossings of interval $[a,b]\ ($ with $a,b\in \mathbb {R})$ by time n and suppose that $\sup _n\operatorname {\mathrm {\textbf {E}}}(|X_n|)<\infty $ . Then for each n, we have
Proposition 2.13 Effective Doob martingale convergence
Let $(X_i)_{i\in \mathbb {N}}$ be a uniformly s-computable martingale process with $\sup _n\operatorname {\mathrm {\textbf {E}}}(|X_n|)<\infty $ . Then, limit $\lim _{n\to \infty } X_n$ exists and is finite on each $1$ - $(s,P)$ -random point.
Proof Suppose that process $(X_i)_{i\in \mathbb {N}}$ does not converge on some random point x. Then there exist rational $a,b$ such that the number of upcrossings of the interval $[a,b]$ by $X_i(x)$ is infinite. Let $C_n$ be the random variable denoting the number of upcrossings of interval $[a,b]$ by the process $(X_i)_{i\in \mathbb {N}}$ by the time n. Let $C_{\infty }$ denote $\sup _n C_n$ and let $f:\mathbb {N}\rightarrow \mathbb {N}$ be a monotonic function. Consider a collection of sets $U_1,U_2,\ldots $ such that for all $i>0$
By Proposition 2.12 (Doob upcrossing inequality) and the Markov inequality, we have
Note that if f grows sufficiently fast, then $\sum ^{\infty }_{i=1}P(U_i)$ converges. Moreover, the collection of sets $U_1,U_2,\ldots $ is uniformly s-c.e. It follows by Proposition 2.9 (effective Borel–Cantelli lemma) that $C_{\infty }(x)<\infty $ for every $1$ - $(s,P)$ -random point x, which is a contradiction.
It remains to observe that the limit of $(X_i)_{i\in \mathbb {N}}$ is finite. This follows easily if one considers the collection of sets $V_1,V_2,\ldots $ with
which are uniformly s-c.e. By the Markov inequality and the monotone convergence theorem, we have $P(V_i)\leq 2^{-i}\sup _{n}\operatorname {\mathrm {\textbf {E}}}(X_n)$ . We apply Proposition 2.9 to conclude that $X_n$ are bounded on every $1$ - $(s,P)$ -random point.
Random variables $P(x_0|X_{-n}^{-1})$ for $n\ge 1$ form a uniformly s-computable martingale process with respect to the filtration generated by cylinder sets $X_{-n}^{-1}$ for any representation s of P. Thus applying the effective Doob martingale convergence, we obtain an effective version of the Lévy law in particular. In this work, our attention is limited to the following form.
Proposition 2.14 Effective Lévy law
On $1$ -P-random points there exist limits
Now let us proceed to a celebrated result of the algorithmic randomness theory, which is the effective Birkhoff ergodic theorem [Reference Bienvenu, Day, Hoyrup, Mezhirov and Shen6, Reference Franklin, Greenberg, Miller and Ng21, Reference Hoyrup and Rojas28, Reference Hoyrup and Rojas29, Reference Nandakumar39, Reference V’yugin55]. In the following, $\operatorname {\mathrm {\textbf {E}}} X:=\int X dP$ stands for the expectation of a random variable X with respect to measure P.
Proposition 2.15 Effective Birkhoff ergodic theorem [Reference Bienvenu, Day, Hoyrup, Mezhirov and Shen6, Theorem 10]
For a stationary ergodic probability measure P and an s-left-c.e. real random variable G such that $G\ge 0$ and $\operatorname {\mathrm {\textbf {E}}} G<\infty $ , on $1$ - $(s,P)$ -random points we have
We note in passing that if a point is not $1$ -random for a computable P then (4) fails on this point for some computable real random variable G and some computable transformation T [Reference Franklin and Towsner22].
The proof of the next proposition is an easy application of Proposition 2.15 and properties of left-c.e. functions.
Proposition 2.16 Effective Breiman ergodic theorem [Reference Steifer51]
For a stationary ergodic probability measure P and uniformly s-computable real random variables $(G_i)_{i\ge 0}$ such that $G_n\ge 0$ , $\operatorname {\mathrm {\textbf {E}}} \sup _n G_n<\infty $ , and limit $\lim _{n\to \infty } G_n$ exists P-almost surely, on $1$ - $(s,P)$ -random points we have
Proof Let $H_k:=\sup _{t>k}G_t\ge 0$ . Then $G_t\le H_k$ for all $t>k$ and consequently,
Observe that the supremum $H_k$ of uniformly s-computable functions $G_{k+1},G_{k+2},\ldots $ is s-left-c.e. Indeed, to enumerate the left cut of the supremum $H_k(x)$ we simultaneously enumerate the left cuts of $G_{k+1}(x),G_{k+2}(x),\ldots $ . This is possible since every s-computable function is also s-left-c.e. Moreover, we are considering only countably many functions, and hence we can guarantee that an element of each left cut appears in the enumeration at least once.
Now, since for all $k\ge 0$ random variables $H_k$ are s-left-c.e. then by Theorem 2.15 (effective Birkhoff ergodic theorem), on $1$ - $(s,P)$ -random points we have
Since $H_k\ge 0$ and $\operatorname {\mathrm {\textbf {E}}} \sup _k H_k<\infty $ then by the dominated convergence,
Thus,
For the converse inequality, consider a natural number M and put random variables $\bar H_k:=M-\inf _{t>k}\min \left \{ G_t,M \right \}\in [0,M]$ . We observe that $\bar H_k$ are also s-left-c.e. since $G_t$ are uniformly s-computable by the hypothesis. By Theorem 2.15 (effective Birkhoff ergodic theorem), on $1$ - $(s,P)$ -random points we have
Since $0\le \bar H_k\le M$ then by the dominated convergence,
Hence, regrouping the terms we obtain
where the last transition follows by the monotone convergence. By (9) and (12) we derive the claim.
The almost sure versions of Propositions 2.15 and 2.16 concern random variables which need not be nonnegative [Reference Breiman7].
An important result for universal prediction is the Azuma inequality [Reference Azuma3], whose following corollary will be used in Sections 3.2 and 3.4.
Theorem 2.17 Effective Azuma theorem
For a probability measure P and uniformly s-computable real random variables $(Z_n)_{n\ge 1}$ such that $Z_n=g(X_1^n,s)$ and $\left | Z_n \right |\le \epsilon _n\sqrt {n/\ln n}$ with $\lim _{n\to \infty } \epsilon _n=0$ , on $1$ - $(s,P)$ -random points we have
Proof Define
Process $(Y_n)_{n\ge 1}$ is a martingale with respect to process $(X_n)_{n\ge 1}$ with increments bounded by inequality
By the Azuma inequality [Reference Azuma3] for any $\epsilon>0$ we obtain
where
Since $\alpha _n\to \infty $ , we have $\sum _{n=1}^\infty P(\left | Y_n \right |\ge n\epsilon )<\infty $ and by Proposition 2.9 (effective Borel–Cantelli lemma), we obtain (13) on $1$ - $(s,P)$ -random points.
3 Main results
This section contains results concerning effective universal coding and prediction, predictors induced by universal measures, and some examples of universal measures and universal predictors.
3.1 Universal coding
Let us begin our considerations with the problem of universal measures, which is related to the problem of universal coding. Suppose that we want to compress losslessly a typical sequence generated by a stationary probability measure P. We can reasonably ask what is the lower limit of such a compression, i.e., what is the minimal ratio of the encoded string length divided by the original string length. In information theory, it is well known that the greatest lower bound of such ratios is given by the entropy rate of measure P. For a stationary probability measure P, we denote its entropy rate as
which exists for any stationary probability measure.
The entropy rate has the interpretation of the minimal asymptotic rate of lossless encoding of sequences emitted by measure P in various senses: in expectation, almost surely, or on algorithmically random points, where the last interpretation will be pursued in this subsection.
To furnish some theoretical background for universal coding let us recall the Kraft inequality $\sum _{w\in A} 2^{-\left | w \right |}\le 1$ , which holds for any prefix-free subset of strings $A\subset \left \{ 0,1 \right \}^*$ . The Kraft inequality implies in particular that lossless compression procedures, called prefix-free codes, can be mapped one-to-one to semi-measures. In particular, if we are seeking for a universal code, i.e., a prefix-free code $w\mapsto C(w)\in \left \{ 0,1 \right \}^*$ which is optimal for some class of stochastic sources P, we can equivalently seek for a universal semi-measure of form $w\mapsto R(w):=2^{-\left | C(w) \right |}$ . (Similar correspondence holds also for uniquely decodable codes [Reference McMillan36].) Consequently, the problem of universal coding will be solved if we point out such a semi-measure R that
for some points that are typical of P.
As it is well established in information theory, some initial insight into the problem of universal coding or universal measures is given by the Shannon–McMillan–Breiman (SMB) theorem, which states that function $\frac {1}{n}\left [ -\log P(X_1^n) \right ]$ tends P-almost surely to the entropy rate $h_{P}$ . The classical proofs of this result were given by Algoet and Cover [Reference Algoet and Cover2] and Chung [Reference Chung9]. An effective version of the SMB theorem was presented by Hochman [Reference Hochman26] and Hoyrup [Reference Hoyrup27] (cf. [Reference Nakamura38, Reference V’yugin55] for related partial and weaker results).
Theorem 3.1 Effective SMB theorem [Reference Hochman26, Reference Hoyrup27]
For a stationary ergodic probability measure P, on $1$ -P-random points we have
The essential idea of Hoyrup’s proof, which is a bit more complicated, can be retold using tools developed in Section 2.4. Observe first that we have
Moreover, we have the uniform bound
(see [Reference Smorodinsky49, Lemma 4.26])—invoked by Hoyrup as well. Consequently, the effective SMB theorem follows by Proposition 2.16 (effective Breiman ergodic theorem) and Proposition 2.14 (effective Lévy law). In contrast, the reasoning by Hoyrup was more casuistic and his effective version of the Breiman ergodic theorem is weaker than the one proven here.
We note in passing that it could be also interesting to check whether one can effectivize the textbook sandwich proof of the SMB theorem by Algoet and Cover [Reference Algoet and Cover2] using the decomposition of conditionally algorithmically random sequences by Takahashi [Reference Takahashi54]. However, this step would require some novel theoretical considerations about conditional algorithmic randomness for uncomputable measures. We mention this only to point out a possible direction for future research.
As a direct consequence of the effective SMB theorem and Proposition 2.10 (effective Barron lemma), we obtain this effectivization of another well-known almost sure statement:
Theorem 3.2 Effective source coding
For any stationary ergodic measure P and any s-computable probability measure R, on $1$ - $(s,P)$ -random points we have
In the almost sure setting, relationship (23) holds P-almost surely for any stationary ergodic measure P and any (not necessarily computable) probability measure R.
Now we can define universal measures.
Definition 3.3 Universal measure
A computable (not necessarily stationary) probability measure R is called (weakly) n-universal if for any stationary ergodic probability measure P, on (weakly) n-P-random points we have
In the almost sure setting, we say that a probability measure R is almost surely universal if (24) holds P-almost surely for any stationary ergodic probability measure P. By Proposition 2.7, there are only two practically interesting cases of computable universal measures: weakly $2$ -universal ones and $1$ -universal ones, since every computable almost surely universal probability measure is automatically weakly $2$ -universal. We stress that we impose computability of (weakly) n-universal measures by definition since it simplifies statements of some theorems. This should be contrasted with universal prediction à la Solomonoff for left-c.e. semimeasures where the universal element belongs to the class and is not computable [Reference Solomonoff50].
Computable almost surely universal measures exist if the alphabet $\mathbb {X}$ is finite. An important example of an almost surely universal and, as we will see in Section 3.5, also $1$ -universal measure is the Prediction by Partial Matching (PPM) measure [Reference Cleary and Witten10, Reference Ryabko44, Reference Ryabko47]. As we have mentioned, universal measures are closely related to the problem of universal coding (data compression) and more examples of universal measures can be constructed from universal codes, for instance given in [Reference Charikar, Lehman, Lehman, Liu, Panigrahy, Prabhakaran, Sahai and Shelat8, Reference Dębowski14, Reference Kieffer and Yang31, Reference Ziv and Lempel56], using the normalization by Ryabko [Reference Ryabko45]. This normalization is not completely straightforward, since we need to forge semi-measures into probability measures.
3.2 Universal prediction
Universal prediction is a problem similar to universal coding. In this problem, we also seek for a single procedure that would be optimal within a class of probabilistic sources but we apply a different loss function, namely, we impose the error rate being the density of incorrect guesses of the next output given previous ones. In spite of this difference, we will try to state the problem of universal prediction analogously to universal coding. A predictor is an arbitrary total function $f:\mathbb {X}^*\rightarrow \mathbb {X}$ . The predictor induced by a probability measure P will be defined as
where $\operatorname *{\mbox {arg max}}_{x\in \mathbb {X}} g(x):=\min \left \{ a\in \mathbb {X}: g(a)\ge g(x) \text { for all }x\in \mathbb {X} \right \}$ for the total order $a_1<\cdots <a_D$ on $\mathbb {X}=\left \{ a_1,\dots ,a_D \right \}$ . Moreover, for a stationary measure P, we define the unpredictability rate
It is natural to ask whether the unpredictability rate can be related to entropy rate. Using the Fano inequality [Reference Fano19], a classical result of information theory, and its converse [Reference Dębowski16], both independently brought to computability theory by Fortnow and Lutz [Reference Fortnow and Lutz20], yields this bound:
Theorem 3.4. For a stationary measure P over a D-element alphabet,
where $\eta (p):=-p\log p-(1-p)\log (1-p)$ .
Moreover, Fortnow and Lutz [Reference Fortnow and Lutz20] found out some stronger inequalities, sandwich-bounding the unpredictability of an arbitrary sequence in terms of its effective dimension. The effective dimension turns out to be a generalization of the entropy rate to arbitrary sequences [Reference Hoyrup27], which are not necessarily random with respect to stationary ergodic measures.
In the less general framework of stationary ergodic measures, using the Azuma theorem, we can show that no predictor can beat the induced predictor and the error rate committed by the latter equals the unpredictability rate $u_{P}$ . The following proposition concerning the error rates effectivizes the well-known almost sure proposition (the proof in the almost sure setting is available in [Reference Algoet1]).
Theorem 3.5 Effective source prediction
For any stationary ergodic measure P and any s-computable predictor f, on $1$ - $(s,P)$ -random points we have
Moreover, if the induced predictor $f_{\kern-1pt P}$ is s-computable then (28) holds with the equality for $f=f_{\kern-1pt P}$ .
Proof Let measure P be stationary ergodic. In view of Theorem 2.17 (effective Azuma theorem), for any s-computable predictor f, on $1$ - $(s,P)$ -random points we have
Moreover, we have
Subsequently, we observe that limits $\lim _{n\to \infty } P(x_0|X_{-n}^{-1})$ exist on $1$ - $(s,P)$ -random points by Proposition 2.14 (effective Lévy law). Thus by Proposition 2.16 (effective Breiman ergodic theorem) and the dominated convergence, on $1$ - $(s,P)$ -random points we obtain
Hence inequality (28) follows by (29)–(31). Similarly, the equality in (28) for $f=f_{\kern-1pt P}$ follows by noticing that inequality (30) turns out to be the equality in this case.
In the almost sure setting, relationship (28) holds P-almost surely for any stationary ergodic measure P and any (not necessarily computable) predictor f.
We can see that there can be some problem in the effectivization of relationship (28) caused by the induced predictor $f_{\kern-1pt P}$ possibly not being s-computable for certain representations s of measure P—since sometimes testing the equality of two real numbers cannot be done in a finite time. However, probabilities $P(X_{i+1}\neq f_{\kern-1pt P}(X_1^i)|X_1^i)$ are always s-computable. Thus, we can try to define universal predictors in the following way.
Definition 3.6 Universal predictor
A computable predictor f is called (weakly) n-universal if for any stationary ergodic probability measure P, on (weakly) n-P-random points we have
In the almost sure setting, we say that a predictor f is almost surely universal if (32) holds P-almost surely for any stationary ergodic probability measure P. Almost surely universal predictors exist if the alphabet $\mathbb {X}$ is finite [Reference Algoet1, Reference Bailey4, Reference Györfi, Lugosi, Dror, L’Ecuyer and Szidarovszky24, Reference Györfi, Lugosi and Morvai25, Reference Ornstein40]. In [Reference Steifer51] it was proved that the almost sure predictor by [Reference Györfi, Lugosi and Morvai25] is also $1$ -universal.
3.3 Predictors induced by backward estimators
The almost surely universal predictors by [Reference Algoet1, Reference Bailey4, Reference Györfi, Lugosi, Dror, L’Ecuyer and Szidarovszky24, Reference Györfi, Lugosi and Morvai25, Reference Ornstein40] were constructed without a reference to universal measures. Nevertheless, these constructions are all based on estimation of conditional probabilities. For a stationary ergodic process one can consider two separate problems: backward and forward estimation. The first problem is naturally connected to prediction. We want to estimate the conditional probability of $(n+1)$ -th bit given the first n bits. Is it possible that, as we increase n, our estimates converge to the true value at some point? To be precise, we ask whether there exists a probability measure R such that for every stationary ergodic measure P we have P-almost surely
It was shown by Bailey [Reference Bailey4] that this is not possible. As we are about to see, we can get something a bit weaker, namely, the convergence in Cesaro averages. But to get there, it will be helpful to consider a bit different problem.
Suppose again that we want to estimate a conditional probability but the bit that we are interested in is fixed and we are looking more and more into the past. In this scenario, we want to estimate the conditional probability $P(x_0|X_{-\infty }^{-1})$ and we ask whether increasing the knowledge of the past can help us achieve the perfect guess. Precisely, we ask if there exists a probability measure R such that for every stationary ergodic measure P we have P-almost surely
It was famously shown by Ornstein that such estimators exist. (Ornstein proved this for binary-valued processes but the technique can be generalized to finite-valued processes.)
Theorem 3.7 Ornstein theorem [Reference Ornstein40]
Let the alphabet be finite. There exists a computable measure R such that for every stationary ergodic measure P we have P-almost surely that
Definition 3.8. We call a measure R an almost surely universal backward estimator when it satisfies condition (35) P-almost surely for every stationary ergodic measure P, whereas it is called a (weakly) n-universal backward estimator if R is computable and convergence (35) holds on all respective (weakly) n-P-random points.
One can come up with a naive idea: What if we take a universal backward estimator and use it in a forward fashion? Surprisingly, this simple trick gives us almost everything we can get, i.e., a forward estimator that converges to the conditional probability on average. Bailey [Reference Bailey4] showed that for an almost surely universal backward estimator R and for every stationary ergodic measure P we have P-almost surely
The proof of this fact is a direct application of the Breiman ergodic theorem. Since we have a stronger effective version of the Breiman theorem (Theorem 2.16), we can strengthen Bailey’s result to an effective version as well. It turns out that even if we take a backward estimator that is good only almost surely (possibly failing on some random points), then the respective result for the forward estimation will hold in the strong sense—on every $1$ -P-random point.
Theorem 3.9 Effective Bailey theorem
Let R be a computable almost surely universal backward estimator. For every stationary ergodic measure P on $1$ -P-random points we have $($ 36 $)$ .
Proof Let R be a computable almost surely universal backward estimator. Fix an $x\in \mathbb {X}$ . By Proposition 2.14 (effective Lévy law), for every stationary ergodic probability measure P we have P-almost surely
Note that the bound $0\le \left | R(x|X^{-1}_{-n})-P(x|X_{-n}^{-1}) \right |\le 1$ holds uniformly. Moreover, variables $R(x|X^{-1}_{-n})-P(x|X_{-n}^{-1})$ are uniformly s-computable for any representation s of P. Hence, we can apply Theorem 2.16 (effective Breiman ergodic theorem) to obtain
for $1$ -P-random points. The claim follows from this immediately.
Definition 3.10. We call a measure R an almost surely universal forward estimator when it satisfies condition (36) P-almost surely for every stationary ergodic measure P, whereas it is called a (weakly) n-universal forward estimator if R is computable and convergence (36) holds on all respective (weakly) n-P-random points.
One can expect that the predictor $f_R$ induced by a universal forward estimator R in the sense of Definition 3.10 is also universal in the sense of Definition 3.6. This is indeed true. To show this fact, we will first prove a certain inequality for induced predictors, which generalizes the result from [Reference Devroye, Györfi and Lugosi17, Theorem 2.2] for binary classifiers. This particular observation seems to be new.
Proposition 3.11 Prediction inequality
Let p and q be two probability distributions over a countable alphabet $\mathbb {X}$ . For $x_p=\operatorname *{\mbox {arg max}}_{x\in \mathbb {X}} p(x)$ and $x_q=\operatorname *{\mbox {arg max}}_{x\in \mathbb {X}} q(x)$ , we have inequality
Proof Without loss of generality, assume $x_p\neq x_q$ . By the definition of $x_p$ and $x_q$ , we have $p(x_p)-p(x_q)\ge 0$ and $q(x_q)-q(x_p)\ge 0$ . Hence we obtain
Now we can show a general result about universal predictors induced by forward estimators of conditional probabilities.
Theorem 3.12 Effective induced prediction I
For a $1$ -universal forward estimator R, the induced predictor $f_R$ is $1$ -universal if $f_R$ is computable.
Proof Let R be $1$ -universal forward estimator. By the definition, for every stationary ergodic measure P and all $1$ -P-random points
Consequently, combining this with Proposition 3.11 (prediction inequality) yields on $1$ -P-random points
Now, we notice that by (29), we have on $1$ -P-random points
Combining the three above observations completes the proof.
Interestingly, it suffices for a measure to be a computable almost surely universal backward estimator to yield a $1$ -universal forward estimator and, consequently, a $1$ -universal predictor. In contrast, we can easily see that a computable almost surely universal forward estimator does not necessarily induce a $1$ -universal predictor.
Theorem 3.13. There exists a computable almost surely universal forward estimator R such that the induced predictor $f_{R}$ is not $1$ -universal.
Proof Let us take $\mathbb {X}=\left \{ 0,1 \right \}$ and restrict ourselves to one-sided space $\mathbb {X}^{\mathbb {N}}$ without loss of generality. Fix a computable almost surely universal forward estimator Q. Let $P_0$ be the computable measure of a Bernoulli( $\theta $ ) process, i.e., $P_0(x_1^n)=\prod _{i=1}^n\theta ^{x_i}(1-\theta )^{1-x_i}$ , where $\theta>1/2$ is rational. Observe that by Proposition 2.8 there exists a point $y\in \mathbb {X}^{\mathbb {N}}$ which is $1$ - $P_0$ -random and a computable function $g:\mathbb {X}^*\rightarrow \left \{ 0,1 \right \}$ such that $P_0(A)=0$ and $A=\left \{ y \right \}$ for event
In other words, there is a computable method to single out some $1$ - $P_0$ -random point y out of the set of sequences $\mathbb {X}^{\mathbb {N}}$ . In particular, we can use function g to spoil measure Q on that point y while preserving the property of an almost surely universal forward estimator. We will denote the spoilt version of measure Q by R. Conditional distributions $R(X_{m+1}|X_1^m)$ will differ from $Q(X_{m+1}|X_1^m)$ for infinitely many m on point y and for finitely many m elsewhere.
Let $K(x_1^n):=\#\left \{ i\le n:g(x_1^i)=1 \right \}$ . The construction of measure R proceeds by induction on the string length together with an auxiliary counter U. We let $R(x_1):=Q(x_1)$ and $U(x_1):=0$ . Suppose that $R(x_1^n)$ and $U(x_1^n)$ are defined but $R(x_1^{n+1})$ is not. If $U(x_1^n)\ge K(x_1^n)$ then we put $R(x_{n+1}|x_1^n):=Q(x_{n+1}|x_1^n)$ and $U(x_1^{n+1}):=U(x_1^n)$ . Else, if $U(x_1^n)< K(x_1^n)$ then we put $R(x_{n+1}^{n+N}|x_1^n):=\prod _{i=n+1}^{n+N}\theta ^{1-x_i}(1-\theta )^{x_i}$ (reverted compared to the definition of $P_0$ !) and $U(x_{n+N}):=K(x_1^n)$ where N is the smallest number such that
Such number N exists since $P_0(X_{i+1}\neq f_R(x_1^{i})|X_1^i=x_1^{i})> 1-\theta $ . This completes the construction of R.
The sets of $1$ -P-random sequences are disjoint for distinct stationary ergodic P by Theorem 2.15 (effective Birkhoff ergodic theorem). Hence $K(X_1^n)$ is bounded P-almost surely for any stationary ergodic P. Consequently, since $U(X_1^n)$ is non-decreasing then P-almost surely there exists a random number $M<\infty $ such that for all $m>M$ we have $R(X_{m+1}|X_1^m)=Q(X_{m+1}|X_1^m)$ . Hence R inherits the property of an almost surely universal forward estimator from Q.
Now let us inspect what happens on y. Since $K(X_1^n)$ is unbounded on y then by the construction of R, we obtain on y that $U(X_1^n)<K(X_1^n)$ holds infinitely often and
Hence predictor $f_{R}$ is not $1$ -universal.
3.4 Predictors induced by universal measures
Following the work of Ryabko [Reference Ryabko45] (see also [Reference Ryabko, Astola and Malyutov46]), we can ask a natural question whether predictors induced by some universal measures in the sense of Definition 3.3, such as the PPM measure [Reference Cleary and Witten10, Reference Ryabko44, Reference Ryabko47] to be discussed in Section 3.5, are also universal. Ryabko was close to demonstrate the analogous implication in the almost sure setting but did not provide the complete proof. He has shown this proposition:
Theorem 3.14 Theorem 3.3 in [Reference Ryabko44]
Let R be an almost surely universal measure and P be a stationary ergodic measure. We have that
At the first glance, condition (48) may seem close to condition (36), i.e., the universal forward estimator, which—as we have shown in Theorem 3.12—implies universality of the induced predictor. However, this average-case result is too weak for our needs as we seek the almost-sure and effective version thereof. If we tried to derive universality of the induced predictor directly from (48), there are two problems on the way (in the following, $Y_n\,{\ge}\, 0$ stands for the expression under the expectation): Firstly, $\lim _{n\to \infty } \operatorname {\mathrm {\textbf {E}}} Y_n=0$ does not necessarily imply $\operatorname {\mathrm {\textbf {E}}}\lim _{n\to \infty } Y_n=0$ since the limit may not exist almost surely and, secondly, if $\operatorname {\mathrm {\textbf {E}}}\lim _{n\to \infty } Y_n=0$ then $Y_n=0$ holds almost surely but this equality may fail on some $1$ -random points.
In this section, we will show that each $1$ -universal measure, under a relatively mild condition (1), satisfied by the PPM measure, is a $1$ -universal forward estimator and hence, in the light of the previous section, it induces a $1$ -universal predictor. We do not know yet whether this condition is necessary. We will circumvent Theorem 3.14 by applying Proposition 2.16 (effective Breiman ergodic theorem) and Theorem 2.17 (effective Azuma theorem). The first stage of our preparations includes two statements which can be called the effective conditional SMB theorem and the effective conditional universality.
Proposition 3.15 Effective conditional SMB theorem
Let the alphabet be finite and let P be a stationary ergodic probability measure. On $1$ -P-random points we have
Proof Let us write the conditional entropy
We have $0\le W_i\le \log D$ with D being the cardinality of the alphabet. Moreover by Proposition 2.14 (effective Lévy law), on $1$ -P-random points there exists limit
Hence by Proposition 2.16 (effective Breiman ergodic theorem), on $1$ - $(s,P)$ -random points
since $\operatorname {\mathrm {\textbf {E}}}\left [ -\log P(X_0|X_{-\infty }^{-1}) \right ]=\lim _{n\to \infty }\left [ -\log P(X_1^n) \right ]/n=h_{P}$ .
Proposition 3.16 Effective conditional universality
Let the alphabet be finite and let P be a stationary ergodic probability measure. If measure R is $1$ -universal and satisfies
then on $1$ -P-random points we have
Proof Let us write the conditional pointwise entropy $Z_i:=-\log R(X_{i+1}|X_1^i)$ . Now suppose that measure R is $1$ -universal and satisfies (53). Then by Theorem 2.17 (effective Azuma theorem), on $1$ -P-random points we obtain
which is the claim of Proposition 3.16.
In the second stage of our preparations, we recall the famous Pinsker inequality used by Ryabko [Reference Ryabko45] to prove Theorem 3.14.
Proposition 3.17 Pinsker inequality [Reference Csiszár and Körner11]
Let p and q be probability distributions over a countable alphabet $\mathbb {X}$ . We have
Now we can show the main result of this section, namely, that every universal measure which satisfies a mild condition induces a universal predictor.
Theorem 3.18 Effective induced prediction II
If measure R is $1$ -universal and satisfies $($ 53 $)$ then it is a $1$ -universal forward estimator.
Proof Let R be a $1$ -universal measure, whereas P be the stationary ergodic measure. By Propositions 3.15 (effective conditional SMB theorem) and 3.16 (effective conditional universality), on $1$ -P-random points we obtain
Hence by Proposition 3.17 (Pinsker inequality), we derive on $1$ -P-random points
Subsequently, the Cauchy–Schwarz inequality $\operatorname {\mathrm {\textbf {E}}} Y^2\ge (\operatorname {\mathrm {\textbf {E}}} Y)^2$ yields on $1$ -P-random points
Consequently, R is a $1$ -universal forward estimator.
Combining Theorems 3.18 and 3.12, we obtain that predictor $f_R$ is $1$ -universal provided measure R is $1$ -universal and satisfies condition (53)—if predictor $f_R$ is computable itself. Condition (53) does not seem to have been discussed in the literature of universal prediction.
3.5 PPM measure
In this section, we will discuss the Prediction by Partial Matching (PPM) measure. The PPM measure comes in several flavors and was discovered gradually. Cleary and Witten [Reference Cleary and Witten10] coined the name PPM, which we prefer since it is more distinctive, and considered the adaptive Markov approximations $\operatorname {\mathrm {PPM}}_k$ defined roughly in Equation (63). Later, Ryabko [Reference Ryabko44, Reference Ryabko47] considered the infinite series $\operatorname {\mathrm {PPM}}$ defined in Equation (64), called it the measure R, and proved that it is a universal measure. Precisely, Ryabko used the Krichevsky–Trofimov smoothing ( $+1/2$ ) rather than the Laplace smoothing ( $+1$ ) applied in (63). This difference does not affect universality. As we will show now, the series $\operatorname {\mathrm {PPM}}$ provides an example of a $1$ -universal measure that satisfies condition (53) and thus yields a natural $1$ -universal predictor.
Upon the first reading, the definition of the PPM measure may appear cumbersome but it is roughly a Bayesian mixture of all Markov chains of all orders. Its universality can be then motivated by the fact that Markov chains with rational transition probabilities are both countable and dense in the class of stationary ergodic measures [Reference Ryabko48]. Our specific definition of the $\operatorname {\mathrm {PPM}}$ measure is as follows.
Definition 3.19 PPM measure
Let the alphabet be $\mathbb {X}=\left \{ a_1,\ldots ,a_D \right \}$ , where $D\ge 2$ .
Define the frequency of a substring $w_1^k$ in a string $x_1^n$ as
Adapting the definitions by [Reference Cleary and Witten10, Reference Dębowski15, Reference Ryabko44, Reference Ryabko47], the PPM measure of order $k\ge 0$ is defined as
Subsequently, we define the total PPM measure
The infinite series (64) is computable since $\operatorname {\mathrm {PPM}}_k(x_1^n)=D^{-n}$ for $k\,{\ge}\, n-1$ . The almost sure universality of the total PPM measure follows by the Stirling approximation and the Birkhoff ergodic theorem (see [Reference Dębowski15, Reference Ryabko44, Reference Ryabko47]). Since the Birkhoff ergodic theorem can be effectivized for $1$ -random points in the form of Proposition 2.15, we obtain in turn this effectivization.
Theorem 3.20 Effective PPM universality; cf. [Reference Ryabko44]
The $\operatorname {\mathrm {PPM}}$ measure is $1$ -universal.
Proof As we have mentioned, computability of the PPM measure follows since series (64) can be truncated with the constant term $\operatorname {\mathrm {PPM}}_k(x_1^n)=D^{-n}$ for $k\ge n-1$ and thus values $\operatorname {\mathrm {PPM}}(x_1^n)$ are rational.
To show $1$ -universality of the PPM measure, we first observe that
In contrast, the empirical (conditional) entropy of string $x_1^n$ of order $k\ge 0$ is defined as
Using the Stirling approximation for the factorial function, the PPM measure of order $k\ge 0$ can be related to the empirical entropy. In particular, by Theorem A4 in [Reference Dębowski15], we have the bound
Subsequently, by Proposition 2.15 (effective Birkhoff ergodic theorem), on $1$ -P-random points we have
Then by (67),
Since
then
on $1$ -P-random points, whereas the reverse inequality for the lower limit follows by Proposition 2.10 (effective Barron lemma) and Theorem 3.1 (effective SMB theorem).
Finally, we can show that predictor $f_{\operatorname {\mathrm {PPM}}}$ induced by the PPM measure is $1$ -universal. First, we notice explicitly these bounds:
Theorem 3.21 PPM bounds
We have
Proof Observe that $\operatorname {\mathrm {PPM}}_k(x_1^n)=D^{-n}$ for $k\ge n-1$ . Hence by (70), we obtain claim (72). The derivation of claim (73) is slightly longer. First, by the definition of $\operatorname {\mathrm {PPM}}_k$ , we have
Now let us denote
We have $G\le n-1$ , since $\operatorname {\mathrm {PPM}}_k(x_1^n)=D^{-n}$ for $k\ge n-1$ . Moreover, we have a bound reverse to (70), namely
Combining the above with (70) yields
Now comes the main theorem.
Theorem 3.22. The predictor $f_{\operatorname {\mathrm {PPM}}}$ is $1$ -universal.
Proof Computability of the predictor $f_{\operatorname {\mathrm {PPM}}}$ follows since values $\operatorname {\mathrm {PPM}}(x_1^n)$ are rational so the least symbol of those having the maximal conditional probability can be computed in a finite time. Consequently, the claim follows by Theorems 3.18, 3.20, and 3.21.
We think that $1$ -universality of the predictor $f_{\operatorname {\mathrm {PPM}}}$ is quite expected and intuitive. But as we can see, the PPM measure satisfies condition (53) with a large reserve. It is an open question whether there are $1$ -universal measures such that conditional probabilities $R(x_{n+1}|x_1^n)$ converge to zero much faster than for the PPM measure but they still induce $1$ -universal predictors. It would be interesting to find such measures. Maybe they have some other desirable properties, also from a practical point of view.
Acknowledgments
The authors are grateful to the anonymous reviewers of unaccepted earlier conference versions of the paper, who provided a very stimulating and encouraging feedback. Additional improvements to the paper were inspired by participants of the Kolmogorov seminar in Moscow at which this work was presented by the first author. Finally, the authors express their gratitude to Dariusz Kalociński for his comments and proofreading. Both authors declare an equal contribution to the paper. This work was supported by the National Science Centre Poland grant no. 2018/31/B/HS1/04018.