1 Introduction
The notion of randomness is at the very core of fundamental ideas of philosophy and science. As such it comes with its own package of puzzles and enigmas. For example, suppose you are faced with some experimental data and you want to learn about the underlying phenomenon—is it deterministic (say, we observe the infinite sequence of zeros $0,0,\ldots $ ) or is it random (e.g., the outcomes of a fair and unbiased coin tossing)? Does it even make any sense to say that an individual object (i.e., an infinite sequence of bits) is random?
Computability theory gives us some tools to deal with this problem. For example, we could say that the sequence is random if we cannot predict it well enough using any effective procedure or we could argue that the random sequences are exactly those that are incompressible. Algorithmic randomness theory studies various answers formulated exactly from that point of view. It is now one of the most active and fruitful branches of modern computability theory, drawing attention of researchers from mathematical logic, as well as from the foundations of probability theory. The cornerstone of this theory is the notion of randomness proposed in 1966 by Martin-Löf [Reference Martin-Löf15]. Roughly speaking, a sequence is random in the Martin-Löf sense if it does not have any effectively rare property, i.e., property of measure zero that could be tested in a sufficiently effective way. Here, the effectiveness is explicated by means of computability. Since then, many other notions of randomness were introduced and studied, constituting an infinite hierarchy of concepts. Furthermore, it was soon observed that the same notions of randomness may be characterized using independent paradigms such as compressibility and betting strategies.
As we have already noted, computability theory provides some perspective on what is an effective procedure and what is not. As it happens, the notion of effectiveness is relevant to other areas of philosophical and scientific investigations. Consider the problem of learning. Can we apply the perspective of computability here as well? Gold [Reference Gold11] and Putnam [Reference Putnam21] thought so. Again, let us see an example. We task a student (often called an agent or a learner) with a learning problem such as: are all dogs green? We supply the student with data and examples, in this case, examples of dogs. Such data may be represented mathematically, e.g., by an infinite binary sequences (one means a green dog, zero means a different color). Each time a new example is given, the student makes a guess—yes or no. One of the answers is the correct one, the one we want to hear. We could expect the student to stop making mistakes at some point. Is there a computable method which, if followed, leads to such outcome? Now, the dogs are easy (say yes as long as you see only green dogs) but of course, it is not hard to come up with more difficult tasks. This framework—called algorithmic learning theory—may serve as a model for various scenarios, e.g., binary classification problems or a choice of a true physical theory. Sometimes, it may be impossible to stop making mistakes at all. In such case, a liberal teacher may come up with weaker criteria of success (such as giving the correct answer infinitely many times).
Another task that fits well in the learning-theoretic framework is that of detection of rare properties, i.e., deciding whether a given binary sequence belongs to some set of measure zero. This is basically something we would expect of randomness, namely, that a set of random outcomes does not have any rare properties that could be recognized in an effective way. This connection between algorithmic randomness and algorithmic learning was explored by Osherson and Weinstein [Reference Osherson and Weinstein18]. They provided new characterizations of the classes of weakly 1-random and weakly 2-random sequences, both of which are readily interpretable in terms of learning and recognition. In a more recent work, Zaffora Blando [Reference Zaffora Blando27] described slightly more involved characterizations of Martin-Löf randomness and Schnorr randomness.
All these definitions may be interpreted in the following manner—a sequence is algorithmically random if and only if no computable agent recognizes the sequence as possessing some rare property. The difference between these characterizations boils down to what criterion of success is assumed. For example, a sequence x is weakly 1-random if and only if there is no computable agent which gives the negative answer infinitely many times with probability one, yet they give the negative answer only finitely many times on prefixes of x.
This note consists of two parts. In the first, I investigate some criteria of success based on the asymptotic density of affirmative answers, answering the question asked by Zaffora Blando [Reference Zaffora Blando27]. On the way, I show novel criteria corresponding to the notions of weak 1-randomness and weak 2-randomness.
In the second part, I argue that learning-theoretic characterizations of randomness may be reinterpreted in terms the effectivization of probabilistic theorems (and vice versa). Suppose we have defined a notion of randomness with respect to some computable measure $\mu $ , which will be called the class of $\mu $ -random sequences. Such class is of $\mu $ -measure one. Now, let $\mu $ be a computable probability measure on infinite sequences. In modern probability theory, many results are stated in the following form
where $\phi $ is some formula—often stating a pointwise convergence. The above is usually stated as “ $\phi (\omega )$ for $\mu $ -almost every $\omega $ .” In computable measure theory, we seek for effective versions of such theorems, that is we want to know if
Roughly speaking, the difference between non-effective and effective theorems is somewhat similar to the difference between sentences “all cats but one is black” and “Fluffy is the only non-black cat.” You may try to formulate effective theorems for various different notions of randomness. Due to historical reasons, much of the attention was given to Martin-Löf randomness. We already know effective versions of many textbook results, e.g., the law of iterated logarithm [Reference Van Lambalgen26], Doob’s martingale convergence theorem [Reference Takahashi24] or even Birkhoff’s ergodic theorem [Reference Bienvenu, Day, Hoyrup, Mezhirov and Shen3, Reference Franklin, Greenberg, Miller and Ng9, Reference V’yugin25]. In some cases the standard proofs are already constructive and the effectivization follows after simple modifications but it is not always the case. Moreover, negative results also exist. For instance, in the context of Solomonoff induction [Reference Solomonoff22, Reference Solomonoff23], Lattimore and Hutter [Reference Lattimore and Hutter14] discovered that no universal mixture (of lower semicomputable semimeasures) converges on all Martin-Löf random sequences. This result motivated Milovanov [Reference Milovanov16] to find a new universal induction method which does converge on all Martin-Löf random sequences. Moreover, mathematicians also gave some attention to effectivization with respect to Schnorr randomness (e.g., [Reference Freer, Nies and Stephan10, Reference Pathak, Rojas and Simpson19]).
It is actually a folklore result that there exists a computable sequence of functions converging almost surely which fails to converge on some Martin-Löf random sequence. We can interpret this fact in the learning-theoretic framework. At the same time, we can straightforwardly translate the results formulated in the learning-theoretic context into statements about convergence on random sequences. My attention here will focus on the convergence in Cesàro averages.
2 Preliminaries
Before moving to the main results, we introduce some notational conventions and provide some preliminary definitions. The set of all finite words over the binary alphabet $\{0,1\}$ is denoted by $ {2^{<\mathbb {N}}}$ , while the set of all one-sided infinite sequences is denoted by $ {2^{\mathbb {N}}}$ . By convention, bits are indexed from $0$ . Given a word or a sequence x, we let $x_i$ denote the $(i+1)$ -th bit and, given $i<j$ , we let $x_i^j$ denote a subword $x_ix_{i+1}\ldots x_j$ of x consisting of all the digits of x from $x_i$ to $x_j$ . The empty set is denoted by $\square $ . Given a word w, $|w|$ stands for its length. We write $x\preceq y$ to say that x is a prefix of y. We use $\#(A)$ to denote the cardinality of a set A.
2.1 Effective reals
Computably enumerable is abbreviated as c.e. A real $r>0$ is called computable (lower semicomputable) if the left cut of r, i.e., $\{q\in \mathbb {Q}:q< r\}$ , is computable (c.e.). A function $f: {2^{<\mathbb {N}}}\rightarrow \mathbb {R}$ is called computable (lower semicomputable) if its values are uniformly computable (lower semicomputable) in $ {2^{<\mathbb {N}}}$ . A real is upper semicomputable if its negation is lower semicomputable. In a similar manner, we can define $\Delta ^0_2$ reals as those for which the left cut is a $\Delta ^0_2$ set.
These reals have a natural characterization in terms of densities. We say that a sequence x has the density r if
We say that the density of x is undefined if such limit does not exist.
As it happens, $\Delta ^0_2$ reals are exactly the densities of computable sequences.
Theorem 2.1 [Reference Jockusch and Schupp12]
A real is $\Delta ^0_2$ if and only if is the density of a computable sequence.
We can go further and define computablity of functions from infinite sequences into reals. A function $f: {2^{\mathbb {N}}}\rightarrow \mathbb {R}$ is computable if there is a Turing functional $\Phi $ which given oracle x computes the left cut of $f(x)$ .
2.2 Probability measures
We are dealing with the binary stochastic process $X=X_0,X_1,\ldots $ . Symbol X is introduced to compress notation. For instance, given some formula $\phi $ and a measure $\mu $ we will often write $\mu (\phi (X))$ instead of $\mu (\{x\in {2^{\mathbb {N}}}:\phi (x)\})$ . X obeys similar notational conventions as sequences, e.g., $X_j^i$ denotes random variables $X_j,X_{j+1},\ldots ,X_i$ and so on. A special attention is given to the uniform measure $\lambda $ on the Cantor space of infinite binary sequences. This measure corresponds to $\lambda (\{x\in {2^{\mathbb {N}}}:x_0^{|\sigma |-1}=\sigma \})=2^{-|\sigma |}$ for all nonempty $\sigma \in {2^{<\mathbb {N}}}$ . Given a word $\sigma \in {2^{<\mathbb {N}}}$ we define the cylinder set $ {[\kern-1.7pt[ \sigma ]\kern-1.7pt] }$ as the set $\{x\in {2^{\mathbb {N}}}:x_0^{|\sigma |-1}=\sigma \}$ . Similarly, if V is a set of words, then $ {[\kern-1.7pt[ V ]\kern-1.7pt] }=\cup _{\sigma \in V} {[\kern-1.7pt[ \sigma ]\kern-1.7pt] }$ . Unless it is stated otherwise, $\mu $ denotes an arbitrary computable probability measure, i.e., a probability measure such that there exists a computable function $f: {2^{<\mathbb {N}}}\times \mathbb {N}\rightarrow \mathbb {Q}$ with $|f(\sigma ,n)-\mu ( {[\kern-1.7pt[ \sigma ]\kern-1.7pt] })|<2^{-n}$ .
A measure $\mu $ is called continuous if $\mu (\{x\})>0$ for no $x\in {2^{\mathbb {N}}}$ . The reader is encouraged to consult [Reference Billingsley5] for an introduction to modern measure-theoretic probability theory.
2.3 Learnable sequences
From the learning-theoretic perspective, $\Delta ^0_2$ sequences from the arithmetical hierarchy are of a special interest. These sequences are sometimes called learnable—this name is justified by the following theorems.
Theorem 2.2 [Reference Gold11], [Reference Putnam21]
A sequence $x\in {2^{\mathbb {N}}}$ is $\Delta ^0_2$ iff there exists a total computable $g:\mathbb {N}^2\rightarrow \{0,1\}$ such that for all $i\in \mathbb {N}$ we have
We can interpret function g as a learner which makes guesses about the true value of $x_i$ . A sequence is learnable if as some point, the answers stabilize on a correct one. However, we might also give a slightly different learning-theoretic characterization. In the second scenario, which will be explained in detail in Section 2.5, the learner reads fragments of a sequence and tries to find out whether the sequence has some property.
Proposition 2.3 (folklore?). For every $x\in {2^{\mathbb {N}}}$ which is $\Delta ^0_2$ , there exists a computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that x is the only sequence for which $\#\{i:f(x_0^i)=1\}$ is infinite.
Proof. Let $x\in {2^{\mathbb {N}}}$ be $\Delta ^0_2$ . By Theorem 2.2 there exists a computable function $g:\mathbb {N}^2\rightarrow \{0,1\}$ such that for all $i\in \mathbb {N}$ we have $\lim _{t\to \infty }g(i,t)=1$ iff $x_i=1$ and $\lim _{t\to \infty }g(i,t)=0$ iff $x_i=0$ . We define f by induction. We also define an auxiliary function u. Let $f(\square )=0$ and $u(\square )=0$ . Suppose that for some $\sigma $ we have already defined $f(\sigma )$ and $u(\sigma )$ and we want to define $f(\sigma b)$ (with $b\in \{0,1\}$ ). Consider a sequence $w=g(0,|\sigma |)g(1,|\sigma |)\ldots g(|\sigma |,|\sigma )$ . Let k be the length of the longest prefix of w which is also a prefix of $\sigma b$ . If $k>u(\sigma )$ , let $f(\sigma b)=1$ . Otherwise, we let $f(\sigma b)=0$ . Finally, set $u(\sigma b)= \max (\{k,u(\sigma )\})$ . It remains to observe that for each n, there exists m such that $g(a,b)=x_a$ for all $a<n$ and $b>m$ . Moreover, there is no $y\neq x$ such that this happens. Hence, f answers $1$ on infinitely many prefixes of x and on only finitely many prefixes of every other sequence.
2.4 Martin-Löf randomness
Several equivalent definitions of Martin-Löf randomness—referred to as 1-randomness here—are now known. We start with the definition by tests. A reader interested in learning more about the algorithmic randomness theory is referred to [Reference Downey and Hirschfeldt8].
Definition 2.4. A collection $U_0,U_1,\ldots $ of sets of sequences is uniformly c.e. if and only if there is a collection $V_0,V_1,\ldots \subset {2^{<\mathbb {N}}}$ such that $U_i= {[\kern-1.7pt[ V_i ]\kern-1.7pt] }$ for every $i\in \mathbb {N}$ and $V_0,V_1,\ldots $ are uniformly c.e.
Definition 2.5 (Martin-Löf $\mu $ -test)
A uniformly c.e. sequence $U_0,U_1,\ldots $ of sets of sequences is called a Martin-Löf $\mu $ -test if there exists a computable f such that $\lim _{n\to \infty }f(n)=0$ and $\mu (U_n)\leq f(n)$ for every $n\in \mathbb {N}$ .
Definition 2.6 (Martin-Löf $\mu $ -randomness)
A sequence $x\in {2^{\mathbb {N}}}$ is called 1-random with respect to $\mu $ (or 1- $\mu $ -random) if there is no Martin-Löf $\mu $ -test $U_0,U_1,\ldots $ such that $x\in \cap _{i\in \mathbb {N}}U_n$ .
When dealing with sequences random with respect to some arbitrary computable measure $\mu $ , we will usually refer to these simply as 1-random sequences.
The following is a folklore result.
Proposition 2.7 (folklore). There exists a $\Delta^0_2\ \lambda $ -random sequence.
As in case of arithmetic hierarchy, we can define a hierarchy of complexities of classes. A set $C\subseteq {2^{\mathbb {N}}}$ is called a $\Sigma ^0_n$ class if there exists a computable relation R such that for all $x\in {2^{\mathbb {N}}}$ we have $x\in C$ if and only if $\exists {i_1}\forall {i_2}\ldots \exists i_n R(x_0^{i_1},x_0^{i_2},\ldots ,x_0^{i_n})$ for an odd n and $\exists {i_1}\forall {i_2}\ldots \forall i_n R(x_0^{i_1},x_0^{i_2},\ldots ,x_0^{i_n})$ for even n. Now, it is possible to give the definition of weak n-randomness.
Definition 2.8 (weak n-randomness). A sequence is called weakly n- $\mu $ -random if it is contained in every $\Sigma ^0_n$ class of $\mu $ -measure one.
As in the case of 1-randomness, if we are dealing with an arbitrary computable measure $\mu $ we omit $\mu $ when referring to weak n- $\mu $ -randomness.
2.5 Learning and randomness
A learning-theoretic characterization of two notions of weak n-randomness was discovered by Osherson and Weinstein. The function f in the following statements formalizes the notion of a computable agent—also called a learner—who tries to learn from the prefixes of the sequence whether the sequence possess some rare property or not. The agent is reading bits of a sequence and gives a positive or a negative answer after reading each bit. A positive answer is interpreted as a sign of belief that the given sequence manifest a certain pattern or property we want to detect. The procedure is constrained by the requirement of computability.
It is assumed that a purely random sequence should not have any non-trivial rare properties that could be detected by such learner. It is now a question of the choice of a criterion of success for such an agent. One idea is to ask for only finitely many negative answers. By Theorem 2.9 this criterion may be used to define the class of weakly $1$ -random sequences.
Theorem 2.9 (Osherson-Weinstein [Reference Osherson and Weinstein18])
A sequence x is weakly 1-random if and only if there is no computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
A weaker criterion—by Theorem 2.10 corresponding to weak 2-randomness—is given by the requirement of infinitely many positive answers.
Theorem 2.10 (Osherson-Weinstein [Reference Osherson and Weinstein18])
A sequence x is weakly 2-random if and only if there is no computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Finally, a recent theorem by Zaffora Blando [Reference Zaffora Blando27] gives a learning-theoretic characterization of Martin-Löf randomness.
Theorem 2.11 (Zaffora Blando [Reference Zaffora Blando27])
A sequence x is 1-random if and only if there is no computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and for all $n\in \mathbb {N}$
3 Density of answers
The learning-theoretic characterizations of Martin-Löf randomness and Schnorr randomness were obtained by Zaffora Blando [Reference Zaffora Blando27] by taking a notion of success present in the Osherson–Weinstein characterization of weak 2-randomness (infinitely many positive answers) and tweaking the measure-theoretic condition in the definition. Naturally, one may wonder, whether similar goal could be obtained by tweaking the success notion instead. To this end, Zaffora Blando [Reference Zaffora Blando27] asked about notions of randomness arising when we enrich the learning-theoretic approach with conditions imposed on the density of positive answers. In particular, she formulated the following problem.
Problem 3.12. Consider a class $\mathcal {L}\mathcal {D}$ of all sequences $x\in {2^{\mathbb {N}}}$ such that there is no computable function g satisfying both of the following:
and
Does $\mathcal {LD}$ correspond to any known notion of algorithmic randomness?
Such notion of success has a natural interpretation, namely, that we allow the learner to make mistakes but if we look at the average answer, it approaches one. In other words, as time passes, the frequency of negative answers becomes negligible.
An immediate corollary of Theorems 2.9 and 2.10 is that $\mathcal {L}\mathcal {D}$ is located between weak 2-randomness and weak 1-randomness. We are going to strengthen this by proving that $\mathcal {LD}$ is, in fact, equal to weak 2-randomness.
Theorem 3.13. A sequence x is weakly 2-random if and only if there is no computable function $g: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Proof. ( $\Leftarrow $ ) We prove this implication by contraposition. Suppose that x is not weakly 2-random. By Theorem 2.10, there is a computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
We will construct a computable function $g: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ . Let $g(\square )=1$ . Suppose that for some w we have already defined $g(v)$ for all $v\preceq w$ but $g(\tau )$ is yet not defined on any $\tau $ —a proper extension of w. Let k be the number of times f gives the positive answer on some prefix of w, i.e.,
For all $\tau \in 2^{i}$ , where $i\leq k$ we let $g(w\tau )=1$ . This completes the construction of the computable function g. Now, observe that for every $y\in {2^{\mathbb {N}}}$
if and only if there are infinitely many n such that $f(y_0^n)=1$ . In fact, if there are exactly k indexes n such that $f(y_0^n)=1$ , then
Finally, we may also conclude that
( $\Rightarrow $ ) This implication follows immediately from Theorem 2.10.
Now, for completeness we observe that the similar characterization—based on the density of positive answers—may be given for weak 1-randomness.
Theorem 3.14. A sequence $x\in {2^{\mathbb {N}}}$ is weakly 1-random if and only if there is no computable function $g: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Proof. ( $\Rightarrow $ ) Fix $x\in {2^{\mathbb {N}}}$ . Suppose that there is a computable function $g: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Consider $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that for every $w\in {2^{<\mathbb {N}}}$ we let $f(w)=1$ if and only if $\frac {\sum ^{|w|-1}_{i=0}g(w_0^i)}{|w|}>1/2$ . Observe that if the ratio of the positive answers given by g converges to $0$ on some sequence, then f gives the negative answer on infinitely many prefixes. This happens with probability $1$ . On the other hand, we have $\#\{i:f(x_0^i)=0\}<\infty $ . Consequently, x is not weakly 1-random.
( $\Leftarrow $ ) Suppose that x is not weakly 1-random. By Theorem 2.9, there exists a computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Let $k=\#\{i:f(x_0^i)=0\}$ . We construct a computable function g. Let $g(\square )=0$ and suppose that for some $w\in {2^{<\mathbb {N}}}$ we have already defined $g(v)$ for all proper prefixes v of w and we want to define $g(w)$ . Let $m=\#\{i<|w|:f(w_0^i)=0\}$ . We let $g(w)=1$ if and only if $m\leq k$ . Otherwise, set $g(w)=0$ .
Observe that for every sequence y if there are no more than k indexes i such that $f(y_0^i)=0$ then $g(y_0^j)=1$ for all indexes j. This is true for $y=x$ .
On the other hand, $g(X_0^j)=0$ for all but finitely many indexes j when $\#\{i:f(X_0^i)=0\}=\infty $ . This happens almost surely.
As it happens, the values placed as the limits in the last theorem (i.e., one and zero) may be substituted for arbitrary $\Delta ^0_2$ reals. Note that this result does not seem to have a straightforward learning-theoretic interpretation and is given as a technical curiosity.
Theorem 3.15. Let a and b be $\Delta ^0_2$ reals (with $a\neq b$ ). A sequence $x\in {2^{\mathbb {N}}}$ is weakly 1-random if and only if there is no computable function $g: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and
Proof. Fix two $\Delta ^0_2$ reals a and b (with $a\neq b$ ). By Theorem 2.1 there exist computable $x,y\in {2^{\mathbb {N}}}$ such that a is the density of x and b is the density of y. Let f be a computable function witnessing that a sequence $\omega $ is not weakly $1$ -random (in the sense of Theorem 2.9). Let $k=\#\{i:f(x_0^i)=0\}$ . We construct a computable function g. Let $g(\square )=x_0$ and suppose that for some $w\in {2^{<\mathbb {N}}}$ we have already defined $g(v)$ for all proper prefixes v of w and we want to define $g(w)$ . Let $m=\#\{i\leq |w|:f(w_0^i)=0\}$ . Check if $m\leq k$ . If so, let $g(w)=x_{|w|}$ . Otherwise, set $g(w)=y_{|w|}. $
Observe that for every sequence $\omega $ if there are no more than k indexes i such that $f(\omega _0^i)=0$ then the density of $g(\omega _0^0)g(\omega _0^1)\ldots $ equals the density of x, i.e., it is equal to a. This happens if $\omega =x$ . Otherwise, this density equals b. This happens with probability one. The implication in the other direction is analogous to the one in the proof of Theorem 3.14.
4 Convergence on random sequences
Combining Propositions 2.3 and 2.7 gives the following folklore result as a corollary.
Proposition 4.16 (folklore). There exists a 1- $\lambda$ -random sequence x and a computable function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that x is the only sequence for which $\#\{i:f(x_0^i)=1\}$ is infinite.
In other words, there are random sequences which are uniquely recognizable by a computable agent, in a certain relaxed sense.
With that in mind, we turn our attention to the problem of convergence on random sequences mentioned in the introduction. We are interested in doing statistical inference based on a finite but increasing amount of data, i.e., we want to study functions which take prefixes of increasing length and output estimates of some parameter (e.g., the entropy rate). This may involve such tasks as hypothesis testing or inductive learning in the form of estimation of the conditional probabilities.
Think of a computable function g which converges to some random variable Y almost surely and on every random sequence, i.e., for every random sequence x we have
Take the function f from Proposition 4.16 and consider a function h defined by $h(w)=f(w)+g(w)$ for all $w\in {2^{<\mathbb {N}}}$ . It follows that h converges to Y almost surely but it fails to converge to Y on some $\lambda $ -random sequence. On the other hand, if Y is computable, then it is a folklore observation that the convergence of g to Y on every weakly 2-random sequence follows from the convergence with probability one. Indeed, we have:
Proposition 4.17 (folklore). Let $g: {2^{<\mathbb {N}}}\rightarrow \mathbb {R}^{\geq 0}$ be a computable function such that $\mu $ -almost surely
If Y is a computable random variable, then this happens on every weakly $2 $ -random sequence x.
Proof. This is a simple consequence of the fact that avoiding a computable limit with an error bounded from below by a rational is a $\Pi ^0_2$ property. To be precise, for every $i\in \mathbb {N}$ the following is a $\Pi ^0_2$ class:
By the assumption this is a class of measure zero and so, no weakly 2-random belongs to it.
Now, suppose we have two computable functions $h_1,h_2: {2^{<\mathbb {N}}}\rightarrow \mathbb {Q}^{\geq 0}$ such that almost surely $\lim _{n\to \infty }h_1(X_0^n)=\lim _{n\to \infty }h_2(X_0^n)$ . Such a pair corresponds to an infinite family of computable learning functions $f_1,f_2,\ldots $ defined by
where $z\in {2^{\mathbb {N}}}$ is arbitrary. What can be immediately observed, each such function gives only finitely many positive answers on a weakly 2-random sequence (by Theorem 2.10).
Furthermore, Theorem 2.11 may be reinterpreted in the following form.
Theorem 4.18. A sequence $x\in {2^{\mathbb {N}}}$ is $1$ -random if and only if for every $m\in \mathbb {N}$ and any pair of computable functions $h_1,h_2: {2^{<\mathbb {N}}}\rightarrow \mathbb {R}^{\geq 0}$ satisfying for all $n\in \mathbb {N}$
we have
Proof. For the first implication, simply observe that given $m\in \mathbb {N}$ the sequence $U_1,U_2,\ldots $ defined by
is a $\mu $ -test. If a sequence x is such that it is not true that
then for sufficiently large m we have $x\in \bigcap _{n>0}U_n$ , so x is not 1-random.
The second implication follows directly from Theorem 2.11. If y is not 1-random then there is a learning function f which witnesses this. Now, setting $h_1(\sigma )=f(\sigma )$ and $h_2(\sigma )=0$ for all $\sigma \in {2^{<\mathbb {N}}}$ gives what is needed.
In a way, these two interpretative frameworks, i.e., the detection of rare properties and convergence of estimators, are closely connected. On the one hand, take an appropriate learning function and add it to an estimator. That procedure will render it bad on some random sequences. On the other hand, take a pair of estimators, monitor their difference and you will get a learning function. In such case, the asymptotic behavior of the learning function imitates that of the estimators.
So far, we have considered pointwise convergence of the estimators. This a relatively strong property. Indeed, many inductive schemes do not satisfy pointwise convergence and are optimal only in terms of some weaker criterion of success. Specifically, mathematicians and statisticians studied, with great attention, the convergence in Cesàro averages. Given a function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ we say that f converges to Y on x in Cesàro averages if
Such form of convergence is very natural and well studied in summability theory (cf. [Reference Peyerimhoff20]). There are plenty of natural examples of infinite sequences of reals which do not converge pointwise, but converge in Cesàro averages. In statistics, this may be pictured by the following scenario. We want to estimate a property of the underlying process (such as the entropy rate). As new data comes, we make a new estimation. It is often assumed that more data means a better estimate but it is not always the case. Suppose that an unlikely (but of a positive measure) outcome causes a large error in the estimation. This will happen rarely (as the event in question is of small probability) but nevertheless, it will happen infinitely often. In many cases, such problem may be alleviated by simply averaging all the estimates made so far and using the average as a new estimator. For instance, consider a problem of forward conditional measure estimation for stationary ergodic processes. It was shown by Bailey [Reference Bailey2] that the pointwise estimators do not exist in this case but there are known estimators which converge almost surely in Cesàro averages.
Unsurprisingly, there are computable estimators which converge in Cesàro averages to some random variable $\mu $ -almost surely but fail to do so on some random point. This prompts a question—under what conditions is convergence (pointwise or in Cesàro averages) on all 1-random sequences guaranteed? In particular, we might be interested in conditions stated in purely probabilistic terms. A partial answer to this is given by the effective version of Breiman’s ergodic theorem. We state it in a specialized form below but, firstly, an additional comment is required. For a binary alphabet, a measure $\mu $ is stationary if $\mu (X_i=1)$ is constant for all i. By the Kolmogorov extension theorem (cf. [Reference Billingsley5]), a stationary measure on the space of sequences from $ {2^{\mathbb {N}}}$ may be uniquely extended to a measure on the space of two-sided infinite sequences (elements of $2^{\mathbb {Z}}$ ). Similarly, the canonical process $X_0,X_1,\ldots $ is uniquely extended to the process $\ldots X_{-1},X_0,X_1,\ldots $
Theorem 4.19. Let $g:{2^{<\mathbb{N}}}\rightarrow\mathbb {R}^+$ be a computable function with $lim_{n\to\infty}g(X_0^n)$ existing almost surely and $\mathbb{E}_{\lambda}(\sup_i|g(X_0^i)|)<\infty$ . Then for every $\lambda $ -random sequence $\omega \in {2^{\mathbb {N}}}$ ,
and
Proof. The result follows from the effective Birkhoff’s ergodic theorem [Reference Bienvenu, Day, Hoyrup, Mezhirov and Shen3, Reference Franklin, Greenberg, Miller and Ng9, Reference V’yugin25] using the proof of Breiman [Reference Breiman6]. For details see, e.g., [Reference Dębowski and Steifer7].
Note that the uniform measure $\lambda $ in Theorem 4.19 may be substituted by an arbitrary stationary ergodic computable measure. To keep the presentation simple, we choose not to introduce this class of measures in detail here. The curious reader is referred to [Reference Billingsley5].
While learning-theoretic definitions of [Reference Osherson and Weinstein18] and later of [Reference Zaffora Blando27] correspond to the problem of pointwise convergence, the density based characterizations of the type discussed in this work are easily interpreted in terms of convergence in Cesàro averages. To this end, we show yet another learning-theoretic characterization of weak 2-randomness. Here, we consider learning functions with the asymptotic frequency of positive answers equal to zero, almost surely. The theorem states that if the average of initial answers does not converge to zero on some sequence then this sequence is not weakly 2-random.
Theorem 4.20. The sequence $x\in {2^{\mathbb {N}}}$ is weakly 2-random if and only if there is no computable function f such that
while
Proof. ( $\Leftarrow $ ) Suppose that $x\in {2^{\mathbb {N}}}$ is not weakly 2-random. Let g be a function witnessing this in the sense of Theorem 2.10.
Fix a rational number $\delta>0$ . We are now constructing the function f by induction on the length of words. Let $f(\square )=0$ . Suppose that we have already defined $f(\sigma )$ for some word $\sigma $ and we want to define $f(\sigma 0)$ and $f(\sigma 1)$ . Let $u(\sigma )=\#\{v:v\preceq \sigma \wedge g(v)=1\}$ . If $u(\sigma )=u(\sigma _0^{|\sigma |-2})$ , we let $f(\sigma 0)=f(\sigma 1)=0$ . Otherwise, compute the least n such that
Let $f(\sigma w)=1$ for every $w\in 2^i$ with $i\leq n$ . It remains to observe, that if g says $1$ on only finitely many prefixes then so does f. Consequently, the average of answers given by f on the prefixes of such sequence converges to $0$ . On the other hand, if g says $1$ on infinitely many prefixes then the average of answers given by f is larger than $\delta $ infinitely many times (and so, it does not converge to $0$ ). The rest follows from the properties of g.
( $\Rightarrow $ ) Let f be a computable function such that
and
Fix $\delta>0$ be a rational such that
for infinitely many n. We define a computable function g as follows. For each $w\in {2^{<\mathbb {N}}}$ let $g(w)=1$ if and only if
By Theorem 2.10, x is not weakly 2-random.
Furthermore, a stronger version of Proposition 4.16 follows from the previous considerations.
Proposition 4.21. There exists a $\lambda $ -random sequence x and a function $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and for all $y\neq x$
As a consequence, even if an estimator converges to some value in the pointwise fashion almost surely, it may happen that it fails to converge in Cesáro averages on some random point. This is true even under the assumption that the expected value of the estimator is bounded. If the estimator gives finitely many nonzero answers almost surely, then the expected value of the limit of answers is zero. In particular, we have the following.
Corollary 4.22. There exists a computable function $g: {2^{<\mathbb {N}}}\rightarrow \mathbb {R}^+$ such that $lim_{n\to\infty}g(X_0^n)$ exists almost surely and $\mathbb {E}_{\lambda }(\sup _i|g(X_0^i)|)<\infty $ and for some random sequence x
5 One additional remark and a question
Let me end this note with a brief remark about universal inductive schemes. In the introduction, I gave the following motivation for algorithmic randomness. We perform some experiment and we want to know whether a sequence of observations comes from a random process. To this end, we take a computable probability measure, say, produced by a Turing machine with index k. Then we fix a notion of algorithmic randomness and finally, we start saying some nontrivial things about properties that a nice sequence of random outcomes should have. But to be honest, it requires a great deal of knowledge to guess that we should be looking at outputs of k-th Turing machine and not of $17$ -th machine or at some other possible measure. More often than not, we do not have that kind of knowledge. And from a certain philosophical point of view, it may matter not if something is random as per given probability measure—rather we may want to simply know if it is random at all. If so, then perhaps our true goal is not a notion of randomness with respect to a fixed measure but something more general—randomness with respect to a class of measures. This may be done to some extend as shown by [Reference Bienvenu, Gács, Hoyrup, Rojas and Shen4]). One way to specialize this into a formal question is as follows.
Problem 5.23. Is there a natural class $\mathcal {C}$ of measures with a non-trivial learning-theoretic definition of randomness with respect to $\mathcal {C}$ ? For instance, is there a class of measures $\mathcal {C}$ such that a sequence $x\in {2^{\mathbb {N}}}$ is 1-random with respect to some measure from $\mathcal {C}$ if and only if there is no computable $f: {2^{<\mathbb {N}}}\rightarrow \{0,1\}$ such that
and for every measure $\mu $ from $\mathcal {C}$ and every $n\in \mathbb {N}$
The difference between this and learning-theoretic version of Martin-Löf randomness lies in the measure-theoretic condition, namely, here we ask about recognizing properties that are rare not only from a perspective of one measure but universally, for every measure in the class $\mathcal {C}$ . I conjecture that such learning-theoretic characterization is possible for the class of computable stationary ergodic measures. My guess is motivated by the known existence of inductive schemes for this class.
Inductive schemes which presuppose only minimal knowledge about the underlying probability measure, are the holy grails of learning theory, statistics, philosophy of science, etc. For example, various nonparametric schemes for empirical inference that are universal in the class of stationary ergodic processes are known, e.g., Ornstein showed the existence of a universal backward estimator of conditional probability [Reference Ornstein17] and Algoet studied universal procedures for sequential decisions [Reference Algoet1]. In general, these schemes achieve optimal performance almost surely on any measure satisfying some general properties (hence, they are called universal).
Universality (e.g., with respect to some class of computable measures) is a strong property. One could wonder if it is strong enough to guarantee convergence on all relevant random sequences. Learning functions from Propositions 4.16 and 4.21 manifest their unusual behavior on exactly 1-random sequence. The uniform measure $\lambda $ is continuous, hence we can disturb the convergence on a single sequence without worrying about the behavior on the set of full measure. This is true for every continuous measure $\mu $ —anything that happens on a singleton only happens with $\mu $ -probability zero. Finally, recall the following theorem.
Theorem 5.24 (Kautz [Reference Kautz13])
If $\mu $ is a computable measure and for some $x\in {2^{\mathbb {N}}}$ we have $\mu (\{x\})>0$ then x is computable.
Consequently, the behavior of the estimator on a unique $\lambda $ -random point is irrelevant to the probabilistic properties of the estimator such as universality. In other words, universality with respect to some class of computable measures does not guarantee convergence on every 1-random sequence.
Acknowledgements
The author is grateful to Łukasz Dębowski and Dariusz Kalociński for their advice. This work was supported by the National Science Centre Poland grant no. 2018/31/B/HS1/04018.