A confirmation of a conjecture on Feldman’s two-armed bandit problem

Zengjing Chen; Yiwei Lin; Jichen Zhang

doi:10.1017/jpr.2023.24

A confirmation of a conjecture on Feldman’s two-armed bandit problem

Part of: Sequential methods Decision theory

Published online by Cambridge University Press: 26 May 2023

Zengjing Chen ,

Yiwei Lin and

Jichen Zhang

Show author details

Zengjing Chen*: Affiliation:
Shandong University
Yiwei Lin*: Affiliation:
Shandong University
Jichen Zhang*: Affiliation:
Shandong University
*: *Postal address: School of Mathematics, Shandong University, Jinan 250100, China.
*Postal address: School of Mathematics, Shandong University, Jinan 250100, China.
*Postal address: School of Mathematics, Shandong University, Jinan 250100, China.

Article contents

Abstract
Introduction
Related literature
Preliminaries
Main results
Funding information
Competing interests
References

Rights & Permissions

Abstract

The myopic strategy is one of the most important strategies when studying bandit problems. In 2018, Nouiehed and Ross put forward a conjecture about Feldman’s bandit problem (J. Appl. Prob. (2018) 55, 318–324). They proposed that for Bernoulli two-armed bandit problems, the myopic strategy stochastically maximizes the number of wins. In this paper we consider the two-armed bandit problem with more general distributions and utility functions. We confirm this conjecture by proving a stronger result: if the agent playing the bandit has a general utility function, the myopic strategy is still optimal if and only if this utility function satisfies reasonable conditions.

Keywords

Myopic strategy stochastically maximizing dynamic programming property

MSC classification

Primary: 62C10: Bayesian problems; characterization of Bayes procedures

Secondary: 62L05: Sequential design

Type: Original Article
Information: Journal of Applied Probability , Volume 61 , Issue 1 , March 2024 , pp. 121 - 136

DOI: https://doi.org/10.1017/jpr.2023.24 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

The bandit problem is a well-known problem in sequential control under conditions of incomplete information. It involves sequential selections from several options referred to as arms of the bandit. The payoffs of these arms are characterized by parameters which are typically unknown. Agents should learn from past information when deciding which arm to select next, with the aim of maximizing the total payoffs.

This problem can be traced back to Thompson’s work [Reference Thompson34] related to medical trials. Now it is widely studied and frequently applied as a theoretical framework for many other sequential statistical decision problems in market pricing, medical research, and engineering, which are characterized by the trade-off between exploration and exploitation (see e.g. [Reference Garbe and Glazebrook15], [Reference Kim and Lim23], and [Reference Rusmevichientong, Shen and Shmoys32]).

Here we focus on the two-armed bandit problem, which is studied by Feldman [Reference Feldman14]. For a given pair $(F_1,F_2)$ of distributions on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ , consider two experiments X and Y (called the X-arm and Y-arm), having distributions under two hypotheses $H_1$ and $H_2$ as follows:

(1.1)

\begin{equation}\begin{array}{r@{\quad}c@{\quad}c@{\quad}c}&\;&\;X&\;Y\\ (\xi_0)&\;H_1\colon &\;F_1&\;F_2\\ (1-\xi_0)&\;H_2\colon &\;F_2&\;F_1, \end{array}\end{equation}

where $\xi_0$ is the a priori probability that $H_1$ is true. In trial i, either the X-arm or Y-arm is selected to generate a random variable $X_i$ or $Y_i$ which describes the payoff, and $\xi_i$ is the posterior probability of $H_1$ being true after i trials. The aim is to find the optimal strategy that maximizes the total expected payoffs.

Among the many notable strategies such as the myopic strategy, the Gittins strategy, and the play-the-winner strategy, the myopic strategy is undoubtedly one of the most appealing. With this strategy, in each trial, agents select the arm with greater immediate expected payoff, i.e. play each time as though there were but one trial remaining. Mathematically, let $\mathbb{E}_\mathbb{P}[ \cdot\mid H_1]$ be the expectation functional under hypothesis $H_1$ if

(1.2)

\begin{equation}\mathbb{E}_\mathbb{P}[X_1\mid H_1]\geq \mathbb{E}_\mathbb{P}[Y_1\mid H_1],\end{equation}

which is equivalent to

\begin{equation*}\int_{\mathbb R} x{\mathrm{d}} F_1(x)\ge \int_{\mathbb R} x{\mathrm{d}} F_2(x).\end{equation*}

Then agents select the X-arm in trial i when $\xi_{i-1}\ge \frac{1}{2}$ , or the Y-arm otherwise.

When the myopic strategy is optimal, it means that the optimal strategy is time-invariant, i.e. it does not depend on the number of trials remaining. Hence the optimal strategy can be easily implemented. Unfortunately, the myopic strategy is not optimal in general. This is mainly because at each time the myopic strategy only considers the payoff of the next trial; however, to maximize the total payoffs, all the remaining trials should be considered. Kelley [Reference Kelley22] and Berry and Fristedt [Reference Berry and Fristedt9] showed counterexamples that the myopic strategy is not optimal. It is an open question to find out under what conditions the myopic strategy is optimal. The optimality of the myopic strategy has always attracted attention. Bradt et al. [Reference Bradt, Johnson and Karlin10] considered model (1.1) when $F_1$ and $F_2$ are Bernoulli with expectation $\alpha$ and $\beta$ respectively. They showed that the myopic strategy is optimal when $\alpha+\beta=1$ . Further, they conjectured that the myopic strategy is also optimal when $\alpha+\beta\neq 1$ , and verified this conjecture for $n\le 8$ . For model (1.1), Feldman [Reference Feldman14] showed that the myopic strategy is optimal in an arbitrary number of trials. Kelley [Reference Kelley22] considered Bernoulli payoffs and asymmetric hypotheses and gave a necessary and sufficient condition for the myopic strategy being optimal. Rodman [Reference Rodman31] extended Feldman’s result to a multi-armed setting.

The exploration of the conditions under which the myopic strategy is the optimal strategy has not come to an end. This problem remains open under more general settings. Nouiehed and Ross [Reference Nouiehed and Ross30] studied a Bernoulli armed bandit problem and posed a conjecture that the myopic strategy also maximizes the probability that no less than k wins occur in the first n trials, for all k, n. They proved this conjecture for $k=1$ and $k=n$ in an n-armed bandit, and for $k=n-1$ in the two-armed case, that is, model (1.1) with Bernoulli payoffs.

Why has the question in Nouiehed and Ross’s conjecture not been raised for almost 60 years? Nouiehed and Ross [Reference Nouiehed and Ross30] explained that this was because in Feldman [Reference Feldman14] and similar studies, it was the number of times the better arm was chosen that was maximized, not the total payoff. Although the two approaches are equivalent in [Reference Feldman14], they are quite different when we want to study a more general utility function.

This opens up a whole new horizon for us to study this issue. Let x be the total payoff, equal to the sum of the generated values. All works mentioned above considered the utility function $\varphi (x)=x$ (e.g. [Reference Berry and Fristedt9], [Reference Feldman14], and [Reference Kelley22]) or $\varphi(x)=I_{[k,+\infty)}(x)$ (e.g. [Reference Nouiehed and Ross30]). So a natural question is what conditions can guarantee the optimality of the myopic strategy for general utility functions.

In this paper we focus on the optimal strategy for the most typical case of two-armed bandit problems (model (1.1)) proposed in the profound paper of Feldman [Reference Feldman14]. With a general utility function to be considered, we obtain a necessary and sufficient condition for the optimality of the myopic strategy. As an application, we could solve Nouiehed and Ross’s conjecture for the two-armed case.

We consider a situation that the agent playing model (1.1) has a utility function $\varphi$ and starts with an initial fund of x and a strategy $\textsf{M}^n$ : in trial i, play the X-arm if $\xi_{i-1}\ge \frac{1}{2}$ , or the Y-arm otherwise. The innovative aspects of the results obtained in this paper are as follows: first, we take $F_1$ and $F_2$ as general distribution functions, continuous or not, rather than Bernoulli distributions; second, we consider general utility functions that are no longer linear. This makes Feldman’s proof method invalid and brings some additional difficulties. We shall show that $\textsf{M}^n$ maximizes the expected utility of n trials if and only if the utility function $\varphi$ and the distributions $F_1$ and $F_2$ satisfy

(1.3)

\begin{align}\mathbb{E}_\mathbb{P}[\varphi(u+X_1)\mid H_1]\ge \mathbb{E}_\mathbb{P}[\varphi(u+Y_1)\mid H_1]\quad \text{for any $u\in\mathbb{R}$.}\end{align}

Condition (1.3) means that no matter how much money the agent already has, if only one trial is to be played, playing the arm with distribution $F_1$ is always better than playing the arm with $F_2$ . In the case that $\varphi(x)=x$ , condition (1.3) coincides with condition (1.2).

It is interesting that if we choose the utility function in condition (1.3) as an indicator function $\varphi(x)=I_{[k,+\infty)}(x)$ , and initial fund $u=0$ , we could prove Nouiehed and Ross’s conjecture for the two-armed case immediately.

The structure of the paper is as follows. In Section 2 we review several important strategies and compare them with the results of this paper. In Section 3 we describe the two-armed bandit problem and some basic properties. In Section 4 we first introduce a dynamic programming property of the optimal expected utility and then prove the main result. Finally, as a corollary, we derive the validity of Nouiehed and Ross’s conjecture in the two-armed bandit case.

2. Related literature

After years of research, many excellent strategies have been produced. Here we review some common strategies and illustrate how the results of this paper relate to and differ from these strategies.

In the multi-armed bandit literature, a celebrated result is the so-called Gittins index strategy introduced by Gittins and Jones [Reference Gittins and Jones17]. Unlike the model studied in this paper, the Gittins index strategy is used in a model with independent arms and an infinite number of trials, and the aim is to maximize the sum of geometrically discounted payoffs. This strategy assigns to each arm an index as a function of its current state, and then activates the arm with the largest index value (breaking the ties arbitrarily). It optimizes the infinite-horizon expected discounted payoffs. If the number of trials N is finite, then the Gittins index strategy can be used to approximate the optimal strategy when N is large. If more than one arm can change its state at every stage, the problem is called ‘restless’. Whittle [Reference Whittle37] proposed an index rule to solve the restless problem. This index is not necessarily optimal, but Whittle conjectured that it would admit a form of asymptotic optimality as both the number of arms and the number of allocated arms in each period grow to infinity at a fixed proportion, which was eventually proved in Weber and Weiss [Reference Weber and Weiss36] under some technical assumptions. The restless multi-armed bandit model can be used in many aspects such as clinical trials, sensor management, and capacity management in healthcare; see [Reference Ahmad, Liu, Javidi, Zhao and Krishnamachari3], [Reference Deo, Iravani, Jiang, Smilowitz and Samuelson13], [Reference Gittins, Glazebrook and Weber19], [Reference Lee, Lavieri and Volk27], [Reference Mahajan and Teneketzis28], and [Reference Washburn35].

A major drawback of the Gittins index and Whittle index is that they are both difficult to calculate. The current fastest algorithm can only solve the index in cubic time [Reference Gast, Gaujal and Khun16, Reference Niño-Mora29]. In contrast, the calculation of the myopic strategy is much easier. A second issue of the Gittins index is that the arms must have independent parameters, and the payoffs are geometrically discounted. If these conditions are not met, as in the model studied in this paper, the Gittins index strategy is no longer the optimal strategy [Reference Berry and Fristedt9, Reference Gittins and Wang18]. Therefore it is necessary to obtain the condition that the myopic strategy is the optimal strategy when studying the model in this paper.

Another important strategy is the UCB strategy. Lai and Robbins [Reference Lai and Robbins25] laid out the theory of asymptotically optimal allocation and were the first to actually use the term ‘upper confidence bound’ (UCB). The UCB strategy is also applied to models with independent arms. Initially the UCB strategy was used for models with an infinite number of trials, but with modifications the UCB can also be used for models with a finite number of trials. When using the UCB strategy, each arm is assigned a UCB for its mean reward, and the arm with the largest bound is to be played. The bound is not the conventional upper limit for a confidence interval. The design of the confidence bound has been successively improved [Reference Agrawal2, Reference Audibert and Bubeck4–Reference Auer, Cesa-Bianchi and Fischer6, Reference Cappé, Garivier, Maillard, Munos and Stoltz11, Reference Honda and Takemura20, Reference Lai26]. Among these, the kl-UCB strategy [Reference Cappé, Garivier, Maillard, Munos and Stoltz11] and Bayes-UCB strategy [Reference Kaufmann21] are asymptotically optimal for exponential family bandit models. The improved UCB strategy is now easy to compute but is still limited to models with independent arms.

The $\epsilon$ -greedy strategy [Reference Kuleshov and Precup24] is widely used because it is very simple. At each round $t = 1, 2,\dots$ the agent selects the arm with the highest empirical mean with probability $1-\epsilon$ , and selects a random arm with probability $\epsilon$ . It has poor asymptotic behavior, because it continues to explore long after the optimal solution becomes apparent. Another common strategy, in the case of Bernoulli arms, is play-the-winner. It is a well-known strategy in which arm a is played at time $t +1$ if it resulted in success at time t. If a failure is observed at time t, then the next arm is either chosen at random or the arms are cycled through deterministically. Play-the-winner can be nearly optimal when the best arm has a very high success rate, but does not perform well in most cases [Reference Berry and Fristedt9, Reference Scott33].

There are also many variants of the bandit problem, such as the adversarial multi-armed bandit with the EXP3 strategy [Reference Auer, Cesa-Bianchi, Freund and Schapire7] and online linear optimization with the FTL strategy [Reference Abernethy, Hazan and Rakhlin1], but these are beyond the scope of this paper.

In fact, compared with the various classical models mentioned above, the biggest feature of the model in this paper is the introduction of the utility function of the agent. All of the above strategies only consider linear utility or discounted returns, but agents may have nonlinear utility functions. For example, in Nouiehed and Ross’s conjecture, the utility of an agent will no longer grow after a certain amount of gain. In this case, the effect of strategies such as the Gittins index or UCB will be significantly reduced.

After the introduction of a generalized utility function, the bandit problem becomes very complicated. To simplify the model and clearly demonstrate the ideas in this paper, we study this two-armed bandit model. Nouiehed and Ross’s conjecture for the multi-armed bandit machine model can also be studied with the method in this paper.

The multi-armed bandit model in Ross’s conjecture can be further generalized. For example, when the arms have independent parameters and the discounting factor is geometric, Banks and Sundaram [Reference Banks and Sundaram8] proved that the myopic strategy is equivalent to the Gittins index strategy for linear utility functions. When the agent has a general utility function, does the Gittins index still exist, and can the myopic strategy still be the optimal strategy under certain conditions? Exploration of these issues will lead us to better understand the nature of the bandit problem. Chen, Epstein, and Zhang [Reference Chen, Epstein and Zhang12] studied a multi-armed bandit problem where the agent is loss-averse; in particular, the agent is risk-averse in the domain of gains and risk-loving in the domain of losses, and they established the corresponding asymptotically optimal strategy. This is an important advance in this research.

3. Preliminaries

Let us start with the description of the two-armed bandit model (1.1) and the strategies.

Consider the bandit model in equation (1.1). Let $\{X_i\}_{i\geq1}$ be a sequence of random variables, where $X_i$ describes the payoff of trial i from the X-arm, and let $\{Y_i\}_{i\geq1}$ be a sequence of random variables selected from the Y-arm; $\{(X_i,Y_i)\}_{i\ge 1}$ are independent under each hypothesis. We define $\mathcal{F}_i\;:\!=\; \sigma\{(X_1,Y_1),\ldots,(X_{i},Y_i)\}$ , which represents all the information that can be obtained until trial i.

Remark 3.1. Note that in practice, regardless of the strategy chosen, the information obtained after the ith trial is a proper subset of $\mathcal{F}_i$ , because for experiment j, the agent can only observe one of $X_j$ and $Y_j$ .

We call this model a $(\xi_0,n,x)$ -bandit if there are n trials to be played with initial fund x and a prior probability $\xi_0$ . In the following discussion, the distributions $F_1$ and $F_2$ of arms are continuous with density $f_1$ and $f_2$ , respectively.

Remark 3.2. The same results still hold when the distributions of arms are discrete, e.g. Bernoulli. We only need to modify the calculation of expectations in this case.

For each $i\geq1$ , let $\theta_i$ be an $\mathcal{F}_{i-1}$ -measurable random variable taking values in $\{0,1\}$ , where $\theta_i=1$ means the X-arm is selected for observation in trial i and $\theta_i=0$ means the Y-arm is selected for observation in trial i. The payoff that an agent receives using $\theta_i$ in trial i is

\begin{equation*}Z_{i}^{\theta}\;:\!=\; \theta_{i}X_{i}+(1-\theta_{i})Y_{i}.\end{equation*}

For a $(\xi_0,n,x)$ -bandit and $\theta=\{\theta_1,\ldots,\theta_n\}$ , if $\theta_i\in \sigma(Z_1^\theta,\ldots,Z_{i-1}^\theta)\subset \mathcal{F}_{i-1}$ , then we call $\theta$ a strategy. The set of strategies for a $(\xi_0,n,x)$ -bandit is denoted by $\Theta_n$ .

For a $(\xi_0,n,x)$ -bandit and a suitable measurable function $\varphi$ , the expected utility obtained by using strategy $\theta$ is denoted by

\begin{align*} W(\xi_0,n,x,\theta)=\mathbb{E}_\mathbb{P}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr],\end{align*}

where $\varphi$ is called a utility function.

For each strategy $\theta\in\Theta_n$ , let $\{\xi_i^\theta\}_{i\ge1}$ be the sequence of the posterior probabilities that hypothesis $H_1$ is true after i trials. The posterior probability $\xi^\theta_1$ after trial 1 with payoff s is calculated by

(3.1)

$$\xi _1^\theta (s) = \left\{ {\matrix{ {{{{\xi _0}{f_1}(s)} \over {{\xi _0}{f_1}(s) + (1 - {\xi _0}){f_2}(s)}}} & {{\rm{if }}\,{\theta _{\rm{1}}}{\rm{ = 1}},{\rm{i}}.{\rm{e}}.\;{\rm{the }} \, X{\rm{- arm \, is \, selected,}}} \cr {{{{\xi _{\rm{0}}}{{\rm{f}}_{\rm{2}}}({\rm{s}})} \over {{\xi _{\rm{0}}}{{\rm{f}}_{\rm{2}}}({\rm{s}}){\rm{ + }}({\rm{1 - }}{\xi _{\rm{0}}}){{\rm{f}}_{\rm{1}}}({\rm{s}})}}} & {{\rm{if }}\,{\theta _{\rm{1}}}{\rm{ = 0}},{\rm{i}}.{\rm{e}}.\;{\rm{the }}\, Y{\rm{ - arm \,is \, selected}}{\rm{.}}} \cr } } \right.$$

We can easily obtain that for any fixed s, $\xi_1^\theta(s)$ is increasing in $\xi_0$ . When the posterior probability $\xi_i^\theta$ is known and the payoff of the $i+1$ trial is s, there is a recursive formula

(3.2)

$$\xi _{i + 1}^\theta (s) = \left\{ {\matrix{ {{{\xi _i^\theta {f_1}(s)} \over {\xi _i^\theta {f_1}(s) + (1 - \xi _i^\theta ){f_2}(s)}}} & {{\rm{if }}\,{\theta _{{\rm{i + 1}}}}{\rm{ = 1}},{\rm{i}}.{\rm{e}}.\;{\rm{the }}\,X{\rm{ - arm \, is \, selected,}}} \cr {{{\xi _{\rm{i}}^\theta {{\rm{f}}_{\rm{2}}}({\rm{s}})} \over {\xi _{\rm{i}}^\theta {{\rm{f}}_{\rm{2}}}({\rm{s}}){\rm{ + }}({\rm{1 - }}\xi _{\rm{i}}^\theta ){{\rm{f}}_{\rm{1}}}({\rm{s}})}}} & {{\rm{if }}\,{\theta _{{\rm{i + 1}}}}{\rm{ = 0}},{\rm{i}}.{\rm{e}}.\;{\rm{the}}\, Y{\rm{ - arm \, is \, selected}}{\rm{.}}} \cr } } \right.$$

Now we propose the following two-armed bandit problem.

Problem (TAB). For a $(\xi_0,n,x)$ -bandit and a utility function $\varphi$ , find some strategy in $\Theta_n$ to achieve the maximal expected utility

(3.3)

\begin{equation}V(\xi_0,n,x)\;:\!=\; \sup_{\theta\in \Theta_n} W(\xi_0,n,x,\theta)=\sup_{\theta\in\Theta_n} \mathbb{E}_\mathbb{P}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr].\end{equation}

Note that the expected utility $\mathbb{E}_\mathbb{P}[\!\cdot\!]$ depends on hypothesis $H_1$ , $H_2$ , and $\xi_0$ . In fact,

\begin{align*}&\mathbb{E}_\mathbb{P} \Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr] =\xi_0 \mathbb{E}_\mathbb{P}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\mid H_1\Biggr]+(1-\xi_0)\mathbb{E}_\mathbb{P}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\mid H_2\Biggr],\end{align*}

where $\mathbb{E}_\mathbb{P}[\cdot\mid H_i]$ is the expectation under hypothesis $H_i$ $(i=1,2).$

To simplify the notation, we write $\mathbb{E}_\mathbb{P}[\cdot\mid H_1]$ as $\mathbb{E}_1[\!\cdot\!]$ , $\mathbb{E}_\mathbb{P}[\cdot\mid H_2]$ as $\mathbb{E}_2[\!\cdot\!]$ , and $\mathbb{E}_\mathbb{P}[\!\cdot\!]$ as $\mathbb{E}_{\xi_0}[\!\cdot\!]$ . Then the expected utility can be written as

\begin{align*} W(\xi_0,n,x,\theta)=\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr],\end{align*}

where

\begin{equation*} \mathbb{E}_{\xi_0} \Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr]=\xi_0 \mathbb{E}_1\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr]+(1-\xi_0)\mathbb{E}_2\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr].\end{equation*}

Immediately, equality (3.3) can be written as follows:

\begin{align*} V(\xi_0,n,x)=\sup_{\theta\in \Theta_n} W(\xi_0,n,x,\theta)=\sup_{\theta\in \Theta_n} \mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr].\end{align*}

Consider a strategy $\textsf{M}^n$ : in trial i, play the X-arm if $\xi_{i-1}\ge \frac{1}{2}$ , or the Y-arm otherwise. Our main result is to find conditions under which $\textsf{M}^n$ could solve Problem (TAB).

The following lemma shows that when calculating the expected utility, we can temporarily fix the payoff of the first trial, calculate the expected utility as a function of the first payoff, and then take expectation while seeing the first payoff as a random variable. This is an extension of equation (2) in Feldman [Reference Feldman14].

Lemma 3.1. For each integer $n\ge 2$ and strategy $\theta=\{\theta_1,\ldots,\theta_n\}\in\Theta_n $ , we have

(3.4)

\begin{align}\mathbb{E}_{\xi_0} \Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr]=\mathbb{E}_{\xi_0}\bigl[h\bigl(x,Z_1^\theta\bigr)\bigr]\quad \textit{for all $x\in \mathbb R$,}\end{align}

where

\begin{align*} h(x,u)=\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=2}^{n} Z_i^{\theta[u]}\Biggr)\Biggr],\end{align*}

$\xi_{1}^\theta$ is defined by (3.1), and $\theta[u]$ is the strategy obtained from $\theta$ by fixing the payoff of the first trial to be u.

Remark 3.3. $\mathbb{E}_{\xi_1^\theta(u)}[\!\cdot\!]$ is the expected utility $\mathbb{E}_{\xi_0}[\!\cdot\!]$ replacing $\xi_0$ with $\xi_1^\theta(u)$ . In integral form, equation (3.4) is

\begin{align*}&\mathbb{E}_{\xi_0} \Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^\theta\Biggr)\Biggr] =\int_{\mathbb R} \mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=2}^{n} Z_i^{\theta[u]}\Biggr)\Biggr](\xi_0f_{1}(u)+(1-\xi_0)f_{2}(u))\,{\mathrm{d}} u.\end{align*}

Remark 3.4. Here $h\bigl(x,Z^\theta_1\bigr)$ can be seen as a conditional expectation of $\varphi\bigl(x+\sum_{i=1}^{n} Z_i^\theta\bigr)$ given $\mathcal{F}_1$ , and this lemma shows that it has the same expectation as $\varphi\bigl(x+\sum_{i=1}^{n} Z_i^\theta\bigr)$ . This is in fact the tower property for conditional expectations, so we omit the proof.

We can see that the form of h(x, u) is very similar to the expected utility of the $(\xi_1^\theta(u),$ $n-1,x+u)$ -bandit with some strategy. There is indeed such a strategy to make the value of h(x, u) equal to the expected utility of the $(\xi_1^\theta(u),n-1,x+u)$ -bandit.

Lemma 3.2. For any strategy $\theta\in \Theta_n$ , let

\begin{align*} h(x,u)=\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=2}^{n} Z_i^{\theta[u]}\Biggr)\Biggr]. \end{align*}

Then, for any u, there exists a strategy $\rho\in \Theta_{n-1}$ , such that the value of h(x,u) is equal to the expected utility of the $(\xi_1^\theta(u),n-1,x+u)$ -bandit with strategy $\rho$ , that is,

\begin{align*} h(x,u)&=\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=2}^{n} Z_i^{\theta[u]}\Biggr)\Biggr]\notag \\[5pt] &=\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=1}^{n-1} Z_i^{\rho}\Biggr)\Biggr]\notag \\[5pt] &=W(\xi_1^\theta(u),n-1,x+u,\rho).\end{align*}

Proof. We know that $\theta[u]$ is obtained by fixing the payoff u of the first trial, so for any $\theta'_{\!\!i}\in \theta[u]$ it is $\sigma(X_2,Y_2,\ldots,X_{i-1},Y_{i-1})$ -measurable. Then there are measurable functions $\pi_i$ , $i\ge 2$ , such that

\begin{align*}\theta'_{\!\!i}=\pi_i(X_2,Y_2,\ldots,X_{i-1},Y_{i-1}),\quad i\ge 2.\end{align*}

Define a new strategy $\rho\in\Theta_{n-1}$ by

\begin{align*}\rho_i=\pi_{i+1}(X_1,Y_1,\ldots,X_{i-1},Y_{i-1}),\quad 1\le i\le n-1.\end{align*}

The fact that $\rho_{i+1}\in \sigma(Z_1^\rho,\ldots,Z_i^\rho)$ can be easily verified.

By the definition of $\rho$ , we know that $\rho_i$ has the same distribution as $\theta'_{i+1}$ in both hypotheses, so $Z_i^{\theta[u]}$ and $Z_i^{\rho}$ have the same distribution. Using this fact, we can easily verify that

\begin{align*}\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=2}^{n} Z_i^{\theta[u]}\Biggr)\Biggr]=\mathbb{E}_{\xi_1^\theta(u)}\Biggl[\varphi\Biggl(x+u+\sum_{i=1}^{n-1} Z_i^{\rho}\Biggr)\Biggr].\end{align*}

Now we introduce the dynamic programming property of the expected utility, which plays an important role in the subsequent arguments. Similar results are found in many works on bandit problems, but only for the case of $\varphi(x)=x$ (e.g. [Reference Berry and Fristedt9], [Reference Feldman14]). Our result extends the classical ones.

Theorem 3.1. For each $\xi_0\in [0,1]$ , $n\geq1$ and $x\in\mathbb{R}$ , consider the $(\xi_0,n,x)$ -bandit. The optimal strategy $\theta^{[n]}\in \Theta_n$ exists. Then there are measurable functions

\begin{align*}\pi^{[n]}_i\;:\;[0,1]\times\mathbb{R}\times\mathbb{R}^{2i-2}\mapsto \{0,1\},\quad 1\le i\le n, \end{align*}

such that for any $\xi_0\in [0,1]$ and $x\in\mathbb{R}$ , the optimal strategy $\theta^{[n]}$ for the $(\xi_0,n,x)$ -bandit satisfies

\begin{equation*} \begin{aligned}\theta^{[n]}_1&=\pi^{[n]}_1(\xi_0,x),\\[5pt] \theta^{[n]}_i&=\pi^{[n]}_i(\xi_0,x,X_1,Y_1,\ldots,X_{i-1},Y_{i-1}),\quad i\geq2.\end{aligned}\end{equation*}

Further, the optimal expected utility satisfies the following dynamic programming property:

\begin{align*} V(\xi_0,n,x)&= \sup_{\theta\in \Theta_n} \mathbb{E}_{\xi_0}\bigl[V\bigl(\xi_1^\theta,n-1,x+Z^\theta_1\bigr)\bigr] \notag \\[5pt] &= \max \biggl\{ \mathbb{E}_{\xi_0}\biggl[V\biggl(\dfrac{\xi_0f_1(X_1) }{\xi_0 f_1(X_1)+(1-\xi_0)f_2(X_1)},n-1,x+X_1\biggr)\biggr],\notag \\[5pt] &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathbb{E}_{\xi_0}\biggl[V\biggl(\dfrac{\xi_0f_2(Y_1) }{\xi_0 f_2(Y_1)+(1-\xi_0)f_1(Y_1)},n-1,x+Y_1\biggr)\biggr] \biggr\}.\end{align*}

Theorem 3.1 is mostly standard for a finite action and horizon dynamic programming problem, so we omit the proof.

4. Main results

Now we are going to study the specific form of the optimal strategy, for the finite two-armed bandit (1.1) with utility function $\varphi$ . It is obvious that different utility functions may lead to different optimal strategies, but we will show that when a reasonable condition is satisfied, the optimal strategy is independent of the specific form of $\varphi$ .

Recall that the myopic strategy for $(\xi_0,n,x)$ -bandits is $\textsf{M}^n=\{\textsf{m}^n_1,\ldots,\textsf{m}^n_n\}$ : in trial i, play the X-arm if the posterior probability $\xi_{i-1}^{\textsf{M}^n}\ge \frac{1}{2}$ , or the Y-arm if $\xi_{i-1}^{\textsf{M}^n}\lt\frac{1}{2}$ . Note that $\textsf{M}^n\in\Theta_n$ . In fact, $\textsf{M}^n$ can be denoted by

(4.1)

\begin{align}\textsf{M}^n&=\bigl\{\textsf{m}^n_1,\ldots,\textsf{m}^n_n\bigr\}\notag \\[5pt] &=\bigl\{g\bigl(\xi_0\bigr),g\bigl(\xi_1^{\textsf{M}^n}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{M}^n}\bigr)\bigr\}\in\Theta_n,\end{align}

where $g(x)=1$ if $x\geq\frac{1}{2}$ , or $g(x)=0$ if $x\lt\frac{1}{2}$ . From the definition of $\xi_{i-1}^{\textsf{M}^n}$ and $\textsf{m}^n_i$ , we know that they are both independent of n and x. Hence we can write $\xi_{i-1}^{\textsf{M}^n}$ for short as $\xi_{i-1}^{\textsf{M}}$ , and write $\textsf{m}^n_i$ as $\textsf{m}_i$ . Now the myopic strategy $\textsf{M}^n$ is denoted by

\begin{align*}\textsf{M}^n&=\{\textsf{m}_1,\ldots,\textsf{m}_n\}\notag \\[5pt] &=\bigl\{g\bigl(\xi_0\bigr),g\bigl(\xi_1^{\textsf{M}}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{M}}\bigr)\bigr\}.\end{align*}

Next we will give a condition on $\varphi$ which is necessary and sufficient for $\textsf{M}^n$ being the optimal strategy.

Theorem 4.1. For any integer $n\ge 1$ , the myopic strategy $\textsf{M}^n$ is the optimal strategy of the $(\xi_0,n,x)$ -bandit for all $ \xi_0\in[0,1]$ , $ x\in \mathbb{R}$ , if and only if

(I)

\begin{equation}\mathbb{E}_1[\varphi(u+X_1)]\ge \mathbb{E}_1[\varphi(u+Y_1)] \quad \textit{for all $ u\in \mathbb{R}$.}\end{equation}

Remark 4.1. When $\varphi(x)=x$ , condition (I) is in fact

\begin{align*} \mathbb{E}_1[X_1]\ge \mathbb{E}_1[Y_1],\end{align*}

and Theorem 4.1 in this case is exactly Theorem 2.1 of Feldman [Reference Feldman14].

Remark 4.2. When the two distributions $F_1$ , $F_2$ are Bernoulli distributions, say $\operatorname{Bernoulli}(\alpha)$ and $\operatorname{Bernoulli}(\beta)$ , $\alpha,\beta\in(0,1)$ , then condition (I) is written as

\begin{equation*}(\varphi(x+1)-\varphi(x))(\alpha-\beta)\ge 0\quad \text{for any $x\in \mathbb{R}$}.\end{equation*}

If $\varphi(x)$ is an increasing function of x, and $\alpha\ge \beta$ , then condition (I) holds.

Remark 4.3. Note that condition (I) here is a necessary and sufficient condition to make $\textsf{M}^n$ the optimal strategy of the $(\xi_0,n,x)$ -bandit model for any $x\in \mathbb{R}$ and any $\xi_0\in[0,1]$ . However, when condition (I) is not satisfied, it is still possible that there is a specific triple $(\overline{\xi_0},\overline{n},\overline{x})$ that makes $\textsf{M}^{\overline{n}}$ the optimal strategy of the $(\overline{\xi_0},\overline{n},\overline{x})$ -bandit, but the optimality of $\textsf{M}^n$ does not hold for general $(\xi_0,n,x)$ triples.

To achieve our goal, we need to formulate some properties of the expected utility of $\textsf{M}^n$ , which can be considered as extensions of properties of $R_N$ , and Lemma 2.1 in Feldman’s paper [Reference Feldman14].

Lemma 4.1. For each $n\in\mathbb{Z}^+$ , we have the following.

(1) The expected utility with strategy $\textsf{M}^n$ is symmetric about $\xi_0=\frac{1}{2}$ , that is,
(4.2) \begin{align}W(\xi_0,n,x,\textsf{M}^n)=W(1-\xi_0,n,x,\textsf{M}^n)\quad {for \, all \, \xi_0\in [0,1], x\in\mathbb{R}.}\end{align}
(2) Let us define the strategies
\begin{align*} \textsf{L}^n=\bigl\{1,g\bigl(\xi_1^{\textsf{L}^n}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{L}^n}\bigr)\bigr\}\quad\textit{and}\quad \textsf{R}^n=\bigl\{0,g\bigl(\xi_1^{\textsf{R}^n}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{R}^n}\bigr)\bigr\}, \end{align*}
where $g(\!\cdot\!)$ is the function defined in (4.1). Then
(4.3) \begin{align}W(\xi_0,n,x,\textsf{L}^n)=W(1-\xi_0,n,x,\textsf{R}^n)\quad {for \, all \, \xi_0\in [0,1], x\in\mathbb{R}.}\end{align}

Proof. An important feature of model (1.1) is symmetry. Swapping the X-arm with the Y-arm and exchanging hypothesis $H_1$ with $H_2$ , we can obtain equation (4.3). Equation (4.2) can be obtained from (4.3) and the definition of $\textsf{M}^n$ . Lemma 4.1 can also be proved using mathematical induction and Lemmas 3.1 and 3.2.

Lemma 4.2. Consider the following strategies.

For $n=2$ , let $\textsf{U}^2=\{1,0\}$ and $\textsf{V}^2=\{0,1\}$ .

For each $n\geq3$ , let

\begin{align*}\textsf{U}^n&=\bigl\{1,0,g\bigl(\xi_2^{\textsf{U}^n}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{U}^n}\bigr)\bigr\},\\[5pt] \textsf{V}^n&=\bigl\{0,1,g\bigl(\xi_2^{\textsf{V}^n}\bigr),\ldots,g\bigl(\xi_{n-1}^{\textsf{V}^n}\bigr)\bigr\},\end{align*}

where $g(\!\cdot\!)$ is the function defined in (4.1). Then, for each $n\ge 2$ , $\xi_0\in[0,1]$ and $x\in\mathbb{R}$ , the expected utilities obtained by using $\textsf{U}^n$ and $\textsf{V}^n$ satisfy the relation

\begin{align*} W(\xi_0,n,x,\textsf{U}^n)=W(\xi_0,n,x,\textsf{V}^n).\end{align*}

Proof. For $n=2$ the result is obvious. We only need to consider cases where $n\ge 3$ . Let $\xi_2^{\textsf{U}^n}(u,s)$ be the posterior probability given the first payoff u and second payoff s, using strategy $\textsf{U}^n$ , and let $\xi_2^{\textsf{V}^n}(s,u)$ be the posterior probability given the first payoff s and second payoff u, using strategy $\textsf{V}^n$ . According to (3.2), there is

(4.4)

\begin{align}\xi_2^{\textsf{U}^n}(u,s)=\dfrac{\xi_0 f_1(u)f_2(s)}{\xi_0f_1(u)f_2(s)+(1-\xi_0)f_2(u)f_1(s)}=\xi_2^{\textsf{V}^n}(s,u).\end{align}

Let the conditional strategy $\textsf{U}^n[u,s]$ denote the strategy $\textsf{U}^n$ given the first payoff u and second payoff s, and let the conditional strategy $\textsf{V}^n[s,u]$ denote the strategy $\textsf{V}^n$ given the first payoff s and second payoff u. Then, by (4.4) and the definitions of $\textsf{U}^n[u,s]$ and $\textsf{V}^n[s,u]$ , we can easily get that these two conditional strategies are the same for $i\ge 3$ trials.

Using the same techniques as in Lemma 3.1, we obtain

\begin{align*}&W(\xi_0,n,x,\textsf{U}^n)\notag \\[5pt] &\quad =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{U}^n}\Biggr)\Biggr]\notag\\[5pt] &\quad =\int_{\mathbb R}\int_{\mathbb R} \mathbb{E}_{\xi_2^{\textsf{U}^n}\!(u,s)\!}\Biggl[\varphi\Biggl(x+u+s+\sum_{j=3}^{n} Z_j^{\textsf{U}^n[u,s]}\Biggr)\Biggr] (\xi_0f_{1}(u)f_{2}(s)+(1-\xi_0)f_{2}(u)f_{1}(s))\,{\mathrm{d}} s\,{\mathrm{d}} u \end{align*}

and

\begin{align*} &W(\xi_0,n,x,\textsf{V}^n)\notag \\[5pt] &\quad =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{V}^n}\Biggr)\Biggr]\notag\\[5pt] &\quad =\int_{\mathbb R}\int_{\mathbb R} \mathbb{E}_{\xi_2^{\textsf{V}^n}\!(s,u)\!}\Biggl[\varphi\Biggl(\!x+s+u+\sum_{j=3}^{n} Z_j^{\textsf{V}^n[s,u]}\Biggr)\Biggr](\xi_0f_{2}(s)f_{1}(u)+(1-\xi_0)f_{1}(s)f_{2}(u))\,{\mathrm{d}} u\,{\mathrm{d}} s.\end{align*}

Then the desired result is obtained by using (4.4) and the fact that $\textsf{U}^n[u,s]=\textsf{V}^n[s,u]$ for $i\ge 3$ trials.

For each $n\in\mathbb{Z}^+$ , let $\textsf{L}^n$ and $\textsf{R}^n$ be the strategies defined in Lemma 4.1. For $x\in\mathbb{R}$ , $\xi_0\in[0,1]$ , define the difference of expected utilities by

\begin{align*}\Delta_n(x,\xi_0)&\;:\!=\; W(\xi_0,n,x,\textsf{L}^n)-W(\xi_0,n,x,\textsf{R}^n)\\[5pt] &=\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{L}^n}\Biggr)\Biggr] - \mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{R}^n}\Biggr)\Biggr].\end{align*}

By Lemma 4.1 we can obtain $\Delta_n(x,\xi_0)=-\Delta_n(x,1-\xi_0)$ and $\Delta_n(x,0.5)=0$ .

In the case $\varphi(x)=x$ , there is an important recurrence formula for $\Delta_n(x,\xi_0)$ ; see equations (12), (13), and (14) in [Reference Feldman14] and equation (7.1.1) in [Reference Berry and Fristedt9]. The following lemma shows that this recurrence formula still holds for a general utility function.

Lemma 4.3. For $n\ge 2$ , and any $x\in\mathbb{R}$ , $\xi_0\in[0,1]$ , there is

\begin{align*}\Delta_n(x,\xi_0)& =\int_{\mathbb R} I_{\bigl\{\xi_1^X(u)\ge 0.5\bigr\}}\Delta_{n-1}\bigl(x+u,\xi_1^X(u)\bigr)\bigl(\xi_0f_1(u)+(1-\xi_0)f_2(u)\bigr)\,{\mathrm{d}} u \notag \\[5pt] &\quad +\int_{\mathbb R} I_{\bigl\{\xi_1^Y(u)<0.5\bigr\}}\Delta_{n-1}\bigl(x +u,\xi_1^Y(u)\bigr)\bigl(\xi_0f_2(u) +(1-\xi_0)f_1(u) \bigr)\,{\mathrm{d}} u,\end{align*}

where

\begin{align*} \xi_1^X(u)=\frac{\xi_0 f_1(u) }{\xi_0 f_1(u) +(1-\xi_0)f_2(u)}\quad\textit{and}\quad\xi_1^Y(u)=\frac{\xi_0 f_2(u) }{\xi_0 f_2(u) +(1-\xi_0)f_1(u)}.\end{align*}

Proof. Consider strategies $\textsf{U}^n,\textsf{V}^n$ defined in Lemma 4.2. Using Lemmas 3.1 and 3.2, we can make the following two differences:

\begin{align*}&W(\xi_0,n,x,\textsf{L}^n)-W(\xi_0,n,x,\textsf{U}^n) \notag \\[5pt] &\quad =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{L}^n}\Biggr)\Biggr] - \mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{U}^n}\Biggr)\Biggr] \notag\\[5pt] &\quad =\int_{\mathbb R} I_{\bigl\{\xi_1^X(u)\ge 0.5\bigr\}}\Delta_{n-1}\bigl(x+u,\xi_1^X(u)\bigr)\bigl(\xi_0f_1(u)+(1-\xi_0)f_2(u)\bigr)\,{\mathrm{d}} u,\\[5pt] &W(\xi_0,n,x,\textsf{V}^n)-W(\xi_0,n,x,\textsf{R}^n) \notag\\[5pt] &\quad =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{V}^n}\Biggr)\Biggr] - \mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{n} Z_i^{\textsf{R}^n}\Biggr)\Biggr] \notag\\[5pt] &\quad =\int_{\mathbb R} I_{\bigl\{\xi_1^Y(u)<0.5\bigr\}}\Delta_{n-1}\bigl(x +u,\xi_1^Y(u)\bigr)\bigl(\xi_0f_2(u) +(1-\xi_0)f_1(u) \bigr)\,{\mathrm{d}} u.\end{align*}

The computational details are similar to the proofs of Lemmas 4.1 and 4.2, so we omit them. The desired formula is obtained by adding the above two equations and using Lemma 4.2.

The key to achieving our main theorem is to prove that for any fixed $x\in\mathbb{R}$ and $n\in\mathbb{Z}^+$ , the above difference $\Delta_n(x,\xi_0)$ is an increasing function of $\xi_0$ . To prove this assertion, we use a method similar to that used by Rodman [Reference Rodman31].

Define functions $D_n(x,t_X,t_Y)$ , $n=1,2,\dots$ , for every tuple $(x,t_X,t_Y)$ of numbers, such that $x\in\mathbb{R}$ and $t_X,t_Y\ge 0$ :

\begin{equation*}D_n(x,t_X,t_Y)= \left\{\begin{aligned}&(t_X+t_Y)\Delta_n\biggl(x,\frac{t_X}{t_X+t_Y}\biggr)& &\text{if $t_X+t_Y>0$,}\\[5pt] &0&&\text{otherwise.}\end{aligned}\right.\end{equation*}

From this definition we obtain immediately that

\begin{align*} D_n(x,t_X,t_Y)=-D_n(x,t_Y,t_X)\quad\text{and}\quad D_n(x,t_X,t_Y)=0\quad\text{if $t_X=t_Y$.}\end{align*}

Lemma 4.4. If condition (I) holds, then $D_n(x,t_X,t_Y)$ is an increasing function of $t_X$ , when $x,t_Y$ are kept fixed.

Proof. The proof is by induction. When $n=1$ ,

\begin{equation*}D_1(x,t_X,t_Y)=(t_X-t_Y)\int_{\mathbb R} \varphi(x+u)(f_1(u)-f_2(u))\,{\mathrm{d}} u\end{equation*}

is clearly an increasing function of $t_X$ when $x,t_Y$ are kept fixed.

Now suppose Lemma 4.4 is proved for $n=k$ . From Lemma 4.3 we know that

\begin{align*}D_{k+1}(x,t_X,t_Y)& =\int_{\mathbb R} I_{\{t_X f_1(u)\ge t_Y f_2(u)\}}D_k(x+u,t_X f_1(u), t_Y f_2(u))\,{\mathrm{d}} u \notag \\[5pt] &\quad +\int_{\mathbb R} I_{\{t_X f_2(u)< t_Y f_1(u)\}}D_k(x+u,t_X f_2(u), t_Y f_1(u))\,{\mathrm{d}} u.\end{align*}

When $u,x,t_Y$ are fixed,

\begin{align*} D_k(x+u,t_X f_1(u), t_Y f_2(u))\quad\text{and}\quad D_k(x+u,t_X f_2(u), t_Y f_1(u))\end{align*}

are increasing functions of $t_X$ . Moreover,

\begin{align*}& D_k(x+u,t_X f_1(u), t_Y f_2(u))\ge 0\quad \text{when $t_X f_1(u)\ge t_Y f_2(u)$,}\\[5pt] & D_k(x+u,t_X f_2(u), t_Y f_1(u))\le 0\quad \text{when $t_X f_2(u)< t_Y f_1(u)$.}\end{align*}

So the two integrands in the above two integrals are both increasing functions of $t_X$ .

Hence we obtain that $D_{k+1}(x,t_X,t_Y)$ is an increasing function of $t_X$ when $x,t_Y$ are fixed, and complete the proof.

Now let $t_X=\xi_0$ , $t_Y=1-\xi_0$ . Then

\begin{align*} D_n(x,t_X,t_Y)=\Delta_n(x,\xi_0).\end{align*}

By Lemma 4.4 and $D_n(x,t_X,t_Y)=-D_n(x,t_Y,t_X)$ , we get the desired fact.

Corollary 4.1. For any fixed $x\in\mathbb{R}$ and $n\in\mathbb{Z}^+$ , $\Delta_n(x,\xi_0)$ is an increasing function of $\xi_0$ .

Now we are ready to prove Theorem 4.1.

Proof of Theorem 4.1. Firstly, we assume that condition (I) holds and prove the optimality of $\textsf{M}^n$ . This can be easily obtained by Theorem 3.1 and Corollary 4.1. We now use mathematical induction.

When $n=1$ , by Corollary 4.1,

\begin{align*}\Delta_{1}(x,\xi_0)=\mathbb{E}_{\xi_0} [\varphi(x+X_{1})]-\mathbb{E}_{\xi_{0}}[\varphi(x+Y_{1})]\end{align*}

is an increasing function of $\xi_0$ for any fixed x. Since $\Delta_{1}(x,\frac{1}{2})=0$ , then the optimal strategy should choose X first if $\xi_0\ge \frac{1}{2}$ and choose Y first if $\xi<\frac{1}{2}$ . This means that $\textsf{M}^1$ is the optimal strategy of the $(\xi_0,1,x)$ -bandit, for any $x\in\mathbb{R}$ and $\xi_0\in[0,1]$ .

Assume that for fixed $k\ge 1$ , the myopic strategy $\textsf{M}^k$ is optimal for $(\xi_0,k,x)$ -bandits, for any $x\in\mathbb{R}$ and $\xi_0\in[0,1]$ .

Now consider a bandit problem with $k+1$ trials. Note that by definition of $\textsf{M}^{k+1}$ , $\textsf{L}^{k+1}$ , and $\textsf{R}^{k+1}$ , there is

\begin{align*} W(\xi_0,k+1,x,\textsf{M}^{k+1})=\begin{cases}W(\xi_0,k+1,x,\textsf{L}^{k+1}) & \text{if $\xi_{0}\geq\frac{1}{2}$,}\\[5pt] W(\xi_0,k+1,x,\textsf{R}^{k+1}) & \text{otherwise.}\end{cases}\end{align*}

By Lemmas 3.1 and 3.2 we know that

\begin{align*}W(\xi_0,k+1,x,\textsf{L}^{k+1})& =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{k+1} Z_i^{\textsf{L}^{k+1}}\Biggr)\Biggr]\notag\\[5pt] & = \mathbb{E}_{\xi_{0}}\biggl[W\biggl(\dfrac{\xi_0f_1(X_1)}{\xi_0f_1(X_1)+(1-\xi_0)f_2(X_1)},k,x+X_1,\textsf{M}^{k}\biggr)\biggr],\\[5pt] W(\xi_0,k+1,x,\textsf{R}^{k+1})& =\mathbb{E}_{\xi_0}\Biggl[\varphi\Biggl(x+\sum_{i=1}^{k+1} Z_i^{\textsf{R}^{k+1}}\Biggr)\Biggr]\notag\\[5pt] & = \mathbb{E}_{\xi_{0}}\biggl[W\biggl(\dfrac{\xi_0f_2(Y_1)}{\xi_0f_2(Y_1)+(1-\xi_0)f_1(Y_1)},k,x+Y_1,\textsf{M}^{k}\biggr)\biggr].\end{align*}

By Corollary 4.1, $\Delta_{k+1}(x,\xi_0)$ is an increasing function of $\xi_0$ , $\Delta_{k+1}(x,\frac{1}{2})=0$ . By the induction hypothesis, $\textsf{M}^k$ is the optimal strategy of $\bigl(\frac{\xi_0 f_1(u) }{\xi_0 f_1(u) +(1-\xi_0)f_2(u)},k,x+u\bigr)$ -bandits and $\bigl(\frac{\xi_0 f_2(u) }{\xi_0 f_2(u) +(1-\xi_0)f_1(u)},k,x+u\bigr)$ -bandits, for $ u\in\mathbb{R}$ . Then we have

\begin{align*}&W\bigl(\xi_0,k+1,x,\textsf{M}^{k+1}\bigr) \notag \\[5pt] &\quad =\max\bigl\{W\bigl(\xi_0,k+1,x,\textsf{L}^{k+1}\bigr),W\bigl(\xi_0,k+1,x,\textsf{R}^{k+1}\bigr)\bigr\} \notag\\[5pt] &\quad =\max\biggl\{\mathbb{E}_{\xi_{0}}\biggl[V\biggl(\dfrac{\xi_0f_1(X_1)}{\xi_0f_1(X_1)+(1-\xi_0)f_2(X_1)},k,x+X_1\biggr)\biggr], \notag\\[5pt] &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathbb{E}_{\xi_{0}}\biggl[V\biggl(\dfrac{\xi_0f_2(Y_1)}{\xi_0f_2(Y_1)+(1-\xi_0)f_1(Y_1)},k,x+Y_1\biggr)\biggr]\biggr\}.\end{align*}

Therefore, by Theorem 3.1,

\begin{align*}W(\xi_0,k+1,x,\textsf{M}^{k+1})=V(\xi_0,k+1,x).\end{align*}

Hence $\textsf{M}^{k+1}$ is the optimal strategy of $(\xi_0,k+1,x)$ -bandits, for any $x\in\mathbb{R}$ and $\xi_0\in[0,1]$ . The first part of the main theorem is proved.

Now we prove that if the strategy $\textsf{M}^n$ is optimal for $(\xi_0,n,x)$ -bandits, for any integer $n\ge 1$ , for all $ \xi_0\in[0,1]$ , $ x\in \mathbb{R}$ , then condition (I) holds.

Indeed, we only need to consider the case when $n=1$ . For any fixed $x\in\mathbb{R}$ , we have

\begin{align*}\mathbb{E}_{\xi_0}[\varphi(x+X_1)]&=\xi_0\int_{\mathbb R}\varphi(x+u)f_1(u)\,{\mathrm{d}} u+(1-\xi_0)\int_{\mathbb R} \varphi(x+u)f_2(u)\,{\mathrm{d}} u,\\[5pt] \mathbb{E}_{\xi_0}[\varphi(x+Y_1)]&=\xi_0\int_{\mathbb R}\varphi(x+u)f_2(u)\,{\mathrm{d}} u+(1-\xi_0)\int_{\mathbb R} \varphi(x+u)f_1(u)\,{\mathrm{d}} u.\end{align*}

Since the strategy $\textsf{M}^1$ which chooses X if and only if $\xi_0\ge \frac{1}{2}$ is the optimal strategy, we know that

\begin{align*}\mathbb{E}_{\xi_0}[\varphi(x+X_1)]-\mathbb{E}_{\xi_0}[\varphi(x+Y_1)]\ge 0\quad \text{if $\xi_0\ge \frac{1}{2}$.}\end{align*}

Then, for any fixed $x\in\mathbb{R}$ , we have

\begin{align*}(2\xi_0-1)\biggl[ \int_{\mathbb R}\varphi(x+u)f_1(u)\,{\mathrm{d}} u-\int_{\mathbb R}\varphi(x+u)f_2(u)\,{\mathrm{d}} u\biggr]\ge 0\quad \text{if $\xi_0\ge \frac{1}{2}$,}\end{align*}

which leads to

\begin{align*}\int_{\mathbb R}\varphi(x+u)f_1(u)\,{\mathrm{d}} u-\int_{\mathbb R}\varphi(x+u)f_2(u)\,{\mathrm{d}} u\ge 0.\end{align*}

This is clearly condition (I).

Theorem 4.1 gives a reasonable condition on $\varphi$ that is necessary and sufficient for the myopic strategy $\textsf{M}^n$ being the optimal strategy. With Theorem 4.1 in hand, we can immediately obtain the following two corollaries. The first corollary is the same as the result of Feldman [Reference Feldman14], obtained by applying Theorem 4.1 on the $\varphi(x)=x$ case. The second corollary answers Nouiehed and Ross’s conjecture for the two-armed case.

Corollary 4.2. (Feldman.) The myopic strategy $\textsf{M}^n$ is optimal for $(\xi_0,n,x)$ -bandits with utility function $\varphi(x)=x$ , for any integer $n\ge 1$ , for all $ \xi_0\in[0,1]$ , $ x\in \mathbb{R}$ , if and only if

\begin{align*}\mathbb{E}_\mathbb{P}[X_1\mid H_1]\ge \mathbb{E}_\mathbb{P}[Y_1\mid H_1].\end{align*}

Corollary 4.3. (Nouiehed and Ross’s conjecture.) Consider the two-armed Bernoulli bandits. Let the distributions $F_1$ and $F_2$ be $\operatorname{Bernoulli}(\alpha)$ and $\operatorname{Bernoulli}(\beta)$ , and $\alpha>\beta$ . Under hypothesis $H_1$ , experiment X obeys $\operatorname{Bernoulli}(\alpha)$ and Y obeys $\operatorname{Bernoulli}(\beta)$ ; under hypothesis $H_2$ , X obeys $\operatorname{Bernoulli}(\beta)$ and Y obeys $\operatorname{Bernoulli}(\alpha)$ . Then, for any fixed positive integers k and n, Feldman’s strategy $\textsf{M}^n$ maximizes $\mathbb{P}(S_n\ge k)=\mathbb{E}_\mathbb{P}[I_A]$ , where $I_A$ is the indicator function of set $A=\{S_n\ge k\}$ .

Proof. Note that the proof of Theorem 4.1 also holds in the case of the distributions being Bernoulli distributions. In this case we only need to modify the calculation of expectations.

For any fixed k, let the utility function be $\varphi(x)=I_{[k,+\infty)}(x)$ . By Remark 2, this $\varphi$ satisfies condition (I), so Theorem 4.1 can be applied, and hence $\textsf{M}^n$ maximizes the expectation $\mathbb{P}(S_n\ge k)=\mathbb{E}_\mathbb{P}[I_A]=\mathbb{E}_{\xi_0}[\varphi(S_n)]$ .

Remark 4.4. An anonymous reviewer reminded us that if the utility $\varphi$ is increasing, then Nouiehed and Ross’s conjecture for model (1.1) is in fact equivalent to proving that the myopic strategy maximizes $\mathbb{E}_{\xi_0}[\varphi(x+S_n)]$ .

Acknowledgements

We sincerely thank Professor Jordan Stoyanov for his valuable suggestions and amendments to this paper. And we sincerely thank the anonymous reviewers for their help in improving this paper.

Funding information

Chen gratefully acknowledges the support of the National Key R&D Program of China (grant 2018YFA0703900), Shandong Provincial Natural Science Foundation, China (grant ZR2019ZD41), and Taishan Scholars Project. Lin gratefully acknowledges the support of the National Science Foundation of China (grant 12101367).

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Abernethy, J., Hazan, E. and Rakhlin, A. (2008). Competing in the dark: an efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory (COLT 2008), pp. 263–274.Google Scholar

Agrawal, R. (1995). Sample mean based index policies by

$o(\log n)$ regret for the multi-armed bandit problem. Adv. Appl. Prob. 27, 1054–1078.CrossRef Google Scholar

Ahmad, S. H. A., Liu, M., Javidi, T., Zhao, Q. and Krishnamachari, B. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inform. Theory 55, 4040–4050.CrossRef Google Scholar

Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11, 2785–2836.Google Scholar

Audibert, J.-Y., Munos, R. and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410, 1876–1902.CrossRef Google Scholar

Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256.CrossRef Google Scholar

Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R. (1995). Gambling in a rigged casino: the adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322–331. IEEE Computer Society Press.CrossRef Google Scholar

Banks, J. S. and Sundaram, R. K. (1992). A class of bandit problems yielding myopic optimal strategies. J. Appl. Prob. 29, 625–632.CrossRef Google Scholar

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Springer, Netherlands.CrossRef Google Scholar

Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 1060–1074.CrossRef Google Scholar

Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41, 1516–1541.CrossRef Google Scholar

Chen, Z., Epstein, L. G. and Zhang, G. (2021). A central limit theorem, loss aversion and multi-armed bandits. Available at arXiv:2106.05472.Google Scholar

Deo, S., Iravani, S., Jiang, T., Smilowitz, K. and Samuelson, S. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 1277–1294.CrossRef Google Scholar

Feldman, D. (1962). Contributions to the ‘two-armed bandit’ problem. Ann. Math. Statist. 33, 847–856.CrossRef Google Scholar

Garbe, R. and Glazebrook, K. D. (1998). Stochastic scheduling with priority classes. Math. Operat. Res. 23, 119–144.CrossRef Google Scholar

Gast, N., Gaujal, B. and Khun, K. (2022). Testing indexability and computing Whittle and Gittins index in subcubic time. Available at arXiv:2203.05207.Google Scholar

Gittins, J. and Jones, D. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, ed. J. Gani, pp. 241–266. North-Holland, Amsterdam.Google Scholar

Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20, 1625–1636.CrossRef Google Scholar

Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices. John Wiley.CrossRef Google Scholar

Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In 23rd Conference on Learning Theory (COLT 2010), pp. 67–79.Google Scholar

Kaufmann, E. (2018). On Bayesian index policies for sequential resource allocation. Ann. Statist. 46, 842–865.CrossRef Google Scholar

Kelley, T. A. (1974). A note on the Bernoulli two-armed bandit problem. Ann. Statist. 2, 1056–1062.CrossRef Google Scholar

Kim, M. J. and Lim, A. E. (2015). Robust multiarmed bandit problems. Manag. Sci 62, 264–285.CrossRef Google Scholar

Kuleshov, V. and Precup, D. (2014). Algorithms for multi-armed bandit problems. Available at arXiv:1402.6028.Google Scholar

Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22.CrossRef Google Scholar

Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15, 1091–1114.CrossRef Google Scholar

Lee, E., Lavieri, M. S. and Volk, M. (2019). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manuf. Serv. Oper. Manag. 21, 198–212.CrossRef Google Scholar

Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations and Applications of Sensor Management, ed. A. O. Hero et al., pp. 121–151. Springer, Boston.CrossRef Google Scholar

Niño-Mora, J. (2007). A

$(2/3)n^3$ fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS J. Computing 19, 596–606.CrossRef Google Scholar

Nouiehed, M. and Ross, S. M. (2018). A conjecture on the Feldman bandit problem. J. Appl. Prob. 55, 318–324.CrossRef Google Scholar

Rodman, L. (1978). On the many-armed bandit problem. Ann. Prob. 6, 491–498.CrossRef Google Scholar

Rusmevichientong, P., Shen, Z.-J. M. and Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operat. Res. 58, 1666–1680.CrossRef Google Scholar

Scott, S. L. (2010). A modern Bayesian look at the multi-armed bandit. Appl. Stoch. Models Business Industry 26, 639–658.CrossRef Google Scholar

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285.CrossRef Google Scholar

Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, ed. A. O. Hero et al., pp. 153–175. Springer, Boston.CrossRef Google Scholar

Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637–648.CrossRef Google Scholar

Whittle, P. (1988). Restless bandits: activity allocation in a changing world. J. Appl. Prob. 25, 287–298.CrossRef Google Scholar

Article contents

A confirmation of a conjecture on Feldman’s two-armed bandit problem

Abstract

Keywords

MSC classification

1. Introduction

2. Related literature

3. Preliminaries

4. Main results

Acknowledgements

Funding information

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests