Published online by Cambridge University Press: 27 July 2009
We consider a bandit problem with infinitely many Bernoulli arms whose unknown parameters are i.i.d. We present two policies that maximize the almost sure average reward over an infinite horizon. Neither policy ever returns to a previously observed arm after switching to a new one or retains information from discarded arms, and runs of failures indicate the selection of a new arm. The first policy is nonstationary and requires no information about the distribution of the Bernoulli parameter. The second is stationary and requires only partial information; its optimality is established via renewal theory. We also develop ε-optimal stationary policies that require no information about the distribution of the unknown parameter and discuss universally optimal stationary policies.