Hostname: page-component-cd9895bd7-gxg78 Total loading time: 0 Render date: 2024-12-28T20:39:58.406Z Has data issue: false hasContentIssue false

Some memoryless bandit policies

Published online by Cambridge University Press:  14 July 2016

Erol A. Peköz*
Affiliation:
Boston University
*
Postal address: School of Management, Boston University, Boston, MA 02215, USA. Email address: pekoz@bu.edu

Abstract

We consider a multiarmed bandit problem, where each arm when pulled generates independent and identically distributed nonnegative rewards according to some unknown distribution. The goal is to maximize the long-run average reward per pull with the restriction that any previously learned information is forgotten whenever a switch between arms is made. We present several policies and a peculiarity surrounding them.

Type
Short Communications
Copyright
Copyright © Applied Probability Trust 2003 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Bellman, R. (1956). A problem in the sequential design of experiments. Sankhyā 16, 221229.Google Scholar
[2] Berry, D. A., and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.Google Scholar
[3] Berry, D. A. et al. (1997). Chen, R. W., Zame, A., Heath, D. C., Shepp, L. A. Bandit problems with infinitely many arms. Ann. Statist. 25, 21032116.Google Scholar
[4] Cover, T. (1987). Pick the largest number. In Open Problems in Communication and Computation, eds Cover, T. and Gopinath, B., Springer, New York, p. 152.Google Scholar
[5] Durrett, R. (1996). Probability: Theory and Examples, 2nd edn. Wadsworth Publishing, Belmont, CA.Google Scholar
[6] Gittins, J. C., and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Europ. Meeting Statisticians, Budapest, 1972), Vol. 1, eds Gani, J., Sarkadi, K. and Vincze, I., North-Holland, Amsterdam, pp. 241266.Google Scholar
[7] Herschkorn, S. J., Peköz, E., and Ross, S. M. (1996). Policies without memory for the infinite-armed Bernoulli bandit under the average-reward criterion. Prob. Eng. Inf. Sci. 10, 2128.Google Scholar
[8] Lai, T. L., and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 422.Google Scholar
[9] Lin, C.-T., and Shiau, C.J. (2000). Some optimal strategies for bandit problems with beta prior distributions. Ann. Inst. Statist. Math. 52, 397405.Google Scholar
[10] Psounis, K., and Prabhakar, B. (2001). A randomized web-cache replacement scheme. In Proc. IEEE Infocom 2001 (Anchorage, AK, 22–26 April 2001), pp. 14071415. Available at http://www.ieee-infocom.org/2001/.Google Scholar