Hostname: page-component-78c5997874-mlc7c Total loading time: 0 Render date: 2024-11-15T02:10:27.507Z Has data issue: false hasContentIssue false

Plan-based reward shaping for multi-agent reinforcement learning

Published online by Cambridge University Press:  11 February 2016

Sam Devlin
Affiliation:
Department of Computer Science, University of York, York, YO10 5GH, England e-mail: sam.devlin@york.ac.uk, daniel.kudenko@york.ac.uk
Daniel Kudenko
Affiliation:
Department of Computer Science, University of York, York, YO10 5GH, England e-mail: sam.devlin@york.ac.uk, daniel.kudenko@york.ac.uk

Abstract

Recent theoretical results have justified the use of potential-based reward shaping as a way to improve the performance of multi-agent reinforcement learning (MARL). However, the question remains of how to generate a useful potential function.

Previous research demonstrated the use of STRIPS operator knowledge to automatically generate a potential function for single-agent reinforcement learning. Following up on this work, we investigate the use of STRIPS planning knowledge in the context of MARL.

Our results show that a potential function based on joint or individual plan knowledge can significantly improve MARL performance compared with no shaping. In addition, we investigate the limitations of individual plan knowledge as a source of reward shaping in cases where the combination of individual agent plans causes conflict.

Type
Articles
Copyright
© Cambridge University Press, 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Asmuth, J., Littman, M. & Zinkov, R. 2008. Potential-based shaping in model-based reinforcement learning. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 604–609.Google Scholar
Babes, M., de Cote, E. & Littman, M. 2008. Social reward shaping in the prisoner’s dilemma. In Proceedings of the Seventh Annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 3, 1389–1392.Google Scholar
Bertsekas, D. P. 2007. Dynamic Programming and Optimal Control, 3rd edition. Athena Scientific.Google Scholar
Claus, C. & Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the National Conference on Artificial Intelligence, 746–752.Google Scholar
De Hauwere, Y., Vrancx, P. & Nowé, A. 2011. Solving delayed coordination problems in mas (extended abstract). In The 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 1115–1116.Google Scholar
Devlin, S., Grześ, M. & Kudenko, D. 2011. An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems 14(2), 251278.CrossRefGoogle Scholar
Devlin, S. & Kudenko, D. 2011. Theoretical considerations of potential-based reward shaping for multi-agent systems. In Proceedings of the Tenth Annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS).Google Scholar
Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the Eleventh Annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS).Google Scholar
De Weerdt, M., Ter Mors, A. & Witteveen, C. 2005. Multi-agent planning - an introduction to planning and coordination. Technical report, Delft University of Technology.Google Scholar
Grześ, M. 2010. Improving exploration in reinforcement learning through domain knowledge and parameter analysis. Technical report, University of York.Google Scholar
Grześ, M. & Kudenko, D. 2008a. Multigrid reinforcement learning with reward shaping. In Artificial Neural Networks-ICANN 5163, 357–366. Lecture Notes in Computer Science, Springer.CrossRefGoogle Scholar
Grześ, M. & Kudenko, D. 2008b. Plan-based reward shaping for reinforcement learning. In Proceedings of the 4th IEEE International Conference on Intelligent Systems (IS'08), 22–29. IEEE.CrossRefGoogle Scholar
Grześ, M. & Kudenko, D. 2009. Improving optimistic exploration in model-free reinforcement learning. Adaptive and Natural Computing Algorithms 5495, 360369.CrossRefGoogle Scholar
Marthi, B. 2007. Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine learning, 608. ACM.CrossRefGoogle Scholar
Nash, J. 1951. Non-cooperative games. Annals of Mathematics 54(2), 286295.CrossRefGoogle Scholar
Ng, A. Y., Harada, D. & Russell, S. J. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, 278–287.Google Scholar
Peot, M. & Smith, D. 1992. Conditional nonlinear planning. In Artificial Intelligence Planning Systems: Proceedings of the First International Conference, 189. Morgan Kaufmann Publisher.CrossRefGoogle Scholar
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc.CrossRefGoogle Scholar
Randløv, J. & Alstrom, P. 1998. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the 15th International Conference on Machine Learning, 463–471.Google Scholar
Rosenschein, J. 1982. Synchronization of multi-agent plans. In Proceedings of the National Conference on Artificial Intelligence, 115–119.Google Scholar
Shoham, Y., Powers, R. & Grenager, T. 2007. If multi-agent learning is the answer, what is the question? Artificial Intelligence 171(7), 365377.CrossRefGoogle Scholar
Shoham, Y. & Tennenholtz, M. 1995. On social laws for artificial agent societies: off-line design. Artificial Intelligence 73(1–2), 231252.CrossRefGoogle Scholar
Sutton, R. S. 1984. Temporal credit assignment in reinforcement learning. PhD thesis, Department of Computer Science, University of Massachusetts.Google Scholar
Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.Google Scholar
Ziparo, V. 2005. Multi-agent planning. Technical report, University of Rome.Google Scholar