A survey on metrics for the evaluation of user simulations

Olivier Pietquin; Helen Hastie

doi:10.1017/S0269888912000343

A survey on metrics for the evaluation of user simulations

Published online by Cambridge University Press: 28 November 2012

Olivier Pietquin and

Helen Hastie

Show author details

Olivier Pietquin: Affiliation:
SUPELEC – IMS-MaLIS Research Group, UMI 2958 (GeorgiaTech – CNRS), 2 rue Edouard Belin, 57070 Metz, France; e-mail: olivier.pietquin@supelec.fr
Helen Hastie: Affiliation:
School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK; e-mail: h.hastie@hw.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

User simulation is an important research area in the field of spoken dialogue systems (SDSs) because collecting and annotating real human–machine interactions is often expensive and time-consuming. However, such data are generally required for designing, training and assessing dialogue systems. User simulations are especially needed when using machine learning methods for optimizing dialogue management strategies such as Reinforcement Learning, where the amount of data necessary for training is larger than existing corpora. The quality of the user simulation is therefore of crucial importance because it dramatically influences the results in terms of SDS performance analysis and the learnt strategy. Assessment of the quality of simulated dialogues and user simulation methods is an open issue and, although assessment metrics are required, there is no commonly adopted metric. In this paper, we give a survey of User Simulations Metrics in the literature, propose some extensions and discuss these metrics in terms of a list of desired features.

Information

Type: Articles
Information: The Knowledge Engineering Review , Volume 28 , Issue 1 , March 2013 , pp. 59 - 73

DOI: https://doi.org/10.1017/S0269888912000343 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Ai, H., Litman, D. 2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings of the 46th Meeting of the Association for Computational Linguistics, Columbus, OH, USA, 622–629.Google Scholar

Ai, H., Litman, D. 2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL), Singapore.CrossRef Google Scholar

Anderson, T. 1962. On the distribution of the two-sample Cramér-von Mises criterion. Annals of Mathematical Statistics 33(3), 1148–1159.CrossRef Google Scholar

Carletta, J. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254.Google Scholar

Chandramohan, S., Geist, M., Lefèvre, F., Pietquin, O. 2011. User Simulation in Dialogue Systems using Inverse Reinforcement Learning. In Proceedings of Interspeech 2011, Florence, Italy.CrossRef Google Scholar

Cramer, H. 1928. On the composition of elementary errors. Second paper: statistical applications. Skandinavisk Aktuarietidskrift 11, 171–180.Google Scholar

Cuayahuitl, H., Renals, S., Lemon, O., Shimodaira, H. 2005. Human–computer dialogue simulation using hidden Markov models. In Proceedings of ASRU, 290–295. Cancun, MexicoCrossRef Google Scholar

Cuayahuitl, H. 2009. Hierarchical Reinforcement Learning for Spoken Dialogue Systems. PhD thesis, University of Edinburgh, UK.Google Scholar

Doddington, G. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Human Language Technology Conference (HLT), San Diego, CA, USA, 128–132.Google Scholar

Eckert, W., Levin, E., Pieraccini, R. 1997. User modeling for spoken dialogue system evaluation. In Proceedings of ASRU'97. Santa Barbara, USA.Google Scholar

Frampton, M., Lemon, O. 2010. Recent research advances in reinforcement learning in spoken dialogue systems. The Knowledge Engineering Review 24(4), 375–408.CrossRef Google Scholar

Georgila, K., Henderson, J., Lemon, O. 2005. Learning user simulations for information state update dialogue systems. In Proceedings of Interspeech 2005. Lisboa, Portugal.CrossRef Google Scholar

Georgila, K., Henderson, J., Lemon, O. 2006. User simulation for spoken dialogue systems: learning and evaluation. In Proceedings of Interspeech'06. Pittsburg, USA.CrossRef Google Scholar

Janarthanam, S., Lemon, O. 2009a. A data-driven method for adaptive referring expression generation in automated dialogue systems: maximising expected utility. In Proceedings of PRE-COGSCI 09. Boston, USA.CrossRef Google Scholar

Janarthanam, S., Lemon, O. 2009b. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of SIGDIAL. London, UK.CrossRef Google Scholar

Janarthanam, S., Lemon, O. 2009c. Learning adaptive referring expression generation policies for spoken dialogue systems using reinforcement learning. In Proceedings of SEMDIAL. Stockholm, Sweden.CrossRef Google Scholar

Janarthanam, S., Lemon, O. 2009d. A Wizard-of-Oz environment to study referring expression generation in a situated spoken dialogue task. In Proceedings of ENLG, 2009. Athens, Greece.CrossRef Google Scholar

Jung, S., Lee, C., Kim, K., Jeong, M., Lee, G. G. 2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language 23(4), 479–509.CrossRef Google Scholar

Kullback, S., Leiber, R. 1951. On information and sufficiency. Annals of Mathematical Statistics 22, 79–86.CrossRef Google Scholar

Levin, E., Pieraccini, R., Eckert, W. 1997. Learning dialogue strategies within the Markov decision process framework. In Proceedings of ASRU'97. Santa Barbara, USA.Google Scholar

Levin, E., Pieraccini, R., Eckert, W. 2000. A stochastic model of human–machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing 8(1), 11–23.CrossRef Google Scholar

López-Cózar, R., de la Torre, A., Segura, J., Rubio, A. 2003. Assesment of dialogue systems by means of a new simulation technique. Speech Communication 40(3), 387–407.CrossRef Google Scholar

Ng, A. Y., Russell, S. 2000. Algorithms for inverse reinforcement learning. In Proceedings of 17th International Conference on Machine Learning. Morgan Kaufmann, 663–670.Google Scholar

Paek, T., Pieraccini, R. 2008. Automating spoken dialogue management design using machine learning: an industry perspective. Speech Communication 50, 716–729.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., Zhu, W. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318.Google Scholar

Pietquin, O., Dutoit, T. 2006. A probabilistic framework for dialog simulation and optimal strategy learning. IEEE Transactions on Audio, Speech and Language Processing 14(2), 589–599.CrossRef Google Scholar

Pietquin, O., Rossignol, S., Ianotto, M. 2009. Training Bayesian networks for realistic man–machine spoken dialogue simulation. In Proceedings of the 1st International Workshop on Spoken Dialogue Systems Technology, Irsee, Germany, 4.Google Scholar

Pietquin, O. 2004. A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Faculté Polytechnique de Mons (FPMs), Belgium.Google Scholar

Pietquin, O. 2006. Consistent goal-directed user model for realisitc man–machine task-oriented spoken dialogue simulation. In Proceedingsof ICME'06. Toronto, Canada.CrossRef Google Scholar

Rieser, V. 2008. Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. PhD thesis, Saarland University, Department of Computational Linguistics.Google Scholar

Rieser, V., Lemon, O. 2006. Simulations for learning dialogue strategies. In Proceedings of Interspeech 2006, Pittsburg, USA.CrossRef Google Scholar

Rieser, V., Lemon, O. 2008. Learning effective multimodal dialogue strategies from Wizard-of-Oz data: bootstrapping and evaluation. In Proceedings of ACL, 2008. Colombus, Ohio.Google Scholar

Russell, S. 1998. Learning agents for uncertain environments (extended abstract). In COLT’ 98: Proceedings of the 11th Annual Conference on Computational Learning Theory. ACM, 101–103. Madisson, USA.CrossRef Google Scholar

Schatzmann, J., Georgila, K., Young, S. 2005a. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of SIGdial'05. Lisbon, Portugal.CrossRef Google Scholar

Schatzmann, J., Stuttle, M. N., Weilhammer, K., Young, S. 2005b. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of ASRU'05. Cancun, Mexico.CrossRef Google Scholar

Schatzmann, J., Thomson, B., Weilhammer, K., Ye, H., Young, S. 2007a. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of ICASSP'07. Honolulu, USA.CrossRef Google Scholar

Schatzmann, J., Thomson, B., Young, S. 2007b. Statistical user simulation with a hidden agenda. In Proceedings of SigDial'07. Anvers, Belgium.Google Scholar

Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review 21(2), 97–126.CrossRef Google Scholar

Scheffler, K., Young, S. 2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proceedings of NAACL Workshop on Adaptation in Dialogue Systems. Pittsburgh, PA, USA.Google Scholar

Singh, S., Kearns, M., Litman, D., Walker, M. 1999. Reinforcement learning for spoken dialogue systems. In Proceedings of the NIPS'99. Vancouver, Canada.Google Scholar

Sutton, R., Barto, A. 1998. Reinforcement Learning: An Introduction. MIT Press.Google Scholar

van Rijsbergen, C. J. 1979. Information Retrieval, second edn.Butterworths.Google Scholar

Walker, M., Hindle, D., Fromer, J., Fabbrizio, G. D., Mestel, C. 1997a. Evaluating competing agent strategies for a voice email agent. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech'97), Rhodes, Greece.CrossRef Google Scholar

Walker, M., Litman, D., Kamm, C., Abella, A. 1997b. Paradise: a framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 271–280. Madrid, Spain.CrossRef Google Scholar

Williams, J. D., Young, S. 2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language 21(2), 393–422.CrossRef Google Scholar

Williams, J., Poupart, P., Young, S. 2005. Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management. In Proceedings of the SigDial Workshop (SigDial'06). Sydney, Australia.Google Scholar

Williams, J. 2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication 50, 829–846.CrossRef Google Scholar

Zukerman, I., Albrecht, D. 2001. Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction 11, 5–18.CrossRef Google Scholar

Article contents

A survey on metrics for the evaluation of user simulations

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests