Hostname: page-component-cd9895bd7-gbm5v Total loading time: 0 Render date: 2024-12-26T08:03:58.434Z Has data issue: false hasContentIssue false

Assessing user simulation for dialog systems using human judges and automatic evaluation measures

Published online by Cambridge University Press:  01 February 2011

HUA AI
Affiliation:
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA e-mail: hua@cs.pitt.edu, litman@cs.pitt.edu, iamhuaai@gmail.com
DIANE LITMAN
Affiliation:
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA e-mail: hua@cs.pitt.edu, litman@cs.pitt.edu, iamhuaai@gmail.com

Abstract

While different user simulations are built to assist dialog system development, there is an increasing need to quickly assess the quality of the user simulations reliably. Previous studies have proposed several automatic evaluation measures for this purpose. However, the validity of these evaluation measures has not been fully proven. We present an assessment study in which human judgments are collected on user simulation qualities as the gold standard to validate automatic evaluation measures. We show that a ranking model can be built using the automatic measures to predict the rankings of the simulations in the same order as the human judgments. We further show that the ranking model can be improved by using a simple feature that utilizes time-series analysis.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ai, H. 2009. User Simulation for Spoken Dialog System Development. Ph.D. Dissertation, University of Pittsburgh.CrossRefGoogle Scholar
Ai, H. and Litman, D. 2006. Comparing real-real, simulated-Simulated, and simulated-real spoken dialog corpora. In Proceedings of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, Boston, USA.Google Scholar
Ai, H. and Litman, D. 2007. Knowledge consistent user simulations for dialog systems. In Proceedings of Interspeech 2007, Antwerp, Belgium.Google Scholar
Ai, H. and Litman, D. 2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings Of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.Google Scholar
Ai, H. and Litman, D. 2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings Of 47th Annual Conference of Association of Computational Linguistics, Suntec, Singapore.Google Scholar
Ai, H., Tetreault, J. and Litman, D. 2007. Comparing user simulation models for dialog strategy learning. In Proceedings Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Rochester, USA.Google Scholar
Ai, H. and Weng, F. 2008. User simulation as testing for spoken dialog systems. In Proceedings Of 9th SIGDial Workshop on Discourse and Dialogue, Columbus, USA.Google Scholar
Bangalore, S., Rambow, O. and Whittaker, S. 2000. Evaluation metrics for generation. In Proceedings of the First International Natural Language Generation Conference, Mitzpe Ramon, Israel.Google Scholar
Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. 2004. Confidence estimation for machine translation. In John Hopkins Summer Workshop on Machine Translation, Maryland, USA.Google Scholar
Cen, H., Koedinger, K. and Junker, B. 2006. Learning factors analysis - a general method for cognitive model evaluation and improvement. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan.Google Scholar
Chung, G. 2004. Developing a flexible spoken dialog system using simulation. In Proceedings Of 42nd Annual Conference of Association of Computational Linguistics, Barcelona, Spain.Google Scholar
Chung, G., Seneff, S. and Wang, C. 2005. Automatic induction of language model data for a spoken dialogue system. In Proceedings Of 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.Google Scholar
Cuayahuitl, H., Renals, S. and Lemon, O. 2005. Human-computer dialogue simulation using hidden Markov models. In Proceedings Of 2005 IEEE workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico.Google Scholar
Dawes, J. 2008. Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales. International Journal of Market Research 50 (1): 6177.CrossRefGoogle Scholar
Di Eugenio, B., Jordan, P., Thomason, R., and Moore, J. 2000. The agreement process: an empirical investiation of human-human computer-mediated collaborative dialogues. International Jounrnal of Human-Computer Studies 53 (6): 10171076.CrossRefGoogle Scholar
Engelbrecht, K., Quade, M. and Möller, S. 2009. Analysis of a new simulation approach to dialog system evaluation. Speech Communnication 51 (12): 12341252.CrossRefGoogle Scholar
Foster, M. E. 2008. Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of the 5th International Natural Language Generation Conference, Salt Fork, USA.Google Scholar
Freund, Y., Iyer, R., Schapire, R. E. and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4: 933969.Google Scholar
Gandhe, S. and Traum, D. 2008. Evaluation understudy for dialogue coherence models. In Prodeedings of 9th SIGdial Workshop on Discourse and Dialogue, Columbus, USA.Google Scholar
Georgila, K., Wolters, M. and Moore, J. 2008. Simulating the behaviour of older versus younger users when interacting with spoken dialog systems. In Proceedings of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.Google Scholar
Gordon, J. and Passoneau, R. 2010. An evaluation framework for natural language understanding in spoken dialogue systems. In Proceedings of the Seventh conference on International Language Resources and Evaluation, Valletta, Malta.Google Scholar
Grice, H. P. 1975. Logic and conversation. Syntax and Semantics III: Speech Acts 3: 4158.Google Scholar
Harman, D. and Over, P. 2004. The Effects of Human Variation in DUC Summarization Evaluation. I In Proceedings of ACL-04 Workshop: Text Summarization Braches Out, Barcelona, Spain.Google Scholar
Hof, A., Hagen, E. and Huber, A. 2009. Adaptive help for speech dialogue systems based on learning and forgetting of speech commands. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia.Google Scholar
Huang, X., Alleva, F., Hon, H.-w., Hwang, M.-y., and Rosenfeld, R. 1993. The SPHINX-II speech recognition system: an overview. Computer, Speech and Language 7 (2): 137148.CrossRefGoogle Scholar
Janarthanam, S. and Lemon, O. 2008. User simulations for online adaptation and knowledge-alignment in Troubleshooting dialog systems. In Proceedings of the 12th SEMdial Workshop on on the Semantics and Pragmatics of Dialogs, London, UK.Google Scholar
Janarthanam, S. and Lemon, O. 2009. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of the 10th SIGdial Workshop on on Discourse and Dialogue, London, UK.Google Scholar
Jung, S., Lee, C., Kim, K., Jeong, M., and Lee, G. 2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language 23 (4): 479509.CrossRefGoogle Scholar
Komatani, K., Kawahara, T. and Okuno, H. G. 2007. Analyzing temporal transition of real user's behaviors in a spoken dialogue system. In Proceedings of Interspeech 2007, Pittsburgh, USA.Google Scholar
Levin, E., Pieraccini, R. and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. on Speech and Audio Processing 8 (1): 1123.CrossRefGoogle Scholar
Lin, C. Y. 2004. ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain.Google Scholar
Linguistic Data Consortium 2005. Linguistic Data Annotation Specification: Assessment of Fluency and Adequacy in Translations.Google Scholar
Litman, D. and Silliman, S. 2004. ITSPOKE: an intelligent tutoring spoken dialog system. In Proceedings of the Human Language Technology: NAACL, Boston, USA.Google Scholar
Litman, D., Rośe, C., Forbes-Riley, K., VanLehn, K., Bhembe, D., and Silliman, S. 2006. Spoken versus typed human and computer dialogue tutoring. International Journal of Artificial Intelligence in Education 16: 145170.Google Scholar
Lou, Y., Abrami, P. C. and d'pollonia, S. 2001. Small group and individual learning with technology: a meta-analysis. Review of Educational Research 71 (3): 449521.CrossRefGoogle Scholar
Mairesse, F., Walker, M., Mehl, M. and Moore, R. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30: 457500.CrossRefGoogle Scholar
Papineni, K. A., Roukos, S., Ward, R. T. and Zhu, W-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Conference of Association of Computational Linguistics, Philadelphia, USA.Google Scholar
Rieser, V. and Lemon, O. 2006. Cluster-based user simulations for learning dialog strategies. In Proceedings of Interspeech 2006, Pittsburgh, PA.Google Scholar
Rieser, V. and Lemon, O. 2010. Natural language generation as planning under uncertainty for spoken dialogue systems. In Krahmer, E., and Theune, M. (eds.), Empirical Methods in Natural Language Generation, volume 5980 of Lecture Notes in Computer Science.CrossRefGoogle Scholar
Rotaru, M. 2008. Applications of Discourse Structure for Spoken Dialog Systems. Ph.D. Dissertation, University of Pittsburgh.Google Scholar
Schatzmann, J., Georgila, K. and Young, S. 2005. Quantitative evaluation of user simulation techniques for spoken dialog systems. In Proceedings of 6th SIGdial Workshop on Discourse and Dialog, Lisbon, Portugal.Google Scholar
Schatzmann, J. and Young, S. 2009. The hidden agenda user simulation model. IEEE Trans. Audio, Speech and Language Processing 17 (4): 733747.CrossRefGoogle Scholar
Scheffler, K. and Young, S. 2001. Corpus-based dialog simulation for automatic strategy learning and evaluation. In Proceedings of NAACL-2001 Workshop on Adaptation in Dialog Systems, Pittsburgh, PA.Google Scholar
VanLehn, K., Jordan, P. W., Roé, C. P., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A., Siler, S., Srivastava, R., and Wilson, R. 2002. The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Proceedings of Intelligent Tutoring Systems Conference, Biarritz, France.Google Scholar
Walker, M., Rambow, O. and Rogati, M. 2001. SPoT: a trainable sentence planner. In Proceedings of NAACL 2001, Pittsburgh, USA.Google Scholar
Walker, M., Whittaker, S., Stent, A., Maloor, P., Moore, J., Johnston, M., and Vasireddy, G. 2005. Generation and evaluation of user tailored responses in multimodal dialog. Cognitive Science 28: 811840.CrossRefGoogle Scholar
Williams, J. 2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication 50 (10): 829846.CrossRefGoogle Scholar