Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-13T04:29:40.209Z Has data issue: false hasContentIssue false

Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*

Published online by Cambridge University Press:  14 April 2016

JACK MOSTOW
Affiliation:
Project LISTEN, School of Computer Science, Carnegie Mellon University, RI-NSH 4103, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: mostow@cs.cmu.edu
YI-TING HUANG
Affiliation:
Information Management, National Taiwan University No. 1, Sec. 4, Roosevelt Road, 10617 Taipei, Taiwan e-mail: d97008@im.ntu.edu.tw
HYEJU JANG
Affiliation:
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: hyejuj@cs.cmu.edu
ANDERS WEINSTEIN
Affiliation:
School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: andersw@cs.cmu.edu
JOE VALERI
Affiliation:
e-mail: joevaleri@gmail.com
DONNA GATES
Affiliation:
e-mail: donnamgates7123@gmail.com

Abstract

We describe the development, pilot-testing, refinement, and four evaluations of Diagnostic Question Generator (DQGen), which automatically generates multiple choice cloze (fill-in-the-blank) questions to test children's comprehension while reading a given text. Unlike previous methods, DQGen tests comprehension not only of an individual sentence but of the context preceding it. To test different aspects of comprehension, DQGen generates three types of distractors: ungrammatical distractors test syntax; nonsensical distractors test semantics; and locally plausible distractors test inter-sentential processing.

  1. (1) A pilot study of DQGen 2012 evaluated its overall questions and individual distractors, guiding its refinement into DQGen 2014.

  2. (2) Twenty-four elementary students generated 200 responses to multiple choice cloze questions that DQGen 2014 generated from forty-eight stories. In 130 of the responses, the child chose the correct answer. We define the distractiveness of a distractor as the frequency with which students choose it over the correct answer. The incorrect responses were consistent with expected distractiveness: twenty-seven were plausible, twenty-two were nonsensical, fourteen were ungrammatical, and seven were null.

  3. (3) To compare DQGen 2014 against DQGen 2012, five human judges categorized candidate choices without knowing their intended type or whether they were the correct answer or a distractor generated by DQGen 2012 or DQGen 2014. The percentage of distractors categorized as their intended type was significantly higher for DQGen 2014.

  4. (4) We evaluated DQGen 2014 against human performance based on 1,486 similarly blind categorizations by twenty-seven judges of sixteen correct answers, forty-eight distractors generated by DQGen 2014, and 504 distractors authored by twenty-one humans. Surprisingly, DQGen 2014 did significantly better than humans at generating ungrammatical distractors and marginally better than humans at generating nonsensical distractors, albeit slightly worse at generating plausible distractors. Moreover, vetting DQGen 2014's output and writing distractors only when necessary would halve the time to write them all, and produce higher quality distractors.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This paper combines material from Mostow and Jang (2012), our AIED2015 paper (Huang and Mostow 2015) on a comparison to human performance, and substantial new content including improvements to DQGen and the evaluations reported in Section 4.1 and 4.2. The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080157, the National Science Foundation through Grant IIS1124240, and by the Taiwan National Science Council through the Graduate Students Study Abroad Program. We thank the other LISTENers who contributed to this work; everyone who categorized and wrote distractors; the reviewers of our BEA2012 and AIED2015 papers and this article for their helpful comments; and Prof. Y. S. Sun at National Taiwan University and Dr. M. C. Chen at Academia Sinica for enabling the first author to participate in this program. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute, the U.S. Department of Education, the National Science Foundation, or the National Science Council.

References

Agarwal, M., and Mannem, P., 2011a. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 56–64.Google Scholar
Agarwal, M., Shah, R., and Mannem, P., 2011b. Automatic question generation using discourse cues. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 1–9.Google Scholar
Aldabe, I., and Maritxalar, M. 2010. Automatic distractor generation for domain specific texts advances in natural language processing. In Loftsson, H., Rögnvaldsson, E., and Helgadóttir, S. (eds.), The 7th International Conference on NLP, Reykjavk, Iceland, pp. 2738, Berlin/Heidelberg: Springer.Google Scholar
Aldabe, I., Maritxalar, M., and Martinez, E. 2007. Evaluating and improving distractor-generating heuristics. In Ezeiza, N., Maritxalar, M., and S. M. (eds.), The Workshop on NLP for Educational Resources. In conjunction with RANLP07, Amsterdam, Netherlands, pp. 713. Borovets, Bulgaria.Google Scholar
Aldabe, I., Maritxalar, M., and Mitkov, R. 2009, July 6–10. A study on the automatic selection of candidate sentences and distractors. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), In Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED2009), pp. 656–8. Brighton, UK: IOS Press.Google Scholar
Becker, L., Basu, S., and Vanderwende, L. 2012. Mind the gap: learning to choose gaps for question generation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 742–51. Montreal, Canada: Association for Computational Linguistics.Google Scholar
Biemiller, A., 2009. Words Worth Teaching: Closing the Vocabulary Gap. Columbus, OH: SRA/McGraw-Hill.Google Scholar
Brown, J. C., Frishkoff, G. A., and Eskenazi, M. 2005, October 6–8. Automatic question generation for vocabulary assessment. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 819–26. Vancouver, BC, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Burton, S. J., Sudweeks, R. R., Merrill, P. F., and Wood, B., 1991. How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty. Salt Lake City, UT: Brigham Young University Testing Services and The Department of Instructional Science.Google Scholar
Cassels, J. R. T., and Johnstone, A. H. 1984. The effect of language on student performance on multiple choice tests in chemistry. Journal of Chemical Education 61 (7): 613.Google Scholar
Chang, K.-M., Nelson, J., Pant, U., and Mostow, J. 2013. Toward exploiting eeg input in a reading tutor. International Journal of Artificial Intelligence in Education 22(1, “Best of AIED2011 Part 1”): 2941.Google Scholar
Chen, W., Mostow, J., and Aist, G. S. 2013. Recognizing young readers’ spoken questions. International Journal of Artificial Intelligence in Education 21 (4): 255–69.Google Scholar
Coniam, D. 1997. A preliminary inquiry into using corpus word frequency data in the automatic generation of english language cloze tests. CALICO Journal 14 (2–4): 1533.Google Scholar
Correia, R., Baptista, J., Mamede, N., Trancoso, I., and Eskenazi, M. 2010, September 22–24. Automatic generation of cloze question distractors. In Proceedings of the Interspeech 2010 Satellite Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, Waseda University, Tokyo, Japan.Google Scholar
Fellbaum, C. 2012. Wordnet. The Encyclopedia of Applied Linguistics: Blackwell Publishing Ltd. Hoboken, New Jersey, USA.Google Scholar
Gates, D., Aist, G., Mostow, J., Mckeown, M., and Bey, J. 2011. How to generate cloze questions from definitions: a syntactic approach. In Proceedings of the AAAI Symposium on Question Generation, pp. 19–22. Arlington, VA, AAAI Press.Google Scholar
Goto, T., Kojiri, T., Watanabe, T., Iwata, T., and Yamada, T. 2010. Automatic generation system of multiple-choice cloze questions and its evaluation. Knowledge Management & E-Learning: An International Journal (KM& EL) 2 (3): 210–24.Google Scholar
Graesser, A. C., and Bertus, E. L. 1998. The construction of causal inferences while reading expository texts on science and technology. Scientific Studies of Reading 2 (3): 247–69.Google Scholar
Haladyna, T. M., Downing, S. M., and Rodriguez, M. C. 2002. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement In Education 15 (3): 309–34.Google Scholar
Heilman, M., and Smith, N. A. 2009. Question Generation Via Overgenerating Transformations and Ranking (Technical Report CMU-LTI-09-013). Pittsburgh, PA: Carnegie Mellon University.Google Scholar
Heilman, M., and Smith, N. A. 2010, June. Good question! Statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 609–17. Los Angeles, CA, Association for Computational Linguistics.Google Scholar
Hensler, B. S., and Beck, J. E. 2006, June 26–30. Better student assessing by finding difficulty factors in a fully automated comprehension measure [best paper nominee]. In Ashley, K. and Ikeda, M. (eds.), Proceedings of the 8th International Conference on Intelligent Tutoring Systems, pp. 21–30. Jhongli, Taiwan, Springer-Verlag.Google Scholar
Huang, Y.-T., Chen, M. C., and Sun, Y. S. 2012, November 26–30. Personalized automatic quiz generation based on proficiency level estimation. In Proceedings of the 20th International Conference on Computers in Education (ICCE 2012), pp. 553–60. Singapore.Google Scholar
Huang, Y.-T., and Mostow, J. 2015, June 22–26. Evaluating human and automated generation of distractors for diagnostic multiple-choice cloze questions to assess children’s reading comprehension. In Conati, C., Heffernan, N., Mitrovic, A., and Verdejo, M. F. (eds.), Proceedings of the 17th International Conference on Artificial Intelligence in Education, pp. 155–64. Madrid, Spain, Lecture Notes in Computer Science, vol. 9112. Switzerland: Springer International Publishing.Google Scholar
Kendall, M. G., and Babington Smith, B. 1939. The problem of m rankings. The Annals of Mathematical Statistics 10 (3): 275–87.Google Scholar
Kintsch, W. 2005. An overview of top-down and bottom-up effects in comprehension: the ci perspective. Discourse Processes 39 (2–3): 125–8.CrossRefGoogle Scholar
Klein, D., and Manning, C. D. 2003, July 7–12. Accurate unlexicalized parsing. In E. W. Hinrichs and D. Roth (eds.), Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–30. Sapporo, Japan, Association for Computational Linguistics.CrossRefGoogle Scholar
Kolb, P. 2008. Disco: a multilingual database of distributionally similar words. In Proceedings of KONVENS-2008 (Konferenz zur Verarbeitung natürlicher Sprache), pp. 5–12. Berlin.Google Scholar
Kolb, P. 2009. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics-NODALIDA’09, Odense, Denmark.Google Scholar
Landis, J. R., and Koch, G. G. 1977. The measurement of observer agreement for categorical data. Biometrics 33 (1): 159–74.CrossRefGoogle Scholar
Lee, J., and Seneff, S. 2007, August 27–31. Automatic generation of cloze items for prepositions. In Proceedings of INTERSPEECH, pp. 2173–6. Antwerp, Belgium,Google Scholar
Li, L., Roth, B., and Sporleder, C. 2010. Topic models for word sense disambiguation and token-based idiom detection. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–47. Uppsala, Sweden, Association for Computational Linguistics.Google Scholar
Li, L., and Sporleder, C. 2009. Classifier combination for contextual idiom detection without labelled data, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 315–23. Singapore, Association for Computational Linguistics.Google Scholar
Lin, Y.-C., Sung, L.-C., and Chen, M. C., 2007. An automatic multiple-choice question generation scheme for english adjective understanding. In Workshop on Modeling, Management and Generation of Problems/Questions in eLearning, the 15th International Conference on Computers in Education (ICCE 2007), Amsterdam, Netherlands, pp. 137–42.Google Scholar
Liu, C.-L., Wang, C.-H., Gao, Z.-M., and Huang, S.-M. 2005, June 29. Applications of lexical information for algorithmically composing multiple-choice cloze items. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, Ann Arbor, Michigan, pp. 1–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Ming, L., Calvo, R. A., Aditomo, A., and Pizzato, L. A. 2012. Using wikipedia and conceptual graph structures to generate questions for academic writing support. IEEE Transactions on Learning Technologies 5 (3): 251–63.Google Scholar
Mitkov, R., Ha, L. A., and Karamanis, N. 2006. A computer-aided environment for generating multiple choice test items. Natural Language Engineering 12 (2): 177–94.Google Scholar
Mitkov, R., Ha, L. A., Varga, A., and Rello, L. 2009, March 31. Semantic similarity of distractors in multiple-choice tests: extrinsic evaluation. In Basili, R. and Pennacchiotti, M. (eds.), EACL 2009 Workshop on GEMS: GEometrical Models of Natural Language Semantics, pp. 49–56. Athens, Greece, Association for Computational Linguistics.Google Scholar
Mostow, J. 2013, July. Lessons from project listen: what have we learned from a reading tutor that listens? (keynote). In H. C. Lane, K. Yacef, J. Mostow, and P. Pavlik (eds.), Proceedings of the 16th International Conference on Artificial Intelligence in Education, pp. 557–8. Memphis, TN, LNAI, vol. 7926. Springer.Google Scholar
Mostow, J., Beck, J. E., Bey, J., Cuneo, A., Sison, J., Tobin, B., and Valeri, J. 2004. Using automated questions to assess reading comprehension, vocabulary, and effects of tutorial interventions. Technology, Instruction, Cognition and Learning 2 (1–2): 97134.Google Scholar
Mostow, J., and Chen, W. 2009, July 6–10. Generating instruction automatically for the reading strategy of self-questioning. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), Proceedings of the 14th International Conference on Artificial Intelligence in Education, pp. 465–72. Brighton, UK: IOS Press.Google Scholar
Mostow, J., and Jang, H. 2012, June 7. Generating diagnostic multiple choice comprehension cloze questions. In NAACL-HLT 2012 7th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 136–46. Montréal, Association for Computational Linguistics.Google Scholar
Niraula, N. B., Rus, V., Stefanescu, D., and Graesser, A. C. 2014. Mining gap-fill questions from tutorial dialogues. In Proceedings of the 7th International Conference on Educational Data Mining, pp. 265–8. London, UK.Google Scholar
Pearson, P. D., and Hamm, D. N. 2005. The history of reading comprehension assessment. In Paris, S. G. and Stahl, S. A. (eds.), Children’s Reading Comprehension and Assessment, pp. 1369. London, United Kingdom, CIERA.Google Scholar
Pino, J., Heilman, M., and Eskenazi, M. 2008. A selection strategy to improve cloze question quality. In Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains. 9th International Conference on Intelligent Tutoring Systems, pp. 22–34. Montreal, Canada.Google Scholar
Piwek, P., and Boyer, K. E. 2012. Varieties of question generation: introduction to this special issue. Dialogue and Discourse 3 (2): 19.Google Scholar
Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., and Manning, C. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501. MIT, Cambridge, MA, Association for Computational Linguistics.Google Scholar
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., and Moldovan, C. 2010. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference, pp. 251–7. Dublin, Ireland, Association for Computational Linguistics.Google Scholar
Shrout, P. E., and Fleiss, J. L. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–8.Google Scholar
Sleator, D. D. K., and Temperley, D. 1993, August 10–13. Parsing english with a link grammar. Third International Workshop on Parsing Technologies, Tilburg, NL, and Durbuy, Belgium.Google Scholar
Smith, S., Sommers, S., and Kilgarriff, A. 2008. Learning words right with the sketch engine and webbootcat: automatic cloze generation from corpora and the web. In Proceedings of the 25th International Conference of English Teaching and Learning & 2008 International Conference on English Instruction and Assessment, pp. 1–8. Lisbon, Portugal.Google Scholar
Sumita, E., Sugaya, F., and Yamamoto, S. 2005. Measuring non-native speakers’ proficiency of english by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, pp. 61–8. Ann Arbor, Michigan, Association for Computational Linguistics.Google Scholar
Tapanainen, P., and Järvinen, T. 1997. A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 64–71. Washington, DC, Association for Computational Linguistics.Google Scholar
Toutanova, K., Klein, D., Manning, C., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of the Human Language Technology Conference and Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Edmonton, Canada, pp. 252–9.Google Scholar
Unspecified. 2006. Tiny invaders, National Geographic Explorer (Pioneer Edition) http://ngexplorer.cengage.com/pioneer/.Google Scholar
van den Broek, P., Everson, M., Virtue, S., Sung, Y., and Tzeng, Y. 2002. Comprehension and memory of science texts: inferential processes and the construction of a mental representation. In Otero, J., Leon, J., and Graesser, A. C. (eds.), The Psychology of Science Text Comprehension, pp. 131154. Mahwah, NJ: Erlbaum.Google Scholar
Zesch, T., and Melamud, O. 2014. Automatic generation of challenging distractors using context-sensitive inference rules. In Workshop on Innovative Use of NLP for Building Educational Applications (BEA), pp. 143–8. Baltimore, MD.Google Scholar
Zhang, X., Mostow, J., and Beck, J. E. 2007, July 9–13. Can a computer listen for fluctuations in reading comprehension?. In R. Luckin, K. R. Koedinger, and J. Greer (eds.), Proceedings of the 13th International Conference on Artificial Intelligence in Education, pp. 495–502. Marina del Rey, CA: IOS Press.Google Scholar