Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2024-12-26T09:46:05.989Z Has data issue: false hasContentIssue false

Formal and functional assessment of the pyramid method for summary content evaluation*

Published online by Cambridge University Press:  06 April 2009

REBECCA J. PASSONNEAU*
Affiliation:
Center for Computational Learning Systems, Columbia University, NY 10115, USA e-mail: becky@cs.columbia.edu

Abstract

Pyramid annotation makes it possible to evaluate quantitatively and qualitatively the content of machine-generated (or human) summaries. Evaluation methods must prove themselves against the same measuring stick – evaluation – as other research methods. First, a formal assessment of pyramid data from the 2003 Document Understanding Conference (DUC) is presented; this addresses whether the form of annotation is reliable and whether score results are consistent across annotators. A combination of interannotator reliability measures of the two manual annotation phases (pyramid creation and annotation of system peer summaries against pyramid models), and significance tests of the similarity of system scores from distinct annotations, produces highly reliable results. The most rigorous test consists of a comparison of peer system rankings produced from two independent sets of pyramid and peer annotations, which produce essentially the same rankings. Three years of DUC data (2003, 2005, 2006) are used to assess the reliability of the method across distinct evaluation settings: distinct systems, document sets, summary lengths, and numbers of model summaries. This functional assessment addresses the method's ability to discriminate systems across years. Results indicate that the statistical power of the method is more than sufficient to identify statistically significant differences among systems, and that the statistical power varies little across the 3 years.

Type
Papers
Copyright
Copyright © Cambridge University Press 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Artstein, R., and Poesio, M 2005. Kappa cubed = alpha (or beta). Technical Report, NLE Technote 2005-01, University of Essex, Essex.Google Scholar
Bloomfield, L. 1933. Language. New York: Holt, Rinehart and Winston.Google Scholar
Carletta, J. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22 (2): 249–54.Google Scholar
Carlson, L., Conroy, J. M., Marcu, D., O'Leary, D. P., Okurowski, M. E., Taylor, A., and Wong, W. 2001. An empirical study of the relation between abstracts, extracts, and the discourse structure of texts. In Proceedings of the Document Understanding Workshop (DUC-2001), New Orleans, LA, September 13–14.Google Scholar
Chinchor, N., Hirschman, L., and Lewis, D. 1993. Evaluating message understanding systems: an analysis of the Third Message Understanding Conference. Computational Linguistics 19 (3): 410–49.Google Scholar
Chung, Y. M., and Lee, J. Y. 2001. A corpus-based approach to comparative evaluation of statistical term association measures. Journal of the American Society for Information Science and Technology 5 (4): 283–96.3.0.CO;2-5>CrossRefGoogle Scholar
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 3746.CrossRefGoogle Scholar
Cohen, P. R. 1995. Empirical Methods for Artificial Intelligence. London: MIT Press.Google Scholar
Dang, H. T. 2007. Overview of DUC 2006. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8–9, 2006.Google Scholar
Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26 (3), 297302.CrossRefGoogle Scholar
Doddington, G. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the ARPA Workshop on Human Language Technology, San Diego, CA, pp. 128–32.Google Scholar
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3; 1289–305.Google Scholar
Fuentes, M., Gonzalez, E., Ferres, D., and Rodriguez, H. 2005. QASUM-TALP at DUC 2005 automatically evaluated with a pyramid based metric. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9–10.Google Scholar
Hajdinjak, M., and Mihelic, F. 2006. The PARADISE evaluation framework: issues and findings. Computational Linguistics 32 (2): 263–72.CrossRefGoogle Scholar
Harnly, A., Nenkova, A., Passonneau, R., and Rambow, O. 2005. Automation of summary evaluation by the pyramid method. In Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria.Google Scholar
Hone, K. S., and Graham, R. 2001. Towards a tool for the subjective assessment of speech system interfaces (SASSI). Natural Language Engineering (Special issue on Best Practice in Spoken Dialogue Systems 6 (3–4), 287303.Google Scholar
Hovy, E., Lin, C.-Y., and Zhou, L. 2005. Evaluating DUC 2005 using basic elements. In Proceedings of the 2005 Document Understanding Workshop, Vancouver, BC, October 9–10.Google Scholar
Hovy, E., Lin, C.-Y., Zhou, L., and Fukumoto, J. 2006. Automated summarization evaluation with basic elements. In Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May 24–26.Google Scholar
Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44: 223–70.Google Scholar
Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. 1998. Summarization evaluation methods experiments and analysis. AAAI Intelligent Text Summarization Workshop, pp. 60–8. Stanford University, Stanford, CA, March 23–25.Google Scholar
Krippendorff, K. 1980. Content Analysis: An Introduction to its Methodology. Beverly Hills, CA: Sage Publications.Google Scholar
Lin, C.-Y., and Hovy, Eduard. 2002. Manual and automatic evaluation of summaries. In Proceedings of the Workshop on Summarization, Association for Computational Linguistics, Philadelphia, PA, July 11–12.Google Scholar
Mani, I, Klein, G, House, D., Hirschman, L., Firmin, T., and Sundheim, B. 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 4368.CrossRefGoogle Scholar
Nenkova, A. 2005. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI 2005), Pittsburgh, PA.Google Scholar
Nenkova, A., and Passonneau, R. J. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Joint Annual Meeting of Human Language Technology (HLT) and the North American Chapter of the Association for Computational Linguistics (NACL), Boston, MA.Google Scholar
Nenkova, A., Passonneau, R., and McKeown, K. 2007. The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing 4 (2): 123.CrossRefGoogle Scholar
Nenkova, A., Vanderwende, L., and McKeown, K. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, August 6–11.Google Scholar
Newman, M. E. J. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46: 323–51.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T., and Jing, W.-Z. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176, IBM Research Division, Yorktown Heights, NY.CrossRefGoogle Scholar
Passonneau, R. 1997. Applying reliability metrics to co-reference annotation. Technical Report CUCS-025-03, Columbia University, Department of Computer Science.Google Scholar
Passonneau, R. May 26–28, 2004. Computing reliability for coreference annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.Google Scholar
Passonneau, R. May 24–26, 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy.Google Scholar
Passonneau, R., Goodkind, A., and Levy, E. 2007. Annotation of children's oral narrations: modeling emergent narrative skills for computational applications. In Proceedings of the 20th Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), Key West, FL.Google Scholar
Passonneau, R., McKeown, K., and Sigelman, S. 2006. Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8–9.Google Scholar
Passonneau, R., and Nenkova, A. 2003. Evaluating content selection in human- or machine-generated summaries: the pyramid scoring method. Technical Report CUCS-025-03, Columbia University, New York, NY.Google Scholar
Passonneau, R., Nenkova, A., McKeown, K., and Sigelman, S. 2005. Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9–10.Google Scholar
Radev, D. R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Celebi, A., Liu, D., and Drabek, E. 2003. Evaluation challenges in large-scale multi-document summarization: the MEAD project. In Proceedings of the 41st Association for Computational Linguistics, Sapporo, Japan, pp. 375–82. Association for Computational Linguistics. Morristown, NJ, USA.CrossRefGoogle Scholar
Rath, G. J., Resnick, A., and Savage, R. 1961. The formation of abstracts by the selection of sentences. Part 1: sentence selection by man and machines. American Documentation 12 (2): 139208.CrossRefGoogle Scholar
Sparck Jones, K., and Galliers, J. R. 1993. Evaluating natural language processing systems. Technical Report 291, Computer Laboratory, University of Cambridge.Google Scholar
Teufel, S., and van Halteren, H. 2004. Evaluating information content by factoid analysis: human annotation and stability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain.Google Scholar
Turian, J., Shen, L., and Melamed, I. D. 2003. Evaluation of machine translation and its evaluation. In Proceedings of MT Summit IX, pp. 386–93. New Orleans, LA, September 23–27.Google Scholar
van Halteren, H., and Teufel, S. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the Document Understanding Conference Workshop, Edmonton, Canada, May 31–June 1.Google Scholar
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Reading, MA: Addison-Welsey.Google Scholar