Formal and functional assessment of the pyramid method for summary content evaluation*

REBECCA J. PASSONNEAU

doi:10.1017/S1351324909005051

Formal and functional assessment of the pyramid method for summary content evaluation*

Published online by Cambridge University Press: 06 April 2009

REBECCA J. PASSONNEAU

Show author details

REBECCA J. PASSONNEAU*: Affiliation:
Center for Computational Learning Systems, Columbia University, NY 10115, USA e-mail: becky@cs.columbia.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Pyramid annotation makes it possible to evaluate quantitatively and qualitatively the content of machine-generated (or human) summaries. Evaluation methods must prove themselves against the same measuring stick – evaluation – as other research methods. First, a formal assessment of pyramid data from the 2003 Document Understanding Conference (DUC) is presented; this addresses whether the form of annotation is reliable and whether score results are consistent across annotators. A combination of interannotator reliability measures of the two manual annotation phases (pyramid creation and annotation of system peer summaries against pyramid models), and significance tests of the similarity of system scores from distinct annotations, produces highly reliable results. The most rigorous test consists of a comparison of peer system rankings produced from two independent sets of pyramid and peer annotations, which produce essentially the same rankings. Three years of DUC data (2003, 2005, 2006) are used to assess the reliability of the method across distinct evaluation settings: distinct systems, document sets, summary lengths, and numbers of model summaries. This functional assessment addresses the method's ability to discriminate systems across years. Results indicate that the statistical power of the method is more than sufficient to identify statistically significant differences among systems, and that the statistical power varies little across the 3 years.

Information

Type: Papers
Information: Natural Language Engineering , Volume 16 , Issue 2 , April 2010 , pp. 107 - 131

DOI: https://doi.org/10.1017/S1351324909005051 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Artstein, R., and Poesio, M 2005. Kappa cubed = alpha (or beta). Technical Report, NLE Technote 2005-01, University of Essex, Essex.Google Scholar

Bloomfield, L. 1933. Language. New York: Holt, Rinehart and Winston.Google Scholar

Carletta, J. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22 (2): 249–54.Google Scholar

Carlson, L., Conroy, J. M., Marcu, D., O'Leary, D. P., Okurowski, M. E., Taylor, A., and Wong, W. 2001. An empirical study of the relation between abstracts, extracts, and the discourse structure of texts. In Proceedings of the Document Understanding Workshop (DUC-2001), New Orleans, LA, September 13–14.Google Scholar

Chinchor, N., Hirschman, L., and Lewis, D. 1993. Evaluating message understanding systems: an analysis of the Third Message Understanding Conference. Computational Linguistics 19 (3): 410–49.Google Scholar

Chung, Y. M., and Lee, J. Y. 2001. A corpus-based approach to comparative evaluation of statistical term association measures. Journal of the American Society for Information Science and Technology 5 (4): 283–96.3.0.CO;2-5>CrossRef Google Scholar

Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46.CrossRef Google Scholar

Cohen, P. R. 1995. Empirical Methods for Artificial Intelligence. London: MIT Press.Google Scholar

Dang, H. T. 2007. Overview of DUC 2006. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8–9, 2006.Google Scholar

Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26 (3), 297–302.CrossRef Google Scholar

Doddington, G. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the ARPA Workshop on Human Language Technology, San Diego, CA, pp. 128–32.Google Scholar

Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3; 1289–305.Google Scholar

Fuentes, M., Gonzalez, E., Ferres, D., and Rodriguez, H. 2005. QASUM-TALP at DUC 2005 automatically evaluated with a pyramid based metric. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9–10.Google Scholar

Hajdinjak, M., and Mihelic, F. 2006. The PARADISE evaluation framework: issues and findings. Computational Linguistics 32 (2): 263–72.CrossRef Google Scholar

Harnly, A., Nenkova, A., Passonneau, R., and Rambow, O. 2005. Automation of summary evaluation by the pyramid method. In Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria.Google Scholar

Hone, K. S., and Graham, R. 2001. Towards a tool for the subjective assessment of speech system interfaces (SASSI). Natural Language Engineering (Special issue on Best Practice in Spoken Dialogue Systems 6 (3–4), 287–303.Google Scholar

Hovy, E., Lin, C.-Y., and Zhou, L. 2005. Evaluating DUC 2005 using basic elements. In Proceedings of the 2005 Document Understanding Workshop, Vancouver, BC, October 9–10.Google Scholar

Hovy, E., Lin, C.-Y., Zhou, L., and Fukumoto, J. 2006. Automated summarization evaluation with basic elements. In Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May 24–26.Google Scholar

Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44: 223–70.Google Scholar

Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. 1998. Summarization evaluation methods experiments and analysis. AAAI Intelligent Text Summarization Workshop, pp. 60–8. Stanford University, Stanford, CA, March 23–25.Google Scholar

Krippendorff, K. 1980. Content Analysis: An Introduction to its Methodology. Beverly Hills, CA: Sage Publications.Google Scholar

Lin, C.-Y., and Hovy, Eduard. 2002. Manual and automatic evaluation of summaries. In Proceedings of the Workshop on Summarization, Association for Computational Linguistics, Philadelphia, PA, July 11–12.Google Scholar

Mani, I, Klein, G, House, D., Hirschman, L., Firmin, T., and Sundheim, B. 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 43–68.CrossRef Google Scholar

Nenkova, A. 2005. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI 2005), Pittsburgh, PA.Google Scholar

Nenkova, A., and Passonneau, R. J. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Joint Annual Meeting of Human Language Technology (HLT) and the North American Chapter of the Association for Computational Linguistics (NACL), Boston, MA.Google Scholar

Nenkova, A., Passonneau, R., and McKeown, K. 2007. The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing 4 (2): 1–23.CrossRef Google Scholar

Nenkova, A., Vanderwende, L., and McKeown, K. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, August 6–11.Google Scholar

Newman, M. E. J. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46: 323–51.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., and Jing, W.-Z. 2001. Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176, IBM Research Division, Yorktown Heights, NY.CrossRef Google Scholar

Passonneau, R. 1997. Applying reliability metrics to co-reference annotation. Technical Report CUCS-025-03, Columbia University, Department of Computer Science.Google Scholar

Passonneau, R. May 26–28, 2004. Computing reliability for coreference annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.Google Scholar

Passonneau, R. May 24–26, 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy.Google Scholar

Passonneau, R., Goodkind, A., and Levy, E. 2007. Annotation of children's oral narrations: modeling emergent narrative skills for computational applications. In Proceedings of the 20th Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), Key West, FL.Google Scholar

Passonneau, R., McKeown, K., and Sigelman, S. 2006. Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Document Understanding Conference, Brooklyn, NY, June 8–9.Google Scholar

Passonneau, R., and Nenkova, A. 2003. Evaluating content selection in human- or machine-generated summaries: the pyramid scoring method. Technical Report CUCS-025-03, Columbia University, New York, NY.Google Scholar

Passonneau, R., Nenkova, A., McKeown, K., and Sigelman, S. 2005. Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Document Understanding Conference, Vancouver, BC, October 9–10.Google Scholar

Radev, D. R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Celebi, A., Liu, D., and Drabek, E. 2003. Evaluation challenges in large-scale multi-document summarization: the MEAD project. In Proceedings of the 41st Association for Computational Linguistics, Sapporo, Japan, pp. 375–82. Association for Computational Linguistics. Morristown, NJ, USA.CrossRef Google Scholar

Rath, G. J., Resnick, A., and Savage, R. 1961. The formation of abstracts by the selection of sentences. Part 1: sentence selection by man and machines. American Documentation 12 (2): 139–208.CrossRef Google Scholar

Sparck Jones, K., and Galliers, J. R. 1993. Evaluating natural language processing systems. Technical Report 291, Computer Laboratory, University of Cambridge.Google Scholar

Teufel, S., and van Halteren, H. 2004. Evaluating information content by factoid analysis: human annotation and stability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain.Google Scholar

Turian, J., Shen, L., and Melamed, I. D. 2003. Evaluation of machine translation and its evaluation. In Proceedings of MT Summit IX, pp. 386–93. New Orleans, LA, September 23–27.Google Scholar

van Halteren, H., and Teufel, S. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In Proceedings of the Document Understanding Conference Workshop, Edmonton, Canada, May 31–June 1.Google Scholar

Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Reading, MA: Addison-Welsey.Google Scholar

Article contents

Formal and functional assessment of the pyramid method for summary content evaluation*

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests