Hostname: page-component-78c5997874-t5tsf Total loading time: 0 Render date: 2024-11-10T16:03:35.267Z Has data issue: false hasContentIssue false

MUSED: A multimedia multi-document dataset for topic segmentation

Published online by Cambridge University Press:  22 October 2018

PEDRO MOTA
Affiliation:
Instituto Superior Técnico, Carnegie Mellon University, Rua Alves Redol 9, Lisbon 1000-029, Portugal e-mail: pedrom@andrew.cmu.edu
MAXINE ESKENAZI
Affiliation:
Carnegie Mellon University, 6413 Gates Hillman Complex, 5000 Forbes Ave, Pittsburgh, PA 15213, USA e-mail: max@cs.cmu.edu
LUÍSA COHEUR
Affiliation:
Instituto Superior Técnico, Rua Alves Redol 9, Lisbon 1000-029, Portugal e-mail: luisa.coheur@inesc-id.pt

Abstract

Research on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the MUltimedia SEgmentation Dataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013, also under projects LAW-TRAIN (H2020-EU.3.7, contract 653587), and through the Carnegie Mellon Portugal Program under Grant SFRH/BD/51917/2012.

References

Alemi, A., and Ginsparg, P. 2015. Text segmentation based on semantic word embeddings. ArXiv e-prints, 1503.05543.Google Scholar
Balagopalan, A., and Damodar, A., 2012. Automatic keyphrase extraction and segmentation of video lectures. In Proceedings of the International Conference on Technology Enhanced Education, Amritapuri, India: ICTEE 2012, pp. 110.Google Scholar
Bougouin, A., Boudin, F., and Daille, B., 2013. Topicrank: graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing, Nagoya, Japan: Asian Federation of Natural Language Processing, pp. 543551.Google Scholar
Choi, F. Y., 2000. Advances in domain independent linear text segmentation. In Proceedings of the North American Chapter of the Association for Computational Lingustics, Seattle, Washington, USA: Association for Computational Linguistics, pp. 2633.Google Scholar
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37.Google Scholar
Du, L., Buntine, W. L., and Johnson, M., 2013. Topic segmentation with a structured topic model. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Atlanta, Georgia, USA: Association for Computational Lingustics, pp. 190200.Google Scholar
Du, L., Pate, J., and Johnson, M., 2015. Topic segmentation with an ordering-based topic model. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference, Austin, Texas, USA: AAAI Press, pp. 22322238.Google Scholar
Eisenstein, J., and Barzilay, R., 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii: Association for Computational Linguistics, pp. 334343.Google Scholar
Eisenstein, J., 2009. Hierarchical text segmentation from multi-scale lexical cohesion. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Boulder, Colorado, USA: Association for Computational Lingustics, pp. 353361.Google Scholar
Fournier, C., 2013. Evaluating text segmentation using boundary edit distance. In Proceedings of the Annual Meeting of the Association for Computational Lingustics, Sofia, Bulgaria: Association for Computational Lingustics, pp. 17021712.Google Scholar
Francis, W. N., and Kucera, H., 1979. The Brown Corpus: A Standard Corpus of Present-Day Edited American English. Brown University: Lingustics Department.Google Scholar
Frey, B. J., and Dueck, D., 2007. Clustering by passing messages between data points. Science 315 (5814): 972977.Google Scholar
Galley, M., McKeown, K., Fosler, E., and Jing, H., 2003. Discourse segmentation of multi-party conversation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Sapporo, Japan: Association for Computational Lingustics, pp. 562569.Google Scholar
Haghighi, A., and Vanderwende, L., 2009. Annotating semantic relations combining facts and opinions. In Proceedings of the 3rd Linguistic Annotation Workshop, Suntec, Singapore: Association for Computational Lingustics, pp. 362370.Google Scholar
Halliday, M. A., and Hasan, R., 1976. Cohesion in English. London: Longman.Google Scholar
Hearst, M. A., 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Lingustics 23 (1): 3364.Google Scholar
Hsueh, P., Moore, J., and Renals, S., 2006. Automatic segmentation of multiparty dialogue. In Proceedings of the European Chapter of the Association for Computational Linguistics, Trento, Italy: Association for Computational Lingustics, pp. 273280.Google Scholar
Jain, S., and Neal, R., 2004. A split-merge Markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics 13 (1): 158182.Google Scholar
Jameel, S., and Lam, W., 2013. An unsupervised topic segmentation model incorporating word order. In Proceedings of the International Conference on Research and Development in Information Retrieval, Dublin, Ireland: ACM, pp. 203212.Google Scholar
Janin, A., Ang, J., Bhagat, S., and Wrede, B., 2004. The ICSI meeting project: resources and research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing Workshop, Montreal, Canada: Prentice Hall, pp. 364367.Google Scholar
Johnson, N., Kotz, S., and Balakrishnan, N., 1997. Discrete Multivariate Distributions. New York: Wiley-Interscience.Google Scholar
Joty, S., Carenini, G., and Ng, R., 2013. Topic segmentation and labeling in asynchronous conversations. Journal of Artificial Intelligence Research 47 (1): 521573.Google Scholar
Kazantseva, A., and Szpakowicz, S., 2011. Linear text segmentation using affinity propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K.: Association for Computational Lingustics, pp. 284293.Google Scholar
Kazantseva, A., and Szpakowicz, S., 2012. Topical segmentation: a study of human performance and a new measure of quality. In Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics, Montreal, Canada: Association for Computational Lingustics, pp. 211220.Google Scholar
Krippendorff, K., 2004. Content Analysis: An Introduction to its Methodology. London: SAGE Publications.Google Scholar
Malioutov, I., and Barzilay, R., 2006. Minimum cut model for spoken lecture segmentation. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Lingustics, pp. 2532.Google Scholar
Minwoo, J., and Ivan, T., 2010. Multi-document topic segmentation. In Proceedings of the Association for Computational Lingustics International Conference on Information and Knowledge Management, Toronto, Canada: ACM, pp. 11191128.Google Scholar
Mota, P., Eskenazi, M., and Coheur, L., 2016. Multi-document topic segmentation using Bayesian estimation. In Proceedings of the International Workshop on Semantic Multimedia, Laguna Hills, CA, USA: IEEE, pp. 443447.Google Scholar
Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., and Wang, Y., 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95 (3): 381421.Google Scholar
Noh, H., Jeong, M., Lee, S., Lee, J., and Lee, G., 2010. Script-description pair extraction from text documents of english as second language podcast. In Proceedings of the International Conference on Computer Supported Education, Valencia, Spain: SciTePress, pp. 510.Google Scholar
Passonneau, R. J., and Litman, D. J., 1997. Discourse segmentation by human and automated means. Computational Lingustics 23 (1): 103139.Google Scholar
Pennington, J., Socher, R., and Manning, C. D., 2014. Glove: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar: Association for Computational Lingustics, pp. 15321543.Google Scholar
Pevzner, L., and Hearst, M. A., 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Lingustics 28 (1): 1936.Google Scholar
Prince, V., and Labadie, A., 2007. Segmentation based on document understanding for information retrieval. In Proceedings of the International Conference on Application of Natural Language to Information Systems, Paris, France: Berlin: Springer, pp. 295304.Google Scholar
Purver, M., Griffiths, T. L., Körding, K. P., and Tenenbaum, J. B., 2006. Unsupervised topic modelling for multi-party spoken discourse. In Proceedings of the International Conference on Computational Lingustics, Sydney, Australia: Association for Computational Linguistics, pp. 1724.Google Scholar
Riedl, M., and Biemann, C., 2012. Topictiling: a text segmentation algorithm based on LDA. In Proceedings of the Association for Computational Lingustics Student Research Workshop, Jeju Island, Korea: Association for Computational Linguistics, pp. 3742.Google Scholar
Scott, W. A., 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19 (3): 321325.Google Scholar
Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2014. ATLAS: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In Proceedings of the Association for Computational Lingustics International Conference on Multimedia, Orlando, Florida, USA: ACM, pp. 209212.Google Scholar
Shah, R., Yu, Y., Shaikh, A., Tang, S., and Zimmermann, R., 2015. TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. In Proceedings of the International Symposium on Multimedia, Miami, Florida, USA: IEEE, pp. 217220.Google Scholar
Shah, R., and Zimmermann, R., 2017. Multimodal Analysis of User-Generated Multimedia Content. Cham, Switzerland: Springer International Publishing.Google Scholar
Shahaf, D., Guestrin, C., and Horvitz, E., 2012. Trains of thought: generating information maps. In Proceedings of the International Conference on World Wide Web, Lyon, France: ACM, pp. 899908.Google Scholar
Shrout, P. E., and Fleiss, J. L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86 (2): 420428.Google Scholar
Sun, B., Mitra, P., Giles, L., Yen, J., and Zha, H., 2007. Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of Association for Computational Lingustics Special Interest Group on Information Retrieval, Amsterdam, The Netherlands: ACM, pp. 199206.Google Scholar
Utiyama, M., and Isahara, H., 2001. A statistical model for domain-independent text segmentation. In Proceedings of the Annual Meeting on Association for Computational Lingustics, Toulouse, France: Association for Computational Linguistics, pp. 499506.Google Scholar
Walker, H., Dallas, W., and Willis, J., 1990. Clinical Methods: The History, Physical, and Laboratory Examinations. Boston: Butterworths.Google Scholar
Ward, N. G., Werner, S. D., Novick, D. G., Shriberg, E. E., Oertel, C., and Kawahara, T. 2013. The similar segments in social speech task. In Working Notes Proceedings of the MediaEval Workshop, Barcelona, Spain.Google Scholar
Watanabe, S., Iwata, T., Hori, T., Sako, A., and Ariki, Y., 2011. Topic tracking language model for speech recognition. Computer Speech and Language 25 (2): 440461.Google Scholar