Hostname: page-component-78c5997874-g7gxr Total loading time: 0 Render date: 2024-11-10T15:49:59.944Z Has data issue: false hasContentIssue false

INVESTIGATING INTER-RATER RELIABILITY OF QUALITATIVE TEXT ANNOTATIONS IN MACHINE LEARNING DATASETS

Published online by Cambridge University Press:  11 June 2020

N. El Dehaibi*
Affiliation:
Stanford University, United States of America
E. F. MacDonald
Affiliation:
Stanford University, United States of America

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

An important step when designers use machine learning models is annotating user generated content. In this study we investigate inter-rater reliability measures of qualitative annotations for supervised learning. We work with previously annotated product reviews from Amazon where phrases related to sustainability are highlighted. We measure inter-rater reliability of the annotations using four variations of Krippendorff's U-alpha. Based on the results we propose suggestions to designers on measuring reliability of qualitative annotations for machine learning datasets.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
The Author(s), 2020. Published by Cambridge University Press

References

Card, D. et al. (2015), “The Media Frames Corpus: Annotations of Frames Across Issues”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, China, July 26-31, 2015, Association for Computational Linguistics, pp. 438444. https://doi.org/10.3115/v1/P15-2072CrossRefGoogle Scholar
Cohen, J. (1960), “A coefficient of agreement for nominal scales”, Educational and Psychological Measurement, Vol. 20 No. 1, pp. 3746. https://doi.org/10.1177/001316446002000104CrossRefGoogle Scholar
El Dehaibi, N., Goodman, N.D. and MacDonald, E.F. (2019), “Extracting customer perceptions of product sustainability from online reviews”, Journal of Mechanical Design, Vol. 141 No. 12, p. 121103. https://doi.org/10.1115/1.4044522CrossRefGoogle Scholar
Fleiss, J.L. (1971), “Measuring nominal scale agreement among many raters”, Psychological Bulletin, Vol. 76 No. 5, pp. 378382. https://doi.org/10.1037/h0031619CrossRefGoogle Scholar
Goodman, J.K. and Paolacci, G. (2017), “Crowdsourcing Consumer Research”, Journal of Consumer Research, Vol. 44 No. 1, pp. 196210. https://doi.org/10.1093/jcr/ucx047Google Scholar
Gwet, K.L. (2014), Handbook of inter-rater reliability, Advanced Analytics, Gaithersburg.Google Scholar
Hallgren, K.A. (2012), “Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial”, Tutor Quant Methods Psychol, Vol. 8 No. 1, pp. 2334.CrossRefGoogle ScholarPubMed
Jurafsky, D. and Martin, J.H. (2017), “Naïve Bayes and sentiment classification”, Speech and language processing , Stanford University.Google Scholar
Kennedy, L. et al. (2019), “Evaluation of a mindfulness-based stress management and nutrition education program for mothers”, Cogent Social Sciences, Vol. 5 No. 1, pp. 112. https://doi.org/10.1080/23311886.2019.1682928CrossRefGoogle Scholar
Krippendorff, K. (2004), “Measuring the reliability of qualitative text analysis data”, Quality and Quantity, Vol. 38 No. 6, pp. 787800. https://doi.org/10.1007/s11135-004-8107-7CrossRefGoogle Scholar
Krippendorff, K. (2018), “Reliability”, In: Accomazzo, T., Helton, E., Olson, A. and Ponce, M. (Eds.), Content analysis, Sage, Thousand Oaks, pp. 277360.Google Scholar
Lai, V.K., Li, J.C. and Lee, A. (2019), “Psychometric validation of the Chinese patient- and family satisfaction in the intensive care unit questionnaires”, Journal of Critical Care, Vol. 54 No. December 2019, pp. 5864. https://doi.org/10.1016/j.jcrc.2019.07.009CrossRefGoogle ScholarPubMed
Liang, Y. et al. (2019), “Using social media to discover unwanted behaviours displayed by visitors to nature parks: comparisons of nationally and privately owned parks in the Greater Kruger National Park, South Africa”, Tourism Recreation Research. https://doi.org/10.1080/02508281.2019.1681720CrossRefGoogle Scholar
Paolacci, G. and Chandler, J. (2014), “Inside the Turk: Understanding Mechanical Turk as a Participant Pool”, Current Directions in Psychology Research, Vol. 23 No. 3, pp. 184188. https://doi.org/10.1177/0963721414531598CrossRefGoogle Scholar
Rash, J.A. et al. (2019), “Assessing the efficacy of a manual-based intervention for improving the detection of facial pain expression”, European Journal of Pain, Vol. 23 No. 5, pp. 10061019. https://doi.org/10.1002/ejp.1369Google ScholarPubMed
Stab, C. and Gurevych, I. (2014), “Identifying argumentative discourse structures in persuasive essays”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 25-29, 2019, Association for Computational Linguistics, pp. 4656. https://doi.org/10.3115/v1/D14-1006CrossRefGoogle Scholar
Stone, T. and Choi, S.K. (2013), “Extracting consumer preference from user-generated content sources using classification”, Proceedings of the ASME 2013 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Portland, OR, August 4-7, 2013, Association of Mechanical Engineers, pp. 19. https://doi.org/10.1115/DETC2013-13228CrossRefGoogle Scholar
Toh, C.A., Miller, S.R. and Kremer, G.E. (2014), “The Impact of Team-Based Product Dissection on Design Novelty”, Journal of Mechanical Design, Vol. 136 No 4, p. 041004. https://doi.org/10.1115/1.4026151CrossRefGoogle Scholar
Tuarob, S. and Tucker, C.S. (2015), “Automated discovery of lead users and latent product features by mining large scale social media networks”, Journal of Mechanical Design, Vol. 137 No. 7, p. 071402. https://doi.org/10.1115/1.4030049CrossRefGoogle Scholar