Hostname: page-component-78c5997874-m6dg7 Total loading time: 0 Render date: 2024-11-13T06:34:45.622Z Has data issue: false hasContentIssue false

An automatic approach to identify word sense changes in text media across timescales

Published online by Cambridge University Press:  16 April 2015

SUNNY MITRA
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India e-mail: sunnym@cse.iitkgp.ernet.in, ritwikm@cse.iitkgp.ernet.in, sumankalyan.maity@cse.iitkgp.ernet.in, pawang@cse.iitkgp.ernet.in, animeshm@cse.iitkgp.ernet.in
RITWIK MITRA
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India e-mail: sunnym@cse.iitkgp.ernet.in, ritwikm@cse.iitkgp.ernet.in, sumankalyan.maity@cse.iitkgp.ernet.in, pawang@cse.iitkgp.ernet.in, animeshm@cse.iitkgp.ernet.in
SUMAN KALYAN MAITY
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India e-mail: sunnym@cse.iitkgp.ernet.in, ritwikm@cse.iitkgp.ernet.in, sumankalyan.maity@cse.iitkgp.ernet.in, pawang@cse.iitkgp.ernet.in, animeshm@cse.iitkgp.ernet.in
MARTIN RIEDL
Affiliation:
FG Language Technology, Computer Science Department, TU Darmstadt, Darmstadt, Germany e-mail: riedl@cs.tu-darmstadt.de, biem@cs.tu-darmstadt.de
CHRIS BIEMANN
Affiliation:
FG Language Technology, Computer Science Department, TU Darmstadt, Darmstadt, Germany e-mail: riedl@cs.tu-darmstadt.de, biem@cs.tu-darmstadt.de
PAWAN GOYAL
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India e-mail: sunnym@cse.iitkgp.ernet.in, ritwikm@cse.iitkgp.ernet.in, sumankalyan.maity@cse.iitkgp.ernet.in, pawang@cse.iitkgp.ernet.in, animeshm@cse.iitkgp.ernet.in
ANIMESH MUKHERJEE
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India e-mail: sunnym@cse.iitkgp.ernet.in, ritwikm@cse.iitkgp.ernet.in, sumankalyan.maity@cse.iitkgp.ernet.in, pawang@cse.iitkgp.ernet.in, animeshm@cse.iitkgp.ernet.in

Abstract

In this paper, we propose an unsupervised and automated method to identify noun sense changes based on rigorous analysis of time-varying text data available in the form of millions of digitized books and millions of tweets posted per day. We construct distributional-thesauri-based networks from data at different time points and cluster each of them separately to obtain word-centric sense clusters corresponding to the different time points. Subsequently, we propose a split/join based approach to compare the sense clusters at two different time points to find if there is ‘birth’ of a new sense. The approach also helps us to find if an older sense was ‘split’ into more than one sense or a newer sense has been formed from the ‘join’ of older senses or a particular sense has undergone ‘death’. We use this completely unsupervised approach (a) within the Google books data to identify word sense differences within a media, and (b) across Google books and Twitter data to identify differences in word sense distribution across different media. We conduct a thorough evaluation of the proposed methodology both manually as well as through comparison with WordNet.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allan, J., Papka, R., and Lavrenko, V., 1998. On-line new event detection and tracking. In Proceedings of SIGIR, Melbourne, Australia, pp. 3745.CrossRefGoogle Scholar
Bamman, D., and Crane, G., 2011. Measuring historical word sense variation. In Proceedings of JCDL, New York, NY, USA, pp. 110.Google Scholar
Biemann, C., 2006. Chinese whispers – an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs, New York City, NY, USA, pp. 7380.CrossRefGoogle Scholar
Biemann, C., 2010. Co-occurrence cluster features for lexical substitutions in context. In Proceedings of TextGraphs 5, Uppsala, Sweden, pp. 5559.Google Scholar
Biemann, C. 2012. Structure Discovery in Natural Language. In: Theory and Applications of Natural Language Processing. Springer, Berlin Heidelberg. ISBN 978-3-642-25922-7.Google Scholar
Biemann, C. 2012. Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources & Evaluation 47 (1): 97112. doi 10.1007/s10579-012-9180-5CrossRefGoogle Scholar
Biemann, C., and Riedl, M., 2013. Text: now in 2D! a framework for lexical expansion with contextual similarity. Journal of Language Modelling 1 (1): 5595.CrossRefGoogle Scholar
Blei, D., and Lafferty, J., 2006. Dynamic topic models. In Proceedings of ICML, Pittsburgh, Pennsylvania, pp. 113120.Google Scholar
Bond, F., Isahara, H., Fujita, S., Uchimoto, K., Kuribayash, T., and Kanzaki, K., 2009. Enhancing the Japanese WordNet. In Proceedings of Workshop on Asian Language Resources, Suntec, Singapore, pp. 18.Google Scholar
Cook, P., Lau, J. H., Rundell, M., McCarthy, D., and Baldwin, T., 2013. A lexicographic appraisal of an automatic approach for detecting new word senses. In Proceedings of eLex, Tallinn, Estonia, pp. 4965.Google Scholar
Cook, P., and Stevenson, S., 2010. Automatically identifying changes in the semantic orientation of words. In Proceedings of LREC, Valletta, Malta, pp. 2834.Google Scholar
Erk, K., McCarthy, D., and Gaylord, N., 2010. Investigations on word senses and word usages. In Proceedings of ACL, Suntec, Singapore, pp. 1018.Google Scholar
Evert, S. 2005. The statistics of word cooccurrences. Dissertation, Stuttgart University.Google Scholar
Fellbaum, C. (ed.) 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Firth, J. R. 1957. A Synopsis of Linguistic Theory, 1933–1955. Studies in Linguistic Analysis, Blackwell, Oxford.Google Scholar
Goldberg, Y., and Orwant, J., 2013. A dataset of syntactic-ngrams over time from a very large corpus of English books. In Proceedings of the Joint Conference on Lexical and Computational Semantics (*SEM), Atlanta, GA, USA, pp. 241247.Google Scholar
Gulordava, K., and Baroni, M., 2011. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In Proceedings of the Workshop on Geometrical Models for Natural Language Semantics, EMNLP, Edinburgh, UK, pp. 6771.Google Scholar
Heyer, G., Holz, F., and Teresniak, S., 2009. Change of topics over time – tracking topics by their change of meaning. In Proceedings of KDIR, Madeira, Portugal, pp. 223228.Google Scholar
Ide, N. and Veronis, J., 1998. Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics 24 (1): 140.Google Scholar
Kilgarriff, A., 1997. I don’t believe in word senses. Computers and the Humanities 31 (2): 91113.CrossRefGoogle Scholar
Kilgarriff, A., Rychly, P., Smrz, P., and Tugwell, D., 2004. The sketch engine. In Proceedings of EURALEX, Lorient, France, pp. 105116.Google Scholar
Kilgarriff, A., and Tugwell, D., 2001. Word sketch: extraction and display of significant collocations for lexicography. In Proceedings of COLLOCATION: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 3238.Google Scholar
Kwong, O. Y. 1998. Aligning WordNet with additional lexical resources. In Proceedings of the Workshop on Usage of WordNet in Natural Language Processing Systems, COLING-ACL98, pp. 73–79.Google Scholar
Lin, D. 1997. Zipf’s law outside the middle range. Proceedings of the 6th Meeting on Mathematics of Language, Florida, USA, pp. 347–356.Google Scholar
Loreto, V., Mukherjee, A., and Tria, F., 2012. On the origin of the hierarchy of color names. PNAS 109 (18): 68196824.CrossRefGoogle ScholarPubMed
Luhn, H. P., 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2: 159165.CrossRefGoogle Scholar
Maity, S. K., Venkat, T. M., and Mukherjee, A., 2012. Opinion formation in time-varying social networks: the case of the naming game. Phys. Rev. E 86: 036110.CrossRefGoogle ScholarPubMed
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Aiden, E. L., 2011. Quantitative analysis of culture using millions of digitized books. Science 331 (6014): 176182.CrossRefGoogle ScholarPubMed
Mihalcea, R., and Nastase, V., 2012. Word epoch disambiguation: finding how words change over time. In Proceedings of ACL, Jeju Island, Korea, pp. 259263.Google Scholar
Mitra, S., Mitra, R., Riedl, M., Biemann, C., Mukherjee, A., and Goyal, P., 2014. That’s sick dude!: automatic identification of word sense change across different timescales. In Proceedings of ACL, Baltimore, USA, pp. 1020–1029.Google Scholar
Mukherjee, A., Tria, F., Baronchelli, A., Puglisi, A., and Loreto, V., 2011. Aging in language dynamics. PLoS ONE 6 (2): e16677.CrossRefGoogle ScholarPubMed
Navigli, R., 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41 (2): 169.CrossRefGoogle Scholar
Pääkkö, P., and Lindén, K. 2012. Finding a location for a new word in WordNet. In Proceedings of the Global WordNet Conference, Matsue, Japan.Google Scholar
Riedl, M., Steuer, R., and Biemann, C. 2014. Distributed distributional similarities of Google books over the centuries. In Proceedings of LREC, Reykjavik, Iceland.Google Scholar
Rychlý, P., and Kilgarriff, A., 2007. An efficient algorithm for building a distributional thesaurus (and other sketch engine developments). In Proceedings of ACL, Poster and Demo Sessions, Prague, Czech Republic, pp. 4144.Google Scholar
Schütze, H., 1998. Automatic word sense discrimination. Computational Linguistics 24 (1): 97123.Google Scholar
Spärk-Jones, K. 1986. Synonymy and Semantic Classification. Edinburgh University Press. Edinburgh, Scotland, ISBN 0-85224-517-3.Google Scholar
Tahmasebi, N., Risse, T., and Dietze, S. 2011. Towards automatic language evolution tracking: a study on word sense tracking. In Proceedings of EvoDyn, vol. 784, Bonn, Germany.Google Scholar
Wang, X., and McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of KDD, Philadelphia, PA, USA, pp. 424433.CrossRefGoogle Scholar
Wijaya, D., and Yeniterzi, R., 2011. Understanding semantic change of words over centuries. In Proceedings of the Workshop on Detecting and Exploiting Cultural Diversity on the Social Web, Glasgow, Scotland, UK, pp. 35–40.Google Scholar