Hostname: page-component-cd9895bd7-gxg78 Total loading time: 0 Render date: 2024-12-26T09:31:05.800Z Has data issue: false hasContentIssue false

Towards syntax-aware token embeddings

Published online by Cambridge University Press:  08 July 2020

Diana Nicoleta Popa*
Affiliation:
Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 700 Avenue Centrale, 38401Saint-Martin-d’Hères, France Naver Labs Europe, 6 Chemin de Maupertuis, 38240Meylan, France
Julien Perez
Affiliation:
Naver Labs Europe, 6 Chemin de Maupertuis, 38240Meylan, France
James Henderson
Affiliation:
Idiap Research Institute, 19 Rue Marconi, 1920Martigny, Switzerland
Eric Gaussier
Affiliation:
Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 700 Avenue Centrale, 38401Saint-Martin-d’Hères, France
*
*Corresponding author. E-mail: diana.popa@imag.fr

Abstract

Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated by word sense disambiguation, using sequential context and large lexical resources. More recently, acknowledging the need for an in-context representation of words, some work leveraged information derived from language modelling and large amounts of data to induce contextualised representations. In this paper, we investigate Syntax-Aware word Token Embeddings (SATokE) as a way to explicitly encode specific information derived from the linguistic analysis of a sentence in vectors which are input to a deep learning model. We propose an efficient unsupervised learning algorithm based on tensor factorisation for computing these token embeddings given an arbitrary graph of linguistic structure. Applying this method to syntactic dependency structures, we investigate the usefulness of such token representations as part of deep learning models of text understanding. We encode a sentence either by learning embeddings for its tokens and the relations between them from scratch or by leveraging pre-trained relation embeddings to infer token representations. Given sufficient data, the former is slightly more accurate than the latter, yet both provide more informative token embeddings than standard word representations, even when the word representations have been learned on the same type of context from larger corpora (namely pre-trained dependency-based word embeddings). We use a large set of supervised tasks and two major deep learning families of models for sentence understanding to evaluate our proposal. We empirically demonstrate the superiority of the token representations compared to popular distributional representations of words for various sentence and sentence pair classification tasks.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bahdanau, D., Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In Proceedings of the 2014 International Conference on Learning Representations.Google Scholar
Bansal, M., Gimpel, K. and Livescu, K. (2014). Tailoring continuous word representations for dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Baroni, M. and Lenci, A. (2010). Distributional memory: a general framework for corpus-based semantics. Journal of Computational Linguistics 36(4), 673721.CrossRefGoogle Scholar
Bengio, Y., Ducharme, R., Vincent, P. and Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 11371155.Google Scholar
Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M. and Zamparelli, R. (2016). Sick through the semeval glasses. lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation 50, 95124.Google Scholar
Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for semantic composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. and Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26. Curran Associates Inc., pp. 27872795.Google Scholar
Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Chen, X., Liu, Z. and Sun, M. (2014). A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.Google Scholar
Conneau, A. and Kiela, D. (2018). Senteval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).Google Scholar
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Dasigi, P., Ammar, W., Dyer, C. and Hovy, E.H. (2017). Ontology-aware token embeddings for prepositional phrase attachment. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Dolan, B. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing.Google Scholar
Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E.H. and Smith, N.A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y. (2017). Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning.Google Scholar
Ghannay, S., Favre, B., Estève, Y. and Camelin, N. (2016). Word embedding evaluation and combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 300305. European Language Resources Association (ELRA).Google Scholar
Grefenstette, E. and Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discovery. USA: Kluwer Academic Publishers.CrossRefGoogle Scholar
Henderson, J. (2003). Inducing history representations for broad coverage statistical parsing. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Honnibal, M. and Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computing 9(8), 17351780.CrossRefGoogle Scholar
Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRefGoogle Scholar
İrsoy, O. and Cardie, C. (2014). Deep recursive neural networks for compositionality in language. In Proceedings of the 27th International Conference on Neural Information Processing Systems.Google Scholar
Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014). A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Kingma, D. and Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 2014 International Conference on Learning Representations.Google Scholar
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A. and Fidler, S. (2015). Skip-thought vectors. In Advances in Neural Information Processing Systems 28.Google Scholar
Levy, O. and Goldberg, Y. (2014a). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Levy, O. and Goldberg, Y. (2014b). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27. MIT Press, pp. 21772185.Google Scholar
Li, X. and Roth, D. (2002). Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics.CrossRefGoogle Scholar
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics - Volume 2.Google Scholar
Liu, P., Qiu, X. and Huang, X. (2015). Learning context-sensitive word embeddings with neural tensor skip-gram model. In Proceedings of the 24th International Conference on Artificial Intelligence.Google Scholar
Liu, Y., Sun, C., Lin, L. and Wang, X. (2016). Learning natural language inference using bidirectional lstm model and inner-attention. CoRR abs/1605.09090.Google Scholar
Marcus, M.P., Marcinkiewicz, M.A. & Santorini, B. (1993). Building a large annotated corpus of english: the penn treebank. Journal of Computational Linguistics - Special Issue on Using Large Corpora.CrossRefGoogle Scholar
McCann, B., Bradbury, J., Xiong, C. & Socher, R. (2017). Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems 30, pp. 62946305.Google Scholar
Melamud, O., Goldberger, J. and Dagan, I. (2016). context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G.S. & Dean, J. (2013a). Efficient estimation of word representations in vector space. CoRR abs/1301.3781.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 31113119.Google Scholar
Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Journal of Cognitive Science 34, 13881429.CrossRefGoogle Scholar
Mrkšic, N., OSéaghdha, D., Thomson, B., Gašic, M., Rojas-Barahona, L., Su, P.-H., Vandyke, D., Wen, T.-H. & Young, S. (2016). Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Neelakantan, A., Shankar, J., Passos, A. and McCallum, A. (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Nickel, M., Tresp, V. and Kriegel, H.-P. (2011). A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on Machine Learning. Omnipress, pp. 809816.Google Scholar
Padó, S. and Lapata, M. (2007). Dependency-based construction of semantic space models. Journal of Computational Linguistics 33, 161199.CrossRefGoogle Scholar
Pang, B. and Lee, L. (2004). A sentimental education: sentiment analysis using subjectivity. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.Google Scholar
Pang, B. and Lee, L. (2005). Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.CrossRefGoogle Scholar
Peters, M., Ammar, W., Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Salant, S. and Berant, J. (2018). Contextualized word representations for reading comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Socher, R., Bauer, J., Manning, C.D, and Ng, A.Y. (2013a). Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Google Scholar
Socher, R., Huang, E.H., Pennington, J., Ng, A.Y. and Manning, C.D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems.Google Scholar
Socher, R., Huval, B., Manning, C.D. and Ng, A.Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar
Socher, R., Manning, C.D. and Ng, A.Y. (2010). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop.Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C. (2013b). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Tang, J., Qu, M. and Mei, Q. (2015). Pte: predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRefGoogle Scholar
Trouillon, T., Welbl, J., Riedel, S., Gaussier, É. and Bouchard, G. (2016). Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.Google Scholar
Tu, L., Gimpel, K. and Livescu, K. (2017). Learning to embed words in context for syntactic tasks. CoRR abs/1706.02807.CrossRefGoogle Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser . and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 59986008.Google Scholar
Vulic, I., Mrksic, N., Reichart, R., Séaghdha, D.ó., Young, S.J. and Korhonen, A. (2017). Morph-fitting: fine-tuning word vector spaces with simple language-specific rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Weir, D., Weeds, J., Reffin, J. and Kober, T. (2016). Aligning packed dependency trees: a theory of composition for distributional semantics. Journal of Computational Linguistics 42, 727761.CrossRefGoogle Scholar
Westera, M. & Boleda, G. (2019). Don’t blame distributional semantics it can’t do entailment. In Proceedings of the 13th International Conference on Computational Semantics.CrossRefGoogle Scholar
Zhao, H., Lu, Z. & Poupart, P. (2015). Self-adaptive hierarchical sentence model. In Proceedings of the 24th International Conference on Artificial Intelligence.Google Scholar
Zou, W.Y., Socher, R., Cer, D. and Manning, C.D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Google Scholar