Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-13T10:45:35.320Z Has data issue: false hasContentIssue false

Efficient bilingual lexicon extraction from comparable corpora based on formal concepts analysis

Published online by Cambridge University Press:  04 October 2021

Mohamed Chebel
Affiliation:
LIPAH Research Laboratory, Faculty of Sciences of Tunis, Tunis EL Manar University, Tunisia
Chiraz Latiri*
Affiliation:
LIPAH Research Laboratory, Faculty of Sciences of Tunis, Tunis EL Manar University, Tunisia
Eric Gaussier
Affiliation:
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
*
*Corresponding author. E-mail: chiraz.latiri@gnet.tn

Abstract

Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.

Type
Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Andrade, D., Matsuzaki, T. and Tsujii, J. (2011). Effective Use of Dependency Structure for Bilingual Lexicon Creation. Springer Berlin Heidelberg, pp. 8092.Google Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp. 22892294.CrossRefGoogle Scholar
Ballesteros, L. and Sanderson, M. (2003). Addressing the lack of direct translation resources for cross-language retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM 2003, New York, NY, USA: ACM, pp. 147–152.CrossRefGoogle Scholar
Barker, K. and Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, London, UK, UK: Springer-Verlag, pp. 40–52.CrossRefGoogle Scholar
Chandar, A.P.S., Lauly, S., Larochelle, H., Khapra, M.M., Ravindran, B., Raykar, V. and Saha, A. (2014). An autoencoder approach to learning bilingual word representations. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2014, Cambridge, MA, USA: MIT Press, pp. 18531861.Google Scholar
Chebel, M., Latiri, C. and Gaussier, É. (2015). Extraction of interlingual documents clusters based on closed concepts mining. In 19th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES 2015, Singapore, 7–9 September 2015, Volume 60 of Procedia Computer Science. Elsevier, pp. 537–546.CrossRefGoogle Scholar
Chebel, M., Latiri, C. and Gaussier, É. (2017). Bilingual lexicon extraction from comparable corpora based on closed concepts mining. In Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23–26, 2017, Proceedings, Part I, pp. 586598.CrossRefGoogle Scholar
Chiao, Y. and Zweigenbaum, P. (2003). The effect of a general lexicon in corpus-based identification of French-English medical word translations. In The New Navigators: from Professionals to Patients - Proceedings of MIE2003, Saint Malo, France, pp. 397402.Google Scholar
Déjean, H., Gaussier, É., Renders, J. and Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence Medicine 33(2), 111124.CrossRefGoogle ScholarPubMed
Fast, E., Chen, B. and Bernstein, M. S. (2017). Lexicons on demand: Neural word embeddings for large-scale text analysis. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 48364840.CrossRefGoogle Scholar
Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, April 26–30, 2014, Gothenburg, Sweden, pp. 462471.CrossRefGoogle Scholar
Fayyad, U.M. and Irani, K.B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI, pp. 1022–1029.Google Scholar
Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL 1995. Association for Computational Linguistics, pp. 236–243.CrossRefGoogle Scholar
Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup, AMTA 1998. Springer-Verlag, pp. 1–17.CrossRefGoogle Scholar
Fung, P. and Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and E. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25–26 July 2004, Barcelona, Spain, pp. 5763.Google Scholar
Fung, P. and Lo, Y.Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL 1998, August 10–14, 1998, Université de Montréal, Montréal, Quebec, Canada. Proceedings of the Conference, pp. 414420.Google Scholar
Fung, P. and McKeown, K.R. (1997). A technical word- and term-translation aid using noisy parallel corpora across language groups. Machine Translation 12(1–2), 5387.CrossRefGoogle Scholar
Ganter, B. and Wille, R. (1999). Formal Concept Analysis. Springer-Verlag.CrossRefGoogle Scholar
Garera, N., Callison-Burch, C. and Yarowsky, D. (2009, June). Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), Boulder, Colorado. Association for Computational Linguistics, pp. 129–137.CrossRefGoogle Scholar
Gaussier, É., Renders, J., Matveeva, I., Goutte, C. and Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21–26 July, 2004, Barcelona, Spain, pp. 526533.CrossRefGoogle Scholar
Haghighi, A., Liang, P., Berg-Kirkpatrick, T. and Klein, D. (2008). Learning bilingual lexicons from monolingual corpora. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA, pp. 771779.Google Scholar
Han, J., Pei, J. and Yin, Y. (2000, May). Mining frequent patterns without candidate generation. SIGMOD Record 29(2), 112.CrossRefGoogle Scholar
Hazem, A. and Morin, E. (2012, March). Qalign: A new method for bilingual lexicon extraction from comparable corpora. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2012, vol. 7182. LNCS, New Delhi, India, pp. 83–96. Springer-Verlag.CrossRefGoogle Scholar
Hazem, A. and Morin, E. (2017). Bilingual word embeddings for bilingual terminology extraction from specialized comparable corpora. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, pp. 685–693.Google Scholar
Hazem, A. and Morin, E. (2018). Leveraging meta-embeddings for bilingual lexicon extraction from specialized comparable corpora. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20–26, 2018, pp. 937949.Google Scholar
Heyman, G., Vulic, I. and Moens, M. (2018). A deep learning approach to bilingual lexicon induction in the biomedical domain. BMC Bioinformatics 19(1), 259:1259:15.CrossRefGoogle ScholarPubMed
Irvine, A. and Callison-Burch, C. (2013). Supervised bilingual lexicon induction with multiple monolingual signals. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9–14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 518523.Google Scholar
Irvine, A. and Callison-Burch, C. (2017). A comprehensive analysis of bilingual lexicon induction. Comput. Linguistics 43(2), 273310.CrossRefGoogle Scholar
Jagarlamudi, J., Udupa, R., Daumé, H. III and Bhole A. (2011). Improving bilingual projections via sparse covariance matrices. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 930940.Google Scholar
Joshi, P., Santy, S., Budhiraja, A., Bali, K. and Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.CrossRefGoogle Scholar
Langlais, P. and Jakubina, L. (2017). Reranking translation candidates produced by several bilingual word similarity sources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3–7, 2017, Volume 2: Short Papers, pp. 605611.Google Scholar
Laroche, A. and Langlais, P. (2010). Revisiting context-based projection methods for term-translation spotting in comparable corpora. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China, pp. 617625.Google Scholar
Levow, G., Oard, D. W. and Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management 41(3), 523–547.CrossRefGoogle Scholar
Li, B. and Gaussier, É. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2010, Beijing, China, pp. 644652.Google Scholar
Li, B. and Gaussier, É. (2012). An information-based cross-language information retrieval model. In Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012, Proceedings, Vol. 7224. Lecture Notes in Computer Science, pp. 281–292. Springer.CrossRefGoogle Scholar
Li, B., Gaussier, É. and Aizawa, A.N. (2011). Clustering comparable corpora for bilingual lexicon extraction. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA - Short Papers, pp. 473478.Google Scholar
Linard, A., Daille, B. and Morin, E. (2015, July). Attempting to bypass alignment from comparable corpora via pivot language. In Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, Beijing, China. Association for Computational Linguistics, pp. 32–37.CrossRefGoogle Scholar
Manning, C.D., Raghavan, P. and SchÜtze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.CrossRefGoogle Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 31113119.Google Scholar
Morin, E. and Hazem, A. (2016). Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction. Natural Language Engineering 22(4), 575601.CrossRefGoogle Scholar
Och, F.J. and Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics 30(4), 417449.CrossRefGoogle Scholar
Otero, P.G. (2008). Comparing window and syntax based strategies for semantic extraction. In Computational Processing of the Portuguese Language, 8th International Conference, PROPOR 2008, Aveiro, Portugal, September 8–10, 2008, Proceedings, pp. 4150.Google Scholar
Pekar, V., Mitkov, R., Blagoev, D. and Mulloni, A. (2006, March). Finding translations for low-frequency words in comparable corpora. Machine Translation 20(4), 247266.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C. (2014), October. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 15321543.Google Scholar
Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, New York, NY, USA: ACM, pp. 55–63.CrossRefGoogle Scholar
Prochasson, E. and Fung, P. (2011). Rare word translation extraction from aligned comparable documents. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA, pp. 13271335.Google Scholar
Prochasson, E., Morin, E. and Kageura, K. (2009, August). Anchor points for bilingual lexicon extraction from small comparable corpora. In Machine Translation Summit, France, pp. 8.Google Scholar
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL 1995. Association for Computational Linguistics, pp. 320–322.CrossRefGoogle Scholar
Rapp, R. (1999). Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL 1999. Association for Computational Linguistics, pp. 519–526.CrossRefGoogle Scholar
Robertson, S.E. and Sparck Jones, K. (1988). Document retrieval systems. Chapter Relevance Weighting of Search Terms. London, UK, UK: Taylor Graham Publishing, pp. 143160.Google Scholar
Saad, M., Langlois, D. and Smali, K. (2014). Cross-lingual semantic similarity measure for comparable articles. In Advances in Natural Language Processing – 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17–19, 2014. Proceedings, pp. 105115.CrossRefGoogle Scholar
Salton, G. and Buckley, C. (1988, August). Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523.CrossRefGoogle Scholar
Savoy, J. (2003). Report on CLEF-2003 multilingual tracks. In Working Notes for CLEF 2003 Workshop co-located with the 7th European Conference on Digital Libraries (ECDL 2003), Trondheim, Norway, August 21–22, 2003. Google Scholar
Shao, L. and Ng, H.T. (2004). Mining new word translations from comparable corpora. In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2004, Geneva, Switzerland.CrossRefGoogle Scholar
SØgaard, A., Ruder, S. and Vulic, I. (2018). On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, pp. 778788.CrossRefGoogle Scholar
Tamura, A., Watanabe, T. and Sumita, E. (2012). Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012. Association for Computational Linguistics, pp. 24–36.Google Scholar
Vulic, I. and Moens, M. (2016). Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55, 953994.CrossRefGoogle Scholar
Vulic, I., Smet, W.D., Tang, J. and Moens, M. (2015). Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing and Management 51(1), 111147.CrossRefGoogle Scholar
Xu, R., Yang, Y., Otani, N. and Wu, Y. (2018). Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 – November 4, 2018, pp. 24652474.CrossRefGoogle Scholar
Zaki, M.J. and Hsiao, C.-J. (2005, April). Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering 17(4), 462478.CrossRefGoogle Scholar
Zhang, M., Liu, Y., Luan, H. and Sun, M. (2017). Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 1: Long Papers, pp. 19591970.CrossRefGoogle Scholar