Efficient bilingual lexicon extraction from comparable corpora based on formal concepts analysis

Mohamed Chebel; Chiraz Latiri; Eric Gaussier

doi:10.1017/S135132492100022X

Efficient bilingual lexicon extraction from comparable corpora based on formal concepts analysis

Published online by Cambridge University Press: 04 October 2021

Mohamed Chebel ,

Chiraz Latiri and

Eric Gaussier

Show author details

Mohamed Chebel: Affiliation:
LIPAH Research Laboratory, Faculty of Sciences of Tunis, Tunis EL Manar University, Tunisia
Chiraz Latiri*: Affiliation:
LIPAH Research Laboratory, Faculty of Sciences of Tunis, Tunis EL Manar University, Tunisia
Eric Gaussier: Affiliation:
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
*: *Corresponding author. E-mail: chiraz.latiri@gnet.tn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.

Keywords

Corpus linguistics Evaluation Information extraction Information retrieval Multilinguality

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 1 , January 2023 , pp. 138 - 161

DOI: https://doi.org/10.1017/S135132492100022X [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Andrade, D., Matsuzaki, T. and Tsujii, J. (2011). Effective Use of Dependency Structure for Bilingual Lexicon Creation. Springer Berlin Heidelberg, pp. 80–92.Google Scholar

Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp. 2289–2294.CrossRef Google Scholar

Ballesteros, L. and Sanderson, M. (2003). Addressing the lack of direct translation resources for cross-language retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM 2003, New York, NY, USA: ACM, pp. 147–152.CrossRef Google Scholar

Barker, K. and Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, London, UK, UK: Springer-Verlag, pp. 40–52.CrossRef Google Scholar

Chandar, A.P.S., Lauly, S., Larochelle, H., Khapra, M.M., Ravindran, B., Raykar, V. and Saha, A. (2014). An autoencoder approach to learning bilingual word representations. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2014, Cambridge, MA, USA: MIT Press, pp. 1853–1861.Google Scholar

Chebel, M., Latiri, C. and Gaussier, É. (2015). Extraction of interlingual documents clusters based on closed concepts mining. In 19th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES 2015, Singapore, 7–9 September 2015, Volume 60 of Procedia Computer Science. Elsevier, pp. 537–546.CrossRef Google Scholar

Chebel, M., Latiri, C. and Gaussier, É. (2017). Bilingual lexicon extraction from comparable corpora based on closed concepts mining. In Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23–26, 2017, Proceedings, Part I, pp. 586–598.CrossRef Google Scholar

Chiao, Y. and Zweigenbaum, P. (2003). The effect of a general lexicon in corpus-based identification of French-English medical word translations. In The New Navigators: from Professionals to Patients - Proceedings of MIE2003, Saint Malo, France, pp. 397–402.Google Scholar

Déjean, H., Gaussier, É., Renders, J. and Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence Medicine 33(2), 111–124.CrossRef Google Scholar PubMed

Fast, E., Chen, B. and Bernstein, M. S. (2017). Lexicons on demand: Neural word embeddings for large-scale text analysis. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 4836–4840.CrossRef Google Scholar

Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, April 26–30, 2014, Gothenburg, Sweden, pp. 462–471.CrossRef Google Scholar

Fayyad, U.M. and Irani, K.B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI, pp. 1022–1029.Google Scholar

Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL 1995. Association for Computational Linguistics, pp. 236–243.CrossRef Google Scholar

Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup, AMTA 1998. Springer-Verlag, pp. 1–17.CrossRef Google Scholar

Fung, P. and Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and E. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25–26 July 2004, Barcelona, Spain, pp. 57–63.Google Scholar

Fung, P. and Lo, Y.Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL 1998, August 10–14, 1998, Université de Montréal, Montréal, Quebec, Canada. Proceedings of the Conference, pp. 414–420.Google Scholar

Fung, P. and McKeown, K.R. (1997). A technical word- and term-translation aid using noisy parallel corpora across language groups. Machine Translation 12(1–2), 53–87.CrossRef Google Scholar

Ganter, B. and Wille, R. (1999). Formal Concept Analysis. Springer-Verlag.CrossRef Google Scholar

Garera, N., Callison-Burch, C. and Yarowsky, D. (2009, June). Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), Boulder, Colorado. Association for Computational Linguistics, pp. 129–137.CrossRef Google Scholar

Gaussier, É., Renders, J., Matveeva, I., Goutte, C. and Déjean, H. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21–26 July, 2004, Barcelona, Spain, pp. 526–533.CrossRef Google Scholar

Haghighi, A., Liang, P., Berg-Kirkpatrick, T. and Klein, D. (2008). Learning bilingual lexicons from monolingual corpora. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA, pp. 771–779.Google Scholar

Han, J., Pei, J. and Yin, Y. (2000, May). Mining frequent patterns without candidate generation. SIGMOD Record 29(2), 1–12.CrossRef Google Scholar

Hazem, A. and Morin, E. (2012, March). Qalign: A new method for bilingual lexicon extraction from comparable corpora. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2012, vol. 7182. LNCS, New Delhi, India, pp. 83–96. Springer-Verlag.CrossRef Google Scholar

Hazem, A. and Morin, E. (2017). Bilingual word embeddings for bilingual terminology extraction from specialized comparable corpora. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, pp. 685–693.Google Scholar

Hazem, A. and Morin, E. (2018). Leveraging meta-embeddings for bilingual lexicon extraction from specialized comparable corpora. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20–26, 2018, pp. 937–949.Google Scholar

Heyman, G., Vulic, I. and Moens, M. (2018). A deep learning approach to bilingual lexicon induction in the biomedical domain. BMC Bioinformatics 19(1), 259:1–259:15.CrossRef Google Scholar PubMed

Irvine, A. and Callison-Burch, C. (2013). Supervised bilingual lexicon induction with multiple monolingual signals. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9–14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 518–523.Google Scholar

Irvine, A. and Callison-Burch, C. (2017). A comprehensive analysis of bilingual lexicon induction. Comput. Linguistics 43(2), 273–310.CrossRef Google Scholar

Jagarlamudi, J., Udupa, R., Daumé, H. III and Bhole A. (2011). Improving bilingual projections via sparse covariance matrices. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 930–940.Google Scholar

Joshi, P., Santy, S., Budhiraja, A., Bali, K. and Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.CrossRef Google Scholar

Langlais, P. and Jakubina, L. (2017). Reranking translation candidates produced by several bilingual word similarity sources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3–7, 2017, Volume 2: Short Papers, pp. 605–611.Google Scholar

Laroche, A. and Langlais, P. (2010). Revisiting context-based projection methods for term-translation spotting in comparable corpora. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China, pp. 617–625.Google Scholar

Levow, G., Oard, D. W. and Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management 41(3), 523–547.CrossRef Google Scholar

Li, B. and Gaussier, É. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2010, Beijing, China, pp. 644–652.Google Scholar

Li, B. and Gaussier, É. (2012). An information-based cross-language information retrieval model. In Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012, Proceedings, Vol. 7224. Lecture Notes in Computer Science, pp. 281–292. Springer.CrossRef Google Scholar

Li, B., Gaussier, É. and Aizawa, A.N. (2011). Clustering comparable corpora for bilingual lexicon extraction. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA - Short Papers, pp. 473–478.Google Scholar

Linard, A., Daille, B. and Morin, E. (2015, July). Attempting to bypass alignment from comparable corpora via pivot language. In Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, Beijing, China. Association for Computational Linguistics, pp. 32–37.CrossRef Google Scholar

Manning, C.D., Raghavan, P. and SchÜtze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.CrossRef Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119.Google Scholar

Morin, E. and Hazem, A. (2016). Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction. Natural Language Engineering 22(4), 575–601.CrossRef Google Scholar

Och, F.J. and Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics 30(4), 417–449.CrossRef Google Scholar

Otero, P.G. (2008). Comparing window and syntax based strategies for semantic extraction. In Computational Processing of the Portuguese Language, 8th International Conference, PROPOR 2008, Aveiro, Portugal, September 8–10, 2008, Proceedings, pp. 41–50.Google Scholar

Pekar, V., Mitkov, R., Blagoev, D. and Mulloni, A. (2006, March). Finding translations for low-frequency words in comparable corpora. Machine Translation 20(4), 247–266.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C. (2014), October. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 1532–1543.Google Scholar

Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, New York, NY, USA: ACM, pp. 55–63.CrossRef Google Scholar

Prochasson, E. and Fung, P. (2011). Rare word translation extraction from aligned comparable documents. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA, pp. 1327–1335.Google Scholar

Prochasson, E., Morin, E. and Kageura, K. (2009, August). Anchor points for bilingual lexicon extraction from small comparable corpora. In Machine Translation Summit, France, pp. 8.Google Scholar

Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL 1995. Association for Computational Linguistics, pp. 320–322.CrossRef Google Scholar

Rapp, R. (1999). Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL 1999. Association for Computational Linguistics, pp. 519–526.CrossRef Google Scholar

Robertson, S.E. and Sparck Jones, K. (1988). Document retrieval systems. Chapter Relevance Weighting of Search Terms. London, UK, UK: Taylor Graham Publishing, pp. 143–160.Google Scholar

Saad, M., Langlois, D. and Smali, K. (2014). Cross-lingual semantic similarity measure for comparable articles. In Advances in Natural Language Processing – 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17–19, 2014. Proceedings, pp. 105–115.CrossRef Google Scholar

Salton, G. and Buckley, C. (1988, August). Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523.CrossRef Google Scholar

Savoy, J. (2003). Report on CLEF-2003 multilingual tracks. In Working Notes for CLEF 2003 Workshop co-located with the 7th European Conference on Digital Libraries (ECDL 2003), Trondheim, Norway, August 21–22, 2003. Google Scholar

Shao, L. and Ng, H.T. (2004). Mining new word translations from comparable corpora. In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2004, Geneva, Switzerland.CrossRef Google Scholar

SØgaard, A., Ruder, S. and Vulic, I. (2018). On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, pp. 778–788.CrossRef Google Scholar

Tamura, A., Watanabe, T. and Sumita, E. (2012). Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012. Association for Computational Linguistics, pp. 24–36.Google Scholar

Vulic, I. and Moens, M. (2016). Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55, 953–994.CrossRef Google Scholar

Vulic, I., Smet, W.D., Tang, J. and Moens, M. (2015). Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing and Management 51(1), 111–147.CrossRef Google Scholar

Xu, R., Yang, Y., Otani, N. and Wu, Y. (2018). Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 – November 4, 2018, pp. 2465–2474.CrossRef Google Scholar

Zaki, M.J. and Hsiao, C.-J. (2005, April). Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering 17(4), 462–478.CrossRef Google Scholar

Zhang, M., Liu, Y., Luan, H. and Sun, M. (2017). Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 1: Long Papers, pp. 1959–1970.CrossRef Google Scholar

Article contents

Efficient bilingual lexicon extraction from comparable corpora based on formal concepts analysis

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests