Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-26T09:40:35.757Z Has data issue: false hasContentIssue false

Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction

Published online by Cambridge University Press:  15 June 2016

EMMANUEL MORIN
Affiliation:
Université de Nantes, LINA UMR CNRS 6241, 2 rue de la houssinière, BP 92208, 44322 Nantes Cedex 03, France e-mails: emmanuel.morin@univ-nantes.fr, amir.hazem@univ-nantes.fr
AMIR HAZEM
Affiliation:
Université de Nantes, LINA UMR CNRS 6241, 2 rue de la houssinière, BP 92208, 44322 Nantes Cedex 03, France e-mails: emmanuel.morin@univ-nantes.fr, amir.hazem@univ-nantes.fr

Abstract

The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced in terms of size. However, the historical context-based projection method is relatively insensitive to the size of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpora and on the quality of bilingual terminology extraction by doing different experiments. Moreover, we have introduced a strategy into the context-based projection method to re-estimate word co-occurrence observations. This is done by using smoothing or prediction techniques that boost the observations of word co-occurrences which are mainly useful for the smallest part of an unbalanced comparable corpus. Our results show that the use of unbalanced specialized comparable corpora results in a significant improvement in the quality of extracted lexicons.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

We thank the two anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript. This work is supported by the French National Research Agency under grant ANR-12-CORD-0020.

References

Agresti, A. 2007. An Introduction to Categorical Data Analysis, 2nd ed.Hoboken, New Jersey: Wiley & Sons, Inc.Google Scholar
Bouamor, D., Semmar, N. and Zweigenbaum, P. 2013. Context vector disambiguation for bilingual lexicon extraction from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, pp. 759–64.Google Scholar
Chen, S. F. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13 (4): 359–93.CrossRefGoogle Scholar
Chiao, Y.-C. and Zweigenbaum, P. 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Tapei, Taiwan, pp. 1208–12.Google Scholar
Chiao, Y.-C. and Zweigenbaum, P. 2003. The effect of a general lexicon in corpus-based identification of french-english medical word translations. In Baud, R., Fieschi, M., Le Beux, P., and Ruch, P. (eds.), The New Navigators: from Professionals to Patients, Actes Medica Informatics Europe, pp. 397402. Studies in Health Technology and Informatics, vol. 95. Amsterdam: IOS Press.Google Scholar
Christensen, R. 1997. Log-Linear Models and Logistic Regression. Berlin: Springer-Verlag.Google Scholar
Déjean, H., Gaussier, É., and Sadat, F. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 1–7.Google Scholar
Diab, M. T. and Finch, S. 2000. A statistical word-level translation model for comparable corpora. In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (RIAO'00), Paris, France, pp. 1500–01.Google Scholar
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.Google Scholar
Evert, S. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart.Google Scholar
Evert, S. and Baroni, M. 2007. Zipfr: word frequency modeling in r. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic.Google Scholar
Fano, R. M. 1961. Transmission of Information: a Statistical Theory of Communications. Cambridge, MA, USA: MIT Press.Google Scholar
Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis (special volume of the Philological Society), pp. 132. Oxford: Blackwell.Google Scholar
Fung, P. 1995. Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora (VLC'95), Cambridge, MA, USA, pp. 173–83.Google Scholar
Fung, P. 1998. A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup (AMTA'98), Langhorne, PA, USA, pp. 1–17.Google Scholar
Fung, P. and Cheung, P. 2004. Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'04), Barcelona, Spain, pp. 57–63.Google Scholar
Fung, P. and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC'97), Hong Kong, pp. 192–202.Google Scholar
Gamallo, P. 2007. Learning bilingual lexicons from comparable english and spanish corpora. In Proceedings of the 11th Conference on Machine Translation Summit (MT Summit XI), Copenhagen, Denmark, pp. 191–98.Google Scholar
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. (2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL'04), Barcelona, Spain, pp. 526–33.CrossRefGoogle Scholar
Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40 (3/4): 237–64.Google Scholar
Grefenstette, G. 1994a. Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th Congress of the European Association for Lexicography (EURALEX'94), Amsterdam, The Netherlands, pp. 279–90.Google Scholar
Grefenstette, G. 1994b. Explorations in Automatic Thesaurus Discovery. Boston, MA, USA: Kluwer Academic Publisher.CrossRefGoogle Scholar
Hazem, A. and Morin, E. 2012. Adaptive dictionary for bilingual lexicon extraction from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, pp. 288–92.Google Scholar
Hazem, A. and Morin, E. 2013. Word co-occurrence counts prediction for bilingual terminology extraction from comparable corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP'13), Nagoya, Japan, pp. 1392–1400.Google Scholar
Hazem, A. and Morin, E. 2014. Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models. In Proceedings of the 15th International Computational Linguistics and Intelligent Text Processing (CICLing'14), Kathmandu, Nepal, pp. 310–23.Google Scholar
Ismail, A. and Manandhar, S. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 481–89.Google Scholar
Jeffreys, H. 1948. Theory of Probability. Oxford: The Clarendon Press.Google Scholar
Johnson, W. 1932. Probability: the deductive and inductive problems. Mind 41 (164): 409–23.Google Scholar
Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (3): 400–01.Google Scholar
Kneser, R. and Ney, H. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'95), Detroit, MI, USA, pp. 181–84.Google Scholar
Koehn, P. and Knight, K. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (ULA'02), Philadelphia, PA, USA, pp. 9–16.Google Scholar
Laroche, A. and Langlais, P. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 617–25.Google Scholar
Li, B. and Gaussier, É. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING'10), Beijing, China, pp. 644–52.Google Scholar
Lidstone, G. J. 1920. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries 8: 182–92.Google Scholar
Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.Google Scholar
McEnery, A., and Xiao, Z. 2007. Parallel and comparable corpora: what are they up to? In Anderman, G., and Rogers, M. (eds.), Incorporating Corpora: Translation and the Linguist, Multilingual Matters, chapter 2, Clevedon, UK, pp. 18–31.Google Scholar
Mercer, L. and Jelinek, F. 1980. Interpolated estimation of markov source parameters from sparse data. In Workshop on Pattern Recognition in Practice, Amsterdam.Google Scholar
Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2007. Bilingual terminology mining – using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic, pp. 664–71.Google Scholar
Morin, E., Daille, B., Takeuchi, K. and Kageura, K. 2010. Brains, not brawn: the use of ‘smart’ comparable corpora in bilingual terminology mining. ACM Transactions on Speech and Language Processing 7 (1): 123.Google Scholar
Morin, E. and Hazem, A. 2014. Looking at unbalanced specialized comparable Corpora for bilingual lexicon extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL'14), Baltimore, Maryland, pp. 1284–93.Google Scholar
Morin, E. and Prochasson, E. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC'11), Portland, OR, USA, pp. 27–34.Google Scholar
Pekar, V., Mitkov, R., Blagoev, D. and Mulloni, A. 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.Google Scholar
Prochasson, E. and Fung, P. 2011. Rare word translation extraction from aligned comparable documents. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL'11), Portland, OR, USA, pp. 1327–35.Google Scholar
Prochasson, E., Morin, E. and Kageura, K. 2009. Anchor points for bilingual lexicon extraction from small comparable corpora. In Proceedings of the 12th Conference on Machine Translation Summit (MT Summit XII), Ottawa, Canada, pp. 284–91.Google Scholar
Rapp, R. 1995. Identify word translations in non-parallel texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'95), Boston, MA, USA, pp. 320–22.Google Scholar
Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, MD, USA, pp. 519–26.Google Scholar
Rubino, R. and Linarès, G. 2011. A multi-view approach for term translation spotting. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'11), Tokyo, Japan, pp. 29–40.Google Scholar
Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery 15 (1): 836.Google Scholar
Sinclair, J. 2005. Corpus and text - basic principles. In Wynne, M. (ed.), Developing Linguistic Corpora: a Guide to Good Practice, pp. 116. Oxford: Oxbow Books. Available online from ota.ox.ac.uk/documents/creating/dlc/ [Accessed 2015-03-03].Google Scholar
Tanaka, K. and Iwasaki, H. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th International Conference on Computational Linguistics (COLING'96), Copenhagen, Denmark, pp. 580–85.Google Scholar
Yu, K. and Tsujii, J. 2009. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'09), Boulder, CO, USA, pp. 121–24.Google Scholar
Zipf, G. K. 1949. Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, MA: Addison-Wesley.Google Scholar