Published online by Cambridge University Press: 15 June 2016
We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.
This material is based on research sponsored by DARPA under contract HR0011-09-1-0044 and by the Johns Hopkins University Human Language Technology Center of Excellence. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government. We would like to thank David Yarowsky for his tremendous support, and for his inspiring work on – and continued ideas about – learning translations from monolingual texts. We would like to thank Alex Klementiev for his substantial contributions to this research and his comments on a draft of this article. We would like to thank Manaal Faruqui and Sneha Jha for providing the reference translations for the two Hindi paragraphs. Thank you to the two anonymous reviewers who provided valuable feedback on the first draft of this manuscript.