Hostname: page-component-78c5997874-m6dg7 Total loading time: 0 Render date: 2024-11-14T06:08:46.075Z Has data issue: false hasContentIssue false

A unified alignment algorithm for bilingual data

Published online by Cambridge University Press:  13 September 2011

CHRISTOPH TILLMANN
Affiliation:
IBM T.J. Watson Research Center, Yorktown Heights, New York, NY 10598, USA email: ctill@us.ibm.com
SANJIKA HEWAVITHARANA
Affiliation:
Carnegie Mellon University, Pittsburgh, PA 15213, USA email: sanjika@cs.cmu.edu

Abstract

The paper presents a novel unified algorithm for aligning sentences with their translations in bilingual data. With the help of ideas from a stack-based dynamic programming decoder for speech recognition (Ney 1984), the search is parametrized in a novel way such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations: the extracted text chunk pairs can be either sub-sentential pairs, one-to-one, or many-to-many sentence-level pairs. The one-stage search algorithm is carried out in a single run over the data. Its memory requirements are independent of the length of the source document, and it is applicable to sentence-level parallel as well as comparable data. With the help of a unified beam-search candidate pruning, the algorithm is very efficient: it avoids any document-level pre-filtering and uses less restrictive sentence-level filtering. Results are presented on a Russian–English, a Spanish–English, and an Arabic–English extraction task. Based on simple word-based scoring features, text chunk pairs are extracted out of several trillion candidates, where the search is carried out on 300 processors in parallel.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Brown, P., Spohrer, J., Hochschild, P. and Baker, J. 1982. Partial traceback and dynamic programming. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 82), Paris, France, pp. 1629–32.Google Scholar
Brown, P. F., Lai, J. C. and Mercer, R. L. 1991. Aligning sentences in parallel corpora. In Proceedings of ACL 91, Berkeley, CA, pp. 169–76.Google Scholar
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A. and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Chen, S. F. 1993. Aligning sentences in bilingual corpora using lexical information. In Proceedings of ACL 93, June 16–17, Columbus, OH, pp. 916.Google Scholar
Deng, Y., Kumar, S. and Byrne, W. 2006. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 12 (4): 126.Google Scholar
Fung, P. and Cheung, P. 2004. Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of EMNLP 04, July 25–26, Barcelona, Spain, pp. 5763.Google Scholar
Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of ACL Workshop on Building and Using Comparable Corpora, June 24, Portland, OR, pp. 61–8.Google Scholar
Gale, W. A. and Church, K. W. 1991. A program for aligning sentences in bilingual corpora. In Proceedings of ACL 91, June 18–21, Berkeley, CA, pp. 177–84.Google Scholar
Koehn, P., Och, F. J. and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 03, May 27–June 1, Edmonton, Alberta, Canada, pp. 127–33.Google Scholar
Koehn, P. 2004. Pharaoh: a beam search decoder for phrase-based SMT models. In Proceedings of AMTA 04, September 28–October 2, Washington DC.Google Scholar
Ma, X. 2006. Champollion: a robust parallel text sentence aligner. In Proceedings of LREC 06, May 22–28, Genova, Italy, pp. 489–92.Google Scholar
Melamed, I. D. 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics 25 (1): 107–30.Google Scholar
Mendonca, A., Graff, D. and DiPersio, D. 2009. Spanish Gigaword Corpus, 2nd ed., LDC catalog no. 2009T21. Philadelphia, PA: LDC.Google Scholar
Moore, R. C. 2002. Fast and accurate sentence alignment of bilingual data. In Proceedings of AMTA 05, Tiburon, CA, pp. 135–44.Google Scholar
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.CrossRefGoogle Scholar
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of COLING/ACL 06, July 17–21, Sydney, Australia, pp. 81–8.Google Scholar
Ney, H. 1984. The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2): 263–71.CrossRefGoogle Scholar
Och, F.-J. and Ney, H. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30 (4): 417–50.CrossRefGoogle Scholar
Och, F. J.et al. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of the Joint HLT and NAACL Conference (HLT 04), May 2–7, Boston, MA, pp. 161–8.Google Scholar
Olive, J., Christianson, C. and McCary, J. (Editors). 2011. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. New York: Springer.CrossRefGoogle Scholar
Ortmanns, S., Ney, H. and Eiden, A. 1996. Language-model look-ahead for large vocabulary speech recognition. In Proceedings of ICASSP 96, May 7–9, Atlanta, GA, pp. 2095–8.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 02, July 7–12, Philadelphia, PA, pp. 311–18.Google Scholar
Parker, R., Graff, D., Kong, J., Chen, K., and Maeda, K. 2009. English Gigaword Corpus, 4th ed., LDC catalog no. 2009T13. Philadelphia, PA: LDC.Google Scholar
Pike, C. and Melamed, I. D. 2004. An automatic filter for non-parallel texts. In The Comp. Volume of the Proceedings of ACL 04, July 21–26, Barcelona, Spain, pp. 114–17.Google Scholar
Quirk, C., Udupa, R. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI, September 10–14, Copenhagen, Demark, pp. 321–7.Google Scholar
Resnik, P. and Smith, N. 2003. The web as parallel corpus. Computational Linguistics 29 (3): 349–80.CrossRefGoogle Scholar
Snover, M., Dorr, B, and Schwartz, R. 2008. Language and translation model adaptation using comparable corpora. In Proceedings of EMNLP08, October 25–27, Honolulu, HI, pp. 856–5.Google Scholar
Tillmann, C. and Xu, J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Proceedings of HLT/NAACL 09, May 31–June 5, Boulder, CO, pp. 93–6.Google Scholar
Tillmann, C. and Zhang, T. 2007. A block bigram prediction model for statistical machine translation. ACM-TSLP 4 (6): 131 (July).Google Scholar
Tillmann, C. 2006. Efficient dynamic programming search algorithms for phrase-based SMT. In Proceedings of the Workshop CHPSLP at HLT 06, June 4–9, New York City, NY, pp. 916.Google Scholar
Tillmann, C. 2009. A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference, August 2–7, Suntec, Singapore, pp. 225–8.Google Scholar
Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning Japanese–English news articles and sentences. In Proceedings of ACL 03, July 7–12, Sapporo, Japan, pp. 72–9.CrossRefGoogle Scholar
Zhao, B. and Vogel, S. 2002. Adaptive parallel sentences mining from WebBilingualNewsCollection. In IEEE International Conference on Data Mining (ICDM 2002), December 2–12, Maebashi City, Japan, pp. 745–8.Google Scholar