Hostname: page-component-cd9895bd7-q99xh Total loading time: 0 Render date: 2024-12-26T08:32:01.409Z Has data issue: false hasContentIssue false

Unsupervised Arabic dialect segmentation for machine translation

Published online by Cambridge University Press:  23 September 2020

Wael Salloum*
Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
Nizar Habash
Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
*
*Corresponding author. E-mail: wael@ccls.columbia.edu

Abstract

Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A Fast and Furious Segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. San Diego, California: Association for Computational Linguistics, pp. 1116.CrossRefGoogle Scholar
Abo Bakr, H., Shaalan, K. and Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into Diacritized Arabic. In The 6th International Conference on Informatics and Systems, INFOS2008. Cairo University.Google Scholar
Al-Badrashiny, M., Pasha, A., Diab, M.T., Habash, N., Rambow, O., Salloum, W. and Eskander, R. (2016). SPLIT: Smart Preprocessing (Quasi) Language Independent Tool. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016).Google Scholar
Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a Dialectical Arabic Lexicon. In Calzolari N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M. and Tapias, D. (eds), LREC. European Language Resources Association.Google Scholar
Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 6572.Google Scholar
Brown, P.F., Pietra, S.A. Della P., Della V.J. and Mercer, R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 263312.Google Scholar
Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC catalog number LDC2004L02, ISBN 1-58563-324-0.Google Scholar
Callison-Burch, C., Koehn, P. and Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 1724.CrossRefGoogle Scholar
Chiang, D., Diab, M., Habash, N., Rambow, O. and Shareef, S. (2006). Parsing arabic dialects. In Proceedings of the European Chapter of ACL (EACL).Google Scholar
Creutz, M. and Lagus, K. (2002). Unsupervised discovery of morphemes. In: ACL 2002 Workshop on Morphological and Phonological Learning. ACL.Google Scholar
Creutz, M. and Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP), 4(1).Google Scholar
Diab, M., Hacioglu, K. and Jurafsky, D. (2007). Automated methods for processing Arabic text: From tokenization to base phrase chunking. In van den Bosch A. and Soudi A.morphological analyzer for Egyptian Arabic (eds), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Kluwer/Springer.Google Scholar
Du, J., Jiang, J. and Way, A. (2010). Facilitating translation using source language paraphrase lattices. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP 2010, pp. 420429.Google Scholar
Duh, K. and Kirchhoff, K. (2005). POS tagging of dialectal Arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Semitic 2005, pp. 5562.CrossRefGoogle Scholar
El Kholy, A. and Habash, N. (2010). Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC).Google Scholar
Erdmann, A., Khalifa, S., Oudah, M., Habash, N. and Bouamor, H. (2019). A little linguistics goes a long way: Unsupervised segmentation with limited language specific guidance. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy: Association for Computational Linguistics, pp. 113124.CrossRefGoogle Scholar
Eskander, R., Habash, N. and Rambow, O. (2013). Automatic extraction of morphological lexicons from morphologically annotated corpora. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, pp. 10321043.Google Scholar
Eskander, R., Habash, N., Rambow, O. and Pasha, A. (2016). Creating resources for dialectal Arabic from a single annotation: A case study on Egyptian and Levantine. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 34553465.Google Scholar
Graff, D. and Cieri, C. (2003). English Gigaword, LDC Catalog No.: LDC2003T05. Linguistic Data Consortium, University of Pennsylvania.Google Scholar
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S. and Buckwalter, T. (2009). Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data Consortium LDC2009E73.Google Scholar
Habash, N. (2006). On Arabic and its dialects. Multilingual Magazine, 17(81).Google Scholar
Habash, N. (2010). Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers.10.2200/S00277ED1V01Y201008HLT010CrossRefGoogle Scholar
Habash, N. and Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 573580.CrossRefGoogle Scholar
Habash, N. and Rambow, O. (2006). MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 681688.CrossRefGoogle Scholar
Habash, N. and Sadat, F. (2006). Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 4952.CrossRefGoogle Scholar
Habash, N., Soudi, A. and Buckwalter, T. (2007). On Arabic transliteration. In van den Bosch A. and Soudi A. (eds.), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Springer.Google Scholar
Habash, N., Eskander, R. and Hawwari, A. (2012a). A morphological analyzer for Egyptian Arabic. In NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON2012), pp. 19.Google Scholar
Habash, N., Eskander, R. and Hawwari, A. (2012b). A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology, pp. 19.Google Scholar
Habash, N., Diab, M. and Rabmow, O. (2012c). Conventional orthography for dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar
Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).Google Scholar
Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., Shargi, F.A., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M. and Saddiki, H. (2018). Unified guidelines and resources for Arabic Dialect orthography. In: Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar
Hajič, J., Hric, J. and Kubon, V. (2000). Machine translation of very close languages. Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000), pp. 712.CrossRefGoogle Scholar
Hamdi, A., Boujelbane, R., Habash, N., Nasr, A., et al. (2013). The effects of factorizing root and pattern mapping in bidirectional Tunisian-Standard Arabic machine translation. MT Summit 2013.Google Scholar
Khalifa, S., Zalmout, N. and Habash, N. (2016). YAMAMA: Yet another multi-dialect Arabic morphological analyzer. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Osaka, Japan: The COLING 2016 Organizing Committee, pp. 223227.Google Scholar
Khalifa, S., Hassan, S. and Habash, N. (2017). A morphological analyzer for Gulf Arabic verbs. In Proceedings of the Workshop for Arabic Natural Language Processing (WANLP).CrossRefGoogle Scholar
Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A. and McLemore, C. (2002). Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177180.CrossRefGoogle Scholar
Kumar, S., Och, F.J. and Macherey, W. (2007). Improving word alignment with bridge languages. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 4250.Google Scholar
Mikolov, T., Chen, K. Corrado G. and Dean J. (2013). Efficient estimation of word representations in vector space. CoRR.Google Scholar
Mohamed, E., Mohit, B. and Oflazer, K. (2012). Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar
Nakov, P. and Ng, H.T. (2011). Translating from morphologically complex languages: A paraphrase-based approach. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL 2011).Google Scholar
Narasimhan, K., Barzilay, R. and Jaakkola, T. (2015). An unsupervised method for uncovering morphological chains. Transactions of the Association for Computational Linguistics (TACL), 3, 157167.CrossRefGoogle Scholar
Och, F.J. (2003). Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Conference of the Association for Computational Linguistics, pp. 160167.CrossRefGoogle Scholar
Och, F.J. and Ney, H. (2003a). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 1951.CrossRefGoogle Scholar
Och, F.J. and Ney, H. (2003b). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 1952.CrossRefGoogle Scholar
Oudah, M., Almahairi, A. and Habash, N. (2019). The impact of preprocessing on Arabic-English statistical and neural machine translation. CoRR, abs/1906.11751.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311318.Google Scholar
Parker, R., Graff, D., Chen, K., Kong, J. and Maeda, K. (2009). Arabic Gigaword Fourth Edition. LDC catalog number No. LDC2009T30, ISBN 1-58563-532-4.Google Scholar
Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).Google Scholar
Riesa, J. and Yarowsky, D. (2006). Minimally supervised morphological segmentation with applications to machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006), pp. 185192.Google Scholar
Sadat, F. and Habash, N. (2006). Combination of Arabic preprocessing schemes for statistical machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, pp. 18.CrossRefGoogle Scholar
Sajjad, H., Darwish, K. and Belinkov, Y. (2013). Translating dialectal Arabic to English. In The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013), Sofia, Bulgaria.Google Scholar
Salloum, W. (2018). Machine Translation of Arabic Dialects. Ph.D. thesis, Columbia University in the City of New York.Google Scholar
Salloum, W. and Habash, N. (2011). Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 1021.Google Scholar
Salloum, W. and Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Demonstration Papers, pp. 385392.Google Scholar
Salloum, W. and Habash, N. (2013). Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).Google Scholar
Samih, Y., Eldesouki, M., Attia, M., Darwish, K., Abdelali, A., Mubarak, H. and Kallmeyer, L. (2017a). Learning from relatives: Unified dialectal Arabic segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: Association for Computational Linguistics, pp. 432441.CrossRefGoogle Scholar
Samih, Y., Attia, M., Eldesouki, M., Abdelali, A., Mubarak, H., Kallmeyer, L. and Darwish, K. (2017b). A neural architecture for dialectal Arabic segmentation. In Proceedings of the Third Arabic Natural Language Processing Workshop. Valencia, Spain: Association for Computational Linguistics, pp. 4654.CrossRefGoogle Scholar
Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA).Google Scholar
Schone, P. and Jurafsky, D. (2000). Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of CoNLL-2000 and LLL-2000, pp. 6772.CrossRefGoogle Scholar
Stallard, D., Devlin, J., Kayser, M., Lee, Y.K. and Barzilay, R. (2012). Unsupervised morphology rivals supervised morphology for Arabic MT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 322327.Google Scholar
Stolcke, A. (2002). SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing.Google Scholar
Utiyama, M. and Isahara, H. (2007). A comparison of pivot methods for phrase-based statistical machine translation. In HLT-NAACL, pp. 484491.Google Scholar
Zalmout, N. and Habash, N. (2017a). Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 704713.CrossRefGoogle Scholar
Zalmout, N. and Habash, N. (2017b). Optimizing tokenization choice for machine translation across multiple target languages. The Prague Bulletin of Mathematical Linguistics, 108(1), 257269.CrossRefGoogle Scholar
Zalmout, N. and Habash, N. (2019). Adversarial multitask learning for joint multi-feature and multi-dialect morphological modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 17751786.CrossRefGoogle Scholar
Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, pp. 4959.Google Scholar
Zhang, X. (1998). Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, ACL 1998, pp. 14601464.Google Scholar