Hostname: page-component-78c5997874-lj6df Total loading time: 0 Render date: 2024-11-10T21:02:10.108Z Has data issue: false hasContentIssue false

A survey of automatic Arabic diacritization techniques

Published online by Cambridge University Press:  10 October 2013

AQIL M. AZMI
Affiliation:
Department of Computer Science, King Saud University, Riyadh 11543, Saudi Arabia e-mails: aqil@ksu.edu.sa, reham.imamu@gmail.com
REHAM S. ALMAJED
Affiliation:
Department of Computer Science, King Saud University, Riyadh 11543, Saudi Arabia e-mails: aqil@ksu.edu.sa, reham.imamu@gmail.com

Abstract

In Modern Standard Arabic texts are typically written without diacritical markings. The diacritics are important to clarify the sense and meaning of words. Lack of these markings may lead to ambiguity even for the natives. Often the natives successfully disambiguate the meaning through the context; however, many Arabic applications, such as machine translation, text-to-speech, and information retrieval, are vulnerable due to lack of diacritics. The process of automatically restoring diacritical marks is called diacritization or diacritic restoration. In this paper we discuss the properties of the Arabic language and the issues that are related to the lack of the diacritical marking. It will be followed by a survey of the recent algorithms that were developed to solve the diacritization problem. We also look into the future trend for researchers working in this area.

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Azami, M. M. 2011. The History of the Qur'anic Text: From Revelation to Compilation, 2nd ed., pp. 123–9. Sherwoord Park, Alberta, Canada: Al-Qalam Publishing.Google Scholar
Alghamdi, M., Khursheed, M., Elshafei, M., Alhargan, F., Alkanhal, M., Alshamsan, A., Alqahtani, S., Muzaffar, Z., Altowim, Y., Yusuf, A., and Almuhtasib, H. 2006. Automatic Arabic text diacritizer. Technical Report CI.25.02, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia.Google Scholar
Alghamdi, M., and Muzaffar, Z. 2007. KACST Arabic diacritizer. In Proceedings of the First International Symposium on Computers and Arabic Language, Riyadh, Saudi Arabia.Google Scholar
Alghamdi, M., Muzaffar, Z., and Alhakami, H., 2010. Automatic restoration of Arabic diacritics: a simple, purely statistical approach. Arabian Journal for Science and Engineering 35 (2C): 125–35.Google Scholar
Ali, N. 1988. Arabic Language and Computer (in Arabic). Cairo, Egypt: Ta'reep.Google Scholar
Alotaiby, F., Alkharashi, I., and Foda, S., 2009. Processing large Arabic text corpora: preliminary analysis and results. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 7882.Google Scholar
Al-Sughaiyer, I., and Al-Kharashi, I., 2004. Arabic morphological analysis techniques: a comprehensive survey. Journal of American Society for Information Science and Technology 55 (3): 189213.CrossRefGoogle Scholar
Ananthakrishnan, S., Narayanan, S., and Bangalore, S. 2005. Automatic diacritization of Arabic transcripts for automatic speech recognition. In Proceedings of the International Conference on Natural Language Processing (ICON-05), Kanpur, India.Google Scholar
Attia, M. 2000. A Large Scale Computational Processor of the Arabic Morphology and Applications (unpublished Master's thesis), Cairo University, Cairo, Egypt.Google Scholar
Attia, M. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation (unpublished D.Sc thesis), University of Manchester, Manchester, UK.Google Scholar
Bahanshal, A., and Al-Khalifa, H. 2012. A first approach to the evaluation of Arabic diacritization systems. In Proceedings of the 7th International Conference on Digital Information Management (ICDIM 2012), Macau, China.Google Scholar
Bellamy, J., 1988. Two pre-Islamic Arabic inscriptions revised: Jabal Ramm and Umm Al-Jimal. Journal of the American Oriental Society 108 (3): 369–72.CrossRefGoogle Scholar
Buckwalter, T., 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Philadelphia, PA: Linguistic Data Consortium (LDC).Google Scholar
Central Intelligence Agency. 2008. World Factbook. Washington DC: CIA.Google Scholar
Debili, F., Achour, H., and Souissi, E. 2002. De l'etiquetage grammatical a la voyellation automatique de l'arabe. Technical Report, Correspondances de l'Institut de Recherche sur le Maghreb Contemporain 17.Google Scholar
Diab, M., Ghoneim, M., and Habash, N. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI(MT-Summit), Copenhagen, Denmark.Google Scholar
El-Imam, Y., 2003. Phonetization of Arabic: rules and algorithms. Computer Speech and Language 18: 339–73.CrossRefGoogle Scholar
Elshafei, E., Al-Muhtaseb, H., and Alghamdi, M., 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC18), Riyadh, Saudi Arabia, pp. 301–6.Google Scholar
Emam, O., and Fisher, V. 2004. A hierarchical approach for the statistical vowelization of Arabic text. Technical Report, IBM patent led DE9–2004–0006, US Patent Application US2005/0192809 A1.Google Scholar
Farghaly, A., and Shaalan, K., 2009. Arabic natural language processing: challenges and solutions. ACM Transaction on Asian Language Information Processing 8 (4): 122.CrossRefGoogle Scholar
Gal, Y., 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (SEMITIC '02), Philadelphia, PA, pp. 2733.Google Scholar
Habash, N. 2010. Introduction to Arabic language processing. In Hirst, G. (ed.), Synthesis Lectures on Human Language Technologies. San Rafael, CA: Morgan & Claypool.Google Scholar
Habash, N., and Rambow, O., 2007. Arabic diacritization through full morphological tagging. In Human Language Technologies: The Conference of the North American Chapter of the Association of Computational Linguistics (NAACL '07), Rochester, NY, pp. 53–6.Google Scholar
Habib, M. 2008. An Intelligent System for Automated Arabic Text Categorization (unpublished Master's thesis), Ain Shams University, Cairo, Egypt.Google Scholar
Hattab, A. M., and Hussain, A.K., 2012. Hybrid statistical and morpho-syntactical Arabic language diacritizing system. International Journal of Academic Research (Part A) 4 (4): 51–6.CrossRefGoogle Scholar
Hifny, Y. 2012a. Smoothing techniques for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC '12), Cairo, Egypt.Google Scholar
Hifny, Y. 2012b. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC '12), Cairo, Egypt.Google Scholar
Ibn Mandhur. 2009. Lisan al-Arab, Haider, A. A. and Ibrahim, A. (eds.). Beirut, Lebanon: Dar al-Kutub al-Ilmiyah (in Arabic).Google Scholar
Kirchhoff, K., and Vergyri, D., 2005. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication 46: 3751.CrossRefGoogle Scholar
Klinger, R., and Tomanek, K. 2007. Classical probabilistic models and conditional random fields. Technical Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, Dortmund, Germany.Google Scholar
Liu, Y., Shriberg, E., Stolcke, A., and Harper, M., 2005. Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In Proceedings of European Conference on Speech Communication and Technology, Lisbon, Portugal, pp. 3313–6.Google Scholar
Maamouri, M., and Bies, A. 2010. The Penn Arabic treebank. In Farghaly, A. (ed.), Arabic Computational Linguistics. Stanford, CA: CSLI.Google Scholar
Maamouri, M., Bies, A., and Kulick, S. 2006. Diacritization: a challenge to Arabic treebank annotation and parsing. In Proceedings of Arabic NLP/MT Conference, The British Computer Society, Natural Language Translation Specialist Group, London.Google Scholar
Manning, C., and Schtze, C. 1999. Foundations of Statistical Natural Language Processing, 2nd ed.Cambridge, MA: MIT Press.Google Scholar
Nelken, R., and Shieber, S., 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC '05), Ann Arbor, MI, pp. 7986.CrossRefGoogle Scholar
Rashwan, M., Al Badrashiny, M., Attia, M., Abdou, S., and Rafea, A., 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transaction on Audio, Speech, and Language Processing 19 (1): 166–75.CrossRefGoogle Scholar
Ryding, K., 2006. A Reference Grammar of Modern Standard Arabic. Cambridge, UK: Cambridge University Press.Google Scholar
Salton, G., 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.Google Scholar
Schlippe, T., Nguyen, T., and Vogel, S., 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In Proceedings of the 8th Conference of the Association for Machine Translation in Americas (AMTA-2008), Waikiki, HI, pp. 192201.Google Scholar
Shaalan, K. 2010. Rule-based approach in Arabic natural language processing. International Journal on Information and Communication Technologies (IJICT), Serial Publications 3 (3): 11–9 (Special Issue on Advances in Arabic Language Processing).Google Scholar
Shaalan, K., Abo Bakr, H., and Ziedan, I., 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of EACL 2009 Workshop on Computational Approaches to Semitic Language, Morristown, NJ, pp. 2735.CrossRefGoogle Scholar
Smrž, O. 2007. Functional Arabic Morphology: Formal System and Implementation (unpublished PhD thesis), Charles University in Prague, Prague, Czech Republic.Google Scholar
Van Gompel, M., 2008. Automatic Arabic Vocalisation. Tilberg, Netherlands: University of Tilburg.Google Scholar
Vergyri, D., and Kirchhoff, K., 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches Arabic Script-Based Languages (SEMITIC '04), Stroudsburg, PA, pp. 6673.CrossRefGoogle Scholar
Wikipedia. n.d. Danish and Norwegian alphabet. Retreived March 17, 2013, from http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet.Google Scholar
Zerrouki, T. 2011. Tashkeela: Arabic vocalized text corpus. Retreived June 9, 2013, from http://aracorpus.e3rab.com/.Google Scholar
Zitouni, I., and Sarikaya, R., 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language 23: 257–76.CrossRefGoogle Scholar