A survey of diacritic restoration in abjad and alphabet writing systems

FRANKLIN ỌLÁDIÍPỌ̀ ASAHIAH; ỌDẸ́TÚNJÍ ÀJÀDÍ ỌDẸ́JỌBÍ; EMMANUEL RÓTÌMÍ ADÁGÚNODÒ

doi:10.1017/S1351324917000407

A survey of diacritic restoration in abjad and alphabet writing systems

Published online by Cambridge University Press: 20 November 2017

FRANKLIN ỌLÁDIÍPỌ̀ ASAHIAH ,

ỌDẸ́TÚNJÍ ÀJÀDÍ ỌDẸ́JỌBÍ and

EMMANUEL RÓTÌMÍ ADÁGÚNODÒ

Show author details

FRANKLIN ỌLÁDIÍPỌ̀ ASAHIAH: Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng
ỌDẸ́TÚNJÍ ÀJÀDÍ ỌDẸ́JỌBÍ: Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng
EMMANUEL RÓTÌMÍ ADÁGÚNODÒ: Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

A diacritic is a mark placed near or through a character to alter its original phonetic or orthographic value. Many languages around the world use diacritics in their orthography, whatever the writing system the orthography is based on. In many languages, diacritics are ignored either by convention or as a matter of convenience. For users who are not familiar with the text domain, the absence of diacritics within text has been known to cause mild to serious readability and comprehension problems. However, the absence of diacritics in text causes near-intractable problems for natural language processing systems. This situation has led to extensive research on diacritization. Several techniques have been applied to address diacritic restoration (or diacritization) but the existing surveys of techniques have been restricted to some languages and hence left gaps for practitioners to fill. Our survey examined diacritization from the angle of resources deployed and various formulation employed for diacritization. It was concluded by recommending that (a) any proposed technique for diacritization should consider the language features and the purpose served by diacritics, (b) that evaluation metrics needed to be more rigorously defined for easy comparison of performance of models.

Information

Type: Survey Paper
Information: Natural Language Engineering , Volume 24 , Issue 1 , January 2018 , pp. 123 - 154

DOI: https://doi.org/10.1017/S1351324917000407 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abandah, G. A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., and Al-Taee, M., 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition (IJDAR) 18 (2): 183–97.CrossRef Google Scholar

Adalı, K., and Eryiǧit, G. 2014. Vowels and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), April 26–30, 2014, at EACL, Association for Computational Linguistics, pp. 53–61. Gothenburg, Sweden.CrossRef Google Scholar

Adegbola, T., and Odilinye, L. U. 2012. Quantifying the effect of corpus size on the quality of automatic diacritization of YorÃ¹bÃ¡ texts. In Proceedings of 3rd International Workshop on Spoken Languages Technologies for Under-resourced Languages, Cape Town, South Africa. Retrieved August 12, 2012 from http://www.mica.edu.vn/sltu2012/files/proceedings/10.pdf.Google Scholar

Ager, S. 2008. Arabic alphabet, pronunciation and language. Web. in Omniglot, writing systems and languages of the world. Retrieved February 12, 2008 from http://www.omniglot.com/writing/arabic.htm.Google Scholar

Aha, D. W., Kilber, D., and Albert, M. K., 1991. Instance—based learning algorithms. Machine Learning 6 (1): 37–66.CrossRef Google Scholar

Ahmed, F., Nürnberger, A., and Nitsche, M., 2011. Supporting Arabic Cross-Lingual Retrieval Using Contextual Information. Berlin: Springer-Verlag.CrossRef Google Scholar

Alansary, S. 2017. Alserag: an automatic diacritization system for arabic. In Hassanien, A. E., Shaalan, K., Gaber, T., Azar, A. T., and Tolba, M. F. (eds.), Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, pp. 182–92. Cham: Springer International Publishing.CrossRef Google Scholar

Al Badrashiny, M. A., 2009. Automatic Diacritizer for Arabic Text. M.sc Thesis, Egypt: Cairo University. Retrieved October 4, 2011 from http://www.rdi-eg.com/Downloads/ArabicNLP/Mohamed-Badashiny_MSc-Thesis_June2009.pdf.Google Scholar

Al-Badrashiny, M., Hawwari, A., and Diab, M. 2017. A layered language model based hybrid approach to automatic full diacritization of arabic. In WANLP 2017 (co-located with EACL 2017), pp. 177–84.Google Scholar

Alghamdi, M., Muzaffar, Z., and Alhakami, H., 2010. Automatic restoration of arabic diacritics: a simple, purely statistical approach. The Arabian Journal for Science and Engineering 35 (2c): 125–35.Google Scholar

Ali, A. R., 2009. Automatic Urdu Diacritization. M.sc Thesis, Pakistan: National University of Computer & Emerging Sciences. Retrieved October 4, 2011 from http://www.cle.org.pk/Publication/theses/2009/Automatic_Urdu_Diacritization.pdf.Google Scholar

Ananthakrishnan, S., Bangalore, S., and Narayanan, S. S., 2005. Automatic diacritization of arabic transcripts for automatic speech recognition. In Proceedings of the International Conference on Natural Language Processing ICON, Kanpur, India, pp. 47–54.Google Scholar

Asahiah, F. O. 2014. Development of A Standard Yorùbá Automatic Diacritic Restoration System. PhD Thesis, Ile-Ife, Nigeria: Obafemi Awolowo University.Google Scholar

Azmi, A. M., and Almajed, R. S. 2013. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21 (3): 477–495. doi:10.1017/S1351324913000284.CrossRef Google Scholar

Ball, M. J. 2001. On the status of diacritics. Journal of the International Phonetic Association 31 (2): 259–64. doi:10.1017/S0025100301002067.CrossRef Google Scholar

Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A., 1996. A maximum entropy approach to natural language processing. Computational Linuistics 22 (1): 39–71.Google Scholar

Bolshakov, I., Gelbukh, A., and Galicia-Haro, S. 1999. A simple method to detect and correct Spanish accentuation typos. In Farghaly, A., and Megerdoomian, K. (eds.), CProc. PACLING-99, Pacific Association for Computational Linguistics, pp. 104–13. Waterloo, Ontario, Canada: University of Waterloo, August 25–28.Google Scholar

Borin, L. 2009. One in the Bush: Low-Density Language Technology. University of Gothenburg.Google Scholar

Buckland, M. 2013. Document theory: an introduction. In Willer, M., Gilliland, A. J., and Tomić, M. (eds.), Records, Archives and Memory: Selected Papers from the Conference and School on Records, Archives and Memory Studies, pp. 223–37. Croatia: University of Zadar, May 2013.Google Scholar

Cocks, J., and Keegan, T. T. 2011. A word-based approach for diacritic restoration in Māori. In Australasian Language Technology Association Workshop 2011, Canberra, Australia. pp. 126–30.Google Scholar

De Pauw, G., Wagacha, P. W., and de Schryver, G. 2007. Automatic diacritic restoration for resource-scarce languages. In Matousek, V. , M. P. (ed.), Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3–7, 2007, Proceedings Lecture Notes in Artificial Intelligence LNAI, subseries of Lecture Notes in Computer Science LNCS, vol. 4629, pp. 170–79. Berlin: Springer-Verlag.CrossRef Google Scholar

Diab, M., Ghoneim, M., and Habash, N. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of the Eleventh Machine Translation Summit, Copenhagen, Denmark.Google Scholar

Ding, P. S. 2005. Tone languages. In Strasny, P. (ed.), Routledge Encyclopedia of Linguistics, pp. 1117–20. London, UK: Routledge.Google Scholar

Ekpenyong, M., Udoinyang, M., and Urua, E., 2009. A robust language processor for African tone language systems. Georgian Electronic Scientific Journal: Computer Science and Telecommunication 6 : 3–12.Google Scholar

El-Harby, A. A., El-Shehawey, M. A., and El-Barogy, R., 2008. A Statistical Approach for Qur’an Vowel Restoration. ICGST-AIML Journal 8 (3): 9–16.Google Scholar

El-Imam, Y., 2003. Phonetization of arabic: rules and algorithms. Computer Speech and Language 18 (4): 339–73.CrossRef Google Scholar

El-Sadany, T., and Hashish, M. 1988. Semi-automatic vowelization of arabic verbs. In Proceedings of 10th National Computer Conference, Jeddah, pp. 725–32.Google Scholar

Elshafei, M., Al-Muhtaseb, H., and Alghamdi, M. 2006a. Statistical methods for automatic diacritization of arabic text. In Proceedings of the Saudi 18th National Computer Conference NCC18, Riyadh, vol. 18, Saudi Arabia, pp. 301–6.Google Scholar

Elshafei, M., Al-Muhtaseb, H., and Alghamdi, M. 2006b. Machine generation of arabic diacritical marks. In Proceedings of the 2006 International Conference on Machine earning; Models, Technologies and Applications. June 2006, USA: CSREA Press, pp. 128–33.Google Scholar

European Language Resources Association (ELRA). 2015. What is a language Resource?. Web Article. Retrieved on March 15, 2016 from http://www.elra.info/en/about/what-language-resource/ Google Scholar

Ezeani, I., Hepple, M., and Onyenwe, I. 2016. Automatic restoration of diacritics for Igbo language. In Sojka, P., Horák, A., Kopeček, I., and Pala, K. (eds.), Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12–16, 2016, Proceedings, Cham: Springer International Publishing, pp. 198–205.Google Scholar

Ezeani, I., Hepple, M., and Onyenwe, I., 2017. Lexical disambiguation of Igbo through diacritic restoration. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, April 4 2017, Association for Computational Linguistics, Valencia, Spain, pp. 53–60.Google Scholar

Gal, Y., 2002. An HMM Approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02Workshop on Computational Aroaches to Semitic Languages, Philadelphia, PA, pp. 27–33.Google Scholar

Gambäck, B. 1997. Processing Swedish Sentences: A Unification-Based Grammar and Some Applications. PhD Thesis, Sweden: Royal Institute of Technology.Google Scholar

Grönqvist, L., and Helgadóttir, S. 2002. Literature review of representativeness of linguistic resources. GSLT course on Linguistic Resources. Retrieved on July 09, 2017 from http://www.gslt.hum.gu.se/~leifg/gslt/doc/rep021202.pdf Google Scholar

Habash, N., and Rambow, O., 2007. Arabic diacritization through full morphological tagging. In Proceedings of NAACL HLT 2007, Companion Volume, Association for Computational Linguistics, Rochester, NY, pp. 53–56.Google Scholar

Haertel, R. A., McClanahan, P., and Ringger, E. R., 2010. Automatic diacritization for low–resource languages using a hybrid word and consonant CMM. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, June 2010, Los Angeles, CA, pp. 519–27.Google Scholar

Hládek, D., Staš, J., and Juhár, J. 2016. Diacritics restoration in the slovak texts using hidden Markov model. In Vetulani, Z., Uszkoreit, H., and Kubis, M. (eds.), Human Language Technology. Challenges for Computer Science and Linguistics: 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7–9, 2013. Revised Selected Papers, Cham: Springer International Publishing, pp. 29–40.Google Scholar

Jurafsky, D., and Martin, J. H. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistic and Speech Recognition. Upper Saddle River, NJ. Pearson Prentice-Hall.Google Scholar

Kanis, J., and Müller, L. 2005. Using lemmatization technique for automatic diacritics restoration. In Proceedings of SPECOM 2005, Moscow: Moscow State Linguistic University, pp. 255–58.Google Scholar

Khoja, S., 2001. APT: arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL, Pittsburg, PA, pp. 20–5.Google Scholar

Lafferty, J., McCallum, A., and Pereira, F., 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning ICML 2001, San Francisco, CA: Morgan Kaufmann, pp. 282–9.Google Scholar

Liddy, E. D. 2001. Natural language processing. In: Drake, M. A., editor, Encyclopedia of Library and Information Science, 2nd ed., Marcel Decker Inc. NY.Google Scholar

Ljubešic, N., Erjavec, T., and Fišer, D. 2016. Corpus-based diacritic restoration for South Slavic languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA) (May 23–28). Portorož, Slovenia, pp. 3612–16.Google Scholar

Mahar, J. A., and Memon, G. Q. (2011). Lexicon based diacritic restorations using wordnet for sindhi. International Journal of Academic Research 3 (2): 37–43.Google Scholar

Manning, C. D., and Schütze, H., 1999. Foundations of Statistical Natural Language Processing. USA: MIT Press.Google Scholar

Marty, F., and Hart, R. S. 1985. Computer program to transcribe french text into speech: Problems and suggested solutions. Technical Report LLL-T-6-85, University of Illinois, Urbana, Illinois, Language Learning Laboratory.Google Scholar

Mihalcea, R. 2002. Diacritic restoration: learning from letters versus learning from words. In Proceedings of Computational Linguistics and Intelligent Text Processing, 3rd International Conference, CICLing 2002, Mexico City: Springer, vol. 2276, pp. 339–348.Google Scholar

Mihalcea, R., and Nastase, V. 2002. Letter level learning for language independent diacritics restoration. In Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 105–111.Google Scholar

Mohamed, E., and Kübler, S. 2009. Diacritization for real-world Arabic texts. In Proceedings of Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 251–7.Google Scholar

Nelken, R., and Shieber, S. M. 2005. Arabic diacritization using weighted finite-state transducers. In ACL2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan, pp. 79–86.CrossRef Google Scholar

Novák, A., and Siklósi, B. 2015. Automatic diacritics restoration for Hungarian. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 17–21 September 2015. Association for Computational Linguistics, Lisbon, Portugal.CrossRef Google Scholar

Ọdẹjọbí, O. A., 2005. A Computational Model of Prosody for YorÃ¹bÃ¡ Text-to-Speech Synthesis. Phd Thesis, Aston: Aston University.Google Scholar

Pham, L.-N., Tran, V.-H., and Nguyen, V.-V., 2013. Vietnamese text accent restoration with statistical machine translation. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation, Taipei, Taiwan, pp. 423–9.Google Scholar

Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C., 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of Association for Computational Linguistics (ACL) ACL-08: HLT, Short Papers (Companion Volume), Columbus, OH, pp. 117–20.Google Scholar

Šantić, N., Šnajder, J., and Bašić, B. D., 2009. Automatic diacritics restoration in croatian texts. In INFuture2009: Digital Resources and Knowledge Sharing, Zagreb, Croatia, pp. 309–18.Google Scholar

Sarikaya, R., Emam, O., Zitouni, I., and Gao, Y. 2006. Maximum entropy modeling for diacritization of arabic text. In INTERSPEECH 2006 - ICSLP, 9th International Conference on Spoken Language Processing, Pittsburgh, PA. ISCA. paper 1418-Mon1BuP.11.Google Scholar

Scannell, K. P. 2011. Statistical unicodification of African languages. Language Resources and Evaluation, 1–12. Retrieved July 20, 2011 from http://borel.slu.edu/pub/lre.pdf.Google Scholar

Schlippe, T., Nguyen, T., and Vogel, S., 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA-2008. MT at work: In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas, Waikiki, Hawai'i, pp. 270–8.Google Scholar

Shaalan, K., Abo Bakr, H. M., and Ziedan, I. 2009. A hybrid approach for building arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Athens Greece, March 31, pp. 27–35.Google Scholar

Shatta, U. 1994. A systemic functional syntax analyzer and case-marker generator for speech acts in Arabic. In 19th International Conference for Statistics, Computer Science, Scientific & Social Applications. Cairo.Google Scholar

Simard, M. 1998. Automatic insertion of accents in french texts. In Ide and Vuotilainen (eds.), Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain. Association for Computational Linguistics (ACL), Somerset, NJ, pp. 27–35.Google Scholar

Simard, M., and Deslauriers, A. 2001. Real-time automatic insertion of accents in French text. Natural Language Engineering 7 (2), 143–65.Google Scholar

Sutton, C., and McCallum, A. 2006. An Introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. Cambridge, MA: MIT Press.Google Scholar

Truyen, T. T., Phung, D. Q., and Venkatesh, S. 2008. Constrained sequence classification for lexical disambiguation. In Tu-Bao, H. and Zhi-Hua, Z., Editors 10th Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, December 15–19, 2008. Lecture Notes in Computer Science, Springer. Berlin vol. 5351: 430–41.Google Scholar

Tufiş, D., and Ceauşu, A. 2008. DIAC: A professional diacritics recovering system. In Proceedings of the 6th International Language Resources and Evaluation (LREC’08). Marrakech, Morocco. paper 54 on Conference CD.Google Scholar

Tufiş, D., and Chiţu, A. 1999. Automatic Diacritic Insertion in Romanian Texts. In Proceedings of the International Conference on Computational Lexicography COMPLEX’99. Pecs, Hungary, pp. 185–94.Google Scholar

Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., and Derviş, A., 2008. Automatic diacritic restoration for a TTS-based e-mail reader application. UPB Scientific Bulletin 70 (4): 3–12.Google Scholar

The Unicode Consortium. 2011. The Unicode Standard Version 6.0 Core Specification. Retrieved February 21, 2015 from http://www.unicode.org/versions/Unicode6.0.0/ch06.pdf.Google Scholar

Vergyri, D., and Kirchhoff, K. 2004. Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition. In Farghaly, A., and Megerdoomian, K. (eds.), COLING 2004 Computational Approaches to Arabic Script-based Languages, pp. 66–73. Geneva, Switzerland: COLING. Retrieved July 25,2017 from http://melodi.ee.washington.edu/people/katrin/Papers/vergyri-kirchhoff-coling04.pdf.CrossRef Google Scholar

Wagacha, P., De Pauw, G., and Githinji, P. 2006. A Grapheme-based Approach for Accent Restoration in Gĩkũyũ. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, ELRA, pp. 1937–40.Google Scholar

Wells, J. C. 2000. Orthographic diacritics and multilingual computing. Language problems & language planning 24 (3): 249–72. Retrieved July 12, 2010 from http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm.Google Scholar

Yarowsky, D. 1994. Decision List for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of 32nd Annual Meeting of Association for Computational Linguistics, Las Cruces, NM, pp. 88–95.Google Scholar

Yarowsky, D. 1999. A comparison of corpus-based techniques for restoring accents in Spanish and French text. Natural language processing using very large corpora, pp. 99–120. Springer.Google Scholar

Zainkó, C., Csapó, T. G., and Németh, G. 2010. Special Speech Synthesis for Social Network Websites. In Sojka, P., Hora, A. , K. P. (eds.), Text Speech and Dialogue: 13th International Conference TSD 2010, Brno, Czech Republic. September 2010 Proceedings, pp. 455–63, Berlin, Germany. Springer-Verlag.CrossRef Google Scholar

Zitouni, I., Sorensen, J. S., and Sarikaya, R. 2006. Maximum Entropy Based Restoration of Arabic Diacritics, In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney Australia, pp. 577–84.Google Scholar

Article contents

A survey of diacritic restoration in abjad and alphabet writing systems

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests