Hostname: page-component-78c5997874-j824f Total loading time: 0 Render date: 2024-11-10T14:20:27.476Z Has data issue: false hasContentIssue false

Arabic spelling error detection and correction

Published online by Cambridge University Press:  18 March 2015

MOHAMMED ATTIA
Affiliation:
School of Computing, Dublin City University, Ireland, e-mail: mattia@computing.dcu.ie, josef@computing.dcu.ie Faculty of Engineering and IT, The British University in Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae
PAVEL PECINA
Affiliation:
Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic e-mail: pecina@ufal.mff.cuni.cz
YOUNES SAMIH
Affiliation:
Department of Linguistics and Information Science, Heinrich-Heine-Universität Düsseldorf, Germany e-mail: samih@phil.uni-duesseldorf.de
KHALED SHAALAN
Affiliation:
Faculty of Engineering and IT, The British University in Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae
JOSEF VAN GENABITH
Affiliation:
School of Computing, Dublin City University, Ireland, e-mail: mattia@computing.dcu.ie, josef@computing.dcu.ie

Abstract

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

We are grateful to our anonymous reviewers whose comments and suggestions have helped us to improve the paper considerably. This research is funded by the Irish Research Council for Science Engineering and Technology (IRCSET), the UAE National Research Foundation (NRF) (Grant No. 0514/2011), the Czech Science Foundation (grant no. P103/12/G084), DFG Collaborative Research Centre 991: The Structure of Representations in Language, Cognition, and Science (http://www.sfb991.uni-duesseldorf.de/sfb991), and the Science Foundation Ireland (Grant No. 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Dublin City University.

References

Alfaifi, A., and Atwell, E. 2012. Arabic learner corpora (ALC): a taxonomy of coding errors. In Proceedings of the 8th International Computing Conference in Arabic (ICCA 2012), Cairo, Egypt.Google Scholar
Alkanhal, M. I., Al-Badrashiny, M. A., Alghamdi, M. M., and Al-Qabbany, A. O., 2012. Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. IEEE Transactions on Audio, Speech, and Language Processing 20 (7): 21112122.CrossRefGoogle Scholar
Attia, M., 2006. An ambiguity-controlled morphological analyzer for modern standard arabic modelling finite state networks. In The Challenge of Arabic for NLP/MT Conference, The British Computer Society. London, UK, pp. 4867.Google Scholar
Attia, M., Pecina, P., Tounsi, L., Toral, A., and van Genabith, J. 2011. An Open-source finite state morphological transducer for modern standard arabic. In International Workshop on Finite State Methods and Natural Language Processing (FSMNLP), Blois, France, pp. 125–133.Google Scholar
Beesley, K., 1998. Arabic morphology using only finite-state operations. In The Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec, pp. 5057.CrossRefGoogle Scholar
Beesley, K., and Karttunen, L., 2003. Finite State Morphology. CSLI Studies in Computational Linguistics. Stanford, California: CSLI.Google Scholar
Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 286–293.Google Scholar
Brown, P. F., Della Pietra, V. J., de Souza, P. V., Lai, J. C., and Mercer, R. L., 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467479.Google Scholar
Buckwalter, T., 2004a. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 3134.Google Scholar
Buckwalter, T. 2004b. Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue number: LDC2004L02.Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3–4): 157174.CrossRefGoogle Scholar
Church, K. W., and Gale, W. A., 1991. Probability scoring for spelling correction. Statistics and Computing 1: 93103.CrossRefGoogle Scholar
Damerau, F. J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171176.CrossRefGoogle Scholar
El Kholy, A., and Habash, N., 2010. Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Workshop on Semitic Languages in the Seventh International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, pp. 4551.Google Scholar
Gao, J., Li, X., Micol, D., Quirk, C., and Sun, X., 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 358366.Google Scholar
Habash, N., and Rambow, O., 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, US, pp. 573580.Google Scholar
Haddad, B., and Yaseen, M., 2007. Detection and correction of non-words in Arabic: a hybrid approach. International Journal of Computer Processing of Oriental Languages 20: 237257.CrossRefGoogle Scholar
Hajič, J., Smrž, O., Buckwalter, T., and Jin, H., 2005. Feature-based tagger of approximations of functional arabic morphology. In Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (TLT), Barcelona, Spain, pp. 5364.Google Scholar
Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, pp. 368378.Google Scholar
Han, J., and Kamber, M., 2006. Data Mining, Southeast Asia Edition: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
Hassan, A., Noeman, S., and Hassan, H., 2008. Language independent text correction using finite state automata. In IJCNLP, Hyderabad, India, pp. 913918.Google Scholar
Heift, T., and Rimrott, A., 2008. Learner responses to corrective feedback for spelling errors in CALL. System 36 (2): 196213.CrossRefGoogle Scholar
Hulden, M., 2009a. Fast approximate string matching with finite automata. In Proceedings of the 25th Conference of the Spanish Society for Natural Language Processing (SEPLN), San Sebastian, Spain, pp. 5764.Google Scholar
Hulden, M., 2009b. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics. Stroudsburg, PA, USA, pp. 2932.Google Scholar
Kernigan, M., Church, K., and Gale, W. 1990. A spelling correction program based on a noisy channel model. AT & T Laboratories, 600 Mountain Ave., Murray Hill, NJ, pp. 205–210.Google Scholar
Kiraz, G. A. 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages, Cambridge University. Cambridge, United Kingdom.CrossRefGoogle Scholar
Kukich, K., 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377439.CrossRefGoogle Scholar
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (8): 707710.Google Scholar
Magdy, W., and Darwish, K., 2006. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 408414.Google Scholar
Mitton, R., 1996. English Spelling and the Computer. Harlow, Essex: Longman Group.Google Scholar
Mooney, R. J., and Bunescu, R., 2005. ACM SIGKDD explorations newsletter. Natural Language Processing and Text Mining 7 (1): 310.Google Scholar
Moussa, M., Fakhr, M. W., and Darwish, K. 2012. Statistical denormalization for arabic text. In Proceedings of KONVENS 2012, Vienna, pp. 228–232.Google Scholar
Norvig, P. 2009. Natural language corpus data. In Segaran, T. and Hammerbacher, J. (eds.), Beautiful Data, pp. 219242. Sebastopol, California: O’Reilly.Google Scholar
Och, F. J., and Genzel, D. 2013. Automatic spelling correction for machine translation. Patent US 20130144592 A1. June 6, 2013.Google Scholar
Oflazer, K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 7390.Google Scholar
Parker, R., Graff, D., Chen, K., Kong, J., and Maeda, K. 2011. Arabic Gigaword Fifth Edition. LDC Catalog No.: LDC2011T11.Google Scholar
Ratcliffe, R. R. 1998. The Broken Plural Problem in Arabic and Comparative Semitic: Allomorphy and Analogy in Non-concatenative Morphology, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV, Current issues in linguistic theory, vol. 168. Amsterdam, Philadelphia: J. Benjamins.CrossRefGoogle Scholar
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C., 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Columbus, Ohio, US, pp. 117120.CrossRefGoogle Scholar
Shaalan, K., Allam, A., and Gomah, A., 2003. Towards automatic spell checking for arabic. In Proceedings of the 4th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, pp. 240247.Google Scholar
Shaalan, K., Magdy, M., and Fahmy, A. 2013. Analysis and feedback of erroneous arabic verbs. Journal of Natural Language Engineering, Cambridge University, UK. FirstView: 153.Google Scholar
Shaalan, K., Samih, Y., Attia, M., Pecina, P., and van Genabith, J. 2012. Arabic word generation and modelling for spell checking. In Language Resources and Evaluation (LREC), Istanbul, Turkey. pp. 719725.Google Scholar
Stolcke, A., Zheng, J., Wang, W., and Abrash, V. 2011. SRILM at sixteen: update and outlook. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Waikoloa, Hawaii.Google Scholar
Tong, X., and Evans, D. A., 1996. A statistical approach to automatic OCR error correction in context. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 88100.Google Scholar
Ukkonen, E. 1983. On approximate string matching. In Foundations of Computation Theory, vol. 158, pp. 487495. Lecture Notes in Computer Science, Berlin: Springer.CrossRefGoogle Scholar
van Delden, S., Bracewell, D. B., and Gomez, F. 2004. Supervised and unsupervised automatic spelling correction algorithms. In Proceedings of the 2004 IEEE International Conference on Web Services, pp. 530–535.Google Scholar
Watson, J. 2002. The Phonology and Morphology of Arabic, New York: Oxford University.CrossRefGoogle Scholar
Wintner, S., 2008. Strengths and weaknesses of finite-state technology: a case study in morphological grammar development. Natural Language Engineering 14 (4): 457469.CrossRefGoogle Scholar
Wu, J., Chiu, H., and Chang, J. S., 2013. Integrating dictionary and web N-grams for chinese spell checking. Computational Linguistics and Chinese Language Processing 18 (4): 1730.Google Scholar
Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh, N., Rozovskaya, A., Farra, N., Alkuhlani, S., and Oflazer, K., 2014. Large scale arabic error annotation: guidelines and framework. In The 9th Edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, pp. 2631.Google Scholar
Zribi, C. B. O., and Ben Ahmed, M. 2003. Efficient automatic correction of misspelled arabic words based on contextual information. Lecture Notes in Computer Science, Springer, 2773: 770777.Google Scholar