Hostname: page-component-78c5997874-j824f Total loading time: 0 Render date: 2024-11-10T12:20:13.399Z Has data issue: false hasContentIssue false

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Published online by Cambridge University Press:  05 May 2020

Taghreed Tarmom
Affiliation:
School of Computing, University of Leeds, Leeds, UK
William Teahan
Affiliation:
School of Computer Science and Electronic Engineering, Bangor University, Bangor, UK
Eric Atwell*
Affiliation:
School of Computing, University of Leeds, Leeds, UK
Mohammad Ammar Alsalka
Affiliation:
School of Computing, University of Leeds, Leeds, UK
*
*Corresponding author. E-mail: E.S.Atwell@leeds.ac.uk

Abstract

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Moghrabi, A.A. (2015). An Examination of Reading Strategies in Arabic (L1) and English (L2) Used by Saudi Female Public High School Adolescents, Doctoral Dissertation, The British University in Dubai (BUiD). Available at https://bspace.buid.ac.ae/handle/1234/776.Google Scholar
Ali, M. (2018). Character level convolutional neural network for German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 172177.Google Scholar
Alkahtani, S. (2015). Building and Verifying Parallel Corpora Between Arabic and English. Doctoral Dissertation, Prifysgol Bangor University. Available at http://e.bangor.ac.uk/6546/1/saad_alkahtani_dissertation.pdf.Google Scholar
Alkhazi, I.S. and Teahan, W.J. (2017). Classifying and segmenting classical and modern standard Arabic using minimum cross-entropy. International Journal of Advanced Computer Science and Applications 8(4), 421430.Google Scholar
Alshutayri, A., Atwell, E., Alosaimy, A., Dickins, J., Ingleby, M. and Watson, J. (2016). Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 204211.Google Scholar
Androutsopoulos, J. (2013). Code-switching in computer-mediated communication. In Herring S.C., Stein D. and Virtanen T. (eds), Pragmatics of Computer-Mediated Communication. Berlin, Germany, New York: Mouton de Gruyter, pp. 659686.Google Scholar
Bacatan, A.C.R., Castillo, B.L.D., Majan, M.J.T., Palermo, V.F. and Sagum, R.A. (2014). Detection of intra-sentential code-switching points using word bigram and unigram frequency count. International Journal of Computer and Communication Engineering 3(3), 184.CrossRefGoogle Scholar
Cleary, J.G. and Teahan, W.J. (1997). Unbounded length contexts for PPM. The Computer Journal 40(2/3), 6775.CrossRefGoogle Scholar
Cleary, J. and Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396402.CrossRefGoogle Scholar
Doyle, C. (2011). A Dictionary of Marketing. Oxford: Oxford University Press.Google Scholar
Eberhard, D.M., Simons, G.F. and Fennig, C.D. (eds) (2017). Ethnologue: Languages of the World, 20th Edn. Dallas, TX: SIL International. http://www.ethnologue.com.Google Scholar
Elfardy, H., Al-Badrashiny, M. and Diab, M. (2014). A hybrid system for code switch point detection in informal Arabic text. XRDS: Crossroads, The ACM Magazine for Students 21(1), 5257.CrossRefGoogle Scholar
Elfardy, H. and Diab, M. (2012). Token level identification of linguistic code switching. In Proceedings of the International Conference on Computational Linguistics (COLING): Posters, pp. 287296.Google Scholar
Global Media Insight Website (2019). Saudi Arabia Social Media Statistics 2018 – Official GMI Blog. Global Media Insight. Available at https://www.globalmediainsight.com/blog/saudi-arabia-social-media-statistics/ (accessed 21 June 2019).Google Scholar
Grosjean, F. (1982). Life with Two Languages: An Introduction to Bilingualism. Cambridge: Harvard University Press.Google Scholar
Gupta, P., Bali, K., Banchs, R.E., Choudhury, M. and Rosso, P. (2014). Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, pp. 677686.CrossRefGoogle Scholar
Hale, S.A. (2014). Global connectivity and multilinguals in the Twitter network In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 26 April 2014, Toronto, Canada. ACM, pp. 833842.Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 1018.CrossRefGoogle Scholar
Johnson, I. (2013). Audience design and communication accommodation theory: use of Twitter by Welsh–English biliterates. In Jones, E.H.G. and Uribe-Jongbloed, E. (eds), Social Media and Minority Languages: Convergence and the Creative Industries. Bristol: Multilingual Matters, pp. 99118.Google Scholar
Jurgens, D., Dimitrov, S. and Ruths, D. (2014). Twitter users# codeswitch hashtags# moltoimportante# wow. In Proceedings of the First Workshop on Computational Approaches to Code-Switching, 25 October 2014, Doha, Qatar, pp. 5161.CrossRefGoogle Scholar
Li, Y., Bontcheva, K. and Cunningham, H. (2009). Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15(2), 241271.Google Scholar
Lignos, C. and Marcus, M. (2013). Toward web-scale analysis of codeswitching. In 87th Annual Meeting of the Linguistic Society of America, 3 January 2013, Boston. Google Scholar
Malmasi, S., Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In Conference of the Pacific Association for Computational Linguistics. Singapore: Springer, pp. 3553.Google Scholar
Malmasi, S. and Zampieri, M. (2016). Arabic dialect identification in speech transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106113.Google Scholar
Malmasi, S. and Zampieri, M. (2017). Arabic dialect identification using iVectors and ASR transcripts. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 178183.CrossRefGoogle Scholar
Mahoui, M., Teahan, W.J., Thirumalaiswamy Sekhar, A.K. and Chilukuri, S. (2008). Identification of gene function using prediction by partial matching (PPM) language models. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, pp. 779786.CrossRefGoogle Scholar
Milroy, L. and Muysken, P. (eds.). 1995. One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Myers-Scotton, C. (2006). Multiple Voices: An Introduction to Bilingualism. London: Blackwell.Google Scholar
Nguyen, D., Dogruöz, A.S., Rosé, C.P. and de Jong, F. (2016). Computational sociolinguistics: a survey. Computational Linguistics. Available at https://arXiv:1508.07544v2.Google Scholar
Oco, N. and Roxas, R.E. (2012). Pattern matching refinements to dictionary-based code-switching point detection. In Pacific Asia Conference on Language, Information and Computation (PACLIC), 7 November 2012, Bali, Indonesia, pp. 229236.Google Scholar
Oco, N., Wong, J., Ilao, J. and Roxas, R. (2013). Detecting code-switches using word bigram frequency count. In 9th National Natural Language Processing Research Symposium, Quezon City, Philippines, March, Vol. 7.Google Scholar
Platt, J. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. In Technical Report MSR-TR-98-14. Microsoft Research.Google Scholar
Sammut, C. and Webb, G.I. (2017). Encyclopedia of Machine Learning and Data Mining. New York, NY: Springer US.CrossRefGoogle Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A. and Fung, P. (2014). Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code-switching, 25 October 2014, Doha, Qatar, pp. 62–72.CrossRefGoogle Scholar
Swann, J. and Sinka, I. (2007). Style shifting, code switching. In: Graddol, D., Leith, D., Swann, J., Rhys, M. and Gillen, J. (eds), Changing English. London: Routledge.Google Scholar
Tarmom, T. (2018). Designing and Evaluating a Compression-Based Approach to the Automatic Detection of Code-switching in Arabic Text, MSc Dissertation, Bangor University.Google Scholar
Teahan, W. (2018). A compression-based toolkit for modelling and processing natural language text. Information 9(12), 294.CrossRefGoogle Scholar
Yu, L.C., He, W.C., Chien, W.N. and Tseng, Y.H. (2013). Identification of code-switched sentences and words using language modeling approaches. Mathematical Problems in Engineering 2013, Article ID 898714.CrossRefGoogle Scholar
Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2. Association for Computational Linguistics, pp. 3741.Google Scholar