Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Taghreed Tarmom; William Teahan; Eric Atwell; Mohammad Ammar Alsalka

doi:10.1017/S135132492000011X

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Published online by Cambridge University Press: 05 May 2020

Taghreed Tarmom ,

William Teahan ,

Eric Atwell and

Mohammad Ammar Alsalka

Show author details

Taghreed Tarmom: Affiliation:
School of Computing, University of Leeds, Leeds, UK
William Teahan: Affiliation:
School of Computer Science and Electronic Engineering, Bangor University, Bangor, UK
Eric Atwell*: Affiliation:
School of Computing, University of Leeds, Leeds, UK
Mohammad Ammar Alsalka: Affiliation:
School of Computing, University of Leeds, Leeds, UK
*: *Corresponding author. E-mail: E.S.Atwell@leeds.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

Keywords

Arabic Corpus linguistics Language resources Machine learning Sublanguages and controlled languages Text segmentation

Information

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 663 - 676

DOI: https://doi.org/10.1017/S135132492000011X [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Al-Moghrabi, A.A. (2015). An Examination of Reading Strategies in Arabic (L1) and English (L2) Used by Saudi Female Public High School Adolescents, Doctoral Dissertation, The British University in Dubai (BUiD). Available at https://bspace.buid.ac.ae/handle/1234/776.Google Scholar

Ali, M. (2018). Character level convolutional neural network for German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 172–177.Google Scholar

Alkahtani, S. (2015). Building and Verifying Parallel Corpora Between Arabic and English. Doctoral Dissertation, Prifysgol Bangor University. Available at http://e.bangor.ac.uk/6546/1/saad_alkahtani_dissertation.pdf.Google Scholar

Alkhazi, I.S. and Teahan, W.J. (2017). Classifying and segmenting classical and modern standard Arabic using minimum cross-entropy. International Journal of Advanced Computer Science and Applications 8(4), 421–430.Google Scholar

Alshutayri, A., Atwell, E., Alosaimy, A., Dickins, J., Ingleby, M. and Watson, J. (2016). Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 204–211.Google Scholar

Androutsopoulos, J. (2013). Code-switching in computer-mediated communication. In Herring S.C., Stein D. and Virtanen T. (eds), Pragmatics of Computer-Mediated Communication. Berlin, Germany, New York: Mouton de Gruyter, pp. 659–686.Google Scholar

Bacatan, A.C.R., Castillo, B.L.D., Majan, M.J.T., Palermo, V.F. and Sagum, R.A. (2014). Detection of intra-sentential code-switching points using word bigram and unigram frequency count. International Journal of Computer and Communication Engineering 3(3), 184.CrossRef Google Scholar

Cleary, J.G. and Teahan, W.J. (1997). Unbounded length contexts for PPM. The Computer Journal 40(2/3), 67–75.CrossRef Google Scholar

Cleary, J. and Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402.CrossRef Google Scholar

Doyle, C. (2011). A Dictionary of Marketing. Oxford: Oxford University Press.Google Scholar

Eberhard, D.M., Simons, G.F. and Fennig, C.D. (eds) (2017). Ethnologue: Languages of the World, 20th Edn. Dallas, TX: SIL International. http://www.ethnologue.com.Google Scholar

Elfardy, H., Al-Badrashiny, M. and Diab, M. (2014). A hybrid system for code switch point detection in informal Arabic text. XRDS: Crossroads, The ACM Magazine for Students 21(1), 52–57.CrossRef Google Scholar

Elfardy, H. and Diab, M. (2012). Token level identification of linguistic code switching. In Proceedings of the International Conference on Computational Linguistics (COLING): Posters, pp. 287–296.Google Scholar

Global Media Insight Website (2019). Saudi Arabia Social Media Statistics 2018 – Official GMI Blog. Global Media Insight. Available at https://www.globalmediainsight.com/blog/saudi-arabia-social-media-statistics/ (accessed 21 June 2019).Google Scholar

Grosjean, F. (1982). Life with Two Languages: An Introduction to Bilingualism. Cambridge: Harvard University Press.Google Scholar

Gupta, P., Bali, K., Banchs, R.E., Choudhury, M. and Rosso, P. (2014). Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, pp. 677–686.CrossRef Google Scholar

Hale, S.A. (2014). Global connectivity and multilinguals in the Twitter network In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 26 April 2014, Toronto, Canada. ACM, pp. 833–842.Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18.CrossRef Google Scholar

Johnson, I. (2013). Audience design and communication accommodation theory: use of Twitter by Welsh–English biliterates. In Jones, E.H.G. and Uribe-Jongbloed, E. (eds), Social Media and Minority Languages: Convergence and the Creative Industries. Bristol: Multilingual Matters, pp. 99–118.Google Scholar

Jurgens, D., Dimitrov, S. and Ruths, D. (2014). Twitter users# codeswitch hashtags# moltoimportante# wow. In Proceedings of the First Workshop on Computational Approaches to Code-Switching, 25 October 2014, Doha, Qatar, pp. 51–61.CrossRef Google Scholar

Li, Y., Bontcheva, K. and Cunningham, H. (2009). Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15(2), 241–271.Google Scholar

Lignos, C. and Marcus, M. (2013). Toward web-scale analysis of codeswitching. In 87th Annual Meeting of the Linguistic Society of America, 3 January 2013, Boston. Google Scholar

Malmasi, S., Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In Conference of the Pacific Association for Computational Linguistics. Singapore: Springer, pp. 35–53.Google Scholar

Malmasi, S. and Zampieri, M. (2016). Arabic dialect identification in speech transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113.Google Scholar

Malmasi, S. and Zampieri, M. (2017). Arabic dialect identification using iVectors and ASR transcripts. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 178–183.CrossRef Google Scholar

Mahoui, M., Teahan, W.J., Thirumalaiswamy Sekhar, A.K. and Chilukuri, S. (2008). Identification of gene function using prediction by partial matching (PPM) language models. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, pp. 779–786.CrossRef Google Scholar

Milroy, L. and Muysken, P. (eds.). 1995. One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching. Cambridge: Cambridge University Press.CrossRef Google Scholar

Myers-Scotton, C. (2006). Multiple Voices: An Introduction to Bilingualism. London: Blackwell.Google Scholar

Nguyen, D., Dogruöz, A.S., Rosé, C.P. and de Jong, F. (2016). Computational sociolinguistics: a survey. Computational Linguistics. Available at https://arXiv:1508.07544v2.Google Scholar

Oco, N. and Roxas, R.E. (2012). Pattern matching refinements to dictionary-based code-switching point detection. In Pacific Asia Conference on Language, Information and Computation (PACLIC), 7 November 2012, Bali, Indonesia, pp. 229–236.Google Scholar

Oco, N., Wong, J., Ilao, J. and Roxas, R. (2013). Detecting code-switches using word bigram frequency count. In 9th National Natural Language Processing Research Symposium, Quezon City, Philippines, March, Vol. 7.Google Scholar

Platt, J. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. In Technical Report MSR-TR-98-14. Microsoft Research.Google Scholar

Sammut, C. and Webb, G.I. (2017). Encyclopedia of Machine Learning and Data Mining. New York, NY: Springer US.CrossRef Google Scholar

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A. and Fung, P. (2014). Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code-switching, 25 October 2014, Doha, Qatar, pp. 62–72.CrossRef Google Scholar

Swann, J. and Sinka, I. (2007). Style shifting, code switching. In: Graddol, D., Leith, D., Swann, J., Rhys, M. and Gillen, J. (eds), Changing English. London: Routledge.Google Scholar

Tarmom, T. (2018). Designing and Evaluating a Compression-Based Approach to the Automatic Detection of Code-switching in Arabic Text, MSc Dissertation, Bangor University.Google Scholar

Teahan, W. (2018). A compression-based toolkit for modelling and processing natural language text. Information 9(12), 294.CrossRef Google Scholar

Yu, L.C., He, W.C., Chien, W.N. and Tseng, Y.H. (2013). Identification of code-switched sentences and words using language modeling approaches. Mathematical Problems in Engineering 2013, Article ID 898714.CrossRef Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2. Association for Computational Linguistics, pp. 37–41.Google Scholar

Article contents

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests