Fine-grained analysis of language varieties and demographics

Francisco Rangel; Paolo Rosso; Wajdi Zaghouani; Anis Charfi

doi:10.1017/S1351324920000108

Fine-grained analysis of language varieties and demographics

Published online by Cambridge University Press: 10 March 2020

Wajdi Zaghouani and

Francisco Rangel*: Affiliation:
Pattern Recognition and Human Language Technologies, Universitat Politècnica de València, Spain
Paolo Rosso: Affiliation:
Pattern Recognition and Human Language Technologies, Universitat Politècnica de València, Spain
Wajdi Zaghouani: Affiliation:
College of Humanities and Social Sciences, Hamad Bin Khalifa University, Ar-Rayyan, Qatar
Anis Charfi: Affiliation:
Information Systems Program, Carnegie Mellon University in Qatar, Ar-Rayyan, Qatar
*: *Corresponding author. E-mail: kico.rangel@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors’ demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors’ gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors’ age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.

Keywords

Language variety identification Demographics Gender Age Author profiling Cybersecurity Arabic

Information

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 641 - 661

DOI: https://doi.org/10.1017/S1351324920000108 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Agić, Ž., Tiedemann, J., Dobrovoljc, K., Krek, S., Merkler, D., Može, S., Nakov, P., Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.Google Scholar

Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Google Scholar

Bogdanova, D., Rosso, P. and Solorio, T. (2014). Exploring high-level features for detecting cyberpedophilia. Computer Speech & Language 28(1), 108–120.CrossRef Google Scholar

Bowman, K.O. and Shenton, L.R. (1985). Method of moments. In Encyclopedia of Statistical Sciences, vol. 5, pp. 467–473, John Wiley & Sons Canada.Google Scholar

Castro, D., Souza, E., de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265–270.Google Scholar

Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456–461.Google Scholar

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M. and Mart, M.A. (2015). Language variety identification using distributed representations of words and documents. In Experimental IR meets Multilinguality, Multimodality, and Interaction, Springer, pp. 28–40.CrossRef Google Scholar

Gini, C.W. (1912/1971). Variability and mutability, contribution to the study of statistical distributions and relations. Studi cconomico-giuridici della r. Universita de Cagliari. Reviewed in: Light R.J. and Margolin B.H. An analysis of variance for categorical data. Journal of American Statistical Association 66, 534–544.Google Scholar

Grouin, C., Forest, D., Paroubek, P. and Zweigenbaum, P. (2011). Présentation et résultats du défi fouille de texte DEFT2011 Quand un article de presse a t-il été écrit? À quel article scientifique correspond ce résumé? Actes du septième Défi Fouille de Textes, p. 3.Google Scholar

Hagen, M., Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Google Scholar

Habash, N. (2010). Introduction to Arabic Natural Language Processing, vol. 3. Morgan & Claypool Publishers.Google Scholar

Heitele, D. (1975). An epistemological view on fundamental stochastic ideas. Educational Studies in Mathematics 6(2), 187–205.CrossRef Google Scholar

Hernández-Fusilier, D., Montes-y-Gómez, M., Rosso, P. and Cabrera-Guzmán, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management 51(4), 433–443.CrossRef Google Scholar

Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404–410.Google Scholar

Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes/labs/workshop, vol. 30.Google Scholar

Kandias, M., Stavrou, V., Bozovic, N. and Gritzalis, D. (2013). Proactive insider threat detection through social media: The YouTube case. In: Proceedings of the 12th ACM Workshop on Workshop on Privacy in the Electronic Society, pp. 261–266.CrossRef Google Scholar

Kestemont, M., Tschuggnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Google Scholar

Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5–15.Google Scholar

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157.CrossRef Google Scholar PubMed

Maier, W. and Gómez-Rodríguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.Google Scholar

Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Google Scholar

Martinc, M., Skrjanec, I., Zupan, K. and Pollak, S. Pan (2017). Author profiling – gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Google Scholar

Rangel, F. and Rosso, P. (2016a). On the impact of emotions on author profiling. Information Processing & Management 52(1), 73–92.Google Scholar

Rangel, F., Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.Google Scholar

Rangel, F., Rosso, P., Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613–0073, CLEF and CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Google Scholar

Rosso, P., Rangel, F., Hernández-Farías, I., Cagnina, L., Zaghouani, W. and Charfi, A. (2018a). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass 12(4), e12275.CrossRef Google Scholar

Rosso, P., Rangel Pardo, F.M., Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).Google Scholar

Russell, C.A. and Miller, B.H. (1977) Profile of a Terrorist. Studies in Conflict & Terrorism 1(1), 17–34.Google Scholar

Sadat, F., Kazemi, F. and Farzindar, A. (2014). Automatic identification of arabic language varieties and dialects in social media. Proceedings of SocialNLP, 22.CrossRef Google Scholar

Salton, G. and Buckley, C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523.CrossRef Google Scholar

Taylor, R.W., Fritsch, E.J. and Liederbach, J. (2014). Digital Crime and Digital Terrorism. Prentice Hall Press.Google Scholar

Tellez, E.S., Miranda-Jiménez, S., Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Google Scholar

Xu, F., Wang, M. and Li, M. (2016). Sentence-level dialects identification in the Greater China region. International Journal on Natural Language Computing (IJNLC) 5(6).Google Scholar

Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Google Scholar

Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics 40(1), 171–202.CrossRef Google Scholar

Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233–237 (2012)Google Scholar

Zampieri, M., Tan, L., Ljubešić, N. and Tiedemann, J. (2014). A report on the DSL shared task 2014. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 58–67.CrossRef Google Scholar

Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9.Google Scholar

Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y., Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–15.Google Scholar

Article contents

Fine-grained analysis of language varieties and demographics

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests