Hostname: page-component-78c5997874-g7gxr Total loading time: 0 Render date: 2024-11-10T17:01:38.816Z Has data issue: false hasContentIssue false

Multilingual SMS-based author profiling: Data and methods

Published online by Cambridge University Press:  26 June 2018

MEHWISH FATIMA
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
SABA ANWAR
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
AMNA NAVEED
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
WAQAS ARSHAD
Affiliation:
Department of Computer Science & IT, Superior University, Lahore, Pakistan e-mail: waqas.arshad@superior.edu.com.pk
RAO MUHAMMAD ADEEL NAWAB
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com
MUNTAHA IQBAL
Affiliation:
Al-Khwarizmi Institute of Computer Science, University of Engineering & Technology, Lahore, Pakistan e-mail: muntaha.iqbal@kics.edu.pk
ALIA MASOOD
Affiliation:
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Pakistan e-mails: adeelnawab@ciitlahore.edu.pk, amnanaveed@ciitlahore.edu.pk, mehwish.fatima@ciitlahore.edu.pk, sabaanwar@ciitlahore.edu.pk, aliamasood2020@gmail.com

Abstract

In the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 and F1 score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aboluwarin, O., Andriotis, P., Takasu, A., and Tryfonas, T., 2016. Optimizing short message text sentiment analysis for mobile device forensics. In Proceedings of the 12th IFIP WG 11.9 International Conference on Advances in Digital Forensics XII, New Delhi, India, Springer, pp. 69–87.Google Scholar
Afzal, H., and Mehmood, K., 2016. Spam filtering of bi-lingual tweets using machine learning. In Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea, IEEE, pp. 710–14.Google Scholar
Ali, I., and Aslam, T. M., 2012. Frequency of learned words of English as a marker of gender identity in SMS language in Pakistan. Journal of Elementary Education 22 (2): 4555.Google Scholar
Alowibdi, J. S., Buy, U. A., and Yu, P. 2013. Language independent gender classification on Twitter. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM-2013), Niagara, Ontario, Canada, ACM, pp. 739–43.Google Scholar
Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J., 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52 (2): 119–23.Google Scholar
Bhatt, R. M., and Bolonyai, A., 2011. Code-switching and the optimal grammar of bilingual language use. Bilingualism: Language and Cognition 14 (4): 522–46.Google Scholar
Bilal, M., Israr, H., Shahid, M., and Khan, A. 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, decision tree and KNN classification techniques. Journal of King Saud University – Computer and Information Sciences, 28 (3): 330–44.Google Scholar
Boutwell, S. R. 2011. Authorship attribution of short messages using multimodal features. Master's thesis. Naval Postgraduate School Monterey, California.Google Scholar
Burger, J. D., Henderson, J., Kim, G., and Zarrella, G., 2011. Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, United Kingdom, Association for Computational Linguistics, pp. 1301–9.Google Scholar
Calvo, R. A., Milne, D. N., Hussain, M. S., and Christensen, H. 2017. Natural language processing in mental health applications using non-clinical texts. Natural Language Engineering, 23 (5): 649–85.Google Scholar
Chen, T., and Kan, M. Y., 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299335.Google Scholar
Cheng, N., Chandramouli, R., and Subbalakshmi, K., 2011. Author gender identification from text. Digital Investigation 8 (1): 7888.Google Scholar
Corney, M., de Vel, O., Anderson, A., and Mohay, G., 2002. Gender-preferential text mining of E-mail discourse. In Proceedings of the 18th Annual Computer Security Applications Conference, ACSAC-2002, Las Vegas, NV, IEEE Computer Society, pp. 282–9.Google Scholar
De-Arteaga, M., Jimenez, S., Mancera, S., and Baquero, J. 2013. Author profiling using corpus statistics, lexicons and stylistic features—notebook for PAN at CLEF-2013. In Proceedings of CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, Valencia, Spain.Google Scholar
Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., and Hu, W., 2012. Gender identification on Twitter using the modified balanced winnow. Communications and Network 4 (3): 189–95.Google Scholar
Del Gaudio, R., Batista, G., and Branco, A., 2014. Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting. Natural Language Engineering 20 (3): 327–59.Google Scholar
Eleta, I., and Golbeck, J., 2014. Multilingual use of Twitter: Social networks at the language frontier. Computers in Human Behavior 41 : 424–32.Google Scholar
Eurobarometer, S. 2006. Europeans and their languages. Technical Report, European Commission.Google Scholar
Fairon, C., and Paumier, S., 2006. A translated corpus of 30,000 French SMS. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, European Language Resources Association (ELRA), pp. 351–4.Google Scholar
Fatima, M., Hasan, K., Anwar, S., and Nawab, R. M. A., 2017. Multilingual author profiling on Facebook. Information Processing & Management 53 (4): 886904.Google Scholar
Flekova, L., Ungar, L., and Preotiuc-Pietro, D. 2016. Exploring stylistic variation with age and income on Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany.Google Scholar
Giannella, C. R., Winder, R., and Wilson, B., 2015. (Un/Semi-)supervised SMS text message SPAM detection. Natural Language Engineering 21 (4): 553–67.Google Scholar
Glance, N., Hurst, M., Nigam, K., Siegler, M., Stockton, R., and Tomokiyo, T., 2005. Deriving marketing intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD-2005), Chicago, Illinois, USA, pp. 419–28.Google Scholar
Goswami, S., Sarkar, S., and Rustagi, M., 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of the 3rd International AAAI Conference of Weblogs and Social Media (ICWSM'09), San Jose, California, AAAI Press, pp. 214–7.Google Scholar
Halim, N. S., and Maros, M., 2014. The functions of code-switching in Facebook interactions. Procedia-Social and Behavioral Sciences 118 : 126–33.Google Scholar
How, Y., and Kan, M. Y. 2005. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services (HCI-2005), Salzburg, Austria.Google Scholar
Ikonomakis, M., Kotsiantis, S., and Tampakas, V., 2005. Text classification using machine learning techniques. WSEAS Transactions on Computers 4 (8): 966–74.Google Scholar
Ishihara, S., 2011. A forensic authorship classification in SMS messages: a likelihood ratio based approach using n-gram. In Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 47–56.Google Scholar
Ishihara, S., 2012. A forensic text comparison in SMS messages: a likelihood ratio approach with lexical features. In Proceedings of the 7th International Workshop on Digital Forensics & Incident Analysis (WDFIA-2012), Crete, Greece, pp. 55–65.Google Scholar
Ishihara, S. 2014. A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using n-grams. International Journal of Speech, Language & the Law 21 (1).Google Scholar
Javed, I., and Afzal, H., 2013. Opinion analysis of bi-lingual event data from social networks. In Proceedings of Emotion and Sentiment in Social and Expressive Media (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, Citeseer, pp. 164–72.Google Scholar
Javed, I., Afzal, H., Majeed, A., and Khan, B., 2014. Towards creation of linguistic resources for bilingual sentiment analysis of Twitter data. In Proceedings of the 19th International Conference on Application of Natural Language to Information Systems, Montpellier, France, Springer, pp. 232–6.Google Scholar
Kebede, A. M., Tefrie, K. G., and Sohn, K. A., 2015. Anonymous author similarity identification. In Proceedings of the 5th International Conference on IT Convergence and Security (ICITCS-2015), Kuala Lumpur, Malaysia, pp. 1–5.Google Scholar
Kiritchenko, S., Zhu, X., and Mohammad, S. M., 2014. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research 50 : 723–62.Google Scholar
Kretchmar, M., and Zhao, Y., 2014. Text message authorship classification using Kernel Support Vector Machines. In Proceedings of the International Conference on Computational Science and Computational Intelligence (CSCI-2014), vol. 2, Las Vegas, Nevada, USA, IEEE, pp. 215–8.Google Scholar
Kubát, M., and Milička, J., 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20 (4): 339–49.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC-2010), IEEE, pp. 1–8.Google Scholar
Layton, R., Watters, P., and Dazeley, R., 2012. Recentred local profiles for authorship attribution. Natural Language Engineering 18 (03): 293312.Google Scholar
Layton, R., Watters, P., and Dazeley, R., 2013. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19 (1): 95120.Google Scholar
Malmasi, S., and Dras, M., 2017. Multilingual native language identification. Natural Language Engineering 23 (2): 163215.Google Scholar
Meylaerts, R. 2010. Multilingualism and translation. Handbook of translation studies, Amsterdam, John Benjamins Publishing, vol. 1, pp. 227–30.Google Scholar
Mikros, G. K., 2012. Authorship attribution and gender identification in Greek blogs. In Proceedings of the Methods and Applications of Quantitative Linguistics, University of Belgrade, Serbia, pp. 21–32.Google Scholar
Mikros, G. K., and Perifanos, K. 2013. Authorship attribution in Greek tweets using author's multilevel n-gram profiles. In Proceedings of AAAI 2013 Spring Symposium: Analyzing Microtext (SAM-2013), Stanford, USA, AAAI Press.Google Scholar
Miller, Z., Dickinson, B., and Hu, W., 2012. Gender prediction on Twitter using stream algorithms with n-gram character features. International Journal of Intelligence Science 2 (24): 143–8.Google Scholar
Mohan, A., Baggili, I. M., and Rogers, M. K. 2010. Authorship attribution of SMS messages using an n-grams approach. Technical Report, CERIAS2010-11, College of Technology.Google Scholar
Mukund, S., and Srihari, R. K., 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled translations. In Proceedings of the Second Workshop on Language in Social Media (LSM-2012), Montreal, Canada, Association for Computational Linguistics, pp. 1–8.Google Scholar
Munro, R., and Manning, C. D., 2012. Short message communications: Users, topics, and in-language processing. In Proceedings of the 2nd ACM Symposium on Computing for Development (ACM DEV-2012), Atlanta, Georgia, pp. 4:1–10.Google Scholar
Nguyen, D., Smith, N. A., and Rosé, C. P., 2011. Author age prediction from text using Linear Regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), Portland, Oregon, Association for Computational Linguistics, pp. 115–23.Google Scholar
Oberlander, J., and Nowson, S., 2006. Whose thumb is it anyway?: Classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006), Sydney, Australia, pp. 627–34.Google Scholar
Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A., 2013. A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering 19 (01): 121–41.Google Scholar
Peersman, C., Daelemans, W., and Van Vaerenbergh, L. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC-2011), ACM, pp. 37–44.Google Scholar
Pervaz, I., Ameer, I., Sittar, A., and Nawab, R. M. A. 2015. Identification of author personality traits using stylistic features—notebook for PAN at CLEF 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.Google Scholar
Przybyła, P., and Teisseyre, P. 2015. What do your look-alikes say about you? Exploiting strong and weak similarities for author profiling—Notebook for PAN at CLEF 2015. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2015), Toulouse, France. CEUR-WS.org.Google Scholar
Ragel, R., Herath, P., and Senanayake, U., 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 2013 IEEE 8th International Conference on Industrial and Information Systems (ICIIS-2013), Sri Lanka, IEEE, pp. 387–92.Google Scholar
Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., and Inches, G. 2013. Overview of the author profiling task at PAN 2013. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.Google Scholar
Rangel, F., Rosso, P., Potthast, M., and Stein, B. 2017. Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. 2015. Overview of the 3rd author profiling task at PAN 2015. In Proceedings of Evaluation Labs and Workshop – Working Notes Papers CLEF-2015, Toulouse, France. CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., and Daeleman, W. 2014. Overview of the 2nd author profiling task at PAN 2014. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2014), Sheffield, UK. CEUR-WS.org.Google Scholar
Rangel, F., Rosso, P., Verhoeven, B., Daeleman, W., Potthast, M., and Stein, B., 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In Proceedings of Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, pp. 750–84.Google Scholar
Santosh, K., Bansal, R., Shekhar, M., and Varma, V. 2013. Author profiling: Predicting age and gender from blogs—notebook for PAN at CLEF 2013. In Evaluation Labs and Workshop – Working Notes Papers (CLEF-2013), Valencia, Spain.Google Scholar
Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. W., 2006. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, vol. 6, Palo Alto, California, AAAI Press, pp. 199–205.Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., and Koppel, M. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–91.Google Scholar
Shrestha, P., Rey-Villamizar, N., Sadeque, F., Pedersen, T., Bethard, S., and Solorio, T. 2016. Age and gender prediction on health forum data. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016). European Language Resources Association (ELRA).Google Scholar
Silessi, S., Varol, C., and Karabatak, M., 2016. Identifying gender from SMS text messages. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, California, USA, IEEE, pp. 488–91.Google Scholar
Sokolova, M., Japkowicz, N., and Szpakowicz, S. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Proceedings of the Australian Conference on Artificial Intelligence, vol. 4304, pp. 1015–21.Google Scholar
Sokolova, M., and Lapalme, G., 2009. A systematic analysis of performance measures for classification tasks. Information Processing & Management 45 (4): 427–37.Google Scholar
Soler, J., and Wanner, L. 2016. A semi-supervised approach for gender identification. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), Portorož, Slovenia, European Language Resources Association (ELRA).Google Scholar
Song, Z., Strassel, S., Lee, H., Walker, K., Wright, J., Garland, J., Fore, D., Gainor, B., Cabe, P., Thomas, T., Callahan, B., and Sawyer, A. 2014. Collecting natural SMS and chat conversations in multiple languages: The BOLT phase 2 corpus. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, European Language Resources Association (ELRA).Google Scholar
Sridhar, V. K. R., Chen, J., Bangalore, S., and Shacham, R. 2014. A framework for translating SMS messages. In Proceedings of the 25th International Conference on Computational Linguistics (COLING-2014), pp. 974–83.Google Scholar
Stamatatos, E., Fakotakis, N., and Kokkinakis, G., 2000. Automatic text categorization in terms of genre and author. Computational Linguistics 26 (4): 471–95.Google Scholar
Treurniet, M., De Clercq, O., Van Den Heuvel, H., and Oostdijk, N., 2012. Collecting a corpus of Dutch SMS. In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC-2012), Istanbul, Turkey, European Language Resources Association (ELRA), pp. 2268–73.Google Scholar
Van de Loo, J., De Pauw, G., and Daelemans, W., 2016. Text-based age and gender prediction for online safety monitoring. International Journal of Cyber-Security and Digital Forensics 5 (1): 4660.Google Scholar
Vicente, M., Batista, F., and Carvalho, J. P. 2016. Improving Twitter gender classification using multiple classifiers*. In Proceedings of the 8th European Symposium on Computational Intelligence and Mathematics (ESCIM-2016) , Sofia, Bulgaria, pp. 121–7.Google Scholar
Wanner, L. 2015. Multiple language gender identification for blog posts. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, pp. 2248–51.Google Scholar
Witten, I. H., Frank, E., and Hall, M. 2011. Data mining: Practical machine learning tools and techniques (3rd ed.), The Morgan Kaufmann Series in Data Management Systems, Elsevier Science.Google Scholar
Yang, Y., and Pedersen, J. O., 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-1997), Nashville, TN, USA, Morgan Kaufmann Publishers Inc., pp. 412–20.Google Scholar