Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-27T19:30:20.825Z Has data issue: false hasContentIssue false

Authorship analysis of aliases: Does topic influence accuracy?

Published online by Cambridge University Press:  08 October 2013

ROBERT LAYTON
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL A. WATTERS
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY
Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Abstract

Aliases play an important role in online environments by facilitating anonymity, but also can be used to hide the identity of cybercriminals. Previous studies have investigated this alias matching problem in an attempt to identify whether two aliases are shared by an author, which can assist with identifying users. Those studies create their training data by randomly splitting the documents associated with an alias into two sub-aliases. Models have been built that can regularly achieve over 90% accuracy for recovering the linkage between these ‘random sub-aliases’. In this paper, random sub-alias generation is shown to enable these high accuracies, and thus does not adequately model the real-world problem. In contrast, creating sub-aliases using topic-based splitting drastically reduces the accuracy of all authorship methods tested. We then present a methodology that can be performed on non-topic controlled datasets, to produce topic-based sub-aliases that are more difficult to match. Finally, we present an experimental comparison between many authorship methods to see which methods better match aliases under these conditions, finding that local n-gram methods perform better than others.

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C. C., and Zhai, C. X. (eds.) 2012. A survey of text classification algorithms. Mining Text Data, Springer, pp. 163–222. doi: 10.1007/978-1-4614-3223-4_6.CrossRefGoogle Scholar
Alazab, M., Layton, R., Venkataraman, S., and Watters, P. 2010. Malware detection based on structural and behavioural features of API calls. In Proceedings of the International Cyber Resilience Conference, School of Computer and Information Science, Security Research Centre, Edith Cowan University, Perth, Western Australia.Google Scholar
Choudhury, J., Kimtani, D. K. and Chakrabarty, A. 2012. Text clustering using a WordNet-based knowledge-base and the Lesk Algorithm. International Journal of Computer Applications 48 (21): 20–4.CrossRefGoogle Scholar
Clarke, R. V. G., 1997. Situational Crime Prevention. Guilderland, New York: Criminal Justice Press.Google Scholar
Escalante, H., Montes-y Gómez, M., and Solorio, T. 2011. A weighted profile intersection measure for profile-based authorship attribution. Advances in Artificial Intelligence, 7094: 232–43.CrossRefGoogle Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C. E., and Howald, B. S., 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence 6 (1): 118.Google Scholar
Holzer, R., Malin, B., and Sweeney, L. 2005. Email Alias Detection Using Social Network Analysis. PhD thesis. Information Networking Institute, Carnegie Mellon University.CrossRefGoogle Scholar
Hotho, A., Staab, S., and Stumme, G., 2003. Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, Melbourne, Florida: IEEE, pp. 541–4.Google Scholar
Jing, L., Zhou, L., Ng, M. K., and Huang, J. Z. 2006. Ontology-based distance measure for text clustering. In Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining, Bethesda, Maryland.Google Scholar
Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Sweden, pp. 175–6.Google Scholar
Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics.Google Scholar
Koppel, M., Schler, J., and Argamon, S. 2010. Authorship attribution in the wild. Language Resources and Evaluation 45 (1): 8394. ISSN . doi: 10.1007/s10579-009-9111-2.CrossRefGoogle Scholar
Layton, R., McCombie, S., and Watters, P. A., 2012. Authorship attribution of IRC messages using inverse author frequency. In Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third, Ballarat, Australia: IEEE, pp. 713.CrossRefGoogle Scholar
Layton, R., and Watters, P. A., 2009. Determining provenance in phishing websites using automated conceptual analysis. In eCrime Researchers Summit, 2009. eCRIME’09., Tacoma, WA, pp. 17.Google Scholar
Layton, R., Watters, P. A., and Dazeley, R. 2010. Authorship attribution for Twitter in 140 characters or less. In 2010 Second Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 18. ISBN 978-1-4244-8054-8. doi: 10.1109/CTC.2010.17.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2011a. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 1 (1): 126.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2011b. Automatically determining phishing campaigns using the USCAP methodology. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 18.Google Scholar
Layton, R., Watters, P. A., and Dazeley, R. 2011c. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering 18 (3): 293312. doi: 10.1017/S1351324911000180. Available on CJO 2011.CrossRefGoogle Scholar
Luyckx, K., and Daelemans, W. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26 (1): 35.CrossRefGoogle Scholar
Narayanan, A., Paskov, H., Gong, N. Z., and Bethencourt, J. 2012. On the feasibility of internet-scale author identification. In Proceedings of the 33rd conference on IEEE Symposium on Security and Privacy, San Francisco, CA,.Google Scholar
Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th conference on World Wide Web - WWW ’04, New York: ACM, pp. 30–9. doi: 10.1145/988672.988678.CrossRefGoogle Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12: 2825–30.Google Scholar
Pillay, S. R., and Solorio, T., 2010. Authorship attribution of web forum posts. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 17.Google Scholar
Rudman, J., 1998. The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31: 351–65.CrossRefGoogle Scholar
Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513–23.CrossRefGoogle Scholar
Salton, G., and McGill, M. J., 1986. Introduction to Modern Information Retrieval. New York: McGraw-Hill.Google Scholar
Schein, A. I., Caver, J. F., Honaker, R. J., and Martell, C. H., 2010. Author attribution evaluation with novel topic cross-validation. In KDIR, Valencia, Spain, pp. 206–15.Google Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34 (1): 147.CrossRefGoogle Scholar
Sedding, J., and Kazakov, D. 2004. Wordnet-based text document clustering. In Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva: Association for Computational Linguistics, pp. 104–13.Google Scholar
Solorio, T., Pillay, S., Raghavan, S., and Montes-y Gómez, M., 2011. Modality specific meta features for authorship attribution in web forum posts. In IJCNLP, Chiang Mai, Thailand, pp. 156–64.Google Scholar
Stabek, A., Watters, P. A., and Layton, R., 2010. The seven scam types: mapping the terrain of cybercrime. In Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, Ballarat, Australia, pp. 4151.CrossRefGoogle Scholar
Stamatatos, E. 2007. Author identification using imbalanced and limited training texts. In 18th International Workshop on Database and Expert Systems Applications, 2007. DEXA’07., Regensburg, pp. 237–41.Google Scholar
Ureche, O., Layton, R., and Watters, P. A., 2012. Towards an implementation of information flow security using semantic web technologies. In 2012 Third Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 18.Google Scholar
Watters, P. A., McCombie, S., Layton, R., and Pieprzyk, J. 2012. Characterising and predicting cyber attacks using the Cyber Attacker Model Profile (CAMP). Journal of Money Laundering Control 15 (4): 430–41.CrossRefGoogle Scholar
Watters, P. A., and Patel, M. 1998. Modeling lexical-semantic processes using wordnet. Glot International 3 (9–10): 23–4.Google Scholar
Zheng, R., Li, J., Chen, H., and Huang, Z., 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57: 378–93.CrossRefGoogle Scholar