Hostname: page-component-78c5997874-j824f Total loading time: 0 Render date: 2024-11-14T06:19:42.477Z Has data issue: false hasContentIssue false

Evaluating authorship distance methods using the positive Silhouette coefficient

Published online by Cambridge University Press:  28 September 2012

ROBERT LAYTON
Affiliation:
Internet Commerce Security Laboratory University of Ballarat, Ballarat VIC, Australia e-mail: r.layton@icsl.com.au
PAUL WATTERS
Affiliation:
Internet Commerce Security Laboratory University of Ballarat, Ballarat VIC, Australia e-mail: r.layton@icsl.com.au
RICHARD DAZELEY
Affiliation:
Data Mining and Informatics Research Group University of Ballarat, Ballarat VIC, Australia e-mails: p.watters@ballarat.edu.au, r.dazeley@ballarat.edu.au

Abstract

Unsupervised Authorship Analysis (UAA) aims to cluster documents by authorship without knowing the authorship of any documents. An important factor in UAA is the method for calculating the distance between documents. This choice of the authorship distance method is considered more critical to the end result than the choice of cluster analysis algorithm. One method for measuring the correlation between a distance metric and a labelling (such as class values or clusters) is the Silhouette Coefficient (SC). The SC can be leveraged by measuring the correlation between the authorship distance method and the true authorship, evaluating the quality of the distance method. However, we show that the SC can be severely affected by outliers. To address this issue, we introduce the Positive Silhouette Coefficient, given as the proportion of instances with a positive SC value. This metric is not easily altered by outliers and produces a more robust metric. A large number of authorship distance methods are then compared using the PSC, and the findings are presented. This research provides an insight into the efficacy of methods for UAA and presents a framework for testing authorship distance methods.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allison, B., and Guthrie, L. 2008. Authorship attribution of e-mail: comparing classifiers over a new corpus for evaluation. In Proceedings of LREC, Vol. 8. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.Google Scholar
Arthur, D., and Vassilvitskii, S. 2007. K-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 10271035. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.Google Scholar
Corbin, M. 2011. Authorship Attribution in the Enron Email Corpus. PhD thesis, University of Maryland, Baltimore, MD, USA.Google Scholar
Davies, D. L., and Bouldin, D. W. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2: 224–27.CrossRefGoogle Scholar
Duarte, J., Fred, A., Lourenço, A., and Duarte, F. 2010. On consensus clustering validation. In Structural, Syntactic, and Statistical Pattern Recognition, pp. 385–94. Lecture Notes in Computer Science, Vol. 6218. Berlin, Germany: Springer.CrossRefGoogle Scholar
Dunn, J. C. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4 (1):95104.CrossRefGoogle Scholar
Foggia, P., Percannella, G., Sansone, C., and Vento, M. 2007. A graph-based clustering method and its applications. In Proceedings of the 2nd International Conference on Advances in Brain, Vision, and Artificial Intelligence, pp. 277–87. Berlin, Germany: Springer-Verlag.Google Scholar
Frantzeskou, G., Stamatatos, E, Gritzalis, S., and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence 6.Google Scholar
Hartigan, J. A., and Wong, M. A. 1979. A K-means clustering algorithm. Applied Statistics 28 (1):100–8.CrossRefGoogle Scholar
Huber, P. J., and Ronchetti, E. 1981. Robust Statistics, 2nd ed. Wiley Online Library. http://au.wiley.com/WileyCDA/WileyTitle/productCd-0470129905.html (Accessed 17 Sep 2012).Google Scholar
Iqbal, F., Hadjidj, R., Fung, Benjamin C. M., and Debbabi, M. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. (Proceedings of the Eighth Annual DFRWS Conference). Digital Investigation 5 (Suppl 1):S42S51.CrossRefGoogle Scholar
Jones, K. S. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28: 1121.CrossRefGoogle Scholar
Juola, P. 2008. Authorship Attribution. Hanover, MA, USA: Now Pub.Google Scholar
Juola, P., and Baayen, R. H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing 20: 5967.CrossRefGoogle Scholar
Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. Proceedings of the Conference of the Pacific Association for Computational Linguistics (PACLING).Google Scholar
Klimt, B., and Yang, Y. 2004. Introducing the Enron corpus. Proceedings of the First Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2010. Automatically determining phishing campaigns using the uscap methodology. In Proceedings of the General Members Meeting and eCrime Researchers Summit (eCrime 2010), pp. 18. New York, NY, USA: IEEE.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2011a Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 1 (1): 126.Google Scholar
Layton, R., Watters, P., and Dazeley, R. 2011b. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering. doi:10.1017/S1351324911000180. Available on CJO 2011.Google Scholar
Pillay, S. R., and Solorio, T. 2011. Authorship attribution of web forum posts. In Proceedings of the General Members Meeting and eCrime Researchers Summit (eCrime 2010), pp. 17. New York, NY, USA: IEEE.Google Scholar
Pollard, H. S. 1934. On the relative stability of the median and arithmetic mean, with particular reference to certain frequency distributions which can be dissected into normal distributions. The Annals of Mathematical Statistics 5 (3):227–62.CrossRefGoogle Scholar
Rosenberg, A., and Hirschberg, J. 2007. V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–20. Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
Rousseeuw, P. 1987 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:5365.CrossRefGoogle Scholar
Stamatatos, E. 2006. Authorship attribution based on feature set subspacing ensembles. International Journal on Artificial Intelligence Tools 15 (5):823–38.CrossRefGoogle Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 57 (3): 378393.Google Scholar
Zheng, R., Li, J., Chen, H., and Huang, Z. 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57:378393.CrossRefGoogle Scholar