Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-15T13:02:29.985Z Has data issue: false hasContentIssue false

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Published online by Cambridge University Press:  27 November 2020

Nadine M. Neumann
Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Alexandre Plastino
Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Jony A. Pinto Junior
Affiliation:
Departamento de Estatística, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mail: jarrais@id.uff.br
Alex A. Freitas
Affiliation:
School of Computing, University of Kent, Canterbury, Kent, UK e-mail: a.a.freitas@kent.ac.uk

Abstract

Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

Type
Research Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barros, E. A. C. & Mazucheli, J. 2005. Um estudo sobre o tamanho e poder dos testes t-student e wilcoxon. Acta Scientiarum: Technology 27(1), 2332.Google Scholar
Benavoli, A., Corani, G., Demšar, J. & Zaffalon, M. 2017. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18(1), 136.Google Scholar
Berben, L., Sereika, S. M. & Engberg, S. 2012. Effect size estimation: methods and examples. International Journal of Nursing Studies 49(8), 10391047.CrossRefGoogle ScholarPubMed
Bertsimas, D. & Dunn, J. 2017. Optimal classification trees. Machine Learning 106(7), 10391082.CrossRefGoogle Scholar
Breiman, L. 2001. Random forests. Machine Learning 45(1), 532.CrossRefGoogle Scholar
Bussab, W. O. & Morettin, P. 2010. Estatística Básica, 6a. edição. Editora Saraiva.Google Scholar
Cardoso, D. O., Gama, J. & França, F. M. 2017. Weightless neural networks for open set recognition. Machine Learning 106(9–10), 15471567.CrossRefGoogle Scholar
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum.Google Scholar
Cousins, S. & Taylor, J. S. 2017. High-probability minimax probability machines. Machine Learning 106(6), 863886.CrossRefGoogle Scholar
Cover, T. & Hart, P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 2127.CrossRefGoogle Scholar
Dheeru, D. & Taniskidou, E. K. 2017. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar
du Plessis, M. C., Niu, G. & Sugiyama, M. 2017. Class-prior estimation for learning from positive and unlabeled data. Machine Learning 106(4), 463492.CrossRefGoogle Scholar
Fern, E. F. & Monroe, K. B. 1996. Effect-size estimates: issues and problems in interpretation. Journal of Consumer Research 23(2), 89105.CrossRefGoogle Scholar
Fisher, R. A. 1925. Statistical Methods for Research Workers. Springer.Google Scholar
Fritz, C. O., Morris, P. E. & Richler, J. J. 2012. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General 141(1), 218.CrossRefGoogle ScholarPubMed
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G. & Abdessalem, T. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106(9–10), 14691495.CrossRefGoogle Scholar
Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. & Tatham, R. L. 2009. Análise multivariada de dados. Bookman Editora.Google Scholar
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Scholkopf, B. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13(4), 1828.CrossRefGoogle Scholar
Huang, K. H. & Lin, H. T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10), 17251746.CrossRefGoogle Scholar
Japkowicz, N. & Shah, M. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.CrossRefGoogle Scholar
Júnior, P. R. M., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S. & Rocha, A. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106(3), 359386.CrossRefGoogle Scholar
Kim, D. & Oh, A. 2017. Hierarchical dirichlet scaling process. Machine Learning 106(3), 387418.CrossRefGoogle Scholar
Kline, R. B. 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Psychological Association.CrossRefGoogle Scholar
Kotłowski, W. & Dembczyński, K. 2017. Surrogate regret bounds for generalized classification performance metrics. Machine Learning 106(4), 549572.CrossRefGoogle Scholar
Krijthe, J. H. & Loog, M. 2017. Projected estimators for robust semi-supervised classification. Machine Learning 106(7), 9931008.CrossRefGoogle Scholar
Langley, P., Iba, W., Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), California, AAAI Press, 90, 223228.Google Scholar
Mena, D., Montañés, E., Quevedo, J. R. & Del Coz, J. J. 2017. A family of admissible heuristics for a* to perform inference in probabilistic classifier chains. Machine Learning 106(1), 143169.CrossRefGoogle Scholar
Journal, ML. 2017. Machine Learning 106(1–12). https://link.springer.com/journal/10994/106/1 Google Scholar
Nakagawa, S. & Cuthill, I. C. 2007. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews 82(4), 591605.CrossRefGoogle ScholarPubMed
Neumann, N. M., Plastino, A., Junior, J. A. P. & Freitas, A. A. 2018. Is p-value< 0.05 enough? two case studies in classifiers evaluation (in Portuguese). In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, SBC, 94103.Google Scholar
Osojnik, A., Panov, P. & Džeroski, S. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106(6), 745770.CrossRefGoogle Scholar
Snyder, P. & Lawson, S. 1993. Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education 61(4), 334349.CrossRefGoogle Scholar
Sullivan, G. M. & Feinn, R. 2012. Using effect size-or why the p-value is not enough. Journal of Graduate Medical Education 4(3), 279282.CrossRefGoogle ScholarPubMed
Suzumura, S., Ogawa, K., Sugiyama, M., Karasuyama, M. & Takeuchi, I. 2017. Homotopy continuation approaches for robust SV classification and regression. Machine Learning 106(7), 10091038.CrossRefGoogle Scholar
Tomczak, M. & Tomczak, E. 2014. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences 21(1), 1925.Google Scholar
Wasserstein, R. L. & Lazar, N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129133.CrossRefGoogle Scholar
Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar
Wu, Y. P. & Lin, H. T. 2017. Progressive random k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5), 671694.CrossRefGoogle Scholar
Xuan, J., Lu, J., Zhang, G., Da Xu, R. Y. & Luo, X. 2017. A Bayesian nonparametric model for multi-label learning. Machine Learning 106(11), 17871815.CrossRefGoogle Scholar
Yu, F. & Zhang, M. L. 2017. Maximum margin partial label learning. Machine Learning 106(4), 573593.CrossRefGoogle Scholar
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., Buntine, W., Hynes, M. & De Sterck, H. 2017. Efficient parameter learning of Bayesian network classifiers. Machine Learning 106(9–10), 12891329.CrossRefGoogle Scholar