Hostname: page-component-5b777bbd6c-cp4x8 Total loading time: 0 Render date: 2025-06-19T07:45:06.835Z Has data issue: false hasContentIssue false

Improving healthcare cost prediction for chronic disease through covariate clustering and subgroup analysis methods

Published online by Cambridge University Press:  15 April 2025

Zhengxiao Li
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yifan Huang*
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yang Cao
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
*
Corresponding author: Yifan Huang; Email: huangyf2217@163.com

Abstract

Predicting healthcare costs for chronic diseases is challenging for actuaries, as these costs depend not only on traditional risk factors but also on patients’ self-perception and treatment behaviors. To address this complexity and the unobserved heterogeneity in cost data, we propose a dual-structured learning statistical framework that integrates covariate clustering into finite mixture of generalized linear models, effectively handling high-dimensional, sparse, and highly correlated covariates while capturing their effects on specific subgroups. Specifically, this framework is realized by imposing a penalty on the prior similarities among covariates, and we further propose an expectation-maximization-alternating direction method of multipliers (EM-ADMM) algorithm to address the complex optimization problem by combining EM with the ADMM. This paper validates the stability and effectiveness of the framework through simulation and empirical studies. The results show that our framework can leverage shared information among high-dimensional covariates to enhance fitting and prediction accuracy, while covariate clustering can also uncover the covariates’ network relationships, providing valuable insights into diabetic patients’ self-perception data.

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Alanazi, R. (2022) Identification and prediction of chronic diseases using machine learning approach. Journal of Healthcare Engineering, 2022, 2826127.10.1155/2022/2826127CrossRefGoogle ScholarPubMed
Andrade, D., Fukumizu, K. and Okajima, Y. (2021) Convex covariate clustering for classification. Pattern Recognition Letters, 151, 193199.10.1016/j.patrec.2021.08.012CrossRefGoogle Scholar
Atienza, N., Garca-Heras, J., Muñoz-Pichardo, J.M. and Villa, R. (2008) An application of mixture distributions in modelization of length of hospital stay. Statistics in Medicine, 27 (9), 14031420.10.1002/sim.3029CrossRefGoogle ScholarPubMed
Avanzi, B., Taylor, G., Wang, M. and Wong, B. (2024) Machine learning with high-cardinality categorical features in actuarial applications. ASTIN Bulletin: The Journal of the IAA, 54 (2), 213238.10.1017/asb.2024.7CrossRefGoogle Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3 (1), 1122.10.1561/2200000016CrossRefGoogle Scholar
Chen, J., Tran-Dinh, Q., Kosorok, M.R. and Liu, Y. (2021) Identifying heterogeneous effect using latent supervised clustering with adaptive fusion. Journal of Computational and Graphical Statistics, 30 (1), 4354.10.1080/10618600.2020.1763808CrossRefGoogle ScholarPubMed
Chen, K., Huang, R., Chan, N.H. and Yau, C.Y. (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insurance: Mathematics and Economics, 86, 818.Google Scholar
Cheng, C., Feng, X., Li, X. and Wu, M. (2022) Robust analysis of cancer heterogeneity for high-dimensional data. Statistics in Medicine, 41 (27), 54485462.10.1002/sim.9578CrossRefGoogle ScholarPubMed
Chi, E.C. and Lange, K. (2015) Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24 (4), 9941013.CrossRefGoogle ScholarPubMed
Delong, Ł., Lindholm, M. and Wüthrich, M.V. (2021) Gamma Mixture Density Networks and their application to modelling insurance claim amounts. Insurance: Mathematics and Economics, 101, 240261.Google Scholar
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39 (1), 122.10.1111/j.2517-6161.1977.tb01600.xCrossRefGoogle Scholar
Devriendt, S., Antonio, K., Reynkens, T. and Verbelen, R. (2021) Sparse regression with multi-type regularized feature modeling. Insurance: Mathematics and Economics, 96, 248261.Google Scholar
Duncan, I., Loginov, M. and Ludkovski, M. (2016) Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal, 20 (1), 6587.10.1080/10920277.2015.1110491CrossRefGoogle Scholar
Fellingham, G.W., Kottas, A. and Hartman, B.M. (2015) Bayesian nonparametric predictive modeling of group health claims. Insurance: Mathematics and Economics, 60, 110.Google Scholar
Fung, T.C., Tzougas, G. and Wüthrich, M.V. (2023) Mixture composite regression models with multi-type feature selection. North American Actuarial Journal, 27 (2), 396428.10.1080/10920277.2022.2099426CrossRefGoogle Scholar
Girvan, M. and Newman, M.E.J. (2002) Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99 (12), 78217826.10.1073/pnas.122653799CrossRefGoogle Scholar
Gneiting, T. and Raftery, A.E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 (477), 359378.10.1198/016214506000001437CrossRefGoogle Scholar
Halder, A., Mohammed, S., Chen, K. and Dey, D.K. (2021) Spatial tweedie exponential dispersion models: An application to insurance rate-making. Scandinavian Actuarial Journal, 2021 (10), 10171036.10.1080/03461238.2021.1921017CrossRefGoogle Scholar
Hallac, D., Leskovec, J. and Boyd, S. (2015) Network lasso: Clustering and optimization in large graphs. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387396.10.1145/2783258.2783313CrossRefGoogle Scholar
Ickowicz, A. and Sparks, R. (2017) Modelling hospital length of stay using convolutive mixtures distributions. Statistics in Medicine, 36 (1), 122135.10.1002/sim.7135CrossRefGoogle ScholarPubMed
Khalili, A. and Chen, J. (2007) Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102 (479), 10251038.10.1198/016214507000000590CrossRefGoogle Scholar
Khalili, A. and Lin, S. (2013) Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69 (2), 436446.10.1111/biom.12020CrossRefGoogle ScholarPubMed
Kurz, C.F. and Hatfield, L.A. (2019) Identifying and interpreting subgroups in health care utilization data with count mixture regression models. Statistics in Medicine, 38 (22), 44234435.10.1002/sim.8307CrossRefGoogle ScholarPubMed
Lee, S.C.K. (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bulletin: The Journal of the IAA, 51 (1), 2755.10.1017/asb.2020.40CrossRefGoogle Scholar
MacLeod, H., Yang, S., Oakes, K., Connelly, K. and Natarajan, S. (2016) Identifying rare diseases from behavioural data: A machine learning approach. 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), pp. 130139. IEEE.Google Scholar
Meng, X. and Rubin, D.B. (1993) Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80 (2), 267278.10.1093/biomet/80.2.267CrossRefGoogle Scholar
Richardson, R. and Hartman, B. (2018) Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics, 83, 18.Google Scholar
Shi, P. and Zhang, W. (2015) Private information in healthcare utilization: Specification of a copula-based hurdle model. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178 (2), 337361.10.1111/rssa.12065CrossRefGoogle Scholar
Witten, D.M., Shojaie, A. and Zhang, F. (2014) The cluster elastic net for high-dimensional regression with unknown variable grouping. Technometrics, 56 (1), 112122.10.1080/00401706.2013.810174CrossRefGoogle ScholarPubMed
Yach, D., Hawkes, C., Gould, C.L. and Hofman, K.J. (2004) The global burden of chronic diseases: Overcoming impediments to prevention and control. JAMA, 291 (21), 26162622.10.1001/jama.291.21.2616CrossRefGoogle ScholarPubMed
Yang, Y., Qian, W. and Zou, H. (2018) Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics, 36 (3), 456470.10.1080/07350015.2016.1200981CrossRefGoogle Scholar
Zhu, Y. (2017) An augmented ADMM algorithm with application to the generalized lasso problem. Journal of Computational and Graphical Statistics, 26 (1), 195204.10.1080/10618600.2015.1114491CrossRefGoogle Scholar
Supplementary material: File

Li et al. supplementary material 1

Li et al. supplementary material
Download Li et al. supplementary material 1(File)
File 461 Bytes
Supplementary material: File

Li et al. supplementary material 2

Li et al. supplementary material
Download Li et al. supplementary material 2(File)
File 1.7 MB
Supplementary material: File

Li et al. supplementary material 3

Li et al. supplementary material
Download Li et al. supplementary material 3(File)
File 10.8 KB