Externally validated clinical prediction models for estimating treatment outcomes for patients with a mood, anxiety or psychotic disorder: systematic review and meta-analysis

Desi G. Burghoorn; Sanne H. Booij; Robert A. Schoevers; Harriëtte Riese

doi:10.1192/bjo.2024.789

Externally validated clinical prediction models for estimating treatment outcomes for patients with a mood, anxiety or psychotic disorder: systematic review and meta-analysis

Published online by Cambridge University Press: 05 December 2024

and

Desi G. Burghoorn*: Affiliation:
University Medical Center Groningen, Department of Psychiatry, Interdisciplinary Center Psychopathology and Emotion Regulation (ICPE), University of Groningen, Groningen, The Netherlands
Sanne H. Booij: Affiliation:
University Medical Center Groningen, Department of Psychiatry, Interdisciplinary Center Psychopathology and Emotion Regulation (ICPE), University of Groningen, Groningen, The Netherlands
Robert A. Schoevers: Affiliation:
University Medical Center Groningen, Department of Psychiatry, Interdisciplinary Center Psychopathology and Emotion Regulation (ICPE), University of Groningen, Groningen, The Netherlands
Harriëtte Riese: Affiliation:
University Medical Center Groningen, Department of Psychiatry, Interdisciplinary Center Psychopathology and Emotion Regulation (ICPE), University of Groningen, Groningen, The Netherlands
*: Correspondence: Desi G. Burghoorn. Email: d.g.burghoorn@umcg.nl

Article contents

Abstract
Background
Aims
Method
Results
Conclusions
Method
Results
Discussion
Data availability
Author contributions
Funding
Declaration of interest
Footnotes
References

Rights & Permissions

Abstract

Background

Suboptimal treatment outcomes contribute to the high disease burden of mood, anxiety or psychotic disorders. Clinical prediction models could optimise treatment allocation, which may result in better outcomes. Whereas ample research on prediction models is performed, model performance in other clinical contexts (i.e. external validation) is rarely examined. This gap hampers generalisability and as such implementation in clinical practice.

Aims

Systematically appraise studies on externally validated clinical prediction models for estimated treatment outcomes for mood, anxiety and psychotic disorders by (1) reviewing methodological quality and applicability of studies and (2) investigating how model properties relate to differences in model performance.

Method

The review and meta-analysis protocol was prospectively registered with PROSPERO (registration number CRD42022307987). A search was conducted on 8 November 2021 in the databases PubMED, PsycINFO and EMBASE. Random-effects meta-analysis and meta-regression were conducted to examine between-study heterogeneity in discriminative performance and its relevant influencing factors.

Results

Twenty-eight studies were included. The majority of studies (n = 16) validated models for mood disorders. Clinical predictors (e.g. symptom severity) were most frequently included (n = 25). Low methodological and applicability concerns were found for two studies. The overall discrimination performance of the meta-analysis was fair with wide prediction intervals (0.72 [0.46; 0.89]). The between-study heterogeneity was not explained by number or type of predictors but by disorder diagnosis.

Conclusions

Few models seem ready for further implementation in clinical practice to aid treatment allocation. Besides the need for more external validation studies, we recommend close examination of the clinical setting before model implementation.

Keywords

Anxiety or fear-related disorders depressive disorders psychotic disorders/schizophrenia systematic review meta-analysis

Type: Review
Information: BJPsych Open , Volume 10 , Issue 6 , November 2024 , e221

DOI: https://doi.org/10.1192/bjo.2024.789 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press on behalf of Royal College of Psychiatrists

For almost three decades, anxiety, mood and psychotic disorders formed over 14% of the age-standardised years lived with disabilities (YLDs) globally.¹ The high burden of these disorders is, beyond the high prevalence, because of the high number of non-responding patients to initial or subsequent treatments.^{Reference Howes, Thase and Pillinger2} Most patients do not remit after their first treatment, and typically multiple therapies have failed before finding one that works.^{Reference Howes, Thase and Pillinger2} For patients, this means enduring a longer period of ineffectively treated symptoms and discomfort. For society, prolonged treatment duration puts a strain on the limited resources of mental healthcare services, causing long waiting lists for those who seek care.

Precision psychiatry

Prolonged treatment duration is related to the low clinical utility of available diagnostic systems to guide treatment choices.^{Reference Howes, Thase and Pillinger2} Classification systems such as the ICD-10 and DSM are descriptive in nature, based on experienced symptoms.^{Reference Maj3,Reference Feczko, Miranda-Dominguez, Marr, Graham, Nigg and Fair4} As such, individuals with the same diagnosis receiving similar treatments may not necessarily share biopsychosocial psychopathological mechanisms. Consequently, treatment responses vary widely.^{Reference Howes, Thase and Pillinger2,Reference Feczko, Miranda-Dominguez, Marr, Graham, Nigg and Fair4} To surpass this heterogeneity, the promise of precision medicine into psychiatry, defined as ‘to integrate clinical data with patient characteristics to uncover disease subtypes and improve the accuracy with which patients are categorized and treated’, is very appealing.^{Reference Bourke, White, Kumar and M5} By integrating clinical data with relevant biological, psychological or social patient information, more specific biopsychosocial markers can be established for better treatment response prediction and, consequently, improvement of outcomes.^{Reference Gillan and Whelan6,Reference Insel and Cuthbert7}

Clinical prediction models

Over the past two decades, precision psychiatry has gained momentum and, as such, a wealth of clinical prediction models for personalised treatment outcomes have been developed.^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8} However, there is a large translational gap between model development and implementation in clinical practice, preventing successful implementation.

Requirements for successful prediction model implementation

For a prediction model to have added value in clinical practice, the model first needs to have adequate discriminative abilityFootnote ^a and calibrationFootnote ^b.^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9} Second, a model should be generalisable to different clinical contexts by external validation (e.g. location, treatment setting or period).^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9} Third, successful implementation depends on applicability factors, such as the feasibility to obtain relevant predictors within a clinical context in which time and financial constraints are common.^{Reference Wolff, Moons, Riley, Whiting, Westwood and Collins10} Previous systematic reviews into clinical prediction models in psychiatry shed light on the above-mentioned factors; either for a broad class of prediction models in psychiatry, including diagnostic models and models predicting onset, or for treatment outcome prediction in depression only.^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} In addition, all focused on a broad range of studies (e.g. from predictor-finding or development to external validation or implementation), and only one review assessed heterogeneity between studies in a quantitative way for a broad range of studies.

Aims

This important knowledge gap regarding the prediction accuracy, methodological quality, applicability and potential sources of heterogeneity for externally validated models for treatment outcomes of common mental disorders is addressed in the current systematic review and meta-regression. We will focus on three aims. First, we systematically appraise the current literature on externally validated clinical prediction models that estimate treatment outcomes for mood, anxiety and psychotic disorders, to qualitatively describe variability across studies and their models. This includes, among other study properties, a systematic categorisation of included predictor types, as precision psychiatry aims to develop prediction models containing relevant predictors, which, according to the biopsychosocial model of mental disorders,^{Reference Bashmi, Cohn, Chan, Tobia, Gohar, Herrera and IsHak12} should be biological and psychosocial, next to clinical predictors. Second, we critically review the methodological quality and applicability of included studies using the Prediction Model Risk of Bias Assessment Tool (PROBAST), to ensure proper interpretation of study results and to determine the validity and applicability of prediction models. Third, given the broad inclusion criteria of the included studies, data are expected to be heterogeneous on characteristics such as patients’ disorders, predictors and study settings.^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13} Therefore, a meta-analysis to estimate the average discriminative ability is not deemed useful.^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13} However, the observed heterogeneity itself and its sources are informative for investigating how clinical and methodological aspects of the studies relate to their results.^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14} For this reason, we conduct a meta-analysis on the omnibus discrimination performance to estimate how well the currently available models perform overall, concurrently focusing on the observed heterogeneity. This is informative for investigating how clinical and methodological aspects of the studies relate to their results.^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14} Hence, if heterogeneity is confirmed, the fourth aim is to examine whether heterogeneity in discrimination performance can be explained by a meta-regression on various study characteristics.

Method

The systematic review and meta-analysis were conducted conform the e-Cochrane Handbook for Systematic Review of Interventions and the preferred reporting items for systematic reviews^{Reference Bourton, Page, Altman, Lundh and Hróbjartsson15} and meta-analysis (PRISMA) statement^{Reference Page, McKenzie, Bossuyt, Boutron, Hoffmann and Mulrow16} (see the research checklist). The protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) CRD42022307987 (307987). See Supplementary file 2 available at https://doi.org/10.1192/bjo.2024.789, which describes the scope of this review in terms of patient population, intervention (treatment), comparison, outcome and type of study (PICOT).

Definitions

The following key concepts are described for a clear and precise understanding of the terminology central to our study.

• Clinical prediction model: a model that is developed to facilitate the prognostic ability estimations in daily medical practice, by (statistically) associating multiple predictors with outcome data from a sample.^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9}
• Multivariate model: a model estimating that a specific event (e.g. remission) will occur, based on multiple characteristics or pieces of information for a specific individual.^{Reference Heus, Damen, Pajouheshnia, Scholten, Reitsma and Collins17}
• Treatment outcome: any type of outcome, whether it be of clinical, psychosocial or biological nature, that could be associated with an effect (or absence thereof) induced by pharmacological or/and psychoeducational treatment.^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9}
• External validation: the prediction model is applied to a similar (but not necessarily the same) target population as in the development data-set, by one or more of the following validation methods to ensure the test data-set is independent:^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9}
- • temporal: model tested in the same data collection, but using different individuals at a later timepoint;
- • geographical: model is tested in a sample from a different location;
- • different settings: model is tested in a different setting (e.g. from secondary to primary care setting).
• Statistical learning: statistical learning includes the Cox hazard model, logistic regression model, linear regression model, negative binomial model, generalised linear model, Weibull regression model, regularised regression model and other regression methods, either standard, penalised, boosted or bagged.^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11}
• Machine learning: machine learning includes classification trees, random forests, artificial neural networks, support vector machines, boosted tree methods, Bayes machine learning algorithms, K-nearest neighbours algorithms, multivariate adaptive regression splines and genetic algorithms.^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11}

In- and exclusion criteria

The following inclusion criteria were adopted: (1) studies including adults with a mood, anxiety or psychotic diagnosis as confirmed by diagnostic interview or patient file; (2) studies externally validating clinical prediction models for any type of treatment outcome; and (3) studies testing multivariable models following the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement.^{Reference Heus, Damen, Pajouheshnia, Scholten, Reitsma and Collins17}

The following exclusion criteria were adopted: (1) publication types not based on original data (abstracts, conference proceeding, reviews, meta-analyses); (2) predictor-finding studies that did not report prediction models; (3) studies with predictive models that did not evaluate a model's performance for personalised risk prediction in any recommended way;Footnote ^c (4) studies predicting the first onset of disorders; (5) studies including children (under the age of 12) or exclusively focused on adolescents (aged 12–18); (6) studies solely including participants recruited from the general population (i.e. not in mental healthcare settings); (7) studies exclusively focusing on postnatal mental disorders; and (8) studies not written in the English, German or Dutch language. No restrictions were placed based on the model's performance in the development data-set.

Development of the search string

To ensure conceptual validity, a priori 13 must-find PubMed reference number articles (PMID) were identified. The search string was developed in consultation with an independent information expert (Dr Paul Braun (P.B.)) from the Central Medical Library (CMB) of the University Medical Center Groningen (UMCG). Once the algorithm for PubMed was finalised, the string was translated to PsychINFO and Excerpta Medica (EMBASE) in collaboration with P.B. All three databases were searched on 8 November 2021. Final search strings are given in Supplementary file 1.

Screening

Results from the three databases were imported into EndNote 20.1 (Clarivate, Philadelphia, PA, USA; see https://support.clarivate.com/Endnote/s/article/Download-EndNote?language=en_US). Duplicate results were first detected with the ‘find duplicates’ feature in EndNote, and thereafter manually removed after review. The remaining results were imported into Rayyan (Rayyan, Cambridge, MA, USA; see www.rayyan.ai), and the duplicate removal was repeated. Next, an order of exclusion criteria was tested blindly on ten randomly selected papers to ensure synchronisation of the eligibility determination process between the two reviewers (see Supplementary file 7). After pilot testing the order, it did not require adjustments. The order was used for the double-blinded title/abstract and full-text screening. The outcomes of the double-blinded screening were discussed during weekly meetings. A priori, a third reviewer was identified (H.R.) in case any discrepancies could not be solved during those meetings. However, consultation of the third reviewer was not needed. Some disagreements between D.G.B. and S.H.B. arose from the different understanding of a clinical sample, resulting in a further specification of the definition of a clinical sample in the PROSPERO protocol (307987), as follows: ‘Studies with a small part of the sample being recruited from the general population (with diagnoses being established via interviews) and without information about mental healthcare, is accepted when at least a majority of the total sample is reported to receive mental health care’.

Data collection

All data were extracted by means of an in-house developed data extraction form following the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS) checklist.^{Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18} The data extraction form was piloted by taking a sample of five included studies. On 13 January 2022, data extraction was started. The first reviewer (D.G.B.) extracted all relevant data, and the second reviewer (S.H.B.) independently assessed data extraction validity using a random subsample of full texts (n = 6 out of 21), and S.H.B. reviewed the correctness of all reported outcomes of main interest, as shown in Table 3.

Risk of bias and applicability

For each study, the risk of bias was assessed double-blinded by D.G.B. and S.H.B. using PROBAST. With PROBAST, the risk of bias was assessed in four domains: participants, predictors, outcomes and analyses.^{Reference Wolff, Moons, Riley, Whiting, Westwood and Collins10} Each domain contained questions that should be rated with low, high or unclear concern. If one domain was judged to have a high risk of bias, the overall risk of bias was considered high. Equivalently, if one domain was assessed as unclear, the overall risk of bias was unclear. The evaluation of model applicability was judged using the same method of assessment. Potential publication bias of studies included in the meta-analysis was explored with a funnel plot using the R package metafor.^{Reference Viechtbauer19}

Data synthesis

Discrimination and calibration measures are of main interest, including the lowest and highest reported range if multiple models are externally validated on the same data-set and the type of measure reported. Other performance measures and descriptive data were also extracted and summarised from the papers: author(s), year of publication, source validation data-set (i.e. cohort, clinical trial), participant characteristics (i.e. type of disorder, age range), treatment characteristics (type of treatment, in- or out-patient setting, primary or secondary facility), outcome measures (definition, timing, method of assessment), predictors (psychosocial/clinical/biological nature), modelling methods (type of statistical analysis, method selection predictors), power (sample size, number of events per variable) and the handling of missing data.

Predictor type categorisation

Predictor type categorisation is not straightforward, and transparent protocols for this are lacking in previous reviews in the field of precision psychiatry.^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} As such, the qualitative method of grounded coding was applied. This qualitative method is directed towards finding common categories among codes and, if possible, higher-order categorisation may yield an overarching theme.^{Reference Punch20} To ensure transparency in reporting and possible reproduction, the resulting codebook generated by grounded coding is given in Supplementary file 6.

Meta-analysis, sensitivity analyses and meta-regression

Formal meta-analysis to estimate an average discriminative ability is only useful if the criteria of the analysis are met.^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13,Reference Eysenck21} The data included in the meta-analysis are homogeneous on characteristics such as patients’ disorders, outcomes, predictors and study settings. However, the heterogeneity itself and its sources are informative for investigating how clinical and methodological aspects of the studies relate to their results.^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14}

As such, a meta-analysis of omnibus discrimination performance measures was executed, as a prerequisite for the meta-regression. Any study that presented an omnibus measure of discrimination, such as the concordance statistic (c-statistic,^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9,Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18} ), or if not available, reported accuracy at single cut-off (ASC), was used in the meta-analysis. Given that ASC is less informative than the area under the curve (AUC),^{Reference Bradley22,Reference Carrington, Manuel, Fieguth, Ramsay, Osmani and Wernley23} a sensitivity analysis was performed, excluding ASCs.

Currently, Cochrane^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14} does not provide recommendations for selecting models for meta-analysis in the case of multiple models reported within the same study. Therefore, we decided to follow the strategy of Lee et al,^{Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24} about using machine learning-based prediction models in psychiatry, to increase comparability. The best performing model per outcome was selected based on the highest reported discrimination. Multiple discrimination estimates from the same study were included under a few conditions as specified in the a priori established decision document (see Supplementary file 6).

When measures of uncertainty were missing for the c-statistic but the total sample size and the number of events were reported, these were estimated according to a formula described by Debray et al.^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9} Before the meta-analysis, discrimination measures were logit transformed to meet the normality assumptions. These transformations were done with the R package metamisc.^{Reference Debray and Jong25}

For the assessment of the amount of heterogeneity, common parameters, I ^{2Footnote d}, tau^{2Footnote e} and the prediction interval between studies were used in the meta-analysis, which was implemented with the R package metafor.^{Reference Viechtbauer19} As heterogeneity between studies was anticipated, a random-effects model was used with restricted maximum likelihood. To ease interpretation, estimates were back transformed before being presented in tables and plots with the inverse logit transformation function (itrans.logit) of the metafor package, representing c-statistics (or ASCs).

To investigate which clinical and methodological aspects of the studies relate to the discriminative ability, a random-effects meta-regression was conducted. The number of investigated potential effect modifiers on model performance should be limited, as the likelihood of a false-positive result among subgroup analysis and meta-regression increases with the number of characteristics investigated.^{Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18} Following the Cochrane Handbook, potential effect modifiers may include study-level characteristics, such as population type, type of intervention or treatment, length of follow-up or methodology (i.e. design and quality). For testing interactions with categorical variables, the handbook advises that the number of studies should be minimally five per category.^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14} For testing interactions with continuous variables, the total number of studies should be minimally 10.^{Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14} The following study characteristics were eligible for meta-regression: type and number of predictors and type of disorder. The other above-mentioned characteristics could not be tested because of insufficient sample sizes. Regression coefficients, the associated z-scores, s.e. values, significance values and prediction intervals were reported as primary outcomes per moderator. The R script for these analyses is listed in Supplementary file 7.

Results

Study selection

The systematic search in PubMed, EMBASE and PsycINFO yielded a total of 2389 hits. Fifty-seven studies were screened full text (see Fig. 1 for a flowchart of the screening process). In total, 28 studies^{Reference Ashar, Clark, Gunning, Goldin, Gross and Wager26–Reference Wang, Patten, Sareen, Bolton, Schmitz and MacQueen53} were part of the systematic review (see Supplementary file 3 for the list of included papers).

Fig. 1 Preferred reporting items for systematic reviews and meta-analysis-conforming flowchart of the screening process.

EMBASE, Excerpta Medica.

The meta-analysis was conducted on the 21 studies^{Reference Athreya, Neavin, Carrillo-Roa, Skime, Biernacka and Frye27,Reference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29,Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson and Trivedi31–Reference Hayes, Osborn, Francis, Ambler, Tomlinson and Boman37,Reference Jha, South, Trivedi, Minhajuddin, Rush and Trivedi39,Reference Kambeitz-Ilankovic, Vinogradov, Wenzel, Fisher, Haas and Betz40,Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark43–Reference Nie, Vairavan, Narayan, Ye and Li45} that either included the needed parameters or those in which these parameters could be calculated to conform with the description in the analysis section. Notably, Ashar et al^{Reference Ashar, Clark, Gunning, Goldin, Gross and Wager26} used a continuous metric for predictive accuracy and could not be included in the meta-analysis. Multiple models from the same study were included following the previously described decision document. Therefore, from the 21 studies, 28 models were included in the meta-analysis.

Study characteristics

Source validation data-set

Five studies made use of data derived from registries, 14 studies included data from clinical trials (either open-label or randomised; nine and five, respectively), eight studies used observational cohort data and one study used survey data.

Participants and setting

Most studies focused on patients diagnosed with a mood disorder (unipolar n = 14, bipolar n = 2), followed by a mixed sample with severe mental disorders (SMIs) (n = 6) and psychotic disorders (n = 5), while there was one study that validated prediction models for anxiety and mood disorder separately. Most studies included adult patients only (n = 14), some included adolescents as well (n = 9) and in some papers, inclusion of adolescents remained unclear (n = 5). All study characteristics, including sample characteristics and care setting details, are given in Table 1. More details regarding the in- and exclusion criteria per validation data-set are listed in Supplementary file 7.

Table 1 Study characteristics

ACT, auditory-based computed tomography intervention; CBT, cognitive behavioural therapy; DEP, depressive disorders; Der, derivation sample; EET, employment, education or training status; EOT, end of treatment; GENDEP, genome-based therapeutic drugs for depression project; GP, general practitioner; INT, interview; ISPC, International SSRI Pharmacogenomics Consortium; MARS, Munich Antidepressant Response Signature project; MHC, medical hospital centre; NMB, no meaningful benefit; OBS, observational status; OUT, out-patient status; SMI, severe mental illness; STAR*D, Sequenced Treatment Alternatives to Relieve Depression study; TAU, treatment as usual; TRS, treatment resistance; UNK, unknown.

a. Calculated from data on training set and training and validation set.

Treatment outcome

Most studies reported clinical outcomes (n = 25), such as symptom change, remission status, adverse events and care consumption. One study^{Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark43} reported a clinical (remission) and psychosocial outcome (employment, education and training status). One study described the psychosocial outcome of likelihood of crime committance.^{Reference Fazel, Wolf, Larsson, Lichtenstein, Mallett and Fanshawe34} To assess the outcome, most studies used a clinician-rated instrument (n = 16) followed by self-report instruments (n = 7) and looking at event occurrence in registry data (n = 5). The timing of the end-points varied greatly. The earliest outcome assessment was at 1 week follow-up,^{Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36} while the latest was at 6 years follow-up.^{Reference Perry, Osimo, Upthegrove, Mallikarjun, Yorke and Stochl49} In some studies (n = 4) the outcome assessment depended on the end of treatment.^{Reference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29,Reference Kautzky, Dold, Bartova, Spies, Kranz and Souery41,Reference Perlis, Ostacher, Miklowitz, Hay, Nierenberg and Thase48,Reference Taliaz, Spinrad, Barzilay, Barnett-Itzhaki, Averbuch and Teltsh52} A detailed description of the end-point assessment per study is given in Table 1.

Predictors

With the exception of three studies,^{Reference Cattaneo, Ferrari, Uher, Bocchio-Chiavetto, Riva and Pariante30,Reference Fabbri, Kasper, Kautzky, Zohar, Souery and Montgomery32,Reference Kambeitz-Ilankovic, Vinogradov, Wenzel, Fisher, Haas and Betz40} all studies included clinical predictors in their model, such as symptom severity, past psychiatric history (PHX) or general medical history (GHX). Notably, studies that included psychosocial variables included biological variables as well (n = 16). Only a few studies (n = 7) included clinical predictors in their model(s). A detailed overview of the predictors used per study is given in Table 2. The codebook for coding which variable is considered which type is given in Supplementary file 7.

Table 2 (Type of) variables used per study

ASC, accuracy at single cut-off; AUC, area under the curve; BMI, body mass index; Childhood AEs, Childhood adverse events; GMH, general medical history; HRQOL, health-related quality of life; LSOA, lower super output area (measurement of neighbourhood deprivation); MHX, psychiatric medication history; NHS, National Health Service; PANSS Gx, positive and negative symptoms scale subscale general psychopathology; PANSS Nx, positive and negative symptoms scale subscale negative symptoms; PANSS Px, positive and negative symptoms scale subscale positive symptoms; PHX, psychiatric history and past psychiatric care use.

Performance measures

Out of the 28 included studies, 11 reported on both calibration and discrimination measures (see Table 3). The majority of studies only reported discrimination measures (n = 16). One study^{Reference Cattaneo, Ferrari, Uher, Bocchio-Chiavetto, Riva and Pariante30} did not report omnibus calibration and discrimination measures, despite constructing and validating a clinical prediction model. The study reported accuracy measures instead (sensitivity, specificity). No studies included calibration measures only.

Table 3 Modelling methods and reported outcomes per study

AdaBoost, adaptive boosting; Anx, anxiety symptoms; Anypol, any polarity in bipolar; ASC, accuracy at single cutoff; AUC, area under the curve; BAC, balanced accuracy curve; Bay, Bayesian updating algorithm; BCT, boosted classification trees; Brier, Brier score; BNPV, Bayes negative predictive value; BPPV, Bayes positive predictive value; CCA, complete case analysis, C-stat, c-statistic; Dep, depressive symptoms; EPV, event per variable (lowest incidence category/number of predictors); FNR, false negative rate; GBDT, gradient boosting decision trees; GBM, gradient boosting; HLT, Hosmer–Lemeshow test; Imp, imputation; LDF, linear discriminant function; Learning, model learning type; LOCV, last observation carried forward; Log, Logistic regression; Mania, manic episode in bipolar; MDB, major depressive episode in bipolar; ModR, model-based R square; NND, number needed to diagnose; NR, not recorded; NRI, net reclassification index; NPV, net positive value; PLR, penalised linear regression; PPV, positive predictive value; Rem, removal; Sens, sensitivity; Spec, specificity; SVM, support vector machine; Stat, statistical learning; UNK, unknown, XGBoost, eXtreme gradient boosting.

All reported measures are rounded on two decimals.

a. Highest performance was only filled in if the study reported multiple models.

b. In case of multiple models, ‘other measures’ are given for the highest performing model.

c. Calculated confidence interval.

Out of 11 studies reporting calibration studies, most reported calibration slopes (n = 4), while some reported calibration plots (n = 3) or reported results from the Hosmer–Lemeshow test (n = 3). One study reported the calibration intercept. Studies that included discrimination measures (n = 27) reported AUC statistics (n = 14), c-statistics (n = 7), ASCs (n = 5) and a continuous outcome with an R ² measure to represent predictive accuracy (n = 1).

Modelling methods

Almost all studies constructed classification models (n = 27). Many studies (see Table 3) constructed statistical learning models (n = 16), with logistic models being the most prevalent (n = 14). A total of 11 models were constructed using machine learning only, with some form of boosting (either gradient, extreme or adaptive) (n = 9) and elastic net (n = 4) the most utilised techniques. One study^{Reference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29} used both statistical and machine learning methods. For further detail, see Supplementary file 7, in which a detailed description of the utilised technique per study is listed.

Handling of missing data

Several studies (n = 9) accounted for missing data by imputation only. A few studies (n = 4) excluded variables from analysis if a given share of a variable was missing and, subsequently, imputed variables. Notably, some studies (n = 7) relied on complete case analysis. One study^{Reference Ashar, Clark, Gunning, Goldin, Gross and Wager26} applied last observation carried forward for handling missing data. In various studies (n = 7), handling of missing data remained unclear from the text and supplement.

Risk of bias

Following PROBAST, the overall quality of reporting in the studies was low (see Supplementary file 6 for study-specific (sub)domain evaluations). Main areas of high concern with respect to bias introduction were the poor reporting on the handling of missing data, analyses and outcomes.

Applicability

Applicability was also assessed with PROBAST. Main areas of concern were participant in- and exclusion criteria, included predictors and outcome assessment. Many studies were rated as high concern on the participant domain, mostly because of stringent in- and exclusion criteria, which harmed the generalisability of the model (n = 12). Seven studies were rated as of high concern on predictor feasibility as operation of the chosen predictors in clinical practice was challenging. For example, one study^{Reference Ashar, Clark, Gunning, Goldin, Gross and Wager26} included the predictor amygdala connectivity, which was calculated from a functional magnetic resonance imaging (fMRI) scan. Following the PROBAST guidelines, this study is rated of high concern with regards to applicability. Not only are neuroimaging data-sets prone to variability in analysis,^{Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber and Johannesson54} but the limited sample size of these studies also limits the applicability of the results.^{Reference Rashid and Calhoun55,Reference Schnack and Kahn56} The specialised techniques required for imaging techniques and their associated complexity and cost may make the applicability lower for widespread clinical deployment.

A few studies (n = 2) were rated as of high concern on the outcome feasibility for various reasons. Notably, only two studies^{Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36,Reference Puntis, Whiting, Pappa and Lennox50} were of low concern in both risk of bias and applicability.

Meta-analysis

In total, 28 models (derived from 21 studies) were eligible for meta-analysis according to the pre-specified criteria. The funnel in Supplementary file 5 indicates heterogeneity in the discrimination performance, even between studies with a small standard error.

As expected, the heterogeneity among studies was substantial (I ² = 97.83%, tau² = 0.28). Overall, evidence for the performance accuracy was mixed. The overall summary discrimination performance was fair (>0.70; point estimate: 0.72; 95% CI [0.46; 0.89]). However, the wide prediction interval (see Fig. 2) indicates that some models performed no better than chance (confidence interval surpasses the value 0.50). Excluding the ASC-reporting studies resulted in a slightly higher summary estimate and narrower confidence interval: 0.76 [0.71–0.80]. The prediction interval surpassed the threshold of 0.50, meaning that all included models performed better than chance [0.51; 0.90]. The forest plot of this sensitivity analysis is presented in Supplementary file 4.

Fig. 2 Forest plot of all eligible models (n = 28). Includes reported ASC and AUC. For a sensitivity analysis with only AUC, please refer to Supplementary file 4. ASC, accuracy at single cut-off; AUC, area under the curve; Bu, buproprion; dul, duloxetine; EET, Education Employment Training status; esc, escitalopram; flu, fluoxetine; mir, mirtazapine; NHS, National Health Service; Nu.Ty.We, Cumbria Northumberland Tyne and Wear NHS Foundation Trust; par, paroxetine; pl, placebo; Rem, remission; ser, sertraline; ven, venlafaxine; wh, whole; Wh.Ba, Whittington Barnet Enfield Haringey Pennine and Humber NHS Foundation Trust.

As can be viewed in Fig. 2, the two studies that reported a psychosocial outcome performed among the best. However, because of the low sample size in this category, no meta-regression on the type of outcome could be conducted.

Potential sources of heterogeneity

When applying the criteria given in the Method section, the following sources of heterogeneity could be investigated: type of disorder and number and type of included predictors.

Type of disorder

Diagnosis of a depressive disorder was a significant moderator (B = −0.47, z = −2.31, s.e. = 0.20 P = 0.02, prediction interval = −0.86; −0.07) and it explained 3.41% of the residual heterogeneity. Models that predict outcomes for individuals diagnosed with depressive disorders (n = 18) showed lower discrimination than models that predict outcomes for other disorders (n = 10). The remaining models included SMIs, such as psychosis (n = 6) and bipolar disorder (n = 2), or mixed samples (n = 2).

Number of included predictors

The number of included predictors did not significantly moderate discrimination performance (B = 0.00, z = 0.25, s.e. = 0.01 P = 0.80, prediction interval = −0.01; 0.01).

Type of predictors

Out of 28 eligible models, 18 models used biological, psychosocial and clinical predictors. The usage of a combination of these three categorical variables did not significantly moderate the discrimination performance (B = 0.11, z = 0.46, s.e. = 0.24, P = 0.65, prediction interval = −0.35; 0.57). The use of clinical predictors (n = 7) only did not significantly moderate discrimination performance (B = 0.03, z = 0.13, s.e. = 0.26, P = 0.89, prediction interval = −0.48; 0.55). As there were few models using biological predictors exclusively (n = 2), a meta-regression for biological models was not performed. There were no models using psychosocial predictors only.

Discussion

This is the first systematic review specifically examining the performance and applicability of externally validated prediction models that estimate treatment outcomes for individuals diagnosed with a mood, anxiety or psychotic disorder. Twenty-eight studies were identified. Studies included in this review differed by source of validation data-set, participant in- and exclusion criteria, in definition and in timing of the assessment method, treatment outcome, type and number of included predictors, reporting of discrimination and calibration measures, handling of missing data and the modelling methods. The majority of the studies were labelled as high risk of bias by the PROBAST, because of methodological concerns, (poor reporting of) analyses or inappropriate handling of missing data. The applicability of the included studies was overall poor because of strict in- and exclusion criteria, included predictors (feasibility to obtain predictors within a clinical context) and outcome assessment. The two studies scoring of low concern with regards to risk of bias and applicability^{Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36,Reference Puntis, Whiting, Pappa and Lennox50} shared the following characteristics: they were both pragmatic in nature (either trial close to clinical practice or routinely collected data), predicted a clinical outcome, utilised accessible clinical, biological and psychosocial variables, employed logistic regression and imputed missing data using chained equations (see Supplementary Table 7.6).

In the meta-analysis, the overall summary discrimination measure was fair. Nevertheless, some models had such wide prediction intervals that they should be interpreted as performing on chance level, while other models were adequate. The two models reporting psychosocial outcomes performed among the best. Out of the meta-regressions performed, only the type of disorder (depression versus other disorders) explained a limited share of the heterogeneity; models for participants with depression performed worse than others.

Despite the effort to conduct a comprehensive search and include all relevant literature, there may have been papers omitted that should have been included by the way we conducted our title and abstract screening. As an example, during the screening of titles and abstracts, we excluded a paper on lithium response by Nunes et al^{Reference Nunes, Ardau, Berghöfer, Bocchetta, Chillotti and Deiana57} (2020) because the methods in the abstract mentioned cross-validation as the validation method, while the full text also included a form of external validation (leave-one-site-out). However, we had to keep a balance between rigorousness and effort.

Because of resource constraints, there was a partial duplication in data extraction. For full texts, only a subsample was taken, and while every effort was made to ensure accuracy, it is possible that minor discrepancies may have occurred. It is important to note that all outcomes of primary interest, particularly those outlined in Table 3, were meticulously double-checked. Despite our best efforts, we acknowledge the possibility of minor discrepancies in other areas.

Literature comparison and interpretation

A recent meta-review^{Reference Gillett, Tomlinson, Efthimiou and Cipriani58} identified four externally validated models, of which one was a predictor-finding study according to the review of Lee et al.^{Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24} A systematic review and meta-analysis for clinical prediction models above and beyond psychiatric treatment outcomes in psychiatry^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} identified 25 externally validated predictive models. The number of studies identified in the current review illustrates a rapidly growing interest in the field of psychiatry with regards to clinical prediction model research. Still, the number of externally validated models is very low compared to the number of developed models. In the meta-review of Gillet et al,^{Reference Gillett, Tomlinson, Efthimiou and Cipriani58} the lack of external validation in the literature was noted. In the review of Lee et al^{Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24} and Salazar de Pablo et al,^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} only 20.0% and 29.9% of the included studies externally validated their models, respectively. Synthesising the finding of relatively few externally validated models compared to the number of developed models, literature comparison supports our notion that the field could greatly benefit from a shift in focus from developing new models to external validation of developed models.

The finding that there was large methodological heterogeneity among studies constructing and validating clinical prediction models is in line with the conclusions of the studies of Gillet et al^{Reference Gillett, Tomlinson, Efthimiou and Cipriani58} and Salazar de Pablo et al.^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} More specifically, Gillet et al^{Reference Gillett, Tomlinson, Efthimiou and Cipriani58} concluded that within systematic reviews, definition of treatment response, predictor variables and assessment thereof varied greatly among studies, and Salazar de Pablo et al^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} concluded that the main limitation of their study was the heterogeneity of characteristics of prediction models. Chekroud et al^{Reference Chekroud, Hawrilenko, Loho, Bondar, Gueorguieva and Hasan59} highlight the challenge of external validation across clinical trials, thereby advocating for the importance of identifying trial-level characteristics in relation to patient outcomes to improve model generalisability. As the growth of mental healthcare delivery systems enables extensive data collection, longitudinal validation methods are the most pragmatic form of examining a model's generalisability.

Similar to previous work by Salazar de Pablo et al,^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} Gillet et al^{Reference Gillett, Tomlinson, Efthimiou and Cipriani58} and Meehan et al,^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8} the majority of the externally validated prediction models were classification models. One could speculate why this is the case. Perhaps, for clinical decision-making, the classification of treatment outcomes into binary categories may be considered more clinically useful than a change in symptom score. In our review, several studies employed a change in or a threshold on symptom scores to determine treatment outcome status such as remission versus non-remission or response versus non-response. Another possible reason for only finding classification models could be inherent to the search string used. If calibration and discrimination measures are underreported in non-classification models, they are not included in the review.

We extend previous work by Salazar de Pablo et al,^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} as the frequent use of clinical variables and the less frequent use of biomarkers (such as genetic information) is identified in this review as well. Salazar de Pablo et al^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} found that most models were based on clinical predictors, and that there is no evidence that models that include biomarkers outperform other types of models. In existing biomarker-based models, the great variability in raw data processing and feature type selection potentially hampers successful external validation.^{Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber and Johannesson54,Reference Rashid and Calhoun55} Some even question if biomarkers are a necessity to include when estimating treatment outcomes. In Meehan et al,^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8} studies reporting models with a majority of biomarkers were excluded because of pragmatic concerns. In this review, no such restrictions were applied. Notably, still few externally validated models were identified that predominantly used biomarkers, and were of poor performance. This supports the notion that biomarkers are unable or not yet sufficiently able to reflect psychopathological mechanisms.

Notably, we could not identify externally validated clinical prediction models that contained tests of known psychological mechanisms. For example, the role of negative bias (i.e. the tendency to pay more attention to negative information) is a well-established characteristic in the psychopathology of patients suffering from depressive disorders, anxiety disorders, bipolar disorders and schizophrenia,^{Reference Fazel, Wolf, Larsson, Lichtenstein, Mallett and Fanshawe33–Reference Fiedorowicz, Merranko, Iyengar, Hower, Gill and Yen35} of which severity can be assessed by the dot-probe computer task.^{Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36} In the review of Lee et al,^{Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24} a study constructed a prediction model utilising a cognitive-emotional biomarker, but unfortunately, this model was not externally validated. Based on this research, it is tempting to speculate that models using outcomes of cognitive tests fail to replicate in external validation, or that this observation identifies a promising gap in the literature that invites further exploration.

Comparing accuracy performance among reviews must be performed with great caution, as the reported accuracy interval depends on the type of validation the included models were tested with.^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13,Reference Eysenck21} In this review, only externally validated models were included, regardless of the model's performance in the development data-set (see Supplementary Table 7.7). Based on the lower point-estimates and wider prediction intervals of this study, one could conclude that the accuracy performance in the current study is lower compared to the accuracy of the reported prediction models of Salazar de Pablo et al.^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} This is also observed in the review of Meehan et al.^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8} However, in both reviews, the sample in the meta-analyses consisted mostly of internally validated models, making the higher reported discrimination measures a more likely finding.^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9,Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13} Moreover, in both reviews, a protocol for the selection of models for meta-analysis was lacking.

Comparing the findings of systematic reviews and meta-analysis enhances the risk of ecological fallacy, as aggregated data can mask important differences between subgroups or individuals, which is contradictory to the mission of precision psychiatry. One may even argue that a narrative review may be more appropriate, as this publication type allows for a broader scope, broader audience and more flexibility.^{Reference Pae60} However, narrative reviews are, because of their flexibility, more prone to bias in the gathering and evaluation of evidence.^{Reference Bourton, Page, Altman, Lundh and Hróbjartsson15}

This review illustrates the symbiotic relationship between introduced bias and applicability concern because of methodological design. In our review, only two out of 28 models were both of low concern when considering risk of bias and applicability. The reviews of Meehan et al^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8} and Salazar de Pablo et al^{Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11} identified key weaknesses in model construction and performance: underpowered studies, failure to report key aspects of model handling (missing data handling, predictive performance (mostly calibration) and bias-prone variable selection strategies. Not surprisingly, most models in both reviews were rated of high concern in both the risk of bias and applicability domains.

The observation, informed by the meta-regression, that models predicting depressive treatment outcomes performed worse compared to the other models in our sample is novel and intriguing. A possible explanation to consider is the particular case mix of the ‘rest’ category in the meta-regression, which included mainly SMI samples. Studies using samples with more severe disorders are more often conducted in patients treated in secondary care, and may have different distributions of outcomes and predictors, which may lead to different accuracy performance.^{Reference Eysenck21} Relatedly, the presentation of patients with moderate severity, which is often the case for depression, is typically more heterogeneous, thereby making it more difficult for a model to perform well as compared to in a sample with more homogeneous severely mentally ill patients.^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13} Notably, the models predicted psychosocial outcomes for a SMI population, raising the question whether the good discrimination of these models is due the sample, the type of outcome predicted or both.

The encountered diversity of studies and settings in this review reflects the challenge of clinical model implementation. The finding that many models were labelled of ‘high concern’ with regard to the risk of bias and/or applicability indicates that few models seem ready for further implementation in clinical practice to aid treatment allocation. For the implementation of clinical prediction models into practice, the need for close examination of the clinical setting before model implementation remains, regardless of whether they have been externally validated. The observation in the meta-regression that depression models reported lower accuracy, and visually apparent stronger accuracy for psychosocial outcomes in the meta-analysis, are new and highly remarkable. More research is needed to understand these differences and their implications.

In conclusion (large-scale) implementation of individualised externally validated prediction models in clinical practice does not seem to be feasible in the near future. Few models seem ready for further implementation in clinical practice to aid treatment allocation. Besides the need for more external validation studies, we recommend close examination of the clinical setting before model implementation.

Supplementary material

Supplementary material is available online at https://doi.org/10.1192/bjo.2024.789

Data availability

Data availability is not applicable to this article as no new data were created or analysed in this study.

Acknowledgements

We would like to express our appreciation to Dr Peter Braun for his contributions to the development of the search string. We would like to thank Dr Ans Hovenkamp for her methodological advice during the inception of this review.

Author contributions

The PROSPERO protocol was designed and registered by S.H.B., D.G.B. and H.R. The search was performed by D.G.B. and S.H.B. Data extraction was performed by D.G.B. and S.H.B. The manuscript was written by D.G.B. and reviewed by H.R., S.H.B. and R.A.S. All authors gave final approval for the work to be published.

Funding

This research received no specific grant from any funding agency, commercial or not-for-profit sectors.

Declaration of interest

None.

Footnotes

^a Definition of discrimination:^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9} ‘a model's ability to discriminate between patients who will benefit and who will not benefit from a given treatment’.

^b Definition of calibration:^{Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9} ‘a model's ability to accurately predict estimated outcomes compared to observed outcomes’.

^c Following the TRIPOD statement,^{Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13} reporting calibration and/or discrimination measures, or at least accuracy at a single cut-off.

^d I ²: estimate of the proportion of variability meta-analysis that is explained by methodological differences rather than sampling error.

^e Tau²: estimate of the standard deviation of the true effect sizes, used to calculate the prediction interval.

References

Collaborators GDaH. Global, regional, and national disability-adjusted life-years (DALYs) for 359 diseases and injuries and healthy life expectancy (HALE) for 195 countries and territories, 1990–2017: a systematic analysis for the global burden of disease study 2017. Lancet 2018; 392(10159): 1859–922.CrossRef Google Scholar

Howes, OD, Thase, ME, Pillinger, T. Treatment resistance in psychiatry: state of the art and new directions. Mol Psychiatry 2022; 27(1): 58–72.CrossRef Google Scholar PubMed

Maj, M. Why the clinical utility of diagnostic categories in psychiatry is intrinsically limited and how we can use new approaches to complement them. World Psychiatry 2018; 17(2): 121–2.CrossRef Google Scholar

Feczko, E, Miranda-Dominguez, O, Marr, M, Graham, AM, Nigg, JT, Fair, DA. The heterogeneity problem: approaches to identify psychiatric subtypes. Trends Cogn Sci 2019; 23(7): 584–601.CrossRef Google Scholar PubMed

Bourke, JW, White, PD. Psychological medicine. In Kumar & Clark's Clinical Medicine 8th ed. (ed. Kumar, PJC, M, Clark): 1115–91. Elsevier, 2012.Google Scholar

Gillan, CM, Whelan, R. What big data can do for treatment in psychiatry. Cur Opin Behav Sci 2017; 18: 34–42.CrossRef Google Scholar

Insel, TR, Cuthbert, BN. Medicine. Brain disorders? Precisely. Science 2015; 348(6234): 499–500.CrossRef Google Scholar PubMed

Meehan, AJ, Lewis, SJ, Fazel, S, Fusar-Poli, P, Steyerberg, EW, Stahl, D, et al. Clinical prediction models in psychiatry: a systematic review of two decades of progress and challenges. Mol Psychiatry 2022; 27(6): 2700–8.CrossRef Google Scholar PubMed

Debray, TP, Vergouwe, Y, Koffijberg, H, Nieboer, D, Steyerberg, EW, Moons, KG. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol 2015; 68(3): 279–89.CrossRef Google Scholar PubMed

Wolff, RF, Moons, KGM, Riley, RD, Whiting, PF, Westwood, M, Collins, GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 2019; 170(1): 51–8.CrossRef Google Scholar PubMed

Salazar de Pablo, G, Studerus, E, Vaquerizo-Serrano, J, Irving, J, Catalan, A, Oliver, D, et al. Implementing precision psychiatry: a systematic review of individualized prediction models for clinical practice. Schizophr Bull 2021; 47(2): 284–97.CrossRef Google Scholar PubMed

Bashmi, L, Cohn, A, Chan, ST, Tobia, G, Gohar, Y, Herrera, N, et al. The biopsychosocial model of evaluation and treatment in psychiatry. In Atlas of Psychiatry (ed. IsHak, WW): 57–89. Springer, 2023.CrossRef Google Scholar

Greco, T, Zangrillo, A, Biondi-Zoccai, G, Landoni, G. Meta-analysis: pitfalls and hints. Heart Lung Vessel 2013; 5(4): 219–25.Google Scholar PubMed

Altman, D, Ashby, D, Birks, J, Borenstein, M, Campbell, M, Deeks, J. Analysing data and undertaking meta-analysis. In Cochrane Handbook for Systematic Reviews of Interventions ed. 6.3. (eds Deeks, J, Higgings, JPT, Altman, DG on behalf of the Cochrane Statistical Methods Group). Cochrane, 2022.Google Scholar

Bourton, I, Page, MJ, Altman, D, Lundh, A, Hróbjartsson, A. Considering bias and conflicts in interest among included studies. In Cochrane Handbook for Systematic Reviews of Interventions ed. 6.3. Cochrane, 2022.Google Scholar

Page, MJ, McKenzie, JE, Bossuyt, PM, Boutron, I, Hoffmann, TC, Mulrow, CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Br Med J 2021; 372: n71.CrossRef Google Scholar PubMed

Heus, P, Damen, J, Pajouheshnia, R, Scholten, R, Reitsma, JB, Collins, GS, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med 2018; 16(1): 120.CrossRef Google Scholar PubMed

Moons, KG, de Groot, JA, Bouwmeester, W, Vergouwe, Y, Mallett, S, Altman, DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med 2014; 11(10): e1001744.CrossRef Google Scholar PubMed

Viechtbauer, W. The metafor Package (R). Metafor Project, 2010 (https://www.metafor-project.org/doku.php/installation).Google Scholar

Punch, K. Analyzing Qualitative Data. Introduction to Social Research. 3rd ed. SAGE Publications, 2014. 179–88.Google Scholar

Eysenck, HJ. Systematic reviews: meta-analysis and its problems. Br Med J 1994; 309: 789.CrossRef Google Scholar

Bradley, P. The use of area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 1996; 30(7): 1145–59.CrossRef Google Scholar

Carrington, A, Manuel, D, Fieguth, P, Ramsay, T, Osmani, V, Wernley, B, et al. Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation. IEEE PAMI 2023; 145(1): 329–41.CrossRef Google Scholar

Lee, Y, Ragguett, R-M, Mansur, RB, Boutilier, JJ, Rosenblat, JD, Trevizol, A, et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: a meta-analysis and systematic review. J Affect Disord 2018; 241: 519–32.CrossRef Google Scholar PubMed

Debray, T, Jong, VD. Package Metamisc (R), version 0.4.0. CRAN (Comprehensive R Archive Network), 2021 (https://github.com/smartdata-analysis-and-statistics/metamisc).Google Scholar

Ashar, YK, Clark, J, Gunning, FM, Goldin, P, Gross, JJ, Wager, TD. Brain markers predicting response to cognitive-behavioral therapy for social anxiety disorder: an independent replication of whitfield-Gabrieli et al. 2015. Transl Psychiatry 2021; 11(1): 260.CrossRef Google Scholar

Athreya, AP, Neavin, D, Carrillo-Roa, T, Skime, M, Biernacka, J, Frye, MA, et al. Pharmacogenomics-Driven prediction of antidepressant treatment outcomes: a machine-learning approach with multi-trial replication. Clin Pharmacol Ther 2019; 106(4): 855–65.CrossRef Google Scholar PubMed

Athreya, AP, Brückl, T, Binder, EB, John Rush, A, Biernacka, J, Frye, MA, et al. Prediction of short-term antidepressant response using probabilistic graphical models with replication across multiple drugs and treatment settings. Neuropsychopharmacol 2021; 46(7): 1272–82.CrossRef Google Scholar PubMed

Bone, C, Simmonds-Buckley, M, Thwaites, R, Sandford, D, Merzhvynska, M, Rubel, J, et al. Dynamic prediction of psychological treatment outcomes: development and validation of a prediction model using routinely collected symptom data. Lancet Digit Health 2021; 3(4): e231–40.CrossRef Google Scholar PubMed

Cattaneo, A, Ferrari, C, Uher, R, Bocchio-Chiavetto, L, Riva, MA, Pariante, CM. Absolute measurements of macrophage migration inhibitory factor and interleukin-1-β mRNA levels accurately predict treatment response in depressed patients. Int J Neuropsychopharmacol 2016; 19(10): 1–10.CrossRef Google Scholar PubMed

Chekroud, AM, Zotti, RJ, Shehzad, Z, Gueorguieva, R, Johnson, MK, Trivedi, MH, et al. Cross-trial prediction of treatment outcome in depression: a machine learning approach. Lancet Psychiatry 2016; 3(3): 243–50.CrossRef Google Scholar PubMed

Fabbri, C, Kasper, S, Kautzky, A, Zohar, J, Souery, D, Montgomery, S, et al. A polygenic predictor of treatment-resistant depression using whole exome sequencing and genome-wide genotyping. Transl Psychiatry 2020; 10(1): 50.CrossRef Google Scholar PubMed

Fazel, S, Wolf, A, Larsson, H, Lichtenstein, P, Mallett, S, Fanshawe, TR. Identification of low risk of violent crime in severe mental illness with a clinical prediction tool (Oxford mental illness and violence tool [OxMIV]): a derivation and validation study. Lancet Psychiatry 2017; 4(6): 461–8.CrossRef Google Scholar

Fazel, S, Wolf, A, Larsson, H, Lichtenstein, P, Mallett, S, Fanshawe, TR. The prediction of suicide in severe mental illness: development and validation of a clinical prediction rule (OxMIS). Transl Psychiatry 2019; 9(1): 98.CrossRef Google Scholar PubMed

Fiedorowicz, JG, Merranko, JA, Iyengar, S, Hower, H, Gill, MK, Yen, S, et al. Validation of the youth mood recurrences risk calculator in an adult sample with bipolar disorder. J Affect Disord 2021; 295: 1482–8.CrossRef Google Scholar

Furukawa, TA, Kato, T, Shinagawa, Y, Miki, K, Fujita, H, Tsujino, N, et al. Prediction of remission in pharmacotherapy of untreated major depression: development and validation of multivariable prediction models. Psychol Med 2019; 49(14): 2405–13.CrossRef Google Scholar PubMed

Hayes, JF, Osborn, DPJ, Francis, E, Ambler, G, Tomlinson, LA, Boman, M, et al. Prediction of individuals at high risk of chronic kidney disease during treatment with lithium for bipolar disorder. BMC Med 2021; 19(1): 99.CrossRef Google Scholar PubMed

Jha, MK, Minhajuddin, A, South, C, Rush, AJ, Trivedi, MH. Irritability and its clinical utility in major depressive disorder: prediction of individual-level acute-phase outcomes using early changes in irritability and depression severity. Am J Psychiatry 2019; 176(5): 358–66.CrossRef Google Scholar PubMed

Jha, MK, South, C, Trivedi, J, Minhajuddin, A, Rush, AJ, Trivedi, MH. Prediction of acute-phase treatment outcomes by adding a single-item measure of activity impairment to symptom measurement: development and validation of an interactive calculator from the STAR*D and CO-MED trials. Int J Neuropsychopharmacol 2019; 22(5): 339–48.CrossRef Google Scholar

Kambeitz-Ilankovic, L, Vinogradov, S, Wenzel, J, Fisher, M, Haas, SS, Betz, L, et al. Multivariate pattern analysis of brain structure predicts functional outcome after auditory-based cognitive training interventions. NPJ Schizophr 2021; 7(1): 40.CrossRef Google Scholar PubMed

Kautzky, A, Dold, M, Bartova, L, Spies, M, Kranz, GS, Souery, D, et al. Clinical factors predicting treatment resistant depression: affirmative results from the European multicenter study. Acta Psychiatr Scand 2019; 139(1): 78–88.CrossRef Google Scholar PubMed

Klein, NS, Holtman, GA, Bockting, CLH, Heymans, MW, Burger, H. Development and validation of a clinical prediction tool to estimate the individual risk of depressive relapse or recurrence in individuals with recurrent depression. J Psychiatr Res 2018; 104: 1–7.CrossRef Google Scholar PubMed

Leighton, SP, Krishnadas, R, Chung, K, Blair, A, Brown, S, Clark, S, et al. Predicting one-year outcome in first episode psychosis using machine learning. PLoS One 2019; 14(3): e0212846.CrossRef Google Scholar PubMed

Leighton, SP, Krishnadas, R, Upthegrove, R, Marwaha, S, Steyerberg, EW, Gkoutos, GV, et al. Development and validation of a nonremission risk prediction model in first-episode psychosis: an analysis of 2 longitudinal studies. Schizophr Bull Open 2021; 2(1): sgab041.CrossRef Google Scholar PubMed

Nie, Z, Vairavan, S, Narayan, VA, Ye, J, Li, QS. Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study. PLoS One 2018; 13(6): e0197268.CrossRef Google Scholar

Nunez, JJ, Nguyen, TT, Zhou, Y, Cao, B, Ng, RT, Chen, J, et al. Replication of machine learning methods to predict treatment outcome with antidepressant medications in patients with major depressive disorder from STAR*D and CAN-BIND-1. PLoS One 2021; 16(6): e0253023.CrossRef Google Scholar

Ortiz, BB, Higuchi, CH, Noto, C, Joyce, DW, Correll, CU, Bressan, RA, et al. A symptom combination predicting treatment-resistant schizophrenia: a strategy for real-world clinical practice. Schizophr Res 2020; 218: 195–200.CrossRef Google Scholar PubMed

Perlis, RH, Ostacher, MJ, Miklowitz, DJ, Hay, A, Nierenberg, AA, Thase, ME, et al. Clinical features associated with poor pharmacologic adherence in bipolar disorder: results from the STEP-BD study. J Clin Psychiatry 2010; 71(3): 296–303.CrossRef Google Scholar PubMed

Perry, BI, Osimo, EF, Upthegrove, R, Mallikarjun, PK, Yorke, J, Stochl, J, et al. Development and external validation of the psychosis metabolic risk calculator (PsyMetRiC): a cardiometabolic risk prediction algorithm for young people with psychosis. Lancet Psychiatry 2021; 8(7): 589–98.CrossRef Google Scholar PubMed

Puntis, S, Whiting, D, Pappa, S, Lennox, B. Development and external validation of an admission risk prediction model after treatment from early intervention in psychosis services. Transl Psychiatry 2021; 11(1): 35.CrossRef Google Scholar PubMed

Soldatos, RF, Cearns, M, Nielsen, M, Kollias, C, Xenaki, LA, Stefanatou, P, et al. Prediction of early symptom remission in Two independent samples of first-episode psychosis patients using machine learning. Schizophr Bull 2022; 48(1): 122–33.CrossRef Google Scholar PubMed

Taliaz, D, Spinrad, A, Barzilay, R, Barnett-Itzhaki, Z, Averbuch, D, Teltsh, O, et al. Optimizing prediction of response to antidepressant medications using machine learning and integrated genetic, clinical, and demographic data. Transl Psychiatry 2021; 11(1): 381.CrossRef Google Scholar PubMed

Wang, JL, Patten, S, Sareen, J, Bolton, J, Schmitz, N, MacQueen, G. Development and validation of a prediction algorithm for use by health professionals in prediction of recurrence of major depression. Depress Anxiety 2014; 31(5): 451–7.CrossRef Google Scholar PubMed

Botvinik-Nezer, R, Holzmeister, F, Camerer, CF, Dreber, A, Huber, J, Johannesson, M, et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582: 84–8.CrossRef Google Scholar PubMed

Rashid, B, Calhoun, V. Towards a brain-based predictome of mental illness. Hum Brain Mapp 2020; 41: 3468–535.CrossRef Google Scholar PubMed

Schnack, HG, Kahn, RS. Detecting neuroimaging biomarkers for psychiatric disorders: sample size matters. Front Psychiatry 2016; 7: 50.CrossRef Google Scholar PubMed

Nunes, A, Ardau, R, Berghöfer, A, Bocchetta, A, Chillotti, C, Deiana, V, et al. Prediction of lithium response using clinical data. Acta Pschiatr Scand 2020; 141(2): 131–41.CrossRef Google Scholar PubMed

Gillett, G, Tomlinson, A, Efthimiou, O, Cipriani, A. Predicting treatment effects in unipolar depression: a meta-review. Pharmacol Ther 2020; 212: 107557.CrossRef Google Scholar PubMed

Chekroud, AM, Hawrilenko, M, Loho, H, Bondar, J, Gueorguieva, R, Hasan, A, et al. Illusory generalizability of clinical prediction models. Science 2024; 383(6679): 164–7.CrossRef Google Scholar PubMed

Pae, CU. Why systematic review rather than narrative review? Psychiatry Invest 2015; 12(3): 417–9.CrossRef Google Scholar PubMed

Fig. 1 Preferred reporting items for systematic reviews and meta-analysis-conforming flowchart of the screening process.EMBASE, Excerpta Medica.

Table 1 Study characteristics

Table 2 (Type of) variables used per study

Table 3 Modelling methods and reported outcomes per study