Introduction
Over 25 antidepressant drugs and other therapies are effective in the treatment of major depressive disorder (MDD). While there are only small differences in the efficacy of the various treatments averaged across large groups of individuals, the response to each treatment varies substantially from individual to individual. Only a minority of individuals with MDD experience an adequate benefit from the first treatment they receive, leading many to sequentially try multiple treatments and combinations (Al-Harbi, Reference Al-Harbi2012). Each unsuccessful treatment trial lasts several months and is associated with the risk of side effects, frustration, and adverse outcomes, including suicide. At present, the initial selection of treatment is usually based on evidence for efficacy and tolerability averaged across groups of individuals. Improvement in the selection of treatment requires tools that can predict whether a given individual will respond to a specific treatment. If this approach can be applied before, or early in the course of a treatment trial, those with a low likelihood of adequate therapeutic response can be redirected to treatment options that are more likely to be beneficial (Simon & Perlis, Reference Simon and Perlis2010).
Our knowledge of factors that predict treatment outcomes in depression has increased over the past decade. Known predictive factors include demographic characteristics (Fournier et al., Reference Fournier, DeRubeis, Shelton, Hollon, Amsterdam and Gallop2009), history of adverse experiences (Nanni, Uher, & Danese, Reference Nanni, Uher and Danese2012), comorbid anxiety (Fava et al., Reference Fava, Rush, Alpert, Balasubramani, Wisniewski, Carmin and Trivedi2008), symptom dimensions (Uher et al., Reference Uher, Perlis, Henigsberg, Zobel, Rietschel, Mors and McGuffin2012a), cognitive performance (Williams et al., Reference Williams, Rush, Koslow, Wisniewski, Cooper, Nemeroff and Gordon2011), molecular biomarkers (Uher et al., Reference Uher, Tansey, Dew, Maier, Mors, Hauser and McGuffin2014), as well as measures of brain structure (Colle et al., Reference Colle, Dupong, Colliot, Deflesselle, Hardy, Falissard and Corruble2018) and function (McGrath et al., Reference McGrath, Kelley, Holtzheimer, Dunlop, Craighead, Franco and Mayberg2013). While some of these predictive factors have been replicated across datasets, none is sufficiently accurate, robust, or economical for routine clinical use at an individual level.
MDD is a heterogeneous condition influenced by many factors that vary across individuals and populations. Therefore, it is likely that a more accurate individualized prediction can be achieved by models that consider multiple factors. Multivariate predictions can be constructed from models developed in existing datasets, using machine learning tools (Vu et al., Reference Vu, Adalı, Ba, Buzsáki, Carlson, Heller and Dzirasa2018). Several groups of investigators have applied machine learning to depression treatment datasets to establish classifiers that could predict treatment outcomes (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016; Etkin et al., Reference Etkin, Patenaude, Song, Usherwood, Rekshan, Schatzberg and Williams2015; Iniesta et al., Reference Iniesta, Hodgson, Stahl, Malki, Maier, Rietschel and Uher2018, Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016; Maciukiewicz et al., Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018; Nie, Vairavan, Narayan, Ye, & Li, Reference Nie, Vairavan, Narayan, Ye and Li2018; Perlis, Reference Perlis2013). These results suggest that it is possible to construct parsimonious predictive models that use combinations of selected features from a large number of measurements to predict treatment effects for individuals who were not in the training dataset. A systematic review of published models found that studies using adequate methodology reported predictions with modest accuracy (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). However, most published studies that used adequate methodology were limited to predictors of single modality, primarily those resulting from clinical questionnaires and interviews. Two prior studies that combined clinical features and molecular genetic markers reported improved accuracy of prediction compared to clinical features alone (Iniesta et al., Reference Iniesta, Hodgson, Stahl, Malki, Maier, Rietschel and Uher2018; Taliaz et al., Reference Taliaz, Spinrad, Barzilay, Barnett-Itzhaki, Averbuch, Teltsh and Lerer2021). The added benefit of multimodal measurement remains to be replicated and extended to additional data modalities, such as neuroimaging. It is also unknown whether the improved prediction is a function of combining data across measurement modalities, or is a result of including a larger number of predictors.
The goal of the Canadian Biomarker Integration Network in Depression (CAN-BIND) is to enhance treatment response in MDD through the prediction of treatment outcomes and personalized treatment strategies (Kennedy et al., Reference Kennedy, Downar, Evans, Feilotter, Lam, MacQueen and Soares2012; Lam et al., Reference Lam, Milev, Rotzinger, Andreazza, Blier, Brenner and Kennedy2016). In the present paper, we leverage the CAN-BIND-1 dataset to test whether multimodal measurements, composed of clinical, molecular, and brain imaging biomarkers improve the prediction of depression treatment outcomes. By systematically examining each domain of measurement and varying the number of predictive features in a tiered approach, we aim to answer the question of whether multimodal measurement or an increased number of variables influence the accuracy of prediction.
Methods
The CAN-BIND-1 dataset
The CAN-BIND-1 study enrolled 211 adults with MDD, who were extensively assessed, offered treatment with the serotonin-reuptake inhibiting antidepressant escitalopram (10–20 mg), and invited for follow-up assessments every two weeks for 16 weeks (Kennedy et al., Reference Kennedy, Lam, Rotzinger, Milev, Blier, Downar and Uher2019; Lam et al., Reference Lam, Milev, Rotzinger, Andreazza, Blier, Brenner and Kennedy2016). Of these, 192 (91%) attended the assessment after 2 weeks, 180 (85%) attended the assessment 8 weeks after treatment initiation, and 166 (79%) attended the final planned assessment after 16 weeks of treatment. At week 8, participants who did not respond to escitalopram were offered additional treatment with aripiprazole (2–10 mg) as an augmenting strategy. CAN-BIND-1 was approved by the Research Ethics Boards at all recruiting sites. All participants signed an informed consent after the study procedures had been explained. The present study uses data from the first 8 weeks when all participants received escitalopram as a monotherapy. We include 192 participants (74 men and 118 women, mean age 35.4, s.d. = 12.8 years) who provided valid outcome data on one or more follow-ups after initiating treatment (see online Supplementary Table S1 for a comparison of participants who did and who did not contribute to analyses). Analyses that included predictors from week 2 used a subset of 188 participants (71 men and 117 women, mean age 35.3, s.d. = 12.7 years) who provided valid outcomes at week 4 or later. Further details of the CAN-BIND-1 clinical dataset are available elsewhere (Kennedy et al., Reference Kennedy, Lam, Rotzinger, Milev, Blier, Downar and Uher2019; Lam et al., Reference Lam, Milev, Rotzinger, Andreazza, Blier, Brenner and Kennedy2016). The detailed flow diagram of CAN-BIND-1 participants is depicted in online Supplementary Fig. S1.
Outcomes of antidepressant treatment
In CAN-BIND-1, we measured the severity of depressive symptoms every 2 weeks for 16 weeks with the clinician-rated Montgomery and Åsberg Depression Rating Scale (MADRS). At baseline, the participants were moderately to severely depressed, scoring on average 30 on MADRS (range 21–47). The improvement in depressive symptoms with treatment can be indexed with a continuous measure (absolute or proportional reduction) or a dichotomized categorical outcome (remission, response). Consistent with previous studies in the field, we chose to use a categorical outcome to provide a metric that is comparable with prior literature (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). Categorical outcome measures based on absolute numbers of the end-point score (remission) and proportion of change from baseline score (response) have complementary advantages and disadvantages. The probability of remission is negatively related to baseline severity, but the probability of response is independent of severity at baseline (Coley et al., Reference Coley, Boggs, Beck, Hartzler and Simon2020). Since we aimed to index improvement in a way that is independent of baseline severity (Kennedy et al., Reference Kennedy, Downar, Evans, Feilotter, Lam, MacQueen and Soares2012), we chose response, defined as a reduction in MADRS by 50% or more from baseline to week 8, as the primary outcome measure. When MADRS at week 8 was missing, we used earlier time points to estimate the outcome based on mixed-effects models for repeated measures, as previously described (Uher et al., Reference Uher, Frey, Quilty, Rotzinger, Blier, Foster and Kennedy2020). When no outcome data were available at any post-baseline time point (or post-week-2 for analyses that used week-2 measurements as predictors), we did not include the participant in any analyses. The outcome response rates were 46.9% and 47.3% for baseline and week 2 samples respectively.
Predictors
At baseline (week 0), CAN-BIND-1 participants underwent detailed assessments with interviews, questionnaires, cognitive testing, magnetic resonance neuroimaging, and blood sampling. The assessments were organized into three modalities: (1) Clinical modality included interviews to establish diagnoses, medical history, current severity of depression, and functioning, questionnaires covering depressive and anxiety symptoms, personality traits, and functioning, cognitive testing, and measurement of body weight and height to calculate body mass index (Kennedy et al., Reference Kennedy, Lam, Rotzinger, Milev, Blier, Downar and Uher2019; Lam et al., Reference Lam, Milev, Rotzinger, Andreazza, Blier, Brenner and Kennedy2016). (2) Molecular modality used blood samples to extract DNA for genomic and epigenetic analyses, and measure a comprehensive panel of micro RNAs, levels of common metabolites, and inflammatory markers. (3) Neuroimaging modality used multimodal magnetic resonance imaging to obtain whole-brain structural T1 and T2-weighted images, diffusion tensor imaging of the white matter, and functional magnetic resonance imaging during resting-state and depression-relevant tasks (Macqueen et al., Reference Macqueen, Hassel, Arnott, Addington, Bowie, Bray and Kennedy2019). A subset of measurements was repeated after 2 weeks (week 2). Further details of CAN-BIND-1 assessments are available elsewhere (Kennedy et al., Reference Kennedy, Lam, Rotzinger, Milev, Blier, Downar and Uher2019; Lam et al., Reference Lam, Milev, Rotzinger, Andreazza, Blier, Brenner and Kennedy2016).
Tiered selection of predictors for analysis
To optimize the use of the rich dataset with a limited sample size, we adopted a two-tiered approach to the inclusion of potential markers in the predictive model development with tiers 1 and 2 considering focused and comprehensive sets of potential predictors, respectively. Tier 1 predictors were selected based on prior published evidence of predictive value, measurement reliability, data completeness, and conceptual value which were pre-processed, derived, or engineered (e.g. total scale score was used rather than individual questionnaire items, selected regional brain volumes rather than voxel-level signal intensity, total DNA methylation rather than methylation at specific genomic loci). The exact number of tier 1 predictors was not determined a priori; however, the aim was to retain several predictors that are similar to or lower than the number of individuals in the analytic sample. The selection process resulted in a set of 134 variables measured at baseline that represented all three measurement modalities in tier 1 analyses (Table 1). For analyses including baseline and week 2 predictors, an additional 80 variables from week 2 assessments were added to the baseline tier 1 variables, resulting in a total of 214 predictors (Table 1). Where prior evidence was used in the selection of predictors, it was strictly limited to prior studies that did not use CAN-BIND-1 participants. Tier 2 predictors were also mostly pre-processed, and adequately measured, but were included without any requirement of prior evidence of predictive value. The inclusion of a greater number of predictors allowed considering more comprehensive and granular information (e.g. sub scores or items from a questionnaire, volume measurements of all brain regions, all known micro RNAs measured with adequate reliability). In tier 2 analysis, we included 1152 predictor variables measured at baseline that comprehensively cover the key information from the three assessment modalities (Table 1). The gradual inclusion of predictors by tier and by modality allowed us to separate the contribution of multimodal measurement from the effect of a greater number of predictors, examine the limits of machine learning analyses in a moderately sized clinical sample and test the value of prior evidence in variable selection. The list and number of predictors considered for inclusion in each tier are given in online Supplementary Table S2. A description of all tier 1 variables is given in online Supplementary Table S3.
Missing values
Missing values are inevitable in human datasets and the way missing values are handled can influence the results. One commonly used approach is to only include individuals with valid values for all variables, referred to as complete case analysis which introduces a bias in all cases other than when all values are missing completely at random. The preferred alternative to complete case analysis is the imputation of missing values. For imputation of missing values on predictor variables, we chose ‘missRanger’ due to its capability to handle non-normal distributed data with various types of predictors (Stekhoven & Bühlmann, Reference Stekhoven and Bühlmann2012). ‘MissRanger’ was introduced by Mayer et al. for the imputation of multimodal datasets, similar to CAN-BIND-1 dataset (Stekhoven & Bühlmann, Reference Stekhoven and Bühlmann2012). It uses a fast implementation of random forest package ‘ranger’ (Wright & Ziegler, Reference Wright and Ziegler2017) and includes predictive mean matching which prevents imputation with values that do not exist in the original data such as a value 0.5 in a 0–1 binary variable. Importantly, missRanger was integrated into the machine learning workflow so that imputation was done independently in each training and each testing set, preventing information leakage. Outcome measures were not included in the imputation procedure (see section Outcomes of Antidepressant Treatment, above, for missing values on outcomes). For visualization of missing values patterns, please see online Supplementary Fig. S2A–I.
Development and assessment of prediction models
When designing the development of the prediction model, we followed the current recommendations to reduce the risk of bias (ROB) and overfitting (Moons et al., Reference Moons, Wolff, Riley, Whiting, Westwood, Collins and Mallett2019; Ranstam, Cook, & Collins, Reference Ranstam, Cook and Collins2016; Wolff et al., Reference Wolff, Moons, Riley, Whiting, Westwood, Collins and Mallett2019). To ensure complete separation of training and testing sets and minimize the ROB or over-optimism, we applied a fully nested cross-validation framework, with all procedures, including the imputation of missing values, performed separately in training and testing sets within each fold of the outer cross-validation loop (Fig. 1). For each combination of predictor modality (clinical, molecular, neuroimaging, clinical + molecular, clinical + neuroimaging, molecular + neuroimaging, and all three modalities), predictor tier (tier 1, and tier 2), predictor measurement time (baseline only, the combination of baseline and week 2), and machine learning method, we completed 100 repetitions of nested cross-validation (inner fivefold cross-validation and outer threefold cross-validation). We applied five machine learning methods with potentially complementary advantages: a penalized multiple regression (hyperparameter tuned elastic net) (Kuhn, Reference Kuhn2020), two tree-based methods [random forests and gradient boosting (GBM)], support vector machines (SVM) with radial basis kernel (Kuhn, Reference Kuhn2020; Meyer et al., Reference Meyer, Dimitriadou, Hornik, Weingessel, Leisch, Chang and Lin2021), and Bayesian network analysis (Naïve Bayes) (Meyer et al., Reference Meyer, Dimitriadou, Hornik, Weingessel, Leisch, Chang and Lin2021). We applied each machine learning method with and without feature selection using CAT scores (correlation-adjusted t-scores) in the sda package (Ahdesmäki, Zuber, Gibb, & Strimmer, Reference Ahdesmäki, Zuber, Gibb and Strimmer2015) to select the top 25 features in each training set. To prevent information leakage, feature selection was carried out in the training set only within each fold of each repeat of the outer cross-validation. Elastic net, GBM, and random forests have additional embedded feature selection, which were also nested and restricted to the training set within each fold and repeat. In total, we developed and evaluated 210 models representing combinations of the modality, predictor tier, predictor measurement time, machine learning method, and CAT feature selection. In each case, we predicted a categorically defined response. Since the outcome distribution was balanced and there was no a priori assumption of differential penalty for false-positive and false-negative predictions, we quantified prediction as balanced accuracy in testing sets the nested cross-validation. We report the mean balanced accuracy and the range of balanced accuracy across the 100 repeats of nested cross-validation. We examine variable importance to evaluate the contribution of specific predictors in models with the highest predictive accuracy (Fig. 2; online Supplementary Figs S3–S6). The Prediction model Risk Of Bias Assessment Tool (PROBAST) (Moons et al., Reference Moons, Wolff, Riley, Whiting, Westwood, Collins and Mallett2019; Wolff et al., Reference Wolff, Moons, Riley, Whiting, Westwood, Collins and Mallett2019) is reported in online Supplementary Table S4. Since the model was developed without external validation, the overall ROB was considered high even though other domains showed low ROB.
Results
Prediction of response from tier 1 baseline predictors
Up to 134 predictors across the three modalities were included in tier 1 analyses. Across 70 machine learning models using seven combinations of predictor modality and five machine learning methods, each with and without feature selection, we predicted response with a mean balanced accuracy of 0.55 (median 0.55, Fig. 3; Table 1; online Supplementary Table S5). There was a gradient of increasing prediction accuracy with the inclusion of multiple predictor modalities. Using a single predictor modality led to a mean accuracy of 0.542 (95% CI 0.541–0.543; s.e. = 0.000562), the combination of two predictor modalities predicted a mean accuracy of 0.553 (95% CI 0.552–0.554; s.e. = 0.000597), the combination of features across all three modalities predicted response with a mean balanced accuracy of 0.573 across methods (95% CI 0.571–0.575; s.e. = 0.000971). The proportion of individuals misclassified in terms of response to antidepressant treatment decreased from 46.9% with change prediction to 44.8% with the best performing set of models using baseline clinical predictors, and 37.4% with the best set of models including all three data modalities. The distributions of balanced accuracy estimates for one, two, and three baseline data modalities are illustrated in Fig 4A, C, E.
Of the machine learning methods, SVM reported the highest mean accuracy (0.58). Feature selection did not affect reported accuracy (mean 0.55 with and 0.55 without feature selection). The highest balanced accuracy of 0.62 was seen with SVM without feature selection using a combination of predictors from all three modalities. Molecular (global DNA methylation, plasma cholesterol, microRNA 26p), clinical (functional impairment, anhedonia), and neuroimaging (fractional anisotropy in several white matter regions) variables all contributed to the prediction (online Supplementary Fig. S3). The most accurate predictive models for each predictor combination are described in Table 1. Receiver operating characteristic (ROC) curves for the most predictive [highest area under the curve (AUC)] models are depicted in online Supplementary Fig. S7A–G.
Prediction of response from tier 2 baseline predictors
Tier 2 analyses involved a more than eightfold increase in the number of predictors across modalities compared to tier 1 (1152 v. 134 features, respectively). The 70 machine learning models covering all combinations of predictor modality and methods predicted response with a mean balanced accuracy of 0.54 (median 0.54, interquartile range 0.52–0.55; Fig. 3; Table 1; online Supplementary Table S5). Using a single predictor modality led to a mean accuracy of 0.528 (95% CI 0.527–0.529; s.e. = 0.000538), the combination of two predictor modalities achieved a mean accuracy of 0.541 (95% CI 0.540–0.542; s.e. = 0.000581), and a combination of features across all three modalities predicted response with a mean balanced accuracy of 0.543 (95% CI 0.541–0.545; s.e. = 0.000942).
The choice of the machine learning method was unrelated to prediction accuracy (Fig. 3; online Supplementary Table S5). Use of CAT score feature selection was associated with a modest increase in accuracy among tier 2 analyses (mean balanced accuracy 0.54 with and 0.53 without feature selection). The highest balanced accuracy of 0.58 was seen using random forest with CAT score feature selection and the combination of clinical and neuroimaging data. When all three modalities were included, the most important contributors to response prediction were polygenic scores for insomnia and educational achievement, suicidal ideation, reduced appetite, plasma cholesterol, and methylation at several loci (online Supplementary Fig. S3). Across modalities and methods, tier 2 analyses consistently achieved slightly lower prediction accuracy than tier 1 analyses (Fig. 3; online Supplementary Table S5). ROC curves for the most predictive (highest AUC) models are depicted in online Supplementary Fig. S8A–G.
Prediction of response from baseline and week 2 predictors
In the next step, we examined how additional measurements early in the course of treatment (at week 2) improved prediction. Since the inclusion of tier 2 variables reduced prediction accuracy, we focused this stage of analysis on tier 1 predictors. We tested 70 machine learning models with up to 80 variables measured at week 2 added to baseline tier 1 predictors.
These models predicted response with a mean balanced accuracy of 0.59 (median 0.60, interquartile range 0.54–0.63; Fig. 3; Table 1; online Supplementary Table S5). There was a gradient of increasing prediction accuracy with the inclusion of multiple predictor modalities. Using a single predictor modality led to a mean accuracy of 0.570 (95% CI 0.568–0.571; s.e. = 0.000738), the combination of two predictor modalities predicted a mean accuracy of 0.595 (95% CI 0.593–0.596; s.e. = 0.000703), the inclusion of baseline and week 2 features across all three modalities predicted response with a mean balanced accuracy of 0.619 (95% CI 0.617–0.621; s.e. = 0.001070) across methods. The distributions of balanced accuracy estimates for combinations of baseline and week 2 measurements are illustrated in Fig. 4B, D, F.
The decisive factor was the inclusion of clinical modality variables measured at week 2. The most accurate predictive models included an elastic net model using clinical predictors without feature selection, and a random forest model using data from all three modalities without additional feature selection, both achieving a balanced accuracy of 0.66 (Fig. 3; Table 1). In the most accurate models, clinical variables (clinical global impression, interest-activity symptoms score, and total depression severity scores from clinician-rated and self-report instruments) measured at week 2 contributed most to the prediction of response (online Supplementary Fig. S6). ROC curves for the most predictive (highest AUC) models are depicted in online Supplementary Fig. S9A–G.
Discussion
In a medium-sized richly assessed sample of patients with MDD, we show that multimodal assessment improves the prediction of antidepressant treatment outcomes. The improvement in prediction accuracy with the inclusion of molecular and neuroimaging information is modest. The addition of week 2 measurements leads to a more substantial improvement in prediction accuracy.
The strength of prediction should be interpreted in the context of existing literature and the known relationship between study quality and prediction accuracy. A recent meta-analysis found that studies with adequate methodology (sample size over 100 and clear separation of training and testing sets) report prediction with lower accuracy than studies with small samples or inadequate separation of training and testing sets (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). In addition, response is the hardest outcome to predict, because it is uncorrelated with baseline severity (Coley et al., Reference Coley, Boggs, Beck, Hartzler and Simon2020). Studies with adequate methodology reported mean prediction accuracies of 0.69, 0.60, and 0.56 for the prediction of treatment resistance, remission, and response respectively (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). The mean accuracy of prediction across the range of models in the present study (0.56) is comparable to reports in prior studies that used adequate quality methods (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). The use of baseline clinical measures helped predict one additional response per 50 patients compared to chance prediction. This level of accuracy is above chance but does not meet the standards required for meaningful clinical application (Dinga et al., Reference Dinga, Marquand, Veltman, Beekman, Schoevers, van Hemert and Schmaal2018; Uher, Tansey, Malki, & Perlis, Reference Uher, Tansey, Malki and Perlis2012b). The use of baseline clinical, molecular, and neuroimaging measures helped predict one additional response per 10 patients compared to chance prediction. This still falls short of clinical significance, especially if blood and neuroimaging biomarkers are not easily accessible, but it is a step toward clinically meaningful multimodal prediction as measurements become more accessible and algorithms improve. In a prior study, the addition of molecular genetic variables to clinical features increased prediction accuracy compared to using clinical features alone (Iniesta et al., Reference Iniesta, Hodgson, Stahl, Malki, Maier, Rietschel and Uher2018, Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016), while others reported highly accurate predictions of depression treatment outcomes from neuroimaging variables (Cash et al., Reference Cash, Cocchi, Anderson, Rogachov, Kucyi, Barnett and Fitzgerald2019; Williams et al., Reference Williams, Korgaonkar, Song, Paton, Eagles, Goldstein-Piekarski and Etkin2015). However, these studies had small samples and did not answer the question as to whether the higher accuracy is due to the unique predictive value of neuroimaging or a result of overfitting (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). The present study is the first one to combine clinical, molecular, and neuroimaging features in a single sample. In the present study, we found a gradient of increasing accuracy with the inclusion of additional modalities of measurement. Both mean accuracy and best predictive model accuracy were highest when clinical, molecular, and neuroimaging predictors were combined. This finding supports prior findings on the added benefits of molecular (Iniesta et al., Reference Iniesta, Hodgson, Stahl, Malki, Maier, Rietschel and Uher2018) and neuroimaging (Williams et al., Reference Williams, Korgaonkar, Song, Paton, Eagles, Goldstein-Piekarski and Etkin2015) and extends them to combinations of clinical, molecular, and neuroimaging data. The lack of prediction increase in accuracy with tier 2 variables suggests that the improvement is due to multiple data modalities rather than just a greater number of variables. However, the degree of prediction accuracy improvement in our data was not as large as that previously reported for molecular (Iniesta et al., Reference Iniesta, Hodgson, Stahl, Malki, Maier, Rietschel and Uher2018) or neuroimaging (Cash et al., Reference Cash, Cocchi, Anderson, Rogachov, Kucyi, Barnett and Fitzgerald2019; Williams et al., Reference Williams, Korgaonkar, Song, Paton, Eagles, Goldstein-Piekarski and Etkin2015). A key decision in designing a predictive model is the number of predictors (features) to be included, relative to the number of participants (observations) that are available. While common recommendations in predictive modeling suggest limiting the number of predictors so that a minimum number of observations per predictor is available, a recent meta-analysis found no relationship between the feature-to-observations ratio and reported prediction accuracy (Sajjadian et al., Reference Sajjadian, Lam, Milev, Rotzinger, Frey, Soares and Uher2021). The present study contributes to informing the optimal number of predictors in two ways. First, we observed no improvement in predictive accuracy with tier 2 compared to tier 1. Tier 1 models with 31-to-134 predictors achieved more accurate predictions than tier 2 models with 194–1152 predictors. Second, the addition of predictor modalities or time points within tier 1 did improve prediction accuracy. Third, feature selection was associated with improved accuracy in tier 2, but not in tier 1. Together, the findings suggest that when predicting treatment outcomes, a predictive model can make use of a number of features smaller than or similar to the number of participants. When the number of predictors exceeds the number of participants, the predictive model development becomes less efficient. The strength of correlations among predictors and the degree of association between each predictor and the outcome in different datasets may modify these conclusions.
Response to antidepressants evolves over 6–8 weeks, but early changes in symptoms within the first 2 weeks of treatment are predictive of longer-term outcomes (Szegedi et al., Reference Szegedi, Jansen, Van Willigenburg, Van Der Meulen, Stassen and Thase2009). In the present study, the addition of measures obtained 2 weeks after treatment initiation led to a striking improvement in prediction accuracy. This improvement in accuracy exceeded the benefits of multimodal measurement at baseline. While multimodal models retained a degree of advantage after the inclusion of week 2, clinical measures obtained at 2 weeks made the most substantial contribution to the improved prediction. This marked improvement in prediction is consistent with other reports. Machine learning studies that included week 2 data (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016; Nie et al., Reference Nie, Vairavan, Narayan, Ye and Li2018) reported more accurate predictions than those that used baseline predictors only (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016). A recent study reported high accuracy when data obtained 4 weeks after the onset of treatment were included as predictors (Athreya et al., Reference Athreya, Brückl, Binder, John Rush, Biernacka, Frye and Bobo2021). These results seem to point to a greater value of initial treatment data compared to extensive baseline assessments. However, the clinical value of prediction and its potential to change treatment course is diminishing with time from baseline as the trajectory of response is already becoming apparent and many clinicians decide to adjust treatment accordingly (Browning et al., Reference Browning, Bilderbeck, Dias, Dourish, Kingslake, Deckert and Dawson2021).
The present study benefits from rich multimodal assessment and standardized protocols. However, the results should be interpreted regarding limitations related to the sample size. The present sample, although larger than previously reported multimodal studies, is not large enough to optimally support the learning or validation of complex prediction models. Given the size of the available data set, we opted for nested cross-validation that uses all parts of the dataset as both training and testing sets in different validation loops while retaining a strict separation of the training and testing sets. The optimal validation strategy includes an additional step of external validation (Chekroud et al., Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021). Ideally, the external validation should occur in a dataset that was not available at the time of model development. CAN-BIND is presently collecting a new dataset with treatment and assessment protocols closely matching those used in the present study. This will present the first opportunity to replicate results in a sample that is strictly external to the model development and yet is fully comparable.
In conclusion, a range of machine learning analyses suggests that a combination of clinical data with neuroimaging and molecular biomarkers improves the prediction of antidepressant treatment outcomes over single modality measurement. Larger samples will be required to scale the present work up and use the full potential of rich multimodal measurements to achieve clinically meaningful response prediction.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291722002124
Financial support
CAN-BIND is an Integrated Discovery Program carried out in partnership with, and financial support from, the Ontario Brain Institute, an independent non-profit corporation, funded partially by the Ontario government. The opinions, results, and conclusions are those of the authors and no endorsement by the Ontario Brain Institute is intended or should be inferred. Funding and/or in-kind support is also provided by the investigators' universities and academic institutions. All study medications are independently purchased at wholesale market values. Dr Uher has been supported by the Canada Research Chairs Program (file number 950 – 233141) and the Canadian Institutes of Health Research (Funding reference numbers 148394, 165835, and 178222).
Conflict of interest
Sajjadian, Uher, Ho, Hassel, Frey, Farzan, Foster, Rotzinger, Turecki, and Taylor declare no conflicts of interest. Dr Milev has received consulting and speaking honoraria from AbbVie, Allergan, Eisai, Janssen, KYE, Lallemand, Lundbeck, Neonmind, Otsuka, and Sunovion, and research grants from CAN-BIND, CIHR, Janssen, Lallemand, Lundbeck. Dr Blier has received honoraria for advisory board participation, giving lectures, and/or scientific expertise from Allergan, Bristol Myers Squibb, Janssen, Lundbeck, Otsuka, Pierre Fabre Medicaments, and research grants from Allergan, CAN-BIND, Janssen, Lundbeck, and Otsuka. Dr Parikh serves as a consultant to Assurex (Myriad), Aifred, Janssen, Otsuka, Sage, Takeda, Mensante, and Neonmind; has research contracts with Aifred, Janssen, Sage, and Merck, along with Ontario Brain Institute and CIHR; and equity in Mensante. Dr Müller has received consulting and speaking honoraria from Lundbeck and Genomind. Dr Soares has received consulting and speaking honoraria from Pfizer, Otsuka, Bayer, Eisai and research grants from CAN-BIND, CIHR, OBI, and SEAMO. Other authors declare no conflict of interest. Dr Lam has received consulting and/or speaking honoraria or received research funds from Allergan, Asia-Pacific Economic Cooperation, BC Leading Edge Foundation, Canadian Institutes of Health Research, Canadian Network for Mood and Anxiety Treatments, Healthy Minds Canada, Janssen, Lundbeck, Lundbeck Institute, Michael Smith Foundation for Health Research, MITACS, Myriad Neuroscience, Ontario Brain Institute, Otsuka, Unity Health, Vancouver Coastal Health Research Institute, Viatris, and VGH-UBCH Foundation. Dr Strother is a senior Scientific Advisor and shareholder in ADMdx, Inc., which receives NIH funding, and during the period of this research, he had research grants from Brain Canada, Canada Foundation for Innovation (CFI), Canadian Institutes of Health Research (CIHR), and the Ontario Brain Institute in Canada. Dr Kennedy has received research funding or honoraria from Abbvie, Boehringer Ingelheim, Brain Canada, Canadian Institutes of Health Research, Janssen, Lundbeck, Lundbeck Institute, Ontario Brain Institute, Ontario Research Foundation, Otsuka, Servier, and Sun and holds stock in Field Trip Health.
Ethical standards
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.