Introduction
Binge-eating disorder (BED) is a prevalent eating disorder associated strongly with obesity, elevated psychiatric and medical comorbidities, and psychosocial impairment (Udo & Grilo, Reference Udo and Grilo2018, Reference Udo and Grilo2019). Specific treatments for BED are known to reduce binge eating (Grilo, Reference Grilo2017; Hilbert et al., Reference Hilbert, Petroff, Herpertz, Pietrowsky, Tuschen-Caffier, Vocks and Schmidt2019) but many patients do not benefit sufficiently; the leading BED treatments result in binge-eating abstinence for only half of patients (Linardon, Reference Linardon2018), and most treatments fail to produce clinically meaningful weight loss (Hilbert et al., Reference Hilbert, Petroff, Herpertz, Pietrowsky, Tuschen-Caffier, Vocks and Schmidt2019).
Prediction of BED treatment outcomes has proven difficult. A number of patient variables have been evaluated as predictors, including – but not limited to – various eating-disorder psychopathology scales/measures as well as specific features such as overvaluation of shape/weight, self-control, depression and negative affect, and psychiatric comorbidity (e.g. Anderson et al., Reference Anderson, Smith, Schaefer, Crosby, Cao, Engel and Peterson2020; Grilo, Masheb, & Crosby, Reference Grilo, Masheb and Crosby2012a; Grilo, Thompson-Brenner, Shingleton, Thompson, & Franko, Reference Grilo, Thompson-Brenner, Shingleton, Thompson and Franko2021; Lydecker & Grilo, Reference Lydecker and Griloin press; see online Supplementary Materials). Research has also tested treatment parameters (Thompson-Brenner et al., Reference Thompson-Brenner, Franko, Thompson, Grilo, Boisseau, Roehrig and Wilson2013) and processes such as rapid response to treatment (⩾65% reduction in binge-eating episodes within the first month of treatment; Grilo, White, Masheb, & Gueorguieva, Reference Grilo, White, Masheb and Gueorguieva2015). To date, however, no reliable predictors of BED outcome (other than rapid response) have been identified (Linardon, Brennan, & de la Piedad Garcia, Reference Linardon, Brennan and de la Piedad Garcia2016; Vall & Wade, Reference Vall and Wade2015). One potential reason for the limited ability to predict treatment outcomes – a problem across many fields, not just eating disorders – could be due to reliance on traditional statistical techniques, such as linear/logistic regression. Regression methods assess univariate and linear relations between limited numbers of predictors and outcomes, and this approach (ideally informed by theory) might be poorly matched to the complexity inherent in both psychopathology architecture and treatment mechanisms (Chekroud et al., Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021; King & Resick, Reference King and Resick2014). In addition, traditional regression models are subject to overfitting, which can result in the identification of significant predictors that lack generalizability and clinical utility (Dwyer, Falkai, & Koutsouleris, Reference Dwyer, Falkai and Koutsouleris2018; Poldrack, Huckins, & Varoquaux, Reference Poldrack, Huckins and Varoquaux2020).
Recently, machine learning (ML) approaches have been used in attempts to enhance the prediction of hard-to-predict outcomes. ML is an umbrella term for many types of analyses sharing several commonalities. First, ML analyses are inductive, meaning that they rely on patterns in the data to generate and optimize models, as compared to relying on clinicians/researchers specifying models a priori (Kuhn & Johnson, Reference Kuhn and Johnson2013). The algorithms include tuning parameters that identify the model that results in optimal prediction (Kuhn & Johnson, Reference Kuhn and Johnson2013). Second, ML enhances generalizability through cross-validation (i.e. a method to evaluate model effectiveness and generalizability), which can be done through simulations (e.g. bootstrapping), training models on one subset of data and then testing models on a separate subset of data, or a combination of the two (Kuhn & Johnson, Reference Kuhn and Johnson2013). Third, ML algorithms can accommodate large numbers of predictors even with sample sizes in the hundreds (Poldrack et al., Reference Poldrack, Huckins and Varoquaux2020). Early applications of ML showed promise in predicting self-injurious behaviors (e.g. Huang, Ribeiro, & Franklin, Reference Huang, Ribeiro and Franklin2020). Whereas traditional statistical models predicted self-injurious behaviors barely above chance (Franklin et al., Reference Franklin, Ribeiro, Fox, Bentley, Kleiman, Huang and Nock2017), initial ML studies reported excellent prediction (Fox et al., Reference Fox, Huang, Linthicum, Wang, Franklin and Ribeiro2019; Huang et al., Reference Huang, Ribeiro and Franklin2020; Walsh, Ribeiro, & Franklin, Reference Walsh, Ribeiro and Franklin2017). ML has been applied to eating disorders in several studies (Espel-Huynh et al., Reference Espel-Huynh, Zhang, Thomas, Boswell, Thompson-Brenner, Juarascio and Lowe2021; Haynos et al., Reference Haynos, Wang, Lipson, Peterson, Mitchell, Halmi and Crowin press; Sadeh-Sharvit, Fitzsimmons-Craft, Taylor, & Yom-Tov, Reference Sadeh-Sharvit, Fitzsimmons-Craft, Taylor and Yom-Tov2020); ML showed increased predictive accuracy for outcomes relative to traditional models in some (Haynos et al., Reference Haynos, Wang, Lipson, Peterson, Mitchell, Halmi and Crowin press) but not other (Espel-Huynh et al., Reference Espel-Huynh, Zhang, Thomas, Boswell, Thompson-Brenner, Juarascio and Lowe2021) studies.
Notably, several of the initial ML studies in clinical psychology used random forests paired with a form of resampling called optimism-corrected bootstrapping (Fox et al., Reference Fox, Huang, Linthicum, Wang, Franklin and Ribeiro2019; Huang et al., Reference Huang, Ribeiro and Franklin2020; Walsh et al., Reference Walsh, Ribeiro and Franklin2017). Although random forests are a robust ML method (see online Supplementary Materials), pairing random forests with optimism-corrected bootstrapping is known to result in inflated estimates of model performance (Tantithamthavorn, McIntosh, Hassan, & Matsumoto, Reference Tantithamthavorn, McIntosh, Hassan and Matsumoto2017), which is one of the problems ML is intended to protect against. Emerging evidence indicates that when random forests are paired with other resampling methods, such as cross-validation or traditional bootstrapping, the prediction of suicide attempts is nearly identical to that produced by logistic regression (Jacobucci, Littlefield, Millner, Kleiman, & Steinley, Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021; Littlefield et al., Reference Littlefield, Cooke, Bagge, Glenn, Kleiman, Jacobucci and Steinley2021). Collectively, findings from the suicide and eating-disorder fields call into question whether ML may be a panacea to improve treatment outcome prediction.
This study compared the accuracy of three types of predictive models (one traditional and two ML) with three types of resampling methods in the prediction of BED treatment outcomes using data from a randomized controlled trial (RCT; Grilo et al., Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020). The primary goal was to determine whether ML was superior to traditional models for predicting treatment outcomes. The secondary goal was to compare predictive accuracy across different ML models paired with different forms of resampling, to serve as an example for future researchers considering using ML. A final goal was to identify variables that most strongly predict BED treatment outcomes. We acknowledge that this last goal diverges from ML's primary purpose/promise, which is increasing predictive accuracy, not identifying single predictors (Kuhn & Johnson, Reference Kuhn and Johnson2013; Murdoch, Singh, Kumbier, Abbasi-Asl, & Yu, Reference Murdoch, Singh, Kumbier, Abbasi-Asl and Yu2019). However, the identification of individual-level predictors may provide necessary information to enhance treatment prescription and refine therapeutic targets. Thus, this final aim represents a bridge between ML models’ aim of accurate prediction and the potentially useful convention of identifying individual variables that predict treatment outcomes.
Method
Participants
Participants were 191 patients (age 18–65 years) with BED and comorbid obesity [body mass index (BMI)⩾30] who participated in an RCT testing 6-month behavioral weight-loss (BWL) and stepped-care interventions (Grilo et al., Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020). A detailed description of the RCT is published (Grilo et al., Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020), thus only a brief description follows. Exclusionary criteria included: concurrent treatment for eating/weight, uncontrolled medical problems, severe psychiatric conditions (psychosis, bipolar disorder, current substance dependence), or current pregnancy/breastfeeding. The majority of participants were female (n = 136, 71.2%) and identified as White (n = 150, 78.5); mean age was 48.4 years (s.d. = 9.5) and mean BMI was 39.0 (s.d. = 6.0) kg/m2.
Procedure
Participants were randomized to either BWL (n = 39) or stepped care (n = 152) delivered following manualized protocols (Grilo et al., Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020). Diagnostic and clinical interviews were performed and height/weight was measured at baseline and post-treatment, and a battery of psychometrically established measures was completed throughout treatment (months 1, 2, and 4) and at post-treatment (6 months). Post-treatment assessments were obtained for 89.5% of participants. BWL and stepped care treatments did not differ significantly in binge-eating remission (74.4% v. 66.5%) or binge-eating frequency (1.7 binges/month v. 2.7 binges/month) at post-treatment. Treatments also did not significantly differ on eating-disorder psychopathology or percent weight loss at post-treatment (5.1% v. 5.8%).
Measures
Predictor variables (see online Supplementary Materials for detailed descriptions and rationale)
Predictor variables (Table 1) included demographics, baseline BMI and clinical characteristics, rapid response, and treatment condition (BWL v. stepped care).
SCID-I/P, Structured Clinical Interview for DSM Diagnosis; EDE, Eating Disorder Examination; TFEQ, Three Factor Eating Questionnaire; QEWP, Questionnaire on Eating and Weight Pattern; EOQ, Emotional Overeating Questionnaire, FTSI, Food Thought Suppression Inventory, YFAS, Yale Food Addiction Scale, DERS, Difficulties in Emotion Regulation Scale, BSCS, Self-Control Scale, WBIS, Weight Bias Internalization Scale, BDI, Beck Depression Inventory, RSES, Rosenberg Self-Esteem Scale, IIP, Inventory of Interpersonal Problems, RRS, Rumination Scale.
Note: Binge-eating percent reduction was log-transformed for analyses, though raw values are presented here for ease of interpretation.
Psychiatric comorbidities. Structured Clinical Interview for DSM Axis I Psychiatric Disorders (SCID-I/P; First, Spitzer, Gibbon, & Williams, Reference First, Spitzer, Gibbon and Williams1997) assessed lifetime DSM-IV (APA, 1994) psychiatric disorders. Disorder classes considered in analyses were depressive, anxiety, posttraumatic stress, and drug and alcohol use disorders.
Eating-related psychopathology. Eating Disorder Examination interview (EDE; Fairburn, Cooper, & O'Connor, Reference Fairburn, Cooper, O'Connor and Fairburn2008), Three-Factor Eating Questionnaire (TFEQ; Anglé et al., Reference Anglé, Engblom, Eriksson, Kautiainen, Saha, Lindfors and Rempelä2009), Questionnaire on Eating/Weight Patterns–Revised (QEWP-R; Spitzer, Yanovski, & Marcus, Reference Spitzer, Yanovski and Marcus1994), Emotional Overeating Questionnaire (Masheb & Grilo, Reference Masheb and Grilo2006), and Food Thought Suppression Inventory (Barnes, Fisak, & Tantleff-Dunn, Reference Barnes, Fisak and Tantleff-Dunn2010) assessed multiple domains of eating-related psychopathology including: binge-eating frequency (EDE), weight/shape overvaluation (EDE), weight/shape dissatisfaction (EDE), restraint (EDE, TFEQ), behavioral indicators for loss-of-control eating for DSM-IV BED diagnosis (QEWP-R), distress about binge eating (QEWP-R), weight cycling (QEWP-R), diet history (QEWP-R), emotional overeating (EOQ), and food thought suppression (FTSI).
Other psychological symptoms/features. Psychological symptoms/features relevant to BED (theoretically/empirically) listed below were included as predictors.
Food addiction. Number of food addiction criteria met and food addiction categorization (present v. absent) were assessed using the Yale Food Addiction Scale (Gearhardt, Corbin, & Brownell, Reference Gearhardt, Corbin and Brownell2009).
Emotion regulation difficulties. Emotion regulation was assessed using the Difficulties in Emotion Regulation Scale (Gratz & Roemer, Reference Gratz and Roemer2004). This 36-item self-report scale includes six subscales (nonacceptance, difficulties meeting goals, impulse control problems, low awareness, limited strategies, and low clarity), which were included as separate predictors.
Self-control. Perceived self-control was assessed with the 13-item self-report Self-Control Scale–Brief (Tagney, Baumeister, & Boone, Reference Tagney, Baumeister and Boone2004).
Weight bias internalization. Weight bias internalization, or the degree to which individuals have internalized negative beliefs about overweight or obesity, was assessed with the 11-item self-report Weight Bias Internalization Scale (Durso & Latner, Reference Durso and Latner2008).
Depression scores. Depressive symptoms experienced in the past week were assessed with the 21-item self-report Beck Depression Inventory (Beck & Steer, Reference Beck and Steer1987).
Self-esteem. Self-esteem was assessed with the 10-item self-report Rosenberg Self Esteem Scale (RSES; Rosenberg, Reference Rosenberg1989).
Interpersonal problems. The extent to which people experience difficulties in their interpersonal functioning was assessed with the 32-item self-report Inventory of Interpersonal Problems (Barkham, Hardy, & Startup, Reference Barkham, Hardy and Startup1996).
Cognitive rumination. Two types of cognitive rumination, reflecting and brooding, were assessed with the 10-item self-report Ruminative Responses Scale (Treynor, Gonzalez, & Nolen-Hoeksema, Reference Treynor, Gonzalez and Nolen-Hoeksema2003).
Physical and mental health. The 36-item self-report Short Form Health Survey (Ware & Sherbourne, Reference Ware and Sherbourne1992) assessed physical and mental functioning and quality of life.
Treatment variables. Two treatment-related variables were included as predictors: treatment condition (BWL or stepped care) and exhibiting rapid response (⩾65% reduction in binge-eating frequency at the month 1 assessment).
Outcome variables
Outcome variables reflected both eating-disorder psychopathology and weight loss, and included complementary approaches of analyzing variables in categorical and continuous formats.
Binge-eating abstinence and binge-eating reduction. Binge-eating abstinence was defined as having zero binge-eating episodes during final month of treatment (EDE). Percent reduction in binge-eating episodes from pre- to post-treatment was also calculated (EDE).
Eating-disorder psychopathology. Eating-disorder psychopathology was measured using the EDE Global score (Fairburn et al., Reference Fairburn, Cooper, O'Connor and Fairburn2008).
Percent weight loss and weight loss ⩾5%. Percent weight loss was calculated from subtracting posttreatment weight from pretreatment weight, dividing by pretreatment weight, and multiplying by 100. A dichotomous variable was also created based on whether weight loss was ⩾5%. Losing five percent of body weight is associated with physiological benefits (Magkos et al., Reference Magkos, Fraterrigo, Yoshino, Luecking, Kirbach, Kelly and Klein2016) and is frequently used in BED and obesity treatment studies.
Data analytic plan
Analyses were completed using R computing software (R Core Team, 2020), using the following packages: mice (van Buuren & Groothuis-Oudshoorn, Reference van Buuren and Groothuis-Oudshoorn2011), caret (Kuhn, Reference Kuhn2008), glmnet (Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010), and random Forest (Liaw & Wiener, Reference Liaw and Wiener2002). dplyr (Wickham, François, Henry, & Müller, Reference Wickham, François, Henry and Müller2021) was used to clean data and ggplot2 (Wickham, Reference Wickham2016) was used to create figures.
Missing data
We ran analyses with both the overall sample and the subsample who completed the post-treatment assessment (n = 171; see online Supplementary Table S1 for comparison of those who did v. did not complete the post-treatment assessment). The pattern of results was highly similar and we present analyses for the full intent-to-treat sample (N = 191). The proportion of missing data was 2.1%. The maximum proportion of missing data was 4% for any single predictor and 15% for any single outcome. After completing diagnostics to identify variables related to missingness, data were judged to be missing at random. Missing data for baseline characteristics were imputed with multivariate imputations with chained equations. Missing data for categorical outcomes were failure imputed (e.g. if data were missing to determine binge-eating abstinence, non-abstinence was coded). Missing data for continuous outcomes were replaced with estimated marginal means for each treatment group, obtained through multilevel modeling (Grilo et al., Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020).
Models
Three types of models were used to predict treatment outcomes: traditional logistic/linear regressionFootnote †Footnote 1, elastic net regression, and random forests. One benefit of logistic/linear regression is high interpretability. Weaknesses include potential to overfit and traditionally limited predictive power (King & Resick, Reference King and Resick2014). Random forests, in contrast, have higher predictive performance but are less interpretable. Elastic nets have intermediate predictive performance and interpretability. Thus, the combination of these three types of models allows for comprehensive comparison among models across a spectrum of interpretability and prediction. We describe the models briefly below. In addition, the online Supplementary Materials provide further details, and we recommend reviewing Kuhn and Johnson (Reference Kuhn and Johnson2013) for comprehensive descriptions.
Elastic net is a linear regression method that contains two regularization parameters, lambda and alpha, which are tuned to achieve the best model prediction. Random forests are a non-linear ensemble method comprised of hundreds of individual trees. Each tree in the forest is estimated from a random subset of predictors, and within each tree, the data are recursively partitioned to find the specific values of the predictors that divide the data into subgroups with the smallest sums of squares error values. This process of creating subgroups within subgroups is repeated until further splits do not result in improved model fit. Results are aggregated across trees to result in an overall metric of predictive performance.
After identifying the optimal models, three types of resampling were completed and compared: repeated 10-fold cross-validation, traditional bootstrapping, and optimism-corrected bootstrapping. Resampling is an umbrella term for methods to prevent overfitting a model to data. Repeated 10-fold cross-validation and traditional bootstrapping were used per recommendations (Kuhn & Johnson, Reference Kuhn and Johnson2013; Tantithamthavorn et al., Reference Tantithamthavorn, McIntosh, Hassan and Matsumoto2017). Optimism-corrected bootstrapping was used given its use in initial ML in clinical psychological science (e.g. Fox et al., Reference Fox, Huang, Linthicum, Wang, Franklin and Ribeiro2019; Huang et al., Reference Huang, Ribeiro and Franklin2020; Walsh et al., Reference Walsh, Ribeiro and Franklin2017).
Repeated 10-fold cross-validation splits the dataset into 10 equal-sized folds. Nine folds are used to train the model on the data and one fold is used to test the model and evaluate its performance. This process is repeated 10 times, with a separate fold held out as the test set each time. Across these 10 repetitions, results are averaged to indicate overall model performance. Bootstrap resampling means that a bootstrap sample is drawn repeatedly (n = 100) from an overall sample. Optimism-corrected bootstrap resampling is similar to bootstrap resampling but in addition to the model being estimated from the bootstrap samples (n = 100), the model is also estimated on the original dataset. The difference between the model's performance in the bootstrap samples and on the original dataset produces a metric called optimism, which quantifies the level of overfitting of the model to the data. The optimism value is then subtracted from the overall metric of model performance. Optimism-corrected bootstrapping should theoretically produce more stringent results. However, optimism-corrected bootstrapping results in highly inflated results of model performance when paired with random forests (Jacobucci et al., Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021; Tantithamthavorn et al., Reference Tantithamthavorn, McIntosh, Hassan and Matsumoto2017). Thus, we include this resampling method to demonstrate the differences that can arise from various combinations of ML models and resampling methods.
For each model, the following pre-processing of predictors was performed: identification and removal of predictors with near-zero variance, identifying whether any variables may be assessing similar underlying constructs, transformations for non-normal distributions, and centering and scaling. Ethnicity and two BED behavioral indicators (‘eating large amounts of food when not physically hungry’ and ‘feeling guilty, depressed, or disgusted with oneself after an eating binge’) had little variance and were removed from models. Because race and education had little variance [e.g. the only races reported in addition to White were Asian (n = 2) and Black (n = 28)], these variables were dichotomized. The largest correlation among predictors was r = .80 (for self-esteem and depression scores), suggesting that no predictors were too highly correlated and all predictors were entered into the models (r cutoff = 0.90). The binge-eating frequency at baseline and binge-eating reduction were log-transformed prior to imputation.
Across all model types (i.e. logistic/linear regression, elastic net, and random forest), performance for categorical outcomes was determined based on the area under the receiver operator characteristic curve (AUC) value. AUC of 0.50 indicates chance-level prediction. AUC classifications are categorized as follows: ⩽0.59 = extremely poor, 0.60–0.69 = poor, 0.70–0.79 = fair, 0.80–0.89 = good, and ⩾0.90 = excellent. Performance for continuous outcomes was determined based on root mean square error (RMSE) values and R 2. RMSE values are in the same units as the outcome variable and indicate the average difference between the observed and predicted values. Lower RMSE indicates greater accuracy. R 2 indicates the proportion of outcome variance explained by the model. Confidence intervals are presented for AUCs, RMSE, and R 2 to facilitate comparison across models and resampling methods. For each outcome, the one standard error rule was used to select the optimal model (i.e. the model that is most parsimonious and whose error is no more than one standard error of the best-fitting model).
Predictor importance
ML analyses are computationally heavy and certain models have limited interpretability and vague clinical implications. To increase the clinical utility of results, we identified the most important predictors for each optimal model and for each resampling method using the caret package. For logistic, linear, and elastic net regressions, variable importance was calculated from the absolute values of each parameter's t test statistic, such that higher values indicate more important variables. For random forests, variable importance was calculated based on how much model fit changed if a predictor's input were permuted over all trees. Results across resampling methods were similar and we averaged predictor importance across resampling methods for each model type. Variable importance calculations do not identify the directionality of associations; thus, regression coefficients for logistic, linear, and elastic net regressions are shown in online Supplementary Tables S2–S6 (directionality is not modeled with random forests).
Results
Table 1 summarizes demographic, baseline clinical characteristics, and treatment outcomes. Table 2 shows AUC values for categorical outcomes and RMSE and R 2 values for continuous outcomes.
AUC. area under the receiver operator characteristic curve; RMSE, root mean square error, 10 repeated CV, repeated 10-fold cross-validation.
Note: Higher AUC values indicate greater predictive accuracy; lower RMSE values and higher R2 values indicate greater predictive accuracy.
Across resampling methods, logistic regressions had extremely poor performance for prediction of binge-eating abstinence and poor to fair prediction of ⩾5% weight loss. Relative to logistic regressions and across resampling methods, elastic nets had similarly poor prediction of binge-eating abstinence and >=5% weight loss. Random forests with repeated 10-fold cross-validation and bootstrapping had similar AUCs as logistic regression with the same resampling methods in the prediction of binge-eating abstinence but lower AUCs than logistic regression in the prediction of ⩾5% weight loss. Random forests with optimism-corrected bootstrapping had excellent predictive performance.
Across resampling methods, for the prediction of binge-eating reduction, eating-disorder psychopathology, and weight loss, overall, RMSE values were significantly lower for elastic net and random forest than for linear regression (though exceptions were (1) elastic net with optimism corrected bootstrapping in predicting binge-eating reduction and (2) elastic net and random forest with 10-fold cross-validation in predicting weight loss). For R 2 values, elastic nets and random forests with 10-fold cross-validation and bootstrapping had similar R 2 as linear regression in predicting eating-disorder psychopathology but higher values in predicting binge-eating reduction. R 2 for random forests with optimism-corrected bootstrapping across outcomes were significantly higher than all other models and resampling methods.
The 20 predictors with the highest average importance across resampling methods are shown in Figs 1–3. The strongest predictors of binge-eating abstinence (Fig. 1) were: low weight bias internalization (logistic, elastic net, and random forest), low lack of awareness of emotions (logistic and elastic net), physical health composite (random forest), and interpersonal problems (random forest). The strongest predictors of binge-eating reduction (Fig. 1) were: higher binge-eating baseline frequency (logistic, elastic net, and random forest), higher weight/shape dissatisfaction (logistic, elastic net, and random forest), lower reflecting cognitive rumination (linear and elastic net), and mental health composite (random forest).
The strongest predictors of eating-disorder psychopathology (Fig. 2) were: higher weight bias internalization (linear, elastic net, and random forest), higher self-esteem (linear, elastic net, and random forest), and higher nonacceptance of emotions (linear and elastic net).
The strongest predictors of ⩾5% weight loss (Fig. 3) were a rapid response to treatment (linear and elastic net) and mental health composite (random forest). The strongest predictors of weight loss (Fig. 3) were: lower brooding cognitive rumination (linear and elastic net), rapid treatment response (linear and elastic net), higher emotional clarity (linear and elastic net), self-control (random forest), and physical health composite (random forest).
Discussion
This study examined how accurately combinations of traditional v. ML models and resampling methods predicted BED treatment outcomes. ML models showed little advantage over traditional models in predictive accuracy across BED outcomes (binge-eating, eating-disorder psychopathology, and weight loss). Although the different analytic models revealed some important predictors of key outcomes, their accuracy was modest. In cases where elastic net regressions and random forests showed greater predictive accuracy than traditional models, the overall prediction was still poor. ML using random forests with optimism-corrected bootstrapping yielded greater model prediction accuracy than all other models.
The superior and seemingly excellent prediction stemming from random forests with optimism-corrected bootstrapping, however, is likely inflated and may not reflect true model accuracy (Jacobucci et al., Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021; Tantithamthavorn et al., Reference Tantithamthavorn, McIntosh, Hassan and Matsumoto2017). This inflation is a consequence of pairing random forests with optimism-corrected bootstrapping (Tantithamthavorn et al., Reference Tantithamthavorn, McIntosh, Hassan and Matsumoto2017). We emphasize this to highlight a potential problem with the emerging ML literature in clinical psychology. Specifically, the initial ML applications predicting self-injurious behaviors, which suggested the high potential promise of ML for improving the prediction of relevant outcomes in clinical psychology, used random forests with optimism-corrected bootstrapping (Fox et al., Reference Fox, Huang, Linthicum, Wang, Franklin and Ribeiro2019; Huang et al., Reference Huang, Ribeiro and Franklin2020; Walsh et al., Reference Walsh, Ribeiro and Franklin2017). Thus, replication of those findings may prove difficult with unbiased resampling methods. Indeed, Jacobucci et al. (Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021) found that random forests with non-inflated resampling methods (i.e. repeated 10-fold cross-validation and bootstrapping) in the prediction of suicide attempts yielded similar AUCs as traditional logistic regression.
While we recognize that our random forest with optimism-corrected bootstrapping results are inflated and did not plan on interpreting these results, we present them for two reasons. First, given the novelty of ML in clinical psychological/behavioral medicine, we wanted to provide an example of marked differences that emerge when different resampling methods are used with different ML models. Second, these findings echo Jacobucci et al. (Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021) findings and recommendation that when using random forests, repeated 10-fold cross-validation or bootstrapping should be used as the resampling methods.
Our findings are consistent with emerging reports, within and outside of the eating disorders field, indicating that at least within the constraints of current psychological studies, non-inflated ML models perform comparably to traditional statistical methods (Buckman et al., Reference Buckman, Cohen, O'Driscoll, Fried, Saunders, Ambler and Pillingin press; Espel-Huynh et al., Reference Espel-Huynh, Zhang, Thomas, Boswell, Thompson-Brenner, Juarascio and Lowe2021; Jacobucci et al., Reference Jacobucci, Littlefield, Millner, Kleiman and Steinley2021; Littlefield et al., Reference Littlefield, Cooke, Bagge, Glenn, Kleiman, Jacobucci and Steinley2021; Zuromski et al., Reference Zuromski, Bernecker, Gutierrez, Joiner, King, Liu and Kessler2019). There are, however, some examples of ML outperforming traditional models (Haynos et al., Reference Haynos, Wang, Lipson, Peterson, Mitchell, Halmi and Crowin press; Kessler et al., Reference Kessler, Warner, Ivany, Petukhova, Rose, Bromet and Ursano2015; Wang et al., Reference Wang, Coppersmith, Kleiman, Bentley, Millner, Fortgang and Nock2021; for a review, see Chekroud et al. Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021). These examples offer points of consideration related to predictor selection and sample sizes that may be necessary for ML to achieve greater potential in clinical areas (Chekroud et al., Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021; Dwyer et al., Reference Dwyer, Falkai and Koutsouleris2018). Regarding predictors, although we included 42 predictors in analyses (i.e. including many more predictors than generally considered in traditional statistical approaches), we were limited to baseline RCT data. In contrast, for example, Kessler et al. (Reference Kessler, Warner, Ivany, Petukhova, Rose, Bromet and Ursano2015) used electronic health records to predict with high accuracy suicide deaths among psychiatrically hospitalized service members. Kessler et al. (Reference Kessler, Warner, Ivany, Petukhova, Rose, Bromet and Ursano2015) considered a total of 421 variables of multiple types (e.g. self-report, demographics, etc.) to include as potential predictors, and the final models included 73 predictors. Thus, increasing the number and/or variety of predictors may prove useful (Chekroud et al., Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021) to enhance accuracy. Regarding sample size, although N = 191 is the largest single-site RCT for BED, it is relatively small for ML algorithms. While small sample sizes can be partly overcome through methodological decisions (e.g. using repeated cross-validation), they can be problematic when they limit external validation. External validation is critical to assess the utility and generalizability of a specific ML algorithm. Thus, collecting larger samples or combining multiple samples to train, test, and validate models is a possible next step (Wang, Reference Wang2021). Finally, ML may more accurately predict treatment outcomes with time-series predictors v. baseline data alone (e.g. Espel-Huynh et al., Reference Espel-Huynh, Zhang, Thomas, Boswell, Thompson-Brenner, Juarascio and Lowe2021; Wang, et al. (Reference Wang, Coppersmith, Kleiman, Bentley, Millner, Fortgang and Nock2021)). Overall, we believe that larger sample sizes, greater numbers of and variability in predictors, and repeated observations are important future directions in predicting eating-disorder treatment outcomes.
Our predictor importance analyses yielded evidence that adds to the limited eating disorder literature (Linardon et al., Reference Linardon, de la Piedad Garcia and Brennan2017); most clearly, findings provide further empirical confirmation for the positive prognostic significance of rapid response to treatment for BED (Grilo, White, Gueorguieva, Wilson, & Masheb, Reference Grilo, White, Gueorguieva, Wilson and Masheb2013; Grilo, White, Wilson, Gueorguieva, & Masheb, Reference Grilo, White, Wilson, Gueorguieva and Masheb2012b; Masheb & Grilo, Reference Masheb and Grilo2007). Inspection of regression coefficients (online Supplementary Tables S5 and S6) indicates that patients with rapid response were more likely than those without rapid response to attain weight reduction ⩾5% and experience greater weight loss. These findings provide further confidence for using rapid response to treatment to inform stepped-care algorithms in BED treatment (Grilo et al., Reference Grilo, White, Wilson, Gueorguieva and Masheb2012b, Reference Grilo, White, Masheb, Ivezaj, Morgan and Gueorguieva2020).
In addition, weight bias internalization was consistently among the strongest predictors of both binge-eating abstinence and eating-disorder psychopathology. Inspection of regression coefficients (online Supplementary Tables S2 and S4) indicates greater baseline weight bias internalization was prospectively associated with a lower likelihood of binge-eating abstinence and higher eating-disorder psychopathology at post-treatment. This is the first study to find that weight bias internalization may negatively impact BED treatment response; our findings (across multiple analyses) extend the cross-sectional associations between weight bias internalization with eating-disorder psychopathology in BED (Durso et al., Reference Durso, Latner, White, Masheb, Blomquist, Morgan and Grilo2012) and obesity (Pearl & Puhl, Reference Pearl and Puhl2018). Pending external validation, our finding that greater weight bias internalization was associated with poorer eating-disorder outcomes following behaviorally based weight-loss treatments for BED could inform future treatment research testing the potential utility of incorporating cognitive interventions to address such internalized beliefs into behaviorally based interventions.
Strengths of this study include the rigorous assessment methods including the independent assessors administering investigator-based interviews and objective weight measurements. The analyses encapsulated nine models for each outcome to ensure that we identified any differences that occurred across various combinations of ML models resampling methods. We also highlight that while we considered 42 predictors given the goals of optimizing prediction and comparing models, we additionally performed logistic and linear regressions using only 10 predictors selected conceptually/empirically from the literature (plus to reduce type-I errors). The results of those reduced models yielded similar predictive performance to the models with all 42 predictors (see online Supplementary Table S7).
Several limitations are noteworthy. First, while we briefly interpret the variable importance results, we did this cautiously because predictive accuracy was roughly comparable across models (Fisher, Rudin, & Dominici, Reference Fisher, Rudin and Dominici2019). Second, even though some significant predictors emerged, their importance is relative and overall model predictions were limited. Third, the sample was primarily White, non-Hispanic, and well-educated and findings may not generalize to people with other characteristics. Fourth, while our predictor variables were quite broad and multimodal, they were not exhaustive. Finally, given the small sample size, we were unable to externally validate algorithms.
In summary, ML models with unbiased resampling methods provided a minimal advantage over traditional models in predictive accuracy for BED treatment outcomes. Improving prediction accuracy for eating disorder treatment outcomes remains a priority.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291721004748
Financial support
This research was supported by the National Institutes of Health grant R01 DK49587 (CMG). Funders played no role in the content of this paper.
Conflict of interest
The authors report no conflicts of interest. Dr Grilo and Dr Ivezaj report several broader interests, which did not influence this research or paper. Dr Grilo reports Honoraria for lectures and CME activities, and Royalties from Guilford Press and Taylor & Francis Publishers for academic books. Dr Ivezaj reports Honoraria for journal editorial roles and lectures.
Ethical standards
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.