Introduction
Bulimia nervosa (BN) and alcohol use disorder (AUD) are two psychiatric disorders that share a number of similarities (American Psychiatric Association, 2013). First, they can both be characterized by binge behavior where patients consume a large amount of food (i.e. binge eating [BE]) or alcohol (i.e. binge drinking [BD]) within a short period of time, while not being able to stop eating or drinking or not being able to control the amount they eat or drink. (American Psychiatric Association, 2013). Second, they can both have a significant impact on health with BN having a high mortality of 1.7 per 1000 person-years and with AUD being the largest risk factor for disease and disability among 15- to 49-year-olds (Arcelus, Mitchell, Wales, & Nielsen, Reference Arcelus, Mitchell, Wales and Nielsen2011; Griswold et al., Reference Griswold, Fullman, Hawley, Arian, Zimsen, Tymeson and Gakidou2018). Third, both disorders can be difficult to treat, with up to 60% of patients who receive treatment not achieving remission (Fleury et al., Reference Fleury, Djouini, Huỳnh, Tremblay, Ferland, Ménard and Belleville2016; Linardon & Wade, Reference Linardon and Wade2018). Taken together, the high impact and poor treatment outcomes highlight the need for more effective therapies for BN and AUD.
One promising new form of therapy is the just-in-time adaptive intervention (JITAI) (Nahum-Shani et al., Reference Nahum-Shani, Smith, Spring, Collins, Witkiewitz, Tewari and Murphy2018). In a JITAI, support is given ‘just-in-time’ or when a patient needs it the most. For example, a patient could report their emotions, behaviors, and context with a smartphone application, and an algorithm could evaluate the risk of BE or BD based on this information, after which an intervention could be sent out when this momentary risk is elevated. The support can also be adaptive, meaning that it can be tailored to a patient's in-the-moment needs. For instance, a patient could receive a text message alert when the estimated risk for BE or BD is moderate, but a phone call when the estimated risk is high. Because of its potential benefits, several researchers have developed and implemented JITAIs in recent years, but their results have been mixed (Carpenter, Menictas, Nahum-Shani, Wetter, & Murphy, Reference Carpenter, Menictas, Nahum-Shani, Wetter and Murphy2020; Hardeman, Houghton, Lane, Jones, & Naughton, Reference Hardeman, Houghton, Lane, Jones and Naughton2019; Wang & Miller, Reference Wang and Miller2020). One reason for this could be the limited adaptive nature of these JITAIs. Namely, the research designs of these JITAIs were primarily based on previous literature (e.g. which information should be gathered from participants and how it should be evaluated), which means that the decision to send out an intervention was static and based on findings from previous studies. However, if such decisions were to be based on the actual data provided by participants, a JITAI would be more adaptive and perhaps more effective.
This goal could be realized with the help of machine learning (ML) where statistical models and algorithms learn from data without explicit instruction (Shatte, Hutchinson, & Teague, Reference Shatte, Hutchinson and Teague2019). A ML model could learn when individuals are at risk of BE or BD in daily life and then subsequently predict this risk when presented with new data in a JITAI. A ML model could also determine which of the many possibly assessed variables (e.g. momentary mood, location, social context, and time) are predictive of BE or BD and which ones are not. Several ML algorithms can examine a large number of variables and select only those that are most predictive of an outcome (Cai, Luo, Wang, & Yang, Reference Cai, Luo, Wang and Yang2018). This kind of information could then provide targets for the interventions in a JITAI. However, researchers are confronted with specific challenges when using ML to predict daily life behavior. Namely, they need to decide whether they want to build person-specific or group-level (pooled) prediction models. On the one hand, person-specific models are trained with the data of an individual patient (Soyster, Ashlock, & Fisher, Reference Soyster, Ashlock and Fisher2021). This type of model can be built more easily and can result in more person-specific information. On the other hand, pooled prediction models are trained with the data of multiple patients (Soyster et al., Reference Soyster, Ashlock and Fisher2021). This model type is more difficult to build as more patients need to be included, but can result in a better model performance, particularly if the factors driving a momentary behavior are similar across the study participants.
In recent years, several studies have built person-specific and pooled models to predict BE, alcohol use, or BD in daily life (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023; Bae, Chung, Ferreira, Dey, & Suffoletto, Reference Bae, Chung, Ferreira, Dey and Suffoletto2018; Bae et al., Reference Bae, Ferreira, Suffoletto, Puyana, Kurtz, Chung and Dey2017; Levinson, Trombley, Brosof, Williams, & Hunt, Reference Levinson, Trombley, Brosof, Williams and Hunt2022; Soyster et al., Reference Soyster, Ashlock and Fisher2021; Walters et al., Reference Walters, Businelle, Suchting, Li, Hébert and Mun2021). When it comes to disordered eating behavior, one study demonstrates that pooled models can predict future occurrences of BE, restriction, and purging in patients with an eating disorder (Levinson et al., Reference Levinson, Trombley, Brosof, Williams and Hunt2022). By utilizing predictors related to disordered eating cognitions and behaviors, along with affect, the study shows that dietary restriction, weighing, and anxiousness are important predictors of subsequent BE. Additionally, another study indicates that person-specific models can perform well in predicting BE in patients with a binge-type eating disorder (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023). This study, using a set of variables selected from feedback from clinicians and patients, reports that hunger and craving are the most common predictors of BE. When it comes to alcohol use, studies have demonstrated that pooled models can successfully use phone sensor data to predict non-heavy alcohol use as well as BD in individuals without AUD (Bae et al., Reference Bae, Ferreira, Suffoletto, Puyana, Kurtz, Chung and Dey2017, Reference Bae, Chung, Ferreira, Dey and Suffoletto2018). In these studies, time of day, number of activities, and phone usage emerge as the most informative predictors. Furthermore, another study finds that both person-specific and pooled models can predict alcohol use in individuals without AUD utilizing predictors related to affect, craving, and recent alcohol use (Soyster et al., Reference Soyster, Ashlock and Fisher2021). More specifically, it finds that craving and feeling pressured to drink are the most common predictors across individuals. Employing a similar set of variables, with additional predictors related to social setting and location, a different study also concludes that pooled models can predict alcohol use in individuals without AUD (Walters et al., Reference Walters, Businelle, Suchting, Li, Hébert and Mun2021). Here, craving, the availability of alcohol, and feeling that alcohol will improve mood were the most important predictors.
However, these studies have significant limitations. First, their generalizability to a broader clinical population is often limited. This is because only a few studies include a clinical sample and those that do, include a small number of participants for which they only have a limited number of observations. This is problematic as a small sample size can have serious methodological implications in ML (Way, Sahiner, Hadjiiski, & Chan, Reference Way, Sahiner, Hadjiiski and Chan2010). Indeed, most studies did not hold out data when training or tuning their ML models and therefore were not able to evaluate model performance on unseen data. This implies that their models run a serious risk of overfitting and might not generalize. Second, the majority of variables in these studies assess emotions or behaviors and do not look at the social or situational context of a patient. However, previous studies show the importance of context in BE, alcohol use, and BD (Allison & Timmerman, Reference Allison and Timmerman2007; Clapp, Shillington, & Segars, Reference Clapp, Shillington and Segars2009). Third, to our knowledge only one study evaluated both person-specific and pooled prediction model and did so only for alcohol use, leaving the question unanswered which model type performs best for BE and BD (Soyster et al., Reference Soyster, Ashlock and Fisher2021).
Because of these limitations, it is still unclear to what extent BE, alcohol use, and BD can be predicted in the daily lives of patients with BN and/or AUD and which variables are important predictors. This study aims to fill that gap by collecting a large amount of data from a clinical sample on a variety of variables and make methodologically correct prediction models. We followed patients with BN and/or AUD over a period of 12 months during which we used the experience sampling method (ESM) to repeatedly assess the patient's emotions, behaviors, and contexts in daily life. We then used this data to fulfill the following two objectives. First, to build and evaluate person-specific and pooled prediction models for BE, alcohol use, and BD in daily life. Second, to identify the most important predictors of these behaviors.
Methods
Study sample
The participants were drawn from a larger ESM study that followed patients with BN and/or AUD as well as control volunteers without these diagnoses in daily life. In the current study, all the data of the patients with BN (n = 50), with AUD (n = 51) or with BN and AUD (n = 19) were used, after the elimination of one patient with BN and two patients with AUD due to insufficient data (i.e. not answering two consecutive assessments, causing the lagging procedure described below to fail). Recruitment happened in Flanders, Belgium through residential and ambulatory care centers, patient groups, universities, social media, and by handing out flyers on the street. Inclusion ran from September 2019 to February 2022. The inclusion criteria were: (1) being assigned female at birth; (2) understanding Dutch language; (3) being of age ⩾18 years; and (4) being of BMI ⩾18.5 kg/m2. It was decided to not include individuals assigned male at birth as the prevalence of BN in significantly lower in this population (Galmiche, Déchelotte, Lambert, & Tavolacci, Reference Galmiche, Déchelotte, Lambert and Tavolacci2019). Additional inclusion criteria for patients were: (5) meeting the criteria for BN or AUD of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 2013); (6) meeting those diagnostic criteria for a duration of ⩽5 years. This maximum was set as the importance of certain predictors of BE, alcohol use, and BD are thought to change over the course of AUD and BN (Boness, Watts, Moeller, & Sher, Reference Boness, Watts, Moeller and Sher2021; Pearson, Wonderlich, & Smith, Reference Pearson, Wonderlich and Smith2015). For example, it is thought that BE episodes start as a rash action during moments of high negative affect, and are reinforced by a subsequent decrease in negative affect (Pearson et al., Reference Pearson, Wonderlich and Smith2015). However, it is then thought that after repeated cycles of emotional distress, urge, and BE, the behavior becomes habitual, where it persists, even when it is no longer reinforced by a decrease in negative affect (Pearson et al., Reference Pearson, Wonderlich and Smith2015). Furthermore, it is thought that the role of negative affect and positive affect change of the course of AUD, whereby changes in positive affect are more predictive of alcohol use in the beginning of AUD, while the role of negative affect increases over the course of AUD (Koob & Le Moal, Reference Koob and Le Moal1997). Participants with AUD also needed to display a pattern of repetitive BD according to the criteria of the National Institute on Alcohol Abuse and Alcoholism (i.e. drinking 4 units of alcohol within 2 h for women) (National Institute on Alcohol Abuse & Alcoholism [NIAAA], 2022). Participants were excluded for the following reasons: (1) major medical pathology (e.g. severe liver or kidney disease, uncontrolled diabetes, cancer or untreated hyper- or hypothyroidism); (2) chronic use of sedatives (i.e. more than three times in the past three month); (3) pregnancy; (4) presence of major psychiatric pathology (i.e. schizophrenia, autism spectrum disorder, bipolar disorder, substance use disorder other than AUD). All participants gave their written consent, and the study was approved by the ethical committee of the UZ/KU Leuven.
Study design
Potential participants were initially screened via telephone or email, after which they attended an in-person assessment. Here, a psychiatry resident confirmed an individual's eligibility to participate. The participants had their weight and height measured with a calibrated scale and stadiometer and completed clinical interviews and questionnaires. All participants underwent a briefing on the ESM questions and practiced the use of the mobile application. Then, the participants entered the ESM protocol on the first Thursday after the in-person assessment. An overview of the protocol can be seen in Fig. 1. It consisted of a repeated measurement design where seven bursts of data collection were spread out over a 12-month period. The bursts had a duration of three weeks and were separated by intervals of five weeks. During the bursts, data were only collected on Thursday, Friday, and Saturday to limit the protocol's impact on the participants. These specific days were selected to consecutively gather data on both week and weekend days. Then, participants were required to respond to eight signals on each day of data collection which were sent out on a signal-contingent (i.e. semi-random) basis. The participants received 20 eurocent per answered assessment. The ESM data were initially collected with the app MobileQ (Meers, Dejonckheere, Kalokerinos, Rummens, & Kuppens, Reference Meers, Dejonckheere, Kalokerinos, Rummens and Kuppens2020). When the development of the app was discontinued in October 2020, data collection continued using m-Path (Mestdagh et al., Reference Mestdagh, Verdonck, Piot, Niemeijer, Tuerlinckx, Kuppens and Dejonckheere2022). More information about the apps can be found in online Supplementary eMethods 1 and eTable 1 in the supplement.
Measures
Baseline measures
The Structured Clinical Interview for DSM-5 (SCID-5-S) was used to confirm the diagnosis of BN or AUD and to screen for other psychiatric disorders (American Psychiatric Association [APA], 2017). BN and AUD severity were assessed using the Eating Disorder Examination Questionnaire (EDE-Q) and the Alcohol Use Disorders Identification Test (AUDIT) (Fairburn & Beglin, Reference Fairburn and Beglin1994; Saunders, Aasland, Babor, De La Fuente, & Grant, Reference Saunders, Aasland, Babor, De La Fuente and Grant1993).
ESM measures
At each assessment, the participants received questions evaluating different emotions, behaviors, and contexts. The exact number of items varied at each assessment as the presentation of some questions was conditional on a participant's response to a previous question. The full list of questions can be seen in Table 1. More information on the reliability and/or validity of the items can be found in the supplement (online Supplementary eMethods 2). Importantly, participants needed to indicate if they had eaten since the previous assessment. If so, they had to identify the eating event as undereating, normal eating, or overeating. Then, participants were asked if they experienced a loss of control over their eating behavior. As in previous studies, BE was defined as an episode of overeating with loss of control (Ambwani, Roche, Minnick & Pincus, Reference Ambwani, Roche and Minnick2015). The participants were trained to interpret overeating as eating an amount of food that is definitely larger than what most people would eat under similar circumstances. Furthermore, they were instructed to interpret loss of control as wanting to stop eating, but not being able to. Similarly, participants needed to indicate whether they drank alcohol since the previous assessment and if so, how many units they drank and if they experienced a loss of control over their drinking behavior. The participants were instructed on the definition of an alcohol unit. Here, BD was defined as having consumed at least four units of alcohol since the previous assessment while alcohol use was conceptualized as having consumed at least one unit since the previous assessment.
Data analysis
Data preparation
A figure providing an overview on the data preparation procedure can be found in the supplement (online Supplementary eFigure 1). Only assessments answered within 240 min of the prompt were used in the analyses. This window was chosen to include assessments which were answered later at night, where the likelihood that patients binge eat, drink alcohol, or binge drink could be higher. First, the scoring of the conditional ESM variables was corrected. A conditional ESM variable depended on a previous ESM answer (e.g. how stressful an event was, was only asked on the condition that a participant answered ‘yes’ on experiencing a stressful event). The conditional ESM variables therefore included missing values, when the condition was not met, which could be filled in with zeroes (i.e. indicating that past events were not stressful at all). Second, temporal variables were created that represented assessment number (i.e. 1–8), weekday (i.e. Thursday, Friday, and Saturday), time since starting participation in the study (linear, quadratic, and cubic) and cycles of 12 h, and 24 h frequency (Flury & Levri, Reference Flury and Levri1999). These temporal variables have been used in previous studies predicting BE and alcohol use (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023; Soyster et al., Reference Soyster, Ashlock and Fisher2021). Third, to account for the varying levels of COVID-19 prevention measures throughout the study, a COVID-19 stringency variable was created based on the Oxford COVID-19 Government Response Tracker (Kira et al., Reference Kira, Saptarshi, Thayslene, Oliveira, Nagesh, Phillips and Hallas2022). More information on the temporal and COVID-19 stringency variables can be found in the supplement (online Supplementary eMethods 3). This brought the total possible number of predictors to 110. However, a predictor could not be entered in the prediction models when it had a variance of zero (i.e. meaning it always had the same response value). More specifically, for the person-specific models, the median number of predictors for BE was 97 (Q1–Q3: 90–100), while the median number of predictors for alcohol use was 96 (Q1–Q3: 90–99), and the median number of predictors for BD was 98 (Q1–Q3: 91–99). The pooled models used all predictors. Fourth, all ESM variables except for the outcome variables were lagged by one assessment, with time between assessments measuring 102 min on average. The variables could be lagged across days, but not across weeks. The temporal variables and the COVID-19 stringency variable were not lagged and therefore remained aligned in time with the outcome variables. Fifth, observations with missing values were removed from the data.
This resulted in a dataset which could be used to predict BE, BD, and alcohol use at a certain point in time in the future, based on the temporal variables and the COVID-19 stringency variable at that timepoint as well as the ESM variables at a previous timepoint. This was done to emulate how a machine learning-based JITAI would be used to treat a patient in daily life. For example, a model could predict whether a patient who reported to experience more stress is more likely to binge eat after two hours. As lagging across days was permitted, BE, BD and alcohol use episodes which happened at night but were reported in the morning could also be predicted.
Model training and evaluation
Person-specific as well as pooled prediction models were built for BE, alcohol use, and BD. Based on the definitions outlined under ESM measures, the moments of BD were also considered moments of alcohol use. This approach was taken because a JITAI would most likely focus on either alcohol use (i.e. drinking any alcohol) or BD (i.e. drinking more than 4 units in 2 h for women), rather than non-heavy alcohol use (i.e. alcohol use that is not BD). For BE, the data of the patients with BN and the patients with BN and AUD were used (n = 69). Similarly, for alcohol use and BD, the data of the patients with AUD and the patients with BN and AUD were used (n = 70). This meant that only the data of patients who displayed the behavior were included in the analyses for a specific outcome. The models were trained and evaluated with the ensr, glmnet, pROC and caret packages in R, version 4.1.1 (DeWitt, Reference DeWitt2019; Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010; Kuhn, Reference Kuhn2021; Robin et al., Reference Robin, Turck, Hainard, Tiberti, Lisacek, Sanchez and Müller2011). The scripts and data for the analyses can be found at https://rdr.kuleuven.be/dataset.xhtml?persistentId=doi:10.48804/OBLDWE. A brief description of the elastic net wrappers which were developed for this paper can be found in the supplement (online Supplementary eMethods 4). More detailed information can be found at https://github.com/nicolasleenaerts/NLML. Person-specific The models were fitted and evaluated on the data of each participant with nested k-fold cross-validation. A visual representation of this method can be seen in Fig. 2. More information on nested cross-validation can be found in the supplement (online Supplementary eMethods 5). For the outer loop, a stratified 5-fold cross-validation was used. Due to the stratification, the distribution of positive events was similar across folds. The allocation of observations to specific folds was random, meaning that the observations within each fold were not temporally contiguous. However, due to the lagging procedure described above, each instance of the dependent variables was only ever predicted by the independent variables for the immediately previous observation. The continuous variables of the training folds were standardized. This can increase performance of regression-based models and simplify comparisons between model estimates (Shahriyari, Reference Shahriyari2019). Additionally, the continuous variables from the test fold were also standardized, but with the mean and standard deviation from the training folds. This separated procedure transforms the testing data to the same scale as the training data but prevents any information from leaking.
For the inner loop, an elastic net regularized regression model was fitted to the training folds of the outer loop (Zou & Hastie, Reference Zou and Hastie2005). Though other machine learning techniques exist, elastic net was specifically chosen as the machine learning algorithm as it is especially suited for the objectives of the current manuscript. This is because it is a performant machine learning technique that reduces overfitting, can work with high-dimensional data (i.e. more predictors than observations), handles correlated variables, but also provides information on the strength and nature of the relation between a predictor and an outcome. It combines two regularization methods, ridge regression which shrinks model estimates and LASSO regression which removes variables that do not contribute to the model. The amount of ridge and lasso regression is expressed by a variable alpha which varies from 0 (exclusively ridge regression) to 1 (exclusively LASSO). The strength of the regularization is defined by a variable lambda with higher values leading to more shrinkage of the coefficients. The most optimal alpha and lambda were selected with a grid search of 10 alphas and 100 lambdas (i.e. the default settings of ensr). For each possible combination, a cross-validation error was calculated with 10-fold cross-validation. The combination with the lowest cross-validation error was then used to fit the definitive elastic net model on the training folds of the outer loop. This elastic net model was then used to predict BE, BD or alcohol use in the data of the test fold of the outer loop. The predictions were then compared with the actual BE, BD, and alcohol use events in the test fold to calculate the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV). Due to the nested cross-validation, a participant needed to have a sufficient number of BE, BD or alcohol use events (n > 4) to be included in the analysis. Pooled The pooled models were also fitted and evaluated with nested k-fold cross-validation. However, they were trained on the pooled training data of all the participants and tested on the individual test data of each participant. More specifically, in the outer loop, the training folds were a combination of the standardized training folds of the participants. Due to the standardization at a participant-level, the values of the continuous variables represented a deviation from the within-person means. As no multilevel variant of elastic net regularized regression exists, this accounted in part for the within-person nesting of the data (Soyster et al., Reference Soyster, Ashlock and Fisher2021). In the inner loop, the most optimal alpha and lambda were again determined with 10-fold cross-validation and used to fit the final elastic net model. This model was then applied to the test fold of every individual participant to evaluate the AUC, sensitivity, specificity, accuracy, PPV, and NPV of the pooled model for each participant. As the training folds were pooled, a participant only needed to have one BE, BD, or alcohol use event in the test fold to be included in the analyses.
Model comparison
For each participant and each outcome, the AUC of the pooled models was subtracted from the AUC of the person-specific models. Then, to explore why some participants had a better performance with the person-specific model than with the pooled model, this difference in AUC was compared between the different analysis groups with Mann–Whitney U tests and correlated to age, BMI, EDE-Q scores, AUDIT scores, BE frequency, and BD frequency with Spearman correlations. Non-parametric tests were performed due to the non-normal distribution of the AUCs.
Validation analyses
First, to assess whether patients were more likely to binge eat, drink alcohol, or binge drink on certain days of the week (e.g. Thursday, Friday, and Saturday), generalized linear mixed models were constructed for each outcome of interest (i.e. binge eating, alcohol use, and binge drinking) with the aggregated data of all participants, with day of the week as a main effect and with a random intercept for the participants. Second, as there was an imbalance in the outcomes whereby patients typically did not display BE, alcohol use or BD, several analyses were performed to assess the validity of the results. To begin, the results of the pooled and person-specific models were compared to those of models that always predict the majority class (i.e. not BE, not drinking alcohol, not BD). Additionally, the results of the person-specific and pooled models were compared to those of models where the imbalance in the outcomes was corrected with ROSE or SMOTE. Third, as some participants changed apps over the course of the study, the impact of app type on model performance was explored. More specifically, person-specific and pooled models were constructed with an additional variable indicating whether MobileQ of m-Path was used, after which their performance was compared to that of the original models. Additionally, the AUC of the original person-specific and pooled models was correlated to the percentage of observations that a participant reported through m-Path. Spearman correlations was performed due to the non-normal distribution of the AUCs. More information on these validation analyses can be found in the supplement (online Supplementary eResults 3–5).
Model predictors
For each outcome and each model type, the 10% best predictors were identified. This was based on the raw estimates for the pooled model (as only one estimate per variable existed) and the mean estimates over all participants for the person-specific models.
Results
Sample characteristics
The characteristics of the different patient groups can be found in Table 2. Additionally, the characteristics of the different analysis groups (i.e. for BE or BD/alcohol use) can be found in the supplement (online Supplementary eTable 2). Notably, the age of the patients with BN/AUD (mean = 20.4, s.d. = 1.7, CI 19.6–21.2) was lower than that of the patients with BN (mean = 22.4, s.d. = 4.1, CI 21.3–23.6). Also, the BMI of the patients with BN (mean = 25.6, s.d. = 5.9, CI 23.9–273) was higher than that of the patients with AUD (mean = 21.5, s.d. = 3.5, CI 20.5–22.4).
Abbreviations: ADHD, attention deficit hyperactivity disorder; AP, agoraphobia; AUD, alcohol use disorder; BMI, body mass index; BN, bulimia nervosa; CI, confidence interval; EDE-Q, Eating Disorder Examination Questionnaire; MDD, major depressive disorder; N, number; PD, panic disorder; PTSD, post-traumatic stress disorder; SAD, social anxiety disorder; SD, standard deviation.
Data characteristics
In total, 41 (34.2%) participants (16 (32.0%) BN, 19 (37.3%) AUD, 6 (31.6%) BN/AUD) dropped out of the study before the end of the ESM protocol. For every participant group (AUD, BN, AUD/BN), there was no significant difference between patients who dropped out and those who did not when it came to age, BMI, illness duration, AUDIT scores, or EDE-Q scores. The mean compliance (percentage of signals answered) per participant during the first burst was 80.4% for the patients with BN, 75.2% for the patients with AUD and 73.6% for the patients with BN/AUD. This is similar to the compliance rates of previous cross-sectional ESM studies in patients with an eating disorder or AUD (Fischer, Wonderlich, Breithaupt, Byrne, & Engel, Reference Fischer, Wonderlich, Breithaupt, Byrne and Engel2018; Jones et al., Reference Jones, Remmerswaal, Verveer, Robinson, Franken, Wen and Field2019; Schaefer et al., Reference Schaefer, Smith, Anderson, Cao, Crosby, Engel and Wonderlich2020). In total, the patients with BN answered 12 932 (61.5%) of their scheduled beeps, while the patients with AUD answered 12 328 (62.9%) and the patients with BN/AUD answered 3947 (51.2%). The overall compliance of this study fell in the range of the lengthier ESM studies on substance use (Jones et al., Reference Jones, Remmerswaal, Verveer, Robinson, Franken, Wen and Field2019). More information on the reasons for dropout and the compliance per burst can be found in the supplement (online Supplementary eResults 1 and eTable 3). There was an imbalance in the outcomes whereby the median percentage of BE episodes that patients with BN and BN/AUD experienced was 14% (Q1–Q3: 6–20%), with a median percentage of alcohol use episodes that patients with AUD and BN/AUD experienced of 14% (Q1–Q3: 8–18%), while the median percentage of BD episodes that patients with AUD and BN/AUD experienced was 4% (Q1–Q3: 2–7%).
Model performance
The performance metrics had a skewed distribution within and between participants. Therefore, the median across folds and across participants was used to describe them. An extended overview can be found in Table 3. A visual summary can be seen in Fig. 3. The confusion matrices of the predictions can be found the supplement (online Supplementary eTables 4–9).
Abbreviations: AUC, area under the curve; CV, cross-validation; N, number of participants with a successful model; NPV, negative predictive value; PPV, positive predictive value.
The performance metrics had a skewed distribution within and between participants. Therefore, they are best described by the median across folds and participants. To compare, the results after taking the mean across folds is also presented.
Binge eating
A person-specific model could be fitted and evaluated for 48 (69.6%) participants. The performance of the person-specific models was poor with a median AUC of 0.61 (Q1:0.53; Q3:0.73), sensitivity of 0.83 (Q1:0.67; Q3:1.00), specificity of 0.71 (Q1:0.56; Q3:0.78), PPV of 0.31 (Q1:0.22; Q3:0.43), and NPV of 0.97 (Q1:0.91;Q3:1.00). The pooled model could be evaluated on 66 (95.7%) participants. Its performance was adequate with a median AUC of 0.71 (Q1:0.60; Q3:0.78), sensitivity of 01.00 (Q1:0.75; Q3: 1.00), specificity of 0.75 (Q1:0.60; Q3:0.86), PPV of 0.33 (Q1:0.23;Q3:0.50), and NPV of 1.00 (Q1:0.94; Q3:1.00).
Alcohol use
There were 43 (61.4%) participants with a person-specific model. The performance of these models was good with an AUC of 0.80 (Q1:0.72; Q3:0.89), sensitivity of 1.00 (Q1:0.79; Q3:1.00), specificity of 0.80 (Q1:0.75; Q3:0.88), PPV of 0.38 (Q1:0.28; Q3:0.57), and NPV of 1.00 (Q1:0.96; Q3:1.00). The pooled model could be evaluated on 63 (90.0%) participants. It had an outstanding performance with an AUC of 0.90 (Q1:0.83; Q3:0.96), sensitivity of 1.00 (Q1: 1.00; Q3: 1.00), specificity of 0.88 (Q1:0.80; Q3:0.96), PPV of 0.50 (Q1:0.37; Q3:0.75), and NPV of 1.00 (Q1:1.00; Q3:1.00).
Binge drinking
A person-specific model could be fitted and evaluated for 13 (18.6%) participants. The performance of the person-specific models was good with a median AUC of 0.85 (Q1: 0.71; Q3: 0.93), sensitivity of 1.00 (Q1:0.75; Q3:1.00), specificity of 0.90 (Q1:0.78; Q3: 0.96), PPV of 0.28 (Q1:0.18;Q3:0.50), and NPV of 1.00 (Q1:0.98;Q3:1.00). The performance of the pooled model could be evaluated on 49 (70.0%) participants. Its performance was outstanding with an AUC of 0.93 (Q1: 0.87; Q3: 0.98), sensitivity 1.00 (Q1: 1.00; Q3: 1.00), a specificity of 0.93 (Q1: 0.84; Q3: 0.99), PPV of 0.50 (Q1:0.20; Q3:0.83), and NPV of 1.00 (Q1:1.00;Q3:1.00).
Model comparison
The difference between the AUC of the person-specific and pooled models did not depend on patient group, and did not correlate with age, BMI, EDE-Q scores, AUDIT scores, BE frequency and BD frequency. In other words, having a higher AUC for the person-specific model than for the pooled model was not related to any of the patient characteristics we recorded. A more detailed analysis on the relation between the number of assessments and the AUC for the person-specific models can be found in the supplement (online Supplementary eResults 2 and eFigure 2).
Validation analyses
There was no difference in the occurrence of BE or alcohol use across days. However, BD happened more frequently on Thursday (Thursday v. Saturday: β = 0.452, s.e. = 0.099, p < 0.001; Friday v. Saturday: β = 0.093, s.e. = 0.106, p = 0.377). The pooled and person-specific models exhibited lower overall accuracy than models always predicting the majority case, but they demonstrated a better predictive performance (online Supplementary eResults 3 and eTable 10). Despite the lower overall accuracy of the original models due to outcome imbalance, they exhibited higher NPV and PPV. This suggests that when the original models predicted the likelihood of an event (i.e. BE or no BE), these predictions were more correct. Moreover, adjusting for the outcome imbalance led to poorer performance overall, except for the person-specific models for BE (see online Supplementary eResults 4 and eTable 11). This aligns with recent findings indicating that correcting for an outcome imbalance may introduce bias (Van Den Goorbergh, Van Smeden, Timmerman, & Ben Van Calster, Reference Van Den Goorbergh, Van Smeden, Timmerman and Ben Van Calster2022). Additionally, incorporating app type as a predictor did not enhance model performance (see online Supplementary eResults 5 and eTable 12). However, patients who reported a higher percentage of their responses through m-Path exhibited poorer performance in the person-specific models for alcohol use (ρ = −0.424, p = 0.002), but this was not observed in other models (see online Supplementary eResults 5).
Model predictors
A visual summary of the 10% best predictors for each outcome and each model type can be seen in Fig. 4.
Binge eating
For the person-specific models, BE was positively predicted by craving for a BE episode, evening hours (i.e. ping 8), doing things that you regret, feeling down, and the pleasantness of the most important event. It was negatively predicted by feeling like you can handle the situation, feeling stressed, and night (i.e. ping 1). For the pooled model, the best positive predictors were evening (i.e. ping 7, ping 8) and craving for a BE episode. The best negative predictors were being alone, experiencing physical discomfort, being under pressure, morning (ping 2), night (ping 1), being in a calm environment (i.e. passive leisure activities, experiencing no stressors), experiencing the suffering of others, and having laughed.
Alcohol use
For the person-specific models, alcohol use was positively predicted by craving alcohol, having drunk alcohol at the previous timepoint, evening (i.e. ping 7 and ping 8), pleasant company, experiencing pleasant events, and feeling cheerful. It was negatively predicted by wanting to do something. The best positive predictors for the pooled model were having drunk alcohol at the previous timepoint, evening (i.e. ping 7 and 8), smoking, and craving alcohol. The best negative predictors for the pooled model were morning/noon (ping 2, 3, 4, and 5), drinking caffeine, having eaten and being on your cellphone.
Binge drinking
For the person-specific models, BD was positively predicted by craving alcohol, being with friends, having drunk alcohol at the previous timepoint, evening (i.e. ping 8), night (i.e. ping 1), experiencing pleasant events, feeling satisfied, and doing groceries. However, BD was negatively predicted by a lack of perseverance, and feeling relaxed. For the pooled models, Important positive predictors were evening (i.e. ping 8), night (i.e. ping 1), having drunk alcohol at the previous timepoint, being with friends, smoking and experiencing no positive events. Important negative predictors were noon (i.e. ping 3, 4, and 5), drinking coffee, studying/working, and having laughed.
Discussion
This study had two objectives. First, to build and evaluate person-specific and pooled prediction models for BE, alcohol use, and BD in patients with BN and/or AUD. Second, to identify the most important predictors of these behaviors.
Model performance
The AUC of the prediction models ranged from poor to outstanding, which was similar or slightly better than those in studies predicting eating behaviors and alcohol use in healthy volunteers (Goldstein et al., Reference Goldstein, Zhang, Thomas, Butryn, Herbert and Forman2018; Soyster et al., Reference Soyster, Ashlock and Fisher2021). Furthermore, the one study which implemented the use of ML in a JITAI has found that dietary lapses could be predicted and prevented with a ML model whose AUC was similar to that of the current study`s pooled model for BE (Forman et al., Reference Forman, Goldstein, Crochiere, Butryn, Juarascio, Zhang and Foster2019). This suggests that the pooled prediction models of the current study as well as the person-specific models for alcohol use and BD could be used in a JITAI for clinical use. However, this is nuanced by the median PPV of the different models in the current study, which ranges from 0.28 to 0.50. This means that patients engaged in binge eating, alcohol use, or binge drinking less than half of the times it was predicted by the models. This could be due to inaccuracies in the models or the possibility that patients were in an at-risk state but did not engage in these behaviors at the time (e.g. by exhibiting alternative behaviors or displaying the behaviors at a later time). However, having an NPV that is higher than the PPV is thought to be advantageous as the consequences of failing to send out an intervention are deemed to be more detrimental than intervening excessively. Notably, existing literature does not provide insights into the minimum PPV required for a JITAI to be effective without causing undue inconvenience by delivering excessive interventions, and this should be investigated by future studies.
In the current study, the performance of the prediction models for BE were lower than those for alcohol use and BD. This could be the result of how BE is defined, as it is more difficult to assess whether you have overeaten and lost control than whether you have drunk alcohol. This could be particularly evident in the current study, as no specific caloric amount was used to define overeating. Furthermore, it could be that some important predictors were not included in the current study. This could be remediated by basing the ESM items on the reported triggers of the individual patients, as was done in a previous study on BE (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023). Lastly, it could be that other ML analysis techniques would be better suited. For example, it could be that elastic net regression is not well adapted to predict BE and that other ML techniques would result in a better predictive performance.
The results also showed that the pooled prediction models outperformed the person-specific ones. Indeed, some studies found that pooled models have a better predictive performance than person-specific ones (Ntekouli et al., 2022; Soyster et al., Reference Soyster, Ashlock and Fisher2021). However, others reported the opposite to be true (Cheung et al., Reference Cheung, Hsueh, Qian, Yoon, Meli, Diaz and Davidson2017; Rozet, Kronish, Schwartz, and Davidson, Reference Rozet, Kronish, Schwartz and Davidson2019). There could be several reasons why the current study finds that pooled models have a higher predictive performance than person-specific ones. On the one hand, the difference in performance could be the result of the substantially larger number of observations that were used to train the pooled models. Indeed, studies show that ML performance is related to dataset size (Althnian et al., Reference Althnian, AlSaeed, Al-Baity, Samha, Dris, Alzakari and Kurdi2021). One study also found that the performance of person-specific models increases with a greater number of observations until it is similar to or exceeds that of pooled models (Rozet et al., Reference Rozet, Kronish, Schwartz and Davidson2019). On the other hand, the better performance of the pooled models could be the result of the characteristics of the participants. Namely, previous studies show that group-level methods lend themselves well to samples that are homogenous (i.e. with a low inter-individual variability), which is the case in the current study (Fisher, Medaglia, & Jeronimus, Reference Fisher, Medaglia and Jeronimus2018; Molenaar, Reference Molenaar2004).
Taken together, these findings raise the question whether pooled prediction models would translate well to a clinical setting, where it could be difficult to gather a large dataset and where more inter-individual variability is seen. However, building a person-specific model could present its own challenges. Namely, as the performance of prediction models is thought to be related to sample size, a patient would have to be followed for a considerable amount of time and need to answer a large number of assessments before a person-specific model could be fit. However, in clinical practice, it might be difficult to observe patients for longer periods of time before moving on to treatment, and some patients may struggle to answer a large amount of assessments. In that case, it could be difficult to observe enough events of interest, which could be helped by aggregating small amounts of data from multiple patients in a pooled model. Therefore, as both person-specific and pooled models have their benefits and disadvantages, future studies should investigate a combination of both approaches. Indeed, studies bridging this gap show encouraging results, but further research is needed (Goldstein et al., Reference Goldstein, Zhang, Thomas, Butryn, Herbert and Forman2018; Ren et al., Reference Ren, Balkind, Pastro, Israel, Pizzagalli, Rahimi-Eichi and Webb2023). Furthermore, studies should investigate which sampling frequency provides a treatment response with only a minimum of patient burden. In the current study, patients needed to answer eight assessments per day, which might be too demanding and excessive in the context of a JITAI.
Most important predictors
There were significant differences between the most important predictors of the person-specific and pooled prediction models. This is not unexpected as previous studies have shown that the agreement between person-level and group-level analyses is limited (Fischer et al., Reference Fischer, Wonderlich, Breithaupt, Byrne and Engel2018). Furthermore, the differences between the types of predictors could have important implications for the development of JITAIs. Namely, it could be challenging to develop interventions that target the predictors of the pooled models as these mostly concern the time of day (e.g. evening or night) or recent events (e.g. experiencing something boring or being under pressure). It might be more valuable to focus on the predictors of the person-specific models as they deal with thoughts (e.g. craving), emotions (e.g. negative affect, positive affect), and behaviors (e.g. acting rash). This suggests that though pooled models might have a better performance, person-specific models could still be of value when it comes to tailoring daily life interventions.
The results also showed that there are both similarities and differences in the predictors for BE, alcohol use, and BD. First, it can be seen that craving was the most important predictor across the person-specific models of all behaviors. Though the relation between craving and alcohol use has been investigated by a large number of studies, this is less the case for BE (Cavicchioli, Vassena, Movalli, & Maffei, Reference Cavicchioli, Vassena, Movalli and Maffei2020; Novelle & Diéguez, Reference Novelle and Diéguez2018; Seo & Sinha, Reference Seo and Sinha2014). Future studies should therefore investigate the relationship between craving and BE in more depth. Second, positive events (i.e. the pleasantness of all events) and affect (i.e. feeling cheerful and satisfied) were important predictors of alcohol use and BD. This showcases the hypothesized link between positive emotions and alcohol consumption (Cooper, Frone, Russell, & Mudar, Reference Cooper, Frone, Russell and Mudar1995). However, the pleasantness of the most important event was also a predictor of BE. Though studies have shown that positive affect often decreases before a BE episode, other studies indicate that patients who act more rashly when positive affect is high also have a higher BE frequency (Michael & Juarascio, Reference Michael and Juarascio2021; Schaefer et al., Reference Schaefer, Smith, Anderson, Cao, Crosby, Engel and Wonderlich2020). Future studies should therefore explore whether positive emotions can also be a trigger for BE episodes in patients with BN. Third, BE was varyingly predicted by the different negative emotions. More specifically, it was positively predicted by feeling down and negatively predicted by feeling stressed. Similarly, a recent machine learning study in adolescents reports that loss of control eating is positively predicted by feeling lonely and feeling rejected, but negatively predicted by feeling stressed (Hagan, Leenaerts, Walsh, & Ranzenhofer, Reference Hagan, Leenaerts, Walsh and Ranzenhofer2024). Indeed, though studies show that negative emotions can trigger BE, others also have found that negative emotions can also lead to dietary restriction and that some emotions are more related to BE than others (Berg et al., Reference Berg, Crosby, Cao, Peterson, Engel, Mitchell and Wonderlich2013; Haedt-Matt & Keel, Reference Haedt-Matt and Keel2011; Leenaerts, Vaessen, Sunaert, Ceccarini, & Vrieze, Reference Leenaerts, Vaessen, Sunaert, Ceccarini and Vrieze2023a; Mikhail, Reference Mikhail2021). Furthermore, studies in the general population as well as rodents show that a strong acute stressor with pronounced physical responses decreases food intake while a mild chronic stressor increases consumption of energy-dense food (O'connor, Jones, Conner, Mcmillan, & Ferguson, Reference O'connor, Jones, Conner, Mcmillan and Ferguson2008; Torres & Nowson, Reference Torres and Nowson2007). In the same line, the current study finds that experiencing physical discomfort and being under time pressure is related to a lower probability of BE. However, the results also highlight that stressors do play a role in BE, as experiencing no stressors is related to a lower chance of BE. Future studies should therefore explore when negative emotions lead to BE and when they lead to dietary restriction. Contrastingly, negative emotions or events were not included in the most important predictors of alcohol use or BD. Though the induction of negative affect has been shown to lead to increases in alcohol consumption in a laboratory context, a recent meta-analysis reports that this is not the case in daily life (Bresin, Mekawi, & Verona, Reference Bresin, Mekawi and Verona2018; Dora et al., Reference Dora, Piccirillo, Foster, Arbeau, Armeli, Auriacombe and King2022). This could be one reason why negative emotions were not an important predictor of alcohol use and BD in the current study. However, in our own recent work, we found that there was a relation between negative affect and alcohol use, but a non-linear one (Leenaerts, Vaessen, Sunaert, Ceccarini, & Vrieze, Reference Leenaerts, Vaessen, Sunaert, Ceccarini and Vrieze2023b). This could be another reason why the elastic net regression did not retain negative emotions as an important predictor as it assumes a linear relation between variables. Interestingly, being alone was a negative predictor of BE in the pooled prediction model. This seems to contradict previous research showing that patients often binge eat when they are alone (Stickney, Miltenberger, & Wolff, Reference Stickney, Miltenberger and Wolff1999). However, it is important to consider that being alone in the current study was a predictor at the previous timepoint. It could therefore be that being alone at the previous timepoint is a negative predictor, as this means that the patients are not among other people, and are therefore less likely to experience interpersonal stressors, which are known to be linked to BE (Goldschmidt et al., Reference Goldschmidt, Wonderlich, Crosby, Engel, Lavender, Peterson and Mitchell2014).
Limitations
This study has several limitations. Importantly, the sample mostly consisted of female participants who were Caucasian and had a short illness duration, resulting in a higher degree of homogeneity compared to the broader population of patients with AUD and/or BN. Consequently, the generalizability of the study's findings may be impacted. First, it's possible that predictors are differently associated with the outcomes in individuals from other backgrounds concerning sex, race, and illness duration. Second, a pooled model may not perform as effectively when applied to a more heterogeneous sample as inter-individual differences would make it more difficult for the machine learning algorithm to learn relations in the data. However, person-specific models might still demonstrate similar performance levels in a more heterogeneous sample, as their efficacy is not impacted by inter-individual differences. Third, individuals can binge eat, drink alcohol, and binge drink without meeting the criteria for AUD and/or BN. Future studies should therefore explore the efficacy of prediction models in subclinical populations. Additionally, there are several methodological limitations. First, assessments were only sent out on Thursday, Friday, and Saturday. In the current study, there was no difference in the frequency of BE or alcohol use across days, but BD happened more often on Thursdays, which is in line with the findings of previous studies (Lavender et al., Reference Lavender, Utzinger, Crosby, Goldschmidt, Ellison, Wonderlich and Le Grange2016; Van Damme et al., Reference Van Damme, Thienpondt, Rosiers, Tholen, Soyez, Sisk and Rapport2022). Nevertheless, participants could still have adapted their behavior on days when data was collected, which could have impacted the study results. Second, data were collected through more than one app. Though including app type as a predictor did not improve performance, participants who responded more through m-Path had a lower AUD in the person-specific models for AUD. This could indicate that app type was not directly related to the outcomes but could have an indirect impact on model performance (e.g. through answer times). Third, the nested nature of the data is partially accounted for in the pooled models by standardizing the continuous variables at a within-person level. Though this is a suboptimal way to handle nested data, this method was chosen as mixed effects elastic net regularized regression are not implemented in R. There are regression-based techniques that better account for multilevel data (i.e. mixed effects LASSO regression), but these techniques struggle with multi-dimensional data and highly correlated variables. Fourth, the majority of the predictors where based on ESM items, which are self-report items, and could have been biased. Future studies should explore the value of combining actively (i.e. ESM items) and passively gathered data (e.g. physiology). Fifth, the intervals between the assessments were uneven, and this is not accounted for by an elastic net model. Studies should therefore explore the use of continuous-time models which do not assume an even interval between assessments. Sixth, the cross-validation divided the data at random, which does not reflect how a JITAI would function, as a model would first be fit on the data of a patient, after which it would be applied to new registrations of a patient. Seventh, though no sample size calculations exist for elastic net regularized regression, studies on other regularized regressions techniques suggest that some person-specific models in the current study might suffer from a small sample size (Riley et al., Reference Riley, Ensor, Snell, Harrell, Martin, Reitsma and Van Smeden2020). Eight, there was a considerable amount of missing data which could have influenced the results. This was especially the case for the patient with both AUD and BN as they displayed a substantially lower compliance from the beginning of the study. Though no difference in the alcohol use or eating disorder characteristics was found between the patients with AUD/BN and the patients with either AUD or BN, it can be that the combination of experiencing difficulties with alcohol use and binge eating has a significantly larger impact on the patients, thereby making it more difficult for them to answer the ESM assessments. A popular technique to handle missingness is multiple imputation by chained equations (MICE), but this method struggles with the correlation between observations across time. However, there are promising deep learning-based methods, which future studies evaluate for clinical prediction models (Kazijevs & Samad, Reference Kazijevs and Samad2023). Ninth, only one individual assessed whether the participants met the criteria for the DSM-diagnoses, making it difficult to assess the reliability of the diagnoses. Tenth, it was not possible to assess the validity of several ESM items in the current study, nor was there any information available on the validity of these items from previous studies. It is necessary for future studies to validate their ESM items to enhance the robustness of their prediction models.
Conclusion
This study builds and evaluates person-specific and pooled prediction models for BE, alcohol use, and BD in patients with BN and/or AUD. The performances of the different models vary between poor and outstanding, but the pooled models outperform the person-specific ones and the models for alcohol use and BD outperform those for BE. This study also explores which variables are the most important predictors in the different models. Here, the predictors of the pooled models mostly concern the time of day and recent events, while those of the person-specific models mostly concern thoughts, emotions, and behaviors. Future studies should explore whether pooled and person-specific approaches could be combined and how BE, alcohol use, and BD can be impacted by interventions in daily life.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291724000862.
Author's Note
The present study' design and analyses were not pre-registered. However, consistent with the Transparency and Openness Promotion (TOP) guidelines, the data and scripts that support the findings of this study are available at the Research Data Repository of the KU Leuven at https://rdr.kuleuven.be/dataset.xhtml?persistentId=doi:10.48804/OBLDWE. More information on the elastic net wrappers which were developed for this paper can be found at https://github.com/nicolasleenaerts/NLML. The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.
Funding statement
A C1 grant (grant number ECA-D4671-C14/18/096) of the Special Research Fund KU Leuven to Vrieze and Ceccarini served as a PhD Scholarship for Leenaerts. Ceccarini was supported by a postdoc grant from FWO (grant number 12R1619N). No other grant of any kind was received to support this work.
Competing interests
No other disclosures were reported.