Introduction
Machine learning (ML) is based on the idea that machines (computers) can learn from historical data and be trained for pattern recognition (e.g., prediction). The perspectives on using ML to aid decision-making in the medical field are promising [Reference Beam, Drazen, Kohane, Leong, Manrai and Rubin1]. Indeed, prediction models based on ML have shown to be accurate in many clinical contexts with performance levels comparable to- or above those of clinicians [Reference Ben-Israel, Jacobs, Casha, Lang, Ryu and de Lotbiniere-Bassett2].
Since patients are major stakeholders in the medical field, their acceptance of ML is of paramount importance with regard to successful implementation of tools based on ML [Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver3,Reference Frank, Elbæk, Børsting, Mitkidis, Otterbring, Borau and Guidi4]. Public and patient trust in ML in the healthcare setting has been surveyed before [Reference Young, Amara, Bhattacharya and Wei5–Reference Reading Turchioe, Harkins, Desai, Kumar, Kim and Hermann7]. According to these studies, stakeholder trust in medical application of ML models relies on the knowledge that the final decision lies in the hands of the health professionals, that is that ML models are merely used for decision support. Moreover, increased information about ML models, including explainability (transparency of what drives prediction), is associated with increased trust in them [Reference Ploug, Sundby, Moeslund and Holm6,Reference Cadario, Longoni and Morewedge8,Reference Yarborough and Stumbo9].
To our knowledge, however, no prior surveys have focused on opinions on ML-based clinical decision support systems among patients receiving treatment in psychiatric services. This is an unfortunate gap in the literature, as such systems are gaining traction in the psychiatric field [Reference Danielsen, Fenger, Østergaard, Nielbo and Mors10,Reference Bernert, Hilberg, Melia, Kim, Shah and Abnousi11] and because the level of general and institutional trust is relatively low among some patient groups in psychiatry [Reference Kopacz, Ames and Koenig12,Reference Verhaeghe and Bracke13]. Therefore, the aim of this study was to investigate trust in ML-based clinical decision support systems among patients with mental disorders. Furthermore, we will test whether receiving information about ML-based clinical decision support systems will increase patients’ trust in this technology.
Methods
Design
We conducted a randomized online survey experiment focusing on ML-based clinical decision support tools in psychiatric services. The study design and analysis plan were pre-registered and are available at https://doi.org/10.17605/OSF.IO/Z9385. The design of the study is shown in Figure 1.
Setting
The survey experiment was performed within the Psychiatric Services of the Central Denmark Region, which has a catchment area of approximately 1.3 million people. It comprises five public psychiatric hospitals, which provide free (tax-financed) inpatient, outpatient, and emergency psychiatric treatment.
Participants
Patients were eligible for participation if they were 18 years old and received treatment in the Psychiatric Services of the Central Denmark Region. Patients were ineligible if they had a forensic sanction, received coercive treatment, or had an organic mental disorder (ICD-10 code: F0X.X), or mental retardation (ICD-10 code: F7X.X). Based on power calculations (see below), we invited 6000 randomly drawn eligible patients using the online survey service SurveyXact [14] via “eBoks” – the secure digital mailing system used by Danish authorities to communicate with citizens [Reference Ebert, Huibers, Christensen and Christensen15]. The survey was distributed from May 26–31, 2023. A reminder was sent on June 12, 2023, to those not having responded yet. The participants provided informed consent for study participation by ticking a box and entering their unique social security number and name. The participants did not receive any monetary incentive for participation.
After the survey was fielded, it came to our attention that participants using an Android-based device to access the “e-Boks” app could not access the hyperlink provided in the invitation. To solve this issue, we informed the participants of a solution to this technical problem when distributing the reminder. While this issue is likely to have reduced the response rate, we consider it unlikely to have introduced bias.
Power calculation
A power-analysis (alpha = 0.05, power = 0.80, two-sided) for a pair-wise comparison, assuming a small intervention effect of Cohen’s D = 0.2 (as seen in similar studies [Reference Cadario, Longoni and Morewedge8]), estimated that 1200 participants (400 per randomization arm) were required. A recent study in the same setting and using the same invitation procedure had a response rate of approximately 20% [Reference Kølbæk, Jefsen, Speed and Østergaard16]. Based on this effect size and response rate, 6000 patients were invited to participate in the study.
Randomization
Participants were randomly allocated to one of three arms: intervention, active control, and blank control (for a description, see the “Survey experiment” section, below). As SurveyXact did not allow for standard 1/3, 1/3, 1/3 random allocation at the time of the study, the system was devised to assign the participants to one of the three arms based on the time they accessed the survey link in the invitation letter. Specifically, participants accessing the link in the intervals from > = 0 to 0.333 second; from >0.333 to 0.666 second, and from >0.666 to <1 second were assigned to the blank control arm, the active control arm, and the intervention arm, respectively. The participants were not aware that a randomization and intervention took place.
Baseline questionnaire
When entering the survey, irrespective of allocation arm, the participants initially filled in a baseline questionnaire regarding education level, current work status, household composition (adults and children), general trust, trust in technology, and perceived understanding of standard clinical decision-making as well as a perceived understanding of ML-based clinical decision support systems. Answers on trust and perceived understanding were provided on Likert scales from 0–10 (an English translation of the questionnaire in Danish is available in Supplementary Table 1: “Baseline questionnaire”).
Survey experiment
The baseline questionnaire was followed by the experiment in which the participants received three different types of information based on the randomized allocation:
-
1. Intervention: Visual and text-based information pamphlet (slides within the electronic survey – see Supplementary Table 2: “Intervention”) explaining how an ML-based clinical decision support system works and may aid clinical practice in psychiatric services.
-
2. Active control: Visual and text-based information pamphlet (slides within the electronic survey – see Supplementary Table 3: “Active control”) explaining a standard clinical decision process in psychiatric services without the use of an ML-based clinical decision support system.
-
3. Blank Control: No information pamphlet.
Both information pamphlets in the online survey consisted of four slides each.
Post-experiment questionnaire (outcome measure)
After the survey experiment, the participants filled in a questionnaire aimed at measuring trust and distrust in ML-based clinical decision support systems in psychiatric services. Specifically, the respondents answered the following questions: 1. “I feel safe that mental health professionals can make decisions with the support of machine learning models.” 2. “I trust that the Psychiatric Services can use machine learning models in a safe and appropriate way.” 3. “I am concerned that use of machine learning models for decision support in psychiatry will increase the risk of error.” 4. “I would like to have the opportunity to opt out of machine learning models being used for decision support in relation to my treatment in the psychiatric services.” 5. “I am concerned that healthcare services, including the psychiatric services, are becoming too dependent on machine learning models.” 6. “I am concerned that the use of machine learning models may lead to increased inequality in healthcare, including psychiatry.” 7. “The advantages of using machine learning models for decision support in psychiatry outweigh the disadvantages.” 8. “It is important to me that I can get an explanation of the basis on which a machine learning model recommends a given treatment”. 9. “I am concerned that a machine learning model may make incorrect recommendations due to inaccuracies in my medical record.” The questions were adapted from prior studies covering the same topic [Reference Ploug, Sundby, Moeslund and Holm6,Reference Esmaeilzadeh, Mirzaei and Patients’17]. All questions were answered using an 11-level Likert scale ranging from 0 (“Totally disagree”) to 10. (“Totally agree”).
Choice of primary outcome measure
As it is suboptimal to sum positively and negatively worded items (after inversion) [Reference Timmerby, Nørholm, Rasmussen, Lindberg, Andreasson Aamund and Bech18], the three positively-worded “trust” items (1, 2, and 7) and the 5 negatively-worded “distrust” items (3, 4, 5, 6, and 9) were grouped a priori (see the pre-registered analysis plan). Item 8 was considered to be neutral and was therefore kept separate. Subsequently, principal component analyses were performed to test whether the trust and distrust items, respectively, were loaded onto latent components. The number of components was determined by analyzing the scree plot and choosing the number of components before the distinct break (“elbow”) in the plot [Reference Abdi and Williams19]. Subsequently, an item was considered to load onto a component if it had a loading of >0.40 or < −0.40 [Reference Guadagnoli and Velicer20]. Based on the number of items loading onto the component, a trust total score (the sum of the positively-worded items) and a distrust total score (the sum of the negatively worded items) were constructed to be used as the outcome measures.
Handling of survey responses
After clicking the generic link to the survey, the patients identified themselves by manually inserting their social security number and their names. If the participants completed the questionnaire more than once, the first response was used. If the participants first made a partial response, including the randomization element, and subsequently completed the survey, while randomized to a different arm, then the full response was not included in the analyses (as these participants were unblinded to the randomization). As the outcome measures were placed at the end of the survey, only completers were included in the analyses. Questions could not be left blank. Thus, there were no missing values.
Supplementary data from electronic health records
The participants gave consent to extraction of sociodemographic (age, sex, civil status) and clinical data (the number of contacts to the psychiatric services including contact type (inpatient/outpatient) and the associated ICD-10 diagnoses, as well as the time since their first contact to the psychiatric services) from the electronic health records for the purpose of the study. Linkage was performed using the respondent’s unique personal identification number [Reference Pedersen21], To define diagnostic subgroups, we considered the participant’s most severe main diagnosis (registered from 2011 until the time of the survey) using the following ICD-10 hierarchy: F2x (psychotic disorders) > F3x (mood disorders) > F4x (anxiety- and stress-related disorders) > F5x (eating, sleeping, and other behavioral syndromes associated with physiological disturbances) > F6x (personality disorders) > F8x (developmental disorders including autism) > F9x (child and adolescent mental disorders) > F1x (substance use disorders).
Statistics
The sociodemographic and clinical characteristics were summarized using descriptive statistics. Potential differences in time used for survey completion between the randomization groups were tested using Mann–Whitney U test. As primary analyses, the level of trust/distrust in ML-based clinical decision support systems quantified by the trust/distrust scores were compared pairwise between the three randomization arms via two-sample t-tests. Equivalent, secondary analyses were conducted at the individual trust/distrust item level. The Pearson correlation coefficient was used to assess correlation between the latent trust and distrust mean scores. As robustness analyses, two-sample t-tests of trust/distrust in ML-based clinical decision support systems was performed across the three randomization arms, while stratifying by sex, age, diagnostic groups (affective/anxiety disorders versus psychotic/other disorders), socioeconomic factors (e.g., educational level, current work status), baseline knowledge of machine learning as decision support, and the level of general trust. The significance threshold was set at 0.05. Correction for multiple comparisons was not performed as the analyses were pre-registered and highly interdependent [Reference Rothman22]. All data management and statistical analyses were performed using Rstudio version 2023.06.0 Build 421.
Ethics
Research studies based on surveys are exempt from ethical review board approval in Denmark (waiver no. 1-10-72-138-22 from the Central Denmark Region Committee on Health Research Ethics). The study was approved by the Legal Office in the Central Denmark Region (reg. no. 1-45-70-21-23) and registered on the internal list of research projects having the Central Denmark Region as data steward (reg. no. 1-16-02-170-23). Before the survey, to ensure that it was appropriate for the study population, we received feedback on the questionnaire and the “intervention” and “active control” information pamphlets from two patients having received treatment in the Psychiatric Services in the Central Denmark Region. The respondents provided informed consent to participate in the study.
Role of the funding source
There was no funding for this study.
Results
A total of 992 invitees completed the survey (106 partial respondents were excluded) and Table 1 lists their sociodemographic and clinical characteristics as well as baseline information regarding trust in institutions and technologies. See Supplementary Table 4: “Clinical characteristics of the 1,098 randomized participants” for clinical characteristics of 1098 randomized participants (Full respondents + partial respondents).
Note: Cell counts <5 are not specified due to risk of identification of individual patients.
a Most severe diagnosis in the period from 2011–2023. (F*) indicates the ICD-10 chapter.
The randomization led to the following distribution of full respondents across arms: blank control = 319 (partial respondents = 37), active control = 343 (partial respondents = 33), and intervention = 330 (partial respondents = 36). The median response time was 280 seconds (IQR: 167) for those allocated to the blank control arm, 368 seconds (IQR: 264) for the active control arm, and 388 seconds (IQR: 244) for the intervention arm. The response time for those in the active control arm and the intervention arm was statistically significantly longer than for those in the blank control arm (active control vs blank control: W:37551, p < 0.0001; intervention vs blank control: 32735, p < 0.0001) but did not differ statistically significantly from each other (active control vs intervention: W: 53222, p = 0.18).
The results of the principal component analysis are shown in Supplementary Table 5: “PCA component loadings for trust items (positively worded outcome items)” and Supplementary Table 6: “PCA component loadings for distrust items (negatively worded outcome items)” and Supplementary Figures 1 and 2. As expected a priori, the survey items were grouped into a trust component consisting of the three positively worded items and a distrust component consisting of the five negatively worded items. The trust and distrust sum scores were inversely correlated (Pearson correlation coefficient = −0.60).
Table 2 shows the response to the items focusing on trust and distrust in ML-based clinical decision support systems, across the three randomization groups. Notably, the median scores on the trust items were generally within the range of the trust general/institutional trust levels reported at baseline (available in Table 1).
Abbreviations: ML: Machine learning. EHR: Electronic health record. Mean differences between groups for single items and results from T-tests are shown in Supplementary Table 7.
a The full phrasing of the items and their scoring range are available in the methods section.
Figure 2 shows the results of the primary analyses, which compare the three randomization groups with regard to the trust and the distrust sum score. The intervention increased trust in ML-based clinical decision support systems when compared to the active control (mean difference in trust: 5% [95% CI: 1%;9%], t = −2.60, df = 668.5, p = 0.0096) and the blank control arm (mean difference in trust: 4% [95% CI: 1%;8%], t = −2.43, df = 645.0, p = 0.015). Similarly, the intervention reduced distrust in ML-based clinical decision support systems when compared to the active control (mean difference in distrust −3% [95% CI: −5%; −1%], t = −2.60, df = 670.4, p = 0.021) and the blank control arm (mean difference in distrust −4% [95% CI: −8%; −1%], t = 2.30, df = 646.53, p = 0.022). For both trust and distrust, there were no material or statistically significant differences between the active and the blank control arms (Trust: t = 0.208, df = 660.0, p = 0.84; Distrust: t = −0.048, df = 660.0). The equivalent results at the level of the Individual trust and distrust items are listed in Supplementary Table 7: “Single items from the post-experimental questionnaire with t-tests between groups” and are in agreement with those from the analyses at the trust and distrust sum scores. Notably, the neutral item (importance of explainability of ML models) had an overall median of 10 (8–10) with a statistically significant difference between the blank control and the intervention arms (higher in the intervention arm), but not for the other comparisons.
The results of the analyses stratified by sex, age, diagnostic groups, educational level, current work status, baseline knowledge of machine learning as decision support, and the level of general trust are listed in Supplementary Table 8: “Results from intervention stratified by age, sex, diagnostic category, educational level, work status” and suggest that the intervention effect is generally consistent across the stratification groups with a few notable exceptions. Specifically, the intervention increased trust and decreased distrust in women, but not/less in men. Similarly, the intervention increased trust and decreased distrust for those with affective/anxiety disorders, but not/less for those with psychotic/other disorders.
Discussion
This randomized survey experiment among patients receiving treatment in psychiatric services showed that information on ML-based clinical decision support systems may increase patient trust in such systems. Notably, the results regarding the explainability of ML models suggest that this aspect was particularly important for the respondents.
To the best of our knowledge, this study represents the first investigation into whether providing information about ML as a decision-support tool may enhance patient trust in such systems, suggesting that this is indeed the case. This result is in line with that from the study by Cadario et al., predominantly targeting individuals from the general population, which investigated the effect of receiving information on the variables (shape/size/color) driving a malignant melanoma risk prediction algorithm [Reference Cadario, Longoni and Morewedge8]. In the study by Cadario et al, the participants either received information on human healthcare providers versus ML algorithm’s decision-making processes. Consistent with our study, it was found that a small amount of information about ML algorithms reduced “algorithm aversion”, i.e., reduced the reluctance to utilize algorithmic support compared to human providers [Reference Cadario, Longoni and Morewedge8]. The results from Cadario et al. also suggest that explainability is positively associated with the uptake of prediction algorithms at the general population level. This is consistent with the patient perspective observed in this study in which the respondents agreed the most with the item stating “It is important to me that I can get an explanation of the basis on which a machine learning model recommends a given treatment.” Analog findings have been reported in studies of patients (within, e.g., primary care, radiology, and dermatology) [Reference Reading Turchioe, Harkins, Desai, Kumar, Kim and Hermann7], clinicians (within, e.g. internal medicine, anesthesia, and psychiatry) [Reference Diprose, Buist, Hua, Thurier, Shand and Robinson23], and the general population [Reference Young, Amara, Bhattacharya and Wei5,Reference Ploug, Sundby, Moeslund and Holm6]. Taken together, these findings suggest that when developing ML models for healthcare, there should be an emphasis on explainability to ensure trust among both clinicians and patients.
The observed sex difference in the intervention effect (more effective in females compared to males) may be attributed to the fact that men generally reported higher baseline knowledge of machine learning and AI, which could likely have led to a ceiling effect (less “room” for increasing trust/reduce distrust in these technologies) [Reference Gillespie, Lockey and Curtis24]. That the intervention effect was more pronounced in patients with affective/anxiety disorders compared to those with psychotic/other disorders is likely partly driven by the sex difference outlined above, as there are relatively more females in the first (72%) compared to the latter patient group (63%). Furthermore, it has been reported that patients with psychotic disorders are particularly prone to have reduced trust in mental healthcare professionals [Reference Verhaeghe and Bracke13], which may extend to the information received in the intervention of this study. Thus, it may be that a different type/dose of intervention is required to reach patients with psychotic disorders.
While the results of this study are indicative of a causal effect of the information intervention, the effect was numerically quite small. This was expected as the intervention consisted of an electronic pamphlet with only four slides of text and pictures, which was administered only once. Furthermore, it can also be argued that the effect of the intervention may well be short-lived as the trust/distrust outcome was measured immediately after the intervention. For these reasons, future studies should ideally employ more comprehensive and repeated interventions, and longer time between the interventions and the outcome measurement.
Another aspect potentially affecting the effect size of the intervention is the timing of the survey. Specifically, as the survey was fielded at the end of May 2023, it came in the aftermath of the press coverage of the open letter signed by several high-profile tech leaders calling for a pause in the development of artificial intelligence until appropriate safeguards and legislation would be in place [Reference Pause Giant25]. This media coverage primarily focused on the security concerns and potential hazards associated with this emerging technology [Reference Gregory and Hern26,Reference Metz and Schmidt27]. This somewhat negative press on artificial intelligence could have rendered the respondents resistant to the information in the intervention (reduced effect size). In contrast, however, it is also possible that the negative press had resulted in reduced baseline trust in the technology, which would then leave ample room for a positive effect of the information conveyed in the intervention (increased effect size). Based on the data at hand, we are, unfortunately, not able to determine the overall direction of this potential response bias.
There are limitations to this study, which should be taken into account by the readers when interpreting the results. First and foremost, the survey had a relatively low response rate, which means that selection bias may be in play. Due to the lack of clinical and sociodemographic data on those not participating, we were unable to adjust the analyses for attrition. However, in a recent 2-wave survey in the same population during the COVID-19 pandemic (focusing on psychological distress/well-being among patients with mental disorders during the pandemic), we were granted force majeure access to clinical and sociodemographic data on non-respondents. This allowed us to employ inverse probability weighting to address the potential bias arising from non-response. Notably, this adjustment had no material impact on the results [Reference Kølbæk, Jefsen, Speed and Østergaard16,Reference Kølbæk, Gil, Schmidt, Speed and Østergaard28]. While this does not preclude selection bias in this study (a different survey topic and timing), it does suggest that this is unlikely to be a substantial problem. Second, we did not use a validated questionnaire for measuring trust and distrust in ML as a clinical decision support tool as such questionnaires, to our knowledge, have not yet been developed. With this inherent limitation in mind, we believe that the latent components of trust and distrust derived from principal component analysis of items with apparent face validity, represent reasonable outcome measures. Third, we did not have attention-check questions, which means that some participants may have responded inconsistently/arbitrarily to the questions. This would have introduced noise in the data and complicated signal detection. Yet, a signal (the intervention increasing trust and reducing distrust) was indeed detected. Fourth, with regard to generalizability, it should be borne in mind that Denmark is among the most digitalized countries in the world [29], and its inhabitants may, therefore, be more positive towards new technology. Thus, replication of the reported findings in other countries is warranted.
In conclusion, this survey experiment suggests that receiving information on ML-based clinical decision-support systems in hospital psychiatry likely increases patient trust in such systems. This is compatible with results from studies of other patient populations, clinicians as well as the general population. Thus, when taken together, the literature suggests that providing appropriate information to patients will be of importance when implementing ML-based clinical decision support systems.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1192/j.eurpsy.2024.1790.
Data availability statement
The data cannot be shared as the participants have not consented to data sharing.
Acknowledgments
The authors are grateful to Bettina Nørremark for data management, to Anders Helles Carlsen and Maria Speed for statistical support, and to Torben Schmidt Kjeldsen for the graphical design of the survey experiment (all are affiliated with the Central Denmark Region).
Financial support
The study is supported by grants from the Lundbeck Foundation (grant number: R344-2020-1073), the Danish Cancer Society (grant number: R283-A16461), the Central Denmark Region Fund for Strengthening of Health Science (grant number: 1-36-72-4-20), and the Danish Agency for Digitisation Investment Fund for New Technologies (grant number 2020-6720) to Østergaard. Outside this study, Østergaard reports further funding from the Lundbeck Foundation (grant number: R358-2020-2341), the Novo Nordisk Foundation (grant number: NNF20SA0062874), and Independent Research Fund Denmark (grant numbers: 7016-00048B and 2096-00055A). These funders played no role in the design or conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Author contributor
All authors contributed to the conceptualization and design of the study. EP and SDØ conducted and supervised the data collection. Data management and verification was performed by EP and MB. Statistical analysis was carried out by EP. All authors contributed to the interpretation of the obtained results. EP wrote the first draft of the manuscript, which was subsequently revised for important intellectual content by the other authors. All authors approved the final version of the manuscript before submission.
Competing interest
AAD has received a speaker honorarium from Otsuka Pharmaceutical. SDØ received the 2020 Lundbeck Foundation Young Investigator Prize and SDØ owns/has owned units of mutual funds with stock tickers DKIGI, IAIMWC, SPIC25KL, and WEKAFKI, and owns/has owned units of exchange-traded funds with stock tickers BATE, TRET, QDV5, QDVH, QDVE, SADM, IQQH, IQQJ, USPY, EXH2, 2B76, IS4S, OM3X, and EUNL. The remaining authors have no financial interests to disclose.
Comments
No Comments have been published for this article.