Attempts to apply artificial intelligence (AI) and machine learning to detection of psychiatric disorder have yielded only moderate accuracy owing to small effect sizes and high heterogeneity.Reference Winter, Blanke, Leenings, Ernsting, Fisch and Sarink1 Nevertheless, improving prediction models by incorporating clinical assessments seems to enable clinical applications.Reference Koutsouleris, Dwyer, Degenhardt, Maj, Urquijo-Castro and Sanfelici2 However, a significant challenge arises from the nature of clinical data: medical free text, especially in psychiatry, encapsulates a wealth of information about a patient's pathology and well-being by unveiling the structure of their thinking and feeling. This information is vital but often remains inaccessible for scalable analysis because of its unstructured nature. The inability to effectively analyse this text on a large scale potentially leads to missed opportunities in clinical decision-making and research.
Recent studies have emphasised the significant impact of advanced technology on managing unstructured medical data.Reference Clusmann, Kolbinger, Muti, Carrero, Eckardt and Laleh3 Specifically, the use of large language models (LLMs) has garnered significant attention.Reference Wiest, Ferber, Zhu, van Treeck, Meyer and Juglan4 Unlike previously used methods of natural language processing (NLP) that require decomposing the text and substantial feature engineering,Reference Irving, Patel, Oliver, Colling, Pritchard and Broadbent5 LLMs are AI models primarily designed to understand and generate text.Reference Kjell, Kjell and Schwartz6 They are trained on vast amounts of text data, allowing them to learn the statistical patterns and relationships within language.Reference Zhao, Zhou, Li, Tang, Wang and Hou7
Accounting for nearly half of all emergency psychiatric admissions,Reference Van Veen, Wierdsma, van Boeijen, Dekker, Zoeteman and Koekkoek8 suicide is one of the most tragic complications of psychiatric care and is often preventable. Sustained efforts can lead to major reductions in in-patient suicides, from 4.2 to 0.74 per 100 000 admissions.Reference Watts, Shiner, Young-Xu and Mills9 Here, we hypothesise that automated tools could help identify in-patient suicide risk using underexploited clinical records. Moreover, beyond clinical application, LLMs might automatically identify and extract suicidality from electronic health records (EHRs) to enhance research.
Method
We systematically extracted n = 100 randomly selected text-based admission notes of in-patients treated in and discharged from the acute psychiatric ward of the Department of Psychiatry and Psychotherapy at the University Hospital Carl Gustav Carus Dresden between 1 January and 31 December 2023. A typical, though fictitious account (to preserve privacy) can be found in the Supplementary material available at https://doi.org/10.1192/bjp.2024.134. We included 54 female and 46 male patients with an average age of 50 years (range 18–96 years, s.d. = 23.8 years). The most prevalent ICD-10 main diagnoses were major depressive disorder (21%), psychotic disorders (20%) and dementia (17%) (Table 1). Suicidality evaluation is part of the unedited input data, as assessment is a required care standard.Reference Chammas, Januel and Bouaziz10 However, this assessment is generally not documented in our EHRs in a structured way. Instead, the rater describes their impression, for example stating that no suicidal ideation was apparent. Variations in expressing this assessment (sometimes without mentioning ‘suicidal intent’ at all) and negations are common (e.g. ‘(no) reason to assume suicidal ideation’, ‘suicidal intent (not) clearly ruled out’, ‘wish to be dead present’), which reduced efficiency in earlier NLP assessments.Reference Cusick, Adekkanattu, Campion, Sholle, Myers and Banerjee11 We ensured data privacy by installing Llama-2 via the llama.cpp framework on a local hospital computer. We extracted the suicidality status from psychiatric admission notes using three different Llama-2-based models: the standard English Llama-2-70b chat model adapted to allow deployment on low-resource consumer hardware,Reference Jobbins12 as well as two versions of Llama-2 that were specifically fine-tuned for the German language (‘Sauerkraut’13 and ‘Emgerman’Reference Harries14). We compared the models’ results with a ground truth consensus which was established by a resident (F.G.V.) and a consultant psychiatrist (P.M.) as a binary variable (suicidal/not suicidal) (Fig. 1). Suicidality was defined as either suicidal thoughts, ideation, plans or attempt identified by hospital admission. We applied a step-by-step approach to prompt engineering, as prompt engineering can substantially improve the performance of LLMs.Reference Chen, Zhang, Langrené and Zhu15 The first prompt simply asked about suicidality in reports (P0). In the second prompt, we added fictitious examples and explanations. We started with one example (P1) and added one example (P2) at a time, with three examples as a maximum (P3) (Supplementary Table 1). After achieving improved performance, we incorporated a chain-of-thought approach (Fig. 1(c)). For this, the model was prompted to identify whether a patient exhibited suicidal thoughts and to provide an explanation based on the given input. Subsequently, the model's output – specifically its reasoning about suicidality – was used as the basis for a second prompt. In this subsequent interaction, the model was tasked with providing a binary response (true or false) regarding the presence of suicidality (P4). To obtain reliable estimates, we used bootstrapping, a statistical resampling technique, with 10 000 iterations.
All research procedures were conducted in accordance with the Declaration of Helsinki. Ethics approval was granted by the ethics committee of Technical University Dresden (reference number BO-EK-400092023). Informed consent was not necessary for this study because the research involved data from which all personal identifiers had previously been removed. The design of the study ensured that there was no interaction or intervention with participants and no potential for harm or invasion of privacy.
Results
Llama-2 extracted suicidality status from psychiatric reports with high accuracy across all five prompt designs and all three models tested. The highest overall accuracy was achieved by one of the German fine-tuned Llama-2 models (‘Emgerman’), which correctly identified suicidality status in 87.5% of the reports. With a sensitivity of 83.0% and a specificity of 91.8%, it demonstrated the highest balanced accuracy of all models (87.4%) (Fig. 2(a)).
The confusion matrix (Fig. 2(b)) also highlights areas for model improvement, particularly in reducing false negatives. To improve the performance, we designed the prompts and developed five different prompting strategies that were tested for all three models (Fig. 2(c)). The simplest prompt, which contained only a ‘system prompt’ framing the model in its role (‘You are an attentive medical assistant with specialised knowledge in psychiatry [ … ]’) one report at a time and the ultimate question of interest (‘Is the patient suicidal? Answer yes or no [ … ]’), yielded the highest sensitivity in the German fine-tuned Llama-2 model ‘Sauerkraut’ (sensitivity: 87.5%, specificity: 61.2%, balanced accuracy: 74.4%). It was immediately followed by the standard English Llama-2 chat model, with a sensitivity of 85.1%, specificity of 63.0% and a balanced accuracy of 74.1%. The Emgerman model had a worse sensitivity (42.6%), but the highest specificity (98.8%). Not all models improved when examples were added to the prompt, allowing for in-context learning. The Emgerman model improved substantially by adding more examples, with the lowest balanced accuracy in the prompt with no examples given (66.2%) and the highest balanced accuracy in the prompt with three examples given (87.4%). The English model was robust, showing similar balanced accuracies for prompts with no, one, two or three examples (P0: 74.1%; P1: 73.3%; P2: 79.3%; P3: 80.3%). The ‘Sauerkraut’ model improved with adding examples but achieved its maximum performance with two examples in the prompt. The use of the chain-of-thought approach did not improve performance (sensitivities: ‘Emgerman’ P4 17.0%, ‘English’ P4 63.8%, ‘Sauerkraut’ P4 81%; specificities: ‘Emgerman’ P4 75.5%, ‘English’ P4 63.3%, ‘Sauerkraut’ P4 77.6% (Table 2)). In fact, all models deteriorated, except for the ‘Sauerkraut’ model, which was not affected negatively by this approach.
PPV, positive predictive value; NPV, negative predictive value.
a. All results were obtained by 10 000-fold bootstrapping, and therefore means and standard deviations are given.
Discussion
We show that large language models (LLMs) demonstrate remarkable efficacy in identifying and extracting references to suicidality from psychiatric reports. Their performance, in terms of both sensitivity and specificity, was notable and improved progressively with the number of examples provided in the prompt. These findings suggest a significant advancement in the field, highlighting the potential of LLMs to revolutionise the way psychiatric medical text is analysed. In contrast to traditional natural language processing (NLP) methods, which require extensive annotation or model training, our approach uses the capabilities of the foundation models’ inference and is applicable to comparatively small data-sets. The real-life clinical data taken from an acute care ward in a maximum care facility in a German urban centre was processed at the ‘edge’ – with no need to upload to commercial servers or a data-processing cloud – by an open-source model on local servers. This enables a privacy-sensitive data protection strategy in a closed loop, that alleviates concerns about data leaving the care provider's control.
The good performance levels (Fig. 2) even in a (medical) domain in which the LLM was not fine-tuned suggest even greater opportunities with further optimisation for mental health, for example in dealing with physician-level linguistic idiosyncrasies or abbreviations.Reference Yang, Zhang, Kuang, Xie, Ananiadou and Huang16 For a clinical application such as suicide risk detection, where false negatives are likely to lead to detrimental outcomes, sensitivity should approach 100%, even at the cost of detecting more false positives, which can be resolved with further human evaluation to ensure no case is missed. In any clinical setting, the final risk assessment remains in the judgement of the experienced clinician and further research needs to elucidate risks and challenges. On the other hand, in the case of data extraction for research purposes, correctly identifying 80% of cases (i.e. classification accuracy of 80%) might be adequate to capture a representative cohort. In comparison, randomised clinical trials for major depression may include less than a quarter of cases from real-life clinical cohorts, owing to strict eligibility criteria.Reference Wisniewski, Rush, Nierenberg, Gaynes, Warden and Luther17
Other clinical applications could include prediction and early warning of deterioration in symptom severity and a subsequent need for escalation of therapy, such as involuntary admission, restraint or forcible medication. Multiprofessional communication in interdisciplinary care provider teams including nurses, specialty therapists, psychotherapists and psychiatrists might also become more efficient, for example owing to reduced information loss during handover or case conferences.
Strengths and limitations
Our approach allows an out-of-the box application, whereas classic NLP approaches require time-consuming training and data annotationReference Fernandes, Dutta, Velupillai, Sanyal, Stewart and Chandran20 and present limitedReference Carson, Mullin, Sanchez, Lu, Yang and Menezes21,Reference Cook, Progovac, Chen, Mullin, Hou and Baca-Garcia22 or comparable performance.Reference Zhong, Mittal, Nathan, Brown, Knudson González and Cai23 In addition, the performance of the basic language models trained on large corpora of a variety of text data not specific to our data-set suggests good generalisability.Reference Wang, Li, Wu, Hovy and Sun18 The comparatively small need for computational resources, since they are used only for inference, not for specific training,Reference Patterson, Gonzalez, Le, Liang, Munguia and Rothchild19 allows for easy application at the point of origin of the data and may therefore be more scalable than classic NLP approaches.
The potential generalisability of our approach is supported by the fact that many physicians were involved in the creation of the clinical letters and it is highly unlikely that the notes used reflect a personal style of any particular resident. On average, 50% of the patients on our acute ward were admitted during the night shifts. On average, 20 residents rotate through the night shifts on a daily basis. Acute ward residents rotate on a 3- to 6-month basis. However, we acknowledge that a clinic-specific style may play a role. In the next step, reproducibility should be tested on a larger external validation sample.
Suicide risk was considered a binary parameter. Future research should concentrate on a more detailed outcome that differentiates between overall suicide risk and acute high risk.Reference Ophir, Tikochinski, Asterhan, Sisso and Reichart24 Additionally, studies should apply extensive ground truth labelling,Reference Meerwijk, Tamang, Finlay, Ilgen, Reeves and Harris25 and evaluate more comprehensive prompt engineering strategies.Reference Nori, King, McKinney, Carignan and Horvitz26 However, our results suggest that, at least in the case of Llama-2, more complex prompting with a chain-of-thought approach might degrade performance. For some tasks, simple example prompting that requires very few computing resources may be more suitable.
Although patient privacy concerns have been addressed, it is important to note that every LLM approach inherits ethical issues related to bias, trust, authorship and equitability.Reference Li, Moon, Purkayastha, Celi, Trivedi and Gichoya27 Expert guidelines for development of LLMs for medical purposes should be carefully considered.Reference Perlis and Fihn28
Conclusions
We provide a proof-of-concept analysis for automated extraction of references to suicidality in in-patients from EHRs using LLMs. This study highlights the transformative potential of using LLMs to detect suicidality in clinical admission notes. The use of a psychiatry-naive model, not specifically fine-tuned to the relevant data-sets, shows high performance, which is promising for generalisability and offers potential for further improvement through more extensive in-context learning and prompt engineering. Possible applications include early warning and surveillance tools for psychiatric emergencies, preventing information transfer failures, quality assurance and evaluation of psychiatric symptoms on large clinical ‘real-world’ samples.
Supplementary material
Supplementary material is available online at https://doi.org/10.1192/bjp.2024.134.
Data availability
The data used in this study are not available for sharing because they contain information that could compromise the privacy of the research participants. The source code necessary for replicating our procedures are openly available to other researchers at https://github.com/I2C9W/LLM4Psych/tree/v0.1.0.
Author contributions
F.G.V., P.M. and I.C.W. conceptualised the study and developed the methodology in close coordination with J.N.K.; I.C.W. developed the scripts and ran the experiments; F.G.V., I.C.W., M.B., A.P., U.L. and P.M. wrote and reviewed the initial manuscript. All authors refined the draft. P.M., U.L., A.P., M.B. and J.N.K. provided supervision and resources for the project.
Funding
J.N.K. is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111), German Cancer Aid (DECADE, 70115166), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (TransplantKI, 01VSF21048), the European Union's Horizon Europe and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council (ERC; NADIR, 101114631) and the National Institute for Health and Care Research (NIHR, NIHR213331) Leeds Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. F.G.V. was supported by the Federal Ministry of Education and Research (PATH, 16KISA100k). P.M. and A.P. were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant number GRK2773/1-454245598. This work was funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Declaration of interest
J.N.K. declares consulting services for Owkin, France, DoMore Diagnostics, Norway, Panakeia, UK, Scailyte, Switzerland, Cancilico, Germany, Mindpeak, Germany, MultiplexDx, Slovakia, and Histofy, UK; furthermore he holds shares in StratifAI GmbH, Germany, has received a research grant by GSK, and has received honoraria by AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer and Fresenius. I.C.W. received honoraria from AstraZeneca. U.L. participated in advisory boards and received honoraria by Janssen Cilag GmbH.
eLetters
No eLetters have been published for this article.