1. Introduction
Negation is a linguistic phenomenon that reverses, tones down, or intensifies the truth value of a linguistic unit (proposition, phrase, or word) that undergoes negation (Martí et al., Reference Martí, Taulé, Nofre, Marsó, Martín-Valdivia and Jiménez-Zafra2016). According to the literature, about half of the natural language statements in the clinical domain feature some sort of negation (Chapman et al., Reference Chapman, Bridewell, Hanbury, Cooper and Buchanan2001). Negation detection plays a crucial role in clinical text mining by identifying instances where negations modify affirmed clinical entities, including findings, diseases, procedures, body parts, and drugs. When clinical entities, such as symptoms and diagnoses, are negated, their validity is compromised. For instance, phrases like “no cough” or “no fever” indicate the absence of these symptoms (Dalianis, Reference Dalianis2018).
A negated statement is made up of three components, namely, the negation cue, the negation scope, and the negation reference, which can be out of the negation scope. While this study focuses on Spanish, Figure 1 shows an example in English for an illustration of this:
Negation cues are understood as words, prefixes, suffixes, or morphemes that indicate negation. A negation scope refers to words that are influenced by a cue. An out-of-scope negation reference is the part of the text to which the negation refers and which is outside its scope.
The high frequency of negation in electronic health records, clinical trials, drug prescriptions, case studies, discharge summaries, etc. has made researchers investigate automatic methods for its processing and representation for over two decades due to the crucial role of text mining and natural language inference for medical knowledge systems. Negation has gained special attention in the last decade, especially after the emergence of various corpus tagged with negation cues and their corresponding scopes, because of the paramount importance of a correct interpretation of medical information to help reduce medical errors and strengthen decision support. From a computational perspective, the problem has been widely studied in English, and just some years ago a few works emerged to address the issue in Spanish (see, e.g.,, Jiménez-Zafra et al. (Reference Jiménez-Zafra, Díaz, Morante and Martín-Valdivia2019)), despite being the second language with the most speakers in the world, according to Cervantes Institute. After the works of Vincze et al. (Reference Vincze, Szarvas, Farkas, Móra and Csirik2008) and Morante and Blanco (Reference Morante and Blanco2012), there appeared new datasets (see a detailed list in Tables 7 and 8 for Spanish) that allow for tackling the problem of identifying negation cues and their scopes automatically, following diverse approaches, which in most cases treat the problem under a token classification scheme.
Negation cues can be easily identified with a list of words or even with a machine learning classifier, but this makes it a domain- or language-dependent task (Fancellu et al., Reference Fancellu, Lopez and Webber2016). There are also more complex challenges such as multi-word and discontinuous negation cues, which limit word list performance. Notwithstanding the complexity of the task, negation cue and scope detection have achieved very competitive results. Additionally, the relevance of identifying negations to improve clinical text mining has been shown in many studies (Mutalik, Deshpande, and Nadkarni (Reference Mutalik, Deshpande and Nadkarni2001); Deléger and Grouin, (Reference Deléger and Grouin2012); Santiso et al. (Reference Santiso, Casillas, Pérez, Oronoz and Gojenola2014); Casillas et al. (Reference Casillas, Pérez, Oronoz, Gojenola and Santiso2016); Gkotsis et al. (Reference Gkotsis, Velupillai, Oellrich, Dean, Liakata and Dutta2016). Surprisingly, however, the literature (Morante and Blanco (Reference Morante and Blanco2021); Mahany et al. (Reference Mahany, Khaled, Elmitwally, Aljohani and Ghoniemy2022)) shows that little to no attention has been paid to identifying negation references located out of the negation scope in spite of the fact that they account for 42.8% of negations, like we show below (see an example of this phenomenon in Figure 1). Considering that a negated statement is a semantic unit, leaving out any of its essential components during its treatment yields senseless, incomplete, or counterfactual chunks. Table 1 illustrates this by using the example in Figure 1 above:
Out of these three scenarios, the only one that actually is an issue in current systems is number 1 in Table 1 above. The practical impact of this flaw on medical data processing is not minor though. Table 2 shows evidence and examples of negation cues and scopes that are currently detected (left column) as well as the information that is not (right column), because it comes in the source text in the form of what we call out-of-scope references (OSRs).
This work’s goal is twofold. First, we aim at identifying OSRs to restore the sense of truncated negated statements such as the ones in Table 2 above. For this, we augment the NUBES dataset (Lima et al. Reference Lima, Pérez, Cuadros and Rigau2020) with OSR annotations. NUBES is the largest clinical Spanish dataset available annotated with negations and uncertainty, and now we augment it into NeRUBioS, the first negation and uncertainty clinical Spanish dataset with OSR annotations. We propose that the dataset allows for (1) a quantification of the OSR phenomenon and (2) using token classification to link negation scopes and their OSRs as most of them are in the same sentence and ambiguity is neglectable (1.2%). Additionally, we fine-tuned five BERT-based models and used transfer learning to jointly identify negation cues and scopes and their respective OSRs as well as uncertainty. Our best model keeps state-of-the-art performance at negation and uncertainty detection and, at the same time, sets a competitive baseline for OSR identification and linking.
The remainder of the paper is as follows. Section 2 presents a survey of related works grouped by the approaches used for negation and uncertainty/speculation detection. This section also includes the details of the most prominent datasets used in the field. Section 3 describes our dataset details as well as our methodology for the dataset annotation process. We provide fine-grained statistics of NeRUBioS. This section also features a description of the transfer learning experiments to jointly tackle negation and uncertainty detection as well as OSR identification and linking. Results and discussion are reported in Section 4. Finally, Section 5 wraps up with conclusions.
2. Related works
According to our review of the field, the present work is the first contribution to tackle the OSR detection and linking problem; hence the lack of related works about this specific task in this section. However, as OSRs are necessarily tied to negated statements, this study builds on a long tradition of works on negation detection.
Our survey shows that two particular aspects seem to determine the performance of negation detection systems, namely the difficulty of the problem and the language of the dataset. The task has become more complex over time. While the first systems focused mostly on negation cues, later, more sophisticated works added negation scopes and uncertainty cues and scopes to the task. This is probably why researchers achieve high F1 scores at the beginning of a systems’ generation, whatever the year or the approach is. For example, a rule-based approach in 2001 (Mutalik et al., Reference Mutalik, Deshpande and Nadkarni2001) reports an F1 score of 0.96, and 22 years later, Mahany, Khaled, and Ghoniemy (Reference Mahany, Khaled and Ghoniemy2023) also report an F1 score of 0.96 using an architecture that combines deep-learning techniques.
Systems seem to lose power as awareness of the problem’s nuances and challenges increase, though. This can be seen in a consistent decrease in systems’ performance by approach over time, except for modern deep-learning systems, which seem to keep high performance. Figure 2 shows this for the works listed in this section’s tables by approach.
With these generalities in mind, we briefly discuss each of the tables below, which list relevant works by approach in chronological order.
2.1 Rule-based approaches
By 2000, part-of-speech tagging had reached competitive performance. This enabled ruled-based approaches to use regular expressions, syntactic, and dependency parsing and also lists of negation cues for negation detection. Biomedical datasets were not as big as they are today and they were available mainly in English, with the exception of Cotik et al. (Reference Cotik, Stricker, Vivaldi and Rodríguez Hontoria2016) and Solarte-Pabón et al. (2021b). Speculation was introduced by Apostolova, Tomuro, and Demner-Fushman (Reference Apostolova, Tomuro and Demner-Fushman2011) and Cotik et al. (Reference Cotik, Stricker, Vivaldi and Rodríguez Hontoria2016) following this approach (see Table 3). The decreasing trend in these systems’ performance over time (see Figure 2) reflects on the increasing complexity of the problem, which probably motivated the next generation of works using machine learning. It is interesting to note that rule-based approaches were still being used for years after the first machine learning works were reported for the task in 2008, as shown below.
2.2 Classic machine learning
The complex configuration of negation and speculation scopes as well as increasing computational capabilities encouraged the use of machine learning techniques (see Table 4). In this generation, speculation has more prominence alongside negation, and more works in Spanish are reported. However, speculation detection is not as promising yet. Most techniques in this approach involve supervised learning using linguistic features. These features are often generated using the data and the knowledge built through rule-based techniques during the first years of research on negation detection.
2.3 Hybrid approaches
As classic machine learning seemed to lag behind new challenges in negation detection, hybrid proposals emerged (see Table 5). This approach consists of mixing machine learning and some sort of pre- or post-processing or reinforcement learning. All of the works reported here use English, and, for some reason, they leave out uncertainty/speculation. Overall results were not as big a jump as expected compared to machine learning-only approaches, which led the scientific community to move to the next paradigm, which we describe below.
2.4 Deep learning
Deep learning applied to language processing through large language models (LLM) truly created a new paradigm. While the rule-based, machine learning, and hybrid approaches cited here overlap in time to a certain extent to compete for good results in negation detection, deep learning has drawn almost exclusive interest from the community due to consistent and improving results on this and other tasks over time (Table 6). This can be seen in Figure 3, which shows all of the related works listed above by year. The Figure shows, for example, that there is a publication time overlap of 8 years between rule-based and machine learning and of 6 years between machine learning and hybrid methods. However, there is only a time overlap between classical machine learning and deep-learning publications of 6 years and virtually no overlap with the hybrid generation. This overlap, however, could be even smaller if we consider that the early works that we cite do not exactly use LLMs but word embeddings and other neural network algorithms that came before the advent of BERT and related LLMs.
The number of deep-learning works in Spanish and other languages has increased dramatically, as can be seen in Jiménez-Zafra et al. (Reference Jiménez-Zafra, Morante, Martín-Valdivia and Lopez2020), who did a thorough review of datasets for negation in Spanish and other languages including this and other approaches. This is mostly due to the availability of models that have been trained on multilingual corpora, such as mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) (see e.g., Solarte-Pabón et al. (Reference Solarte-Pabón, Montenegro, Blazquez-Herranz, Saputro, Rodriguez-González and Menasalvas2021a)), or on datasets in the specific language. Table 7 shows a sample of the most representative datasets in Spanish for the clinical/biomedical domain. Some of these datasets are not currently annotated with negation labels, but we include them here for reference as they can potentially be used for further dataset construction. Likewise, Table 8 lists Spanish datasets used for negation detection in other domains. The third column in each table highlights in bold when the dataset contains negation or uncertainty annotations, besides other types of labels, as well as the number of labels per type.
DIANN corpus available at: http://nlp.uned.es/diann/
3. Methodology
This work follows a data-based approach which is at the core of transfer learning using large language models. Therefore, in this section we describe the dataset used and its annotation process as well as the five pre-trained models tested and the fine-tuning parameters utilized to optimize the results for each of the tackled tasks.
3.1 The dataset
Fine-tuning a pre-trained language model requires supervised learning, hence the need of a dataset annotated for the relevant task. While there are datasets available for negation and uncertainty detection, there is no dataset for OSR detection and linking. In this section we describe how we augmented an existing dataset with OSR annotations and report the experiments carried out with fine-tuned, multi-task language models for negation detection, uncertainty detection, and OSR identification and linking.
For the OSR identification and linking task, we created NeRUBioS, an augmented version of NUBES (Lima et al., Reference Lima, Pérez, Cuadros and Rigau2020), which is the largest publicly available corpus annotated for negation and uncertainty in Spanish in the biomedical domain. NeRUBioS (Negation, References, and Uncertainty in Biomedical Texts in Spanish), the first dataset of its kind, resulted from manually annotating NUBES with OSR tags (see Figure 4).
NeRUBioS contains 18,437 sentences from anonymized health records, which total 342,788 tokens and 32,562 types. 7,520 of these sentences contain negated statements, out of which 3,606 (47.9%) feature OSRs. Figure 5 shows the number of sentences per partition. We follow Lima et al. (Reference Lima, Pérez, Cuadros and Rigau2020) for this distribution of samples across partitions in order to allow for a comparative analysis of each task using NUBES vs. NeRUBioS, that is, 75% of the dataset for training, 10% for development, and 15% for testing. For more statistics about NeRUBioS, see Table 9 below.
3.2 Tagset
In order to be consistent with the annotation scheme inherited from NUBES, NeRUBios’ OSRs were tagged using the BIO scheme (Ramshaw and Marcus, Reference Ramshaw and Marcus1999). Most OSRs and their negation scopes are in the same sentence, which enables the training of a model to make the linking inference between a negation and its OSR by tackling the task as a token classification problem.
After annotating every OSR, the dataset ends up having 11 labels (See Table 10 above). The bilingual example below illustrates the use of this tagging (Spanish version in italics):
Figures 6 and 7 show the frequency of each label across the three partitions of the dataset. Due to space limitations, the Figures do not show the frequencies of the label “O,” which are as follows: training = 202,364, development = 27,927, and testing = 39,674. Likewise, Figures 6 and 7 highlight the imbalance in the number of labels in the dataset, which adds to the complexity of each task. The figures unveil some relevant facts per dataset partition, for example, that most of the OSRs (NegREF) are multi-word sequences while most of the negation cues (NEG) are one-word strings. Negation scopes (NSCO) are way longer than the rest of the classes. Likewise, multi-word uncertainty cues (UNC) frequency is slightly higher than the number of one-word uncertainty cues, but way shorter than uncertainty scopes (USCO), which are mostly long multi-word sequences also.
From Figure 6, an important additional fact can be claimed. Around 42.8% of negation references are out of the negation scope of the negated statement (40.7% in training, 44.7% in development, and 42.9% in testing). This evidence, which we illustrate in Figure 8, backs up our goal of creating a dataset and fine-tuning a model for OSR identification and linking. This would contribute to restoring the three essential components of a negated statement (i.e., OSR, negation cue, and negation scope).
3.3 Annotation process and guidelines
This section describes the dataset annotation process to augment NUBES into NeRUBioS. As mentioned above, we address negation references that have been so far left out by negation identification systems because they are out of the scope of the relevant negation cue (see example in Figure 1). We therefore follow Lima et al. (Reference Lima, Pérez, Cuadros and Rigau2020) to annotate OSRs and add them to NeRUBioS together with the annotations inherited from NUBES. This dataset was manually annotated by a computational linguist, native speaker of Spanish, and was thoroughly checked several times to ensure annotation consistency and terminology accuracy with the support of medical terminology experts. Likewise, as we use Lima et al. (Reference Lima, Pérez, Cuadros and Rigau2020) methodology and annotations as an anchor to add OSR labels, we leverage and assume their inter-annotator agreement rate to ensure NeRUBioS’s annotation reliability, which averages 85.7% for negation and uncertainty cues and scopes.
The main challenges posed by this task are:
-
(1) The semantic or syntactic role of OSRs in the sentences is unpredictable, which rules out any sort of rule-based assisted annotation.
-
(2) Most OSRs are multi-word sequences, which may include conjunctions and commas. For example:
-
abdomen blando y depresible (soft and depressible abdomen)
-
molestia abdominal infraumbilical, difusa, constante (diffuse, constant, infra umbilical abdominal discomfort)
-
-
(30) OSRs may be discontinuous. For example, in the sentence below, the string nódulos en CV comes between two parts of an OSR (i.e., Fibroinolaringoscopia and cierre):
-
(4) OSRs and their negations are not always adjacent.
-
(5) There can be more than one valid OSR for a single negation in the same sentence. If this is the case and the OSRs are of different types, we call them mixed (see Table 11).
-
(6) OSRs may fall within the scope of another class, such as an uncertainty scope.
-
(7) There can be ambiguity when the same sentence includes two or more negations.
We found that the latter challenge barely represents 1.2% of the OSRs in the dataset, and, therefore, it was not addressed. The first six challenges were tackled by following the annotation guidelines below.
As a general rule, to delimit the boundaries of an OSR, we identify its head (mostly nouns) and extend it bidirectionally until the maximal syntactic unit is covered (e.g., noun phrase or verb phrase). That is, this unit must include its attributes or modifiers. Determiners such as definite articles, however, are annotated as “O,” that is, outside of the OSR (e.g., the_O admitted_B-NegREF patient_I-NegREF). Likewise, during the OSR annotation process, when an OSR fell within an uncertainty scope, priority was given to the OSR annotation and uncertainty labels were removed.
With this general approach in mind, we manually annotated 13,193 tokens for a total of 3,858 OSRs in the dataset. As the annotation progressed, a pattern arose, namely, every OSR always comes before the negation, either adjacently or not. This is because, when the negation reference comes after the negation, it is generally captured by the negation scope. The pattern of a negated statement with OSR, therefore, is as follows:
Out-of-scope reference + other possible items + negation cue + [negation scope]
This pattern helped with the identification of categories the OSRs fall into. Roughly, these categories are diseases (findings), body parts (organs, tissues, etc.), types of tests or examinations, individuals or groups of people, medications, treatments, procedures, actions, and combinations of them (i.e., mixed OSRs). Table 11 lists bilingual examples of each category. The ellipsis ( $\ldots$ ) between an OSR example and its negation indicates that they are not adjacent, and the plus ( + ) sign between categories means a mixed OSR.
3.4 Fine-tuning experiments
We used NeRUBioS to fine-tune a number of state-of-the-art Large Language Models (LLMs) based on BERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2018). These models were retrained to tackle three tasks at a time using transfer learning, that is, negation detection, speculation detection, and OSR detection and linking with its negation scope. We report the results of five representative models that we utilized for the fine-tuning process:
-
RoBERTa (Carrino et al., Reference Carrino, Armengol-Estapé, Gutiérrez-Fandiño, Llop-Palao, Pàmies, Gonzalez-Agirre and Villegas2021). The version of RoBERTa that we used is a pre-trained LLM in the medical field. The corpus used during its training consists of a collection of documents in the biomedical-clinical domain from several sources in Spanish.
-
mBERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2018). It is a version of BERT trained with large documents written in 104 different languages, including Spanish. mBERT was trained to predict both the next sentence and masked words.
-
BETO (Cañete et al., Reference Cañete, Chaperon, Fuentes, Ho, Kang and Pérez2020). BETO is a Spanish version of BERT trained from scratch on a big corpus. This model’s size is similar to BERT-base’s and was trained using the word masking technique.
-
RoBERTa-BNE (Gutiérrez-Fandiño et al., Reference Gutiérrez-Fandiño, Armengol-Estapé, Pàmies, Llop-Palao, Silveira-Ocampo, Pio Carrino, Armentano-Oller, Rodriguez-Penagos, Gonzalez-Agirre and Villegas2022). A general-domain Spanish version of the RoBERTa architecture, pre-trained on a 570 GB corpus obtained from the Spanish National Library (BNE)
-
XLM-RoBERTa (Conneau et al., Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020). A model pre-trained on the general-domain 2.4TB CommonCrawl Corpus in 100 languages.
In order to be able to jointly tackle these tasks as a token classification problem, each model’s prediction layer was modified. Figure 9 shows the architecture we used to fine-tune the models. The architecture receives a sentence as input, carries out a subword tokenization process (RoBERTa’s tokenization is shown), transforms the tokens to word embeddings encoding the token position in the sentence, and generates a powerful representation for each token by applying the multi-head self-attention process. Then, the representation of the input is passed to a linear layer which predicts a label for each token. Finally, a decoding process is carried out for a subword concatenation to obtain the output.
We followed the same training, development, and test partitions described above for the fine-tuning process with all the models. To find the best hyperparameter configuration, a grid search for epochs (12, 7, 5, 3), learning rate (5e-7, 5e-5, 2e-7, 2e-5), and weight decay (0.01, 0.1) were used. 12, 2e-5, and 0.1 were the best values for epochs, learning rate, and weight decay, respectively (see Results below). The remaining hyperparameters were kept by default (see Table 12 below).
We used the transformer library and the models available at Hugging Face Hub. A Tesla A100 GPU with 27 GB of RAM memory was used for all the experiments. We recorded the computational cost during the fine-tuning process for RoBERTa, mBERT, and BETO, which was 80, 50, and 50 min respectively. The NeRUBioS dataset and the code implemented for this work are ready to be released at a public repository upon publication of this article.
4. Results and discussion
This work’s overall goal is to automatically identify out-of-scope negation references and to link them to their respective negation markers and scopes in the sentence. In this section, we describe the obtained results for each individual task and then we evaluate OSR detection and linking together as well as all the tasks, including uncertainty detection. The values in this sections’ tables are the average results of 12 epochs at the hyperparameter configuration described above. Likewise, the figures in this section show a plot of the F1 score for the five tested models at each of the 12 epochs in both the development and the testing partitions.
The results of the models have been evaluated using precision (Pr), recall (Rec), and F-score (F1). These metrics have a scale of 0.0 to 1.0, and are defined in equations (1), (2), and (3). Since the problem faced is treated as a token classification one, we used the average precision, the average recall, and the average F1 score. The average F1 score is the harmonic mean of precision and recall.
4.1 OSR detection
RoBERTa was the best model for this task and reached F1 scores of 0.59 and 0.56 on the development and testing datasets, respectively (see Table 13 and Figure 10). To the best of our knowledge, this is the first time this problem has been addressed, which makes these results not only a novel contribution to the field but also a competitive baseline, given the complexity of the task.
As we treat the OSR identification problem as a token classification task, we do not measure whether the whole OSR was correctly labeled but rather whether each token in the OSR was correctly classified. The underlying philosophy of this methodology is that, in an information management system, a truncated OSR is more helpful than no OSR at all. However, in order to provide a deeper insight into the results in Table 13, we took a further step supported on the statistics shown in Table 9. Notice that OSRs’ length ranges from 1 to 25 words with an average length of 3.65 words. These figures are relevant because it has been observed (see also Error Analysis below) that classification accuracy decreases as phrase length increases. Based on this assumption, we extrapolated the test F1 score in Table 13 to estimate F1 scores for OSRs shorter and longer than the average length 3.65. This may be helpful in the future to assess the usefulness of some pre- or post-processing to handle OSRs’ length.
We can use linear interpolation to estimate the interpolation for predicting the F1 score for OSRs of lengths incrementing from 1 to 25 based on the observed F1 score of 0.56 in the testing dataset for the average OSR length of 3.65. This interpolation assumes a linear relationship between the lengths of the OSRs and the F1 scores. Let’s denote:
-
$ F1_{3.65} = 0.56$ : The observed F1 score for OSRs of average length 3.65.
-
$ F1_1$ : F1 score for OSRs of length 1. We will need this value to determine the rate of decrease.
The linear decay in F1 score as the OSR length increases is considered in the interpolation formula:
where $k$ is the rate of decrease in F1 score with respect to OSR length and $ n - 3.65$ represents the deviation of the OSR length $n$ from the average length 3.65.
We need to determine the value of $k$ based on the observed F1 score for length 1 ( $F1_1$ ). We can calculate it as follows:
Once we have the value of $k$ , we can use it in the interpolation formula above to estimate the F1 score for OSRs of different lengths.
In order to determine $k$ , we need the F1 score of OSRs of length 1 ( $F1_1$ ). We calculated $k$ with a different $F1_1$ score each time starting with 0.1 up to 1 at intervals of 0.1. Figure 11 (left) shows that $F1_1$ scores below 0.6 yield negative $k$ values, which makes sense if we consider that 0.56 is the reference score to start calculating the decay.
To visualize the way OSR length and $F1_1$ s impact extrapolated F1 scores, we plot a mesh diagram (see Figure 11, right) including hypothetical $F1_1$ scores for OSRs of length 1 (n = 1) from 0.6 to 1, OSR lengths from 1 to 25, and the resulting extrapolated F1 scores. This model assumes that, the shorter the OSR, the better the token classifier’s performance. The graph helps identifying the thresholds to avoid OSR lengths and $F1_1$ scores combinations that yield negative extrapolated F1 values. Factoring in this decay, the average extrapolated F1 score for OSRs of length 1 is 0.8, and for OSRs of length 2 is 0.71, while for lengths 4 and 5, the extrapolated F1 score is 0.53 and 0.44, respectively.
4.2 Negation cue and scope detection
Results for negation on predicted cues and scopes are shown in Table 14 and Figures 12and 13. Our fine-tuned RoBERTa reaches outstanding scores at both tasks (F1 = 0.96 and F1 = 0.89). Moreover, it is also robust enough to maintain negation detection performance despite being fine-tuned on NeRUBioS, a dataset augmented with more classes to encode an additional layer of information.
4.3 Uncertainty cue and scope detection
Results for uncertainty detection on predicted cues and scopes are less consistent (see Table 15 and Figures 14 and 15) and below-related works, though, as we show later. We think this may be explained by a number of cases of uncertainty scopes and OSRs that overlapped in the same text span in NeRUBioS. During the OSR annotation process, whenever an OSR fell within an uncertainty scope, priority was given to the OSR annotation and uncertainty labels were removed.
4.4 Negation and uncertainty detection results in perspective
Other works have reported very successful results for negation and uncertainty identification on the NUBES dataset. Table 16 and Figure 16 include these works as well as our own results with NeRUBioS so we can put them all in perspective. It is not clear, though, whether Solarte-Pabón et al. (Reference Solarte-Pabón, Torrente, Provencio, Rodríguez-Gonzalez and Menasalvas2021b) used the same testing partition the other authors cited in the table used for these scores. The results show that fine-tuning our model with more (OSR) labels does not impact negation detection performance; actually, our model performs slightly better for negation cue detection. Uncertainty detection, however, was compromised to some extent due to a number of uncertainty labels being replaced with OSR labels, as we explained above.
4.5 OSR detection and linking
This task involves OSR detection, negation cue detection, and negation scope detection. While this task has been treated here like a token classification problem, it can be seen as one of relation extraction. This is so because most OSRs and their negation are in the same sentence. Therefore, once the OSR and its negation have been identified, their relation has likewise been established. Table 17 and Figure 17 show very competitive F-scores with RoBERTa in both partitions (0.84 and 0.86) for this novel task.
4.6 The joint task: OSR detection and linking + uncertainty detection
This joint task encompasses the five tasks assessed in the previous sub-sections, that is, OSR detection, negation cue detection, negation scope detection, uncertainty cue detection, and uncertainty scope detection. Table 18 and Figure 18 show a competitive performance for the joint task, although there is variance in these scores brought in by both the high scores of negation detection and the not-so-high OSR detection scores.
With regard to the results in Table 18, RoBERTa certainly was the best model. However, mBERT’s performance is remarkable. It outperforms the other models if its multilingual capacity is considered. mBERT can solve the same task in several languages with promising results using the zero-shot approach, as shown in the machine-translated examples in Table 19.
4.7 Error analysis
We carried out an exhaustive error analysis in OSR detection and linking and have grouped errors in categories as detailed below. In the provided examples, gold and predicted OSRs are in bold while negation cues and their scopes are underlined.
-
Discontinuous OSRs: Some discontinuous OSRs are not detected in their entirety. This type of OSR is very challenging for the model due to the occurrence of other items interrupting an OSR:
Gold
Paciente con los antecedentes reseñados queue ingresa por episodio de pérdida de conciencia mientras se encontraba sentada, que se inicia con sensación de mareo con prodromos de visión borrosa sin otros síntomas $\ldots$
Predicted
con sensación de mareo con prodromos de visión borrosa sin otros síntomas $\ldots$
Like the example shows, the model tends to correctly predict the part that is closer to the negation cue. This was expected since there are many more samples of continuous OSRs in the dataset. Additionally, the first or last part of discontinuous OSRs may feature an infrequent syntactic pattern, which adds to the already challenging identification task.
-
Mixed OSRs: Some mixed OSRs are not detected in its entirety, as in the example below where the OSR is made up of a disease (cáncer de pulmón) and a treatment (tratado con quimioterapia):
Gold
Cáncer de pulmón detectado hace 8 semanas después de consulta por signos de dolor lumbar fue tratado con quimioterapia presentando nula mejoría.
Predicted
tratado con quimioterapia presentando nula mejoría.
Mixed OSRs generally include some sort of complementary information. Therefore, in most cases, the model can identify the part of the OSR corresponding to one category of the OSR, but it struggles to extract the complementary part. Apparently, the myriad of patterns in mixed OSR causes the errors in this category.
-
Long OSRs: Long OSRs are frequently not detected in their entirety. In many cases, the model can identify OSR chunks with a complete semantic sense, as in this example, but it is not always the complete OSRs:
Gold
Paladar asimétrico con desviación de uvula a la derecha, hiperemico, no abombado.
Predicted
Paladar asimétrico con desviación de uvula a la derecha, hiperemico, no abombado.
OSRs in NeRUBioS were tagged by identifying their head (mostly nouns) and extending it bidirectionally until the maximal syntactic unit is reached. This sometimes produces very long OSRs, which can include punctuation marks, especially commas, and non-content words. This type of OSR is a challenge due to the high variety in their syntactic patterns.
-
Tokenization of numbers: OSRs are sometimes truncated when numbers occur.
Gold
Micralax cánulas rectal si más de 48 horas sin deposiciones.
Predicted
de 48 horas sin deposiciones.
Numbers are highly frequent in clinical documents. Due to the tokenization process carried out by the transformer architecture, some numbers are split causing the truncation of OSRs that include a number.
Overall, most of the strings resulting from the types of errors described above are truncations, missing OSR parts, or extended OSRs. As OSRs in NeRUBios are not categorized according to this error typology, it was not possible to quantify the number of errors in each category. However, all these categories together significantly impact the model’s recall and precision, particularly because most of the OSRs are multi-word sequences. On the other hand, a qualitative analysis shows that, when the exact match constraint is relaxed, our approach can still identify useful OSR chunks and link them with their respective negation scopes. The number and usefulness of this partial matching cases, however, are not reflected on the evaluation metrics above, since we assess each model’s exact predictions rather than any form of fuzzy string matching.
5. Conclusions
This work addressed the phenomenon of OSRs in negated statements in clinical documents in Spanish. OSRs are crucial to the integral meaning of a negated statement and they have been systematically left out by negation detection systems so far. Our survey of the literature up to date reveals that (1) this is the first time the issue has been tackled; (2) related issues such as negation and uncertainty/speculation detection have been tackled with four distinct approaches, that is, rules, classical machine learning, hybrid methods, and deep learning. These approaches can be seen as methodological generations given their appearance in chronological order with some overlapping between generations; 3) the early works in each generation reached high performance as they initially tackled more formal features, but, as the modeling of the problem unveiled higher level, more complex features, each generation’s performance as a whole seems to decrease over time, with the exception of deep-learning models.
In order for the OSR task to be approached with transfer learning using deep-learning models, a dataset with annotated OSRs was required. Therefore, we manually augmented NUBES (Lima et al., Reference Lima, Pérez, Cuadros and Rigau2020) into NeRUBioS, the first negation and uncertainty clinical Spanish dataset with OSR annotations. For this, a protocol was defined and followed. The annotation process unveiled seven challenges posed by OSRs, but it also allowed to determine the overall pattern of negated statements. OSRs fall into nine categories and account for 42.8% of negation references in the dataset, which makes it a very relevant problem. Using NeRUBioS, we fine-tuned five BERT-based models and used transfer learning to jointly identify negation scopes and their respective OSRs as well as uncertainty. Our best model achieves state-of-the-art performance in negation detection while also establishing a competitive baseline for OSR identification (Macro F1 = 0.56) and linking (Macro F1 = 0.86). Moreover, an extrapolation of these results to OSRs of shorter lengths suggests that the F1 score for this task may go up to 0.71 for two-word OSRs and to 0.80 for one-word OSRs. The results suggest that ORS identification may be more challenging for the models than negation and uncertainty detection. An analysis of errors and results confirms that the tested models struggle with some of the OSR challenges outlined in the methodology above plus other unforeseen features like dates and numbers. However, a qualitative assessment still shows very useful hits for OSR detection when the exact match constraint is relaxed a bit. Uncertainty detection performance, on the other hand, was impacted by the overlapping of OSRs and uncertainty scopes in some instances.
For future work, we plan to address the models’ limitations, such as distant references, by combining deep-learning architectures to enhance OSR detection and linking results. This combination can also be in the form of a pipeline of models trained for various related tasks. For example, many OSRs fall in the categories of diseases and species, for which a myriad of named-entity recognition (NER) systems has been proposed. The model itself or an architecture enriched with some sort of layer with NER knowledge might help the identification of challenging cases such as discontinuous, mixed, and long OSRs. Additionally, while this work keeps the dataset partitions’ size consistent with Lima et al. (Reference Lima, Pérez, Cuadros and Rigau2020) to allow for performance comparison, it will be interesting to balance or optimize the development and testing partitions size and see whether this, in combination with batch size, impacts the models’ performance. Future iterations of NeRUBioS will also benefit from a dedicated measurement of inter-annotator agreement for OSRs. Likewise, we are adding post-processing capabilities to our system, which has proven useful in previous works in the medical field (Tamayo et al. (Reference Tamayo, Gelbukh and Burgos2022b); Tamayo et al. (Reference Tamayo, Gelbukh and Burgos2022c)). This may be particularly useful to solve number splits and, more importantly, to restore uncertainty tags removed during the OSR annotation process and boost uncertainty identification performance. This seems feasible since uncertainty scopes may overlap with OSRs but uncertainty cues rarely do; therefore, cue labels might be used as anchors for uncertainty scope restoration in a post-processing stage.