Introduction
Increasingly, aspects of decision-making are aided or taken over by AI. In domains like warfare, finance, law, healthcare, insurance, and dating, algorithms inform, prepare, or generate hypotheses, diagnoses, behavioral options, or decisions. For instance, decision support systems (DSS) have been introduced to improve medical decisions, by matching the characteristics of an individual patient to a clinical knowledge base, and by providing patient-specific assessments or recommendations to support the clinician in reaching a decision. In terms of scale and speed, these systems can utilize data and observations outside human reach.Footnote 1 , Footnote 2
Questions have been raised about how humans could remain in control over the overall process and be responsible for its outcomes.Footnote 3 , Footnote 4 As different people interact with an AI system (e.g., developing it, operating it, being subject to its outcome), each requires a different set of solutions to exercise control. For humans affected by decisions involving a DSS, Article 22(1) of the EU General Data Protection Regulation (GDPR) states that a “data subject shall have the right not to be subject to a decision based solely on automated processing.”Footnote 5 Further, Article 14.4 (b-d) of the AI Act states that humans using a DSS to make a decision should be empowered to “remain aware of the possible tendency of automatically relying or over-relying on the output produced by a high-risk AI system (i.e., ‘automation bias’).”Footnote 6 It goes on to specify that humans should be able to interpret, accept, disregard, override or reverse DSS outputs. Similarly, the EU High-Level Expert GroupFootnote 7 has emphasized the importance of human oversight for trustworthy AI, for instance by having humans “in-the-loop” (i.e., humans monitoring the system’s operation and intervening during the decision cycle).
It is, however, not clear whether humans can provide the type, amount, and consistency on supervision over longer periods of time, such that it would amount to effective human oversight of machine contributions to decision-making. Rather, due to general psychological reasons, human attention and concentration may wax and wane, such that effectively being in the loop might not be possible in practical.Footnote 8 Various factors such as tiredness, recklessness, boredom, or a lack of attention can play a role in automation bias. There is thus a considerable risk that humans become overly reliant on DSSs,Footnote 9 , Footnote 10 leading to deficiencies in “meaningful human control”Footnote 11 , Footnote 12 and potentially raising “responsibility gaps.”Footnote 13 , Footnote 14 Overall, a human in the loop does not ensure that effective human oversight will be exerted to the extent required for moral and legal responsibility. Rather, humans might end up being “under” the loop,Footnote 15 merely playing a symbolic role by providing formal “stamps of approval” without genuine reflection.
Therefore, this paper proposes and explores the concept of a “Reflection Machine” (RM): an additional computational system to support effective and meaningful human oversight over a DSS. Cornelissen et al.Footnote 16 have recently introduced a first technical proof-of-concept implementation. This paper therefore focuses on how RMs provide feedback on joint human–DSS decisions and urge the human to negotiate the proposed decision, thereby increasing human involvement in the decision-making process. RMs improve human oversight by asking questions about the reasoning behind accepting or rejecting a recommendation. In other words, whereas a DSS thinks “for” the human, the RM thinks “against” them. One way for the RM to do so would be to indicate data that support an alternative option other than the one recommended by the DSS. The questions raised by the RM add frictionFootnote 17 and thereby prevent mindless decisions and instead promote deliberate and reflective decision-making.
Thinking “against” requires more time and effort of a human. Hence, an effective design of an RM and an appropriate balance between the activities of the DSS and RM will be important topics for research. The growing awareness of the potential risks of AI has led to a substantial increase in ethical, political, and legal codes and regulations within the EU. The increased attention, however, has not led to an unequivocal and practically precise set of instructions for the design and development of applications.Footnote 18 , Footnote 19 Besides, legislations such as the GDPR and the AI-Act are inspired by philosophical, political, and legal considerations, but do not explicitly take psychological mechanisms of decision-making into account. Hence, further analysis is required in order to determine what “effective oversight” in human–machine decision-making amounts to, and which RM features contribute to enabling human individuals to reflect on, disregard, or override the output of a machine.
Introducing Reflection Machines in Low Back Pain Medical Decision Support Systems
While specific RMs could be used in different contexts and by different people, we introduce a (currently hypothetical but soon to be implemented) RM used in the medical context by a physician who also uses a DSS to treat low back pain (LBP) in patients.
LBP accounts for more years lived with disability worldwide than any other health condition.Footnote 20 In the Netherlands, approximately 44% of the population experiences at least one episode of LBP in their lifetime, with one in five reporting persistent back pain lasting longer than three months (chronic low back pain, CLBP).Footnote 21 CLBP results in substantial limitations in activities and leads to high healthcare and socioeconomic costs.Footnote 22 , Footnote 23
The treatment for CLBP is still debated and there is a wide variance in treatments available, ranging from standard physiotherapy to a combined physical/psychological (CPP) program and surgery. In the vast majority of patients with CLBP (85–90%), the etiology is unknownFootnote 24 and for medical specialists, it is challenging to identify patients who would benefit from surgical or non-surgical interventions. The etiology can be very different and the psychological coping strategies of the patient have a huge impact on the treatment outcome. The proposed treatment by the orthopedic surgeon is heavily dependent on the presentation of the patient and the experiences of the surgeon.
To reduce extreme variability in diagnosis or proposed treatment, a DSS has been developed, namely, the Nijmegen decision support tool for chronic low back pain (NDT-CLBP).Footnote 25 The example in Figure 1 shows the system’s output for a patient likely to benefit from surgery, which potentially means the patient is referred to the spinal surgeon for consultation. The NDT-CLPB consists of (1) questionnaires that patients complete when they are referred to secondary care (in the Sint Maartenskliniek), (2) patient outcomes registry, and (3) formulae for calculating outcome predictions. The formulae are based on successful (responder) and disappointing (non-responder) outcomes one year after treatment. The tool supports shared decision-making between patient and physician based on patient profiles (patient characteristics related to treatment outcomes) and matches patients, based on questionnaires, to the treatment that they are most likely to benefit from.Footnote 26 The treatment options are spine surgery, conservative combined psychological and physical pain self-management program (CCP program) and no treatment in secondary care (meaning counseling during consultation and physiotherapy in primary care). Patients are referred to either spinal surgeon consultation or non-surgical consultation.
The current version of the NDT-CLBP is rated as very helpful by orthopedic surgeons as it gives extra information. However, most of them are aware that the result might also push toward tunnel vision. Poor outcome of surgery is seen frequently and there can be a bias towards non-surgical treatment in most cases. Moreover, a poor prediction of surgical treatment by the NDT-CLBP aggravates that bias, limiting an objective evaluation of the patient by the orthopedic surgeon. Eventually, the DSS will also limit the accessibility of orthopedic surgeons, since non-surgical consultations will be conducted by physician assistants, who have limited knowledge about surgical treatment indications. Therefore, once a patient is on a path of non-surgical treatment, it is difficult for them to be redirected. To overcome such problems, we introduce reflection support by means of an RM. This complementary system maintains or stimulates active human involvement in machine-supported decision-making, such as required by the EU’s ethical and legal codes.
The following case may serve as an illustration:
A male patient, age 40, suffers from LBP for over a year, physiotherapy was not effective. He is using morphine but cannot work, still needs help to get dressed, and feels depressed. The pain gets worse after 500 meters of walking, spreading to both legs. After some rest, the leg pain resolves. He has problems walking with an upward posture and tends to lean forward. Walking downhill is more problematic than walking uphill. Physical examination by the general practitioner reveals no neurological deficits and the GP refers the patient to an orthopedic surgeon. The treatment for CLBP is still under debate and there is a wide variance in treatments available ranging from standard physiotherapy to a combined physical/psychological (CPP) program and even surgery. The proposed treatment by the orthopedic surgeon heavily depends on the presentation of the patient and the experiences of the surgeon. The patient filled in all questionnaires and the NDT-CLBP estimated benefit from surgery as 28%. The orthopedic surgeon discussed all items with him, and he was sent to the CPP program. Although he became more positive, his LBP remained persistent. Fortunately for the patient, one of the physiotherapists in the CPP program thought of the option of the patient having a spinal stenosis. A new consultation by the orthopedic surgeon, with this specific question, resulted in an MRI scan which confirmed it. The patient was treated surgically and was relieved of all symptoms after three months.
Based on historical patient data, the NDT-CLBP in the scenario described above led the orthopedic surgeon to non-surgical treatment for the patient. Thereby, the surgeon became less aware of the symptoms that fit the diagnosis of spinal stenosis, which can be successfully treated by surgery. If a RM were in place, it could have redirected the surgeon toward a non-psychological diagnosis, that is, the spinal stenosis, as reason for the LBP. Ideally, the DSS itself would have suggested surgical treatment. However, given the variety of different factors for CLBP, many of which cannot be adequately quantified, each case must be treated individually.Footnote 28 This is not to discount the usefulness of DSS, but the output should not be considered as universally applicable. In other words, lesser-known cases for which there is little or no data cannot be adequately addressed by the DSS. The RM thus urges the doctor to re-evaluate and re-consider the suggestion of the DSS. Figure 2 visualizes the basic contours of such a decision-making process (joint human–machine DSS & RM), in which the questions of the RM (in parallel to the physiotherapist in the case described above) increase the reflection of the orthopedic surgeon, which could influence the choice between the two options.
Reflection Machine
An RM can be described as a system that receives information about the medical situation, the NDT-CLBP’s recommendation, and (optionally) the physician’s behavior (e.g., reflection time, decision style, decision history, and preference for an option) as input. Based on this information, the RM can then produce output in the form of questions that prompt the physician to reflect on the decision more deeply. A core aspect is the identification of appropriate prompts to generate reasonable questions. This is similar to the problem of generating reasonable explanations in XAIFootnote 29 and often involves counterfactual reasoning. Counterfactuals feature prominently in human reasoning and communication about decisions.Footnote 30 , Footnote 31 Hence, an RM ideally fosters epistemic certainty about a suggestion by a DSS, which in turn promotes trust between doctor and patient.Footnote 32
To propose relevant questions, the RM can explicitly take into account: (1) the weakest evidence for the accepted decision, (2) the strongest evidence for the alternative decision, and (3) missing evidence that would have had a significant impact on the recommended decision (based on unperformed but potentially useful tests). The relation between evidence and hypotheses will be formalized with a probabilistic approach,Footnote 33 distinguishing weak from strong evidence, given a decision, through model selection.Footnote 34 A variables significance analysis will identify which variables explain the decision made and select important evidence (focused on observations that have higher information gain).
RMs and Human Decision-Making
Although the focus of this paper is primarily to introduce and elucidate the basic idea of an RM, there is a significant amount of fundamental research in the cognitive neuroscience and psychology of human decision-making that will be useful to extend the simple framework provided in Figure 2. To illustrate the possibilities for extension of the framework, we will briefly review some important literature on four topics relevant to the implementation and/or application of an RM, viz. factors influencing decisions, evidence accumulation, neural mechanisms, and surrogate decision-making. Although this research on decision-making is clearly informative, it must be kept in mind that the contexts of such studies are most often social interaction or consumer choices, using for example, investment games with personal gains or losses as tasks. Hence, a direct translation to a medical context is to be treated with caution.
First, much is known about factors that influence (for better or worse) human decision-making. In a review paper, Saposnik et al.Footnote 35 indicated that overconfidence, lower tolerance to risk, anchoring effects, and information and availability biases are associated with diagnostic inaccuracies in 36.5 to 77% of case scenarios. It seems fair to say that at least part of the attractiveness of DSSs derives from their potential to remedy (avoid or compensate for) such flaws in human decision-making. CroskerryFootnote 36 suggests that six major clusters of factors can influence the clinical reasoning during the diagnostic process. These range from individual characteristics of the decision maker to factors in the work environment to factors associated with the patient. Although the emphasis of the paper is to provide a general introduction to the main ideas of an RM, knowledge about aspects influencing decision-making will be very informative in the implementation of the prototype.
Second, decision-making takes place over time and often involves the weighing of incoming information. Busemeyer et al.Footnote 37 focus on decisions involving multi-attribute, multi-alternative choices, to be made under risk and uncertainty, where, over time, evidence accumulation takes place. Noguchi and StewartFootnote 38 present a sequential sampling version of a multi-alternative decision model. It provides a useful framework for understanding how incoming partial evidence (such as empirical evidence or test results) can be sequentially weighed and compared in terms of their support for alternative hypotheses or diagnoses. Moreover, models of evidence accumulation have been found to provide “remarkably accurate” descriptions for the neural dynamics of decision-making (albeit currently mainly in animal studies).Footnote 39
Third, knowledge about the neural mechanisms will be useful in the attempt to understand the why and how of the effects of the RM on the persons involved in making the decision. Rilling and SanfeyFootnote 40 review the neural mechanisms underlying social decision-making, for instance in relation to trust (see also van Baar et al.Footnote 41). Park et al.Footnote 42 investigate the neural mechanisms involved in making group decisions where the outcome of one’s decision can depend on the decisions of others.
Finally, in clinical decision-making there is a difference between the decision makers (physicians) and the ones undergoing the consequences of the decision (patients). Füllbrun et al.Footnote 43 edited a special issue focusing on decision-making for others. They suggest that it is important to consider the “psychological distance” between decision maker and recipient. They refer to a model proposed by Tunney and ZieglerFootnote 44 that captures four different perspectives that together can influence decision makers in their ultimate choice. The final decision is the weighted result of a combination of egocentric (what do I want), projected (what would I do), benevolent (what should I do), and simulated (what would you do) perspectives. Several factors (intent, significance, accountability, calibration, and empathy) determine the weights these perspectives are given in a concrete case, thereby increasing or decreasing the psychological distance between decision maker and recipient. Taken together, these insights can help to identify optimal points of RM intervention in the decision-making process.
From an ethical perspective, the implications of the RM can best be understood along the lines indicated by the High-Level Expert Group on AI of the EU.Footnote 45 In their ethical guidelines for trustworthy AI, the first of seven key requirements concerns human agency and oversight. As they say:
Users should be able to make informed autonomous decisions regarding AI systems. They should be given the knowledge and tools to comprehend and interact with AI systems to a satisfactory degree and, where possible, be enabled to reasonably self-assess or challenge the system. AI systems should support individuals in making better, more informed choices in accordance with their goals.Footnote 46
We suggest that the RM precisely captures the idea of a tool that makes it possible to “reasonably self-assess or challenge” a DSS. In so doing, the ethical requirement of effective human oversight is fulfilled to, at least to a higher degree, in that an RM “helps ensuring that an AI system does not undermine human autonomy or causes other adverse effects.”Footnote 47
Clearly, the overall framework of a joint human–machine decision-making process is complicated. First of all, there is information about the patient, historical cases and general medical knowledge. Second, there are the features of the human involved in decision-making (decision style, expertise level). Third, there are two AI systems (a DSS and an RM). The timing and frequency of information exchange between these elements will need to be specified and the standard flow of information processing and decision-making will need to be analyzed, using the above insights derived from psychology and cognitive neuroscience regarding decision-making in order to derive optimal points for RM intervention. Finally, the usefulness of an RM needs to be explored. Preliminary investigationsFootnote 48 , Footnote 49 suggest that usefulness needs to be investigated in at least three ways. First, does the RM increase effective human oversight over the machine-supported decision process? This involves measuring potential overreliance on the DSS.Footnote 50 , Footnote 51 Overreliance is worrisome insofar as it leads to culpability gaps should an erroneous suggestion be followed.Footnote 52 , Footnote 53 The RM tries to counter this problem by urging the human expert to make a more intentional decision, ideally foreseeing possible outcomes. Nevertheless, an RM is itself a new actor in the decision-making process, and in its wake new human actors are introduced in the chain of decision-making––namely, the designers of the RM. Consequently, the danger of new culpability gaps emerges. It is thus important, when building an RM, to avoid an overly complex model that leads to another “black box.” Instead, the aim of the RM is to ideally provide or assist in the creation of epistemic insights into the workings of the DSS by reducing its opacity, and by strengthening the explicit arguments used by the human in making the final decision. So, although the RM requires a certain level of domain knowledge, a balance between simplicity and complexity must be found.
Second, does the RM increase the quality of decision-making, in terms of accuracy and efficiency? This requires an analysis of false positives and false negatives, as well as decision time measurements.Footnote 54 Relatedly, the question of whether the RM contributes to the patient’s well-being arises. Successful treatment of CLBP relieves suffering, and patients have presumably more trust in doctors if they can explain their decision in ways that go beyond repeating the DSS recommendation. Patient interviews will be necessary to establish this.
Third, how does the RM affect user (i.e., the medical expert) experience?Footnote 55 Here, the focus will be on the understandability of RM interventions, interference with workflow, user confidence in selected options, sense of agency regarding decision-making, and satisfaction levels regarding the overall decision-making process.Footnote 56 A variety of conditions needs to be controlled for, for example, level of medical expertise, automation experience, and automation expectancy.Footnote 57 , Footnote 58 Recent cases of clinical practice with CLBP will be used. In the end, an RM should not simply be a “technological fix” (i.e., solutionism). Rather, context-specific knowledge from involved stakeholders must inform the design of the RM, which is part of a broader socio-technical system, to also anticipate potential harms.
Conclusion
Effective human oversight of DSS is one of the major societal challenges posed by AI. The need for maintaining or increasing meaningful human control over machine recommendations, decisions, or actions has been expressed in several recent EU codes, regulations, and acts. In addition, DSSs bring the risk of “hollowing out” of professional skills.Footnote 59 , Footnote 60 By taking over much of the knowledge consultation and inferencing involved in decision-making, human professionals are at risk of losing their high-level skills, and moreover may experience (and resist) being side-lined. RMs offer the possibility to develop a “best of both worlds” approach, increasing opportunities for responsible decision-making without accountability gaps.
An RM will be relevant for the assessment of work disability for social security benefits.Footnote 61 In the Netherlands, the Dutch social security institute, UWV, performed 155,900 such assessments (“invaliditeitskeuringen”) in 2020.Footnote 62 A recurring problem is the consistency of assessments over time, doctors, and patients. The RM could help physicians make consistent and thoughtful diagnoses in a complex domain while maintaining human authority and accountability and ensuring the efficient use of AI resources in joint decision-making. The RM would thus also improve patient well-being.
Finally, the idea of an RM is applicable to other domains. The application of DSSs and the various issues surrounding human oversight, accountability, and professional expertise is not restricted to the medical domain, but encompasses legal, financial, and policing areas as well. The significant growth in the usage of DSS emphasizes the urgent need for effective human oversight. Overall, RMs could mitigate the harms of over-relying on DSS, as, for example, is the case with wrongful arrests through DSS in law enforcement. The development of RM architectures could therefore stimulate the development of trustworthy intelligent systems that is currently at the core of the EU approach to AI.