Introduction
Many clinical decisions involve an evaluation and comparison of risks and benefits. Little is known about how physicians estimate risks and benefits or how those estimations are used in decision-making. Although many studies have evaluated physician pretest probabilities, relatively few studies have prospectively compared physicians’ quantitative estimates of the risk of a condition with the true probability of that condition in real patients.Reference Poses, Cebul and Collins 1 – Reference Smith, Poses and McClish 6 In such studies, most showed that physicians provided accurate probability estimates in one or more subsets of patients or clinical situations but provided inaccurate estimates in other subsets. Prior research has identified wide variability between physicians regarding estimates of the probability of an outcome for the same patient.Reference Poses, Bekes and Copare 3 , Reference Dolan, Bordley and Mushlin 7 To our knowledge, estimates of the probability of outcomes have not been evaluated previously for patients with head injury.
An improved understanding of how physicians estimate risk and the degree to which they are accurate may facilitate promotion of effective decision-making strategies and warnings against ineffective decision strategies, both for children with head injury and for other conditions.Reference Croskerry 8 , Reference Croskerry 9 A greater understanding of the influences that impact physician decision-making may also facilitate the design of clinical practice guidelines that incorporate all information considered relevant by physicians and are more likely to influence physician behaviour.Reference Grimshaw, Eccles and Steen 10 , Reference Scott, Grimshaw and Klassen 11
In order to explore these issues further, we were able to use data obtained during derivation of a clinical prediction rule for predicting brain injury in children with minor head trauma. After assessing patients and before any diagnostic imaging, physicians were asked to provide an estimate of the probability of any brain injury and an injury requiring intervention.Reference Osmond, Klassen and Wells 12 Using these data, we sought to determine the accuracy of physicians’ probability estimates, to explore influences on probability estimates, and to explore the relationship between probability estimates and decisions.
Methods
We performed a secondary analysis of data from the Canadian Assessment of Tomography for Childhood Head Injury (CATCH) derivation study, a prospective multicentre observational study used to derive a clinical prediction rule in children with a minor head injury.Reference Osmond, Klassen and Wells 12 Subjects were required to have a Glasgow Coma Score on presentation of 13 or higher and at least one of the following: witnessed loss of consciousness or disorientation, definite amnesia, persistent vomiting, or persistent irritability in the emergency department (if <2 years old). During the derivation study, physicians were asked to report clinical characteristics of included patients at initial assessment and to perform further evaluation as they deemed appropriate. All included subjects who did not undergo computed tomography (CT) during the initial visit and received a phone follow-up 2 weeks after discharge in order to identify any clinically significant injuries that were not previously found. Further details of the methods have been previously reported.Reference Osmond, Klassen and Wells 12
The research ethics boards at the University of Ottawa and the University of Manitoba approved this secondary analysis. Statistical testing was performed using Stata 13.1 (StataCorp LP, College Station, Texas).
Measures
In addition to an assessment of specific clinical characteristics, physicians assessing patients for CATCH were asked to choose a value corresponding to their estimate of the probability of any brain injury (P-Injury), an estimate of the probability of an injury requiring neurologic intervention (P-Intervention), and a rating on a five-point scale of how comfortable they would feel not ordering a CT scan (Figure 1). A notice on the case report form reminded physicians that the questions should be answered before any imaging was done. Brain injury was defined as “any acute intracranial finding revealed on CT that was attributable to acute injury,” including depressed skull fractures but excluding nondepressed and basilar skull fractures.Reference Osmond, Klassen and Wells 12 The need for neurologic intervention was defined as having a craniotomy, intubation, elevation of skull fracture, intracranial pressure monitoring, or death within 7 days. Study assessments were completed by attending physicians or resident physicians in their second year of training or beyond in pediatrics, emergency medicine, or family medicine.
Accuracy of estimates
We used Pearson’s χ2-test to compare expected and observed frequencies of subjects with injury for each level of P-Injury and P-Intervention. This test could not be performed for certain levels of risk, including 0% and 100%, for which the expected frequency of subjects with or without injury was <2. We also evaluated the overall goodness-of-fit of P-Injury and P-Intervention using the Hosmer–Lemeshow test.
Patient characteristics
For subjects in each year of age and for each sex, we compared the expected and actual numbers of patients with any brain injury using Pearson’s χ2-test.
Decisions
We used two measures of physician decision-making that provided complementary information: 1) whether a patient received a CT scan and 2) the rating of how comfortable a physician was with not ordering a CT scan, which was performed at the time of the initial assessment. For some analyses about decisions, we created a binary measure of physician comfort: comfortable (i.e., either comfortable or very comfortable not ordering a CT) or not comfortable (i.e., neutral, uncomfortable, or very uncomfortable not ordering a CT). These categories were based on the proportion of subjects with each comfort rating who received a CT. For each level of P-Injury and P-Intervention, we determined the proportion of patients who underwent CT and the proportion that received a not comfortable rating. We used logistic regression to determine whether there was a difference in the relationship between estimated probability of injury and CT performance in young children.
Inter-rater agreement
A subset of subjects had assessments by two physicians. Both attending and resident physicians could serve as the primary or secondary rater. Inter-rater agreement was determined for P-Injury, P-Intervention, and comfort ratings using weighted κ statistics (kap command in Stata 13.1 using w2 option for weights). For these evaluations, we used a numerical point scale for comfort ratings (1=very uncomfortable not ordering a CT, 5=very comfortable ordering a CT).
Physician employment status
We stratified the sample by the physician employment status of the primary rater (full-time attending, part-time attending, or resident) in order to explore differences in estimates and decisions between the groups. For each group, we used Pearson’s χ2-test to compare expected and overall frequencies of any brain injury. We also compared the frequency of injury, probability estimates, mean comfort ratings, and the proportion undergoing CT for attending v. resident physicians using the Kruskal–Wallis test for continuous variables and Pearson’s χ2-test for binary variables.
Test characteristics
We determined the test characteristics (i.e., sensitivity, specificity, positive predictive value, and negative predictive value) of physician-estimated probability using each level of P-Injury and P-Intervention as a cutpoint. We compared the test characteristics of the CATCH clinical prediction rule to the test characteristics of physician-estimated probability for each cutpoint with a sensitivity of at least 95%.
Results
Enrolled subjects (n=3866) were assessed by 1150 physicians at 10 centres (Table 1). All subjects who were ultimately diagnosed with brain injury underwent CT prior to discharge. Further details of subject characteristics have been published previously.Reference Osmond, Klassen and Wells 12
Physician probability estimates were missing for approximately 2% (n=74) of the enrolled subjects. Subjects from 1 of the 10 centres and subjects for whom the primary rater was a resident physician were more likely to have missing probability estimates. Otherwise, no differences between subjects with and without missing estimates were identified. For 21 (0.5%) ratings, P-Intervention was greater than P-Injury. These values were excluded from further analysis because they were internally inconsistent and likely to represent errors. The following analyses were performed on 3771 subjects with consistent non-missing values for P-Injury and P-Intervention, including 153 subjects with brain injury and 24 with brain injury requiring intervention.
Accuracy of estimates
Using a two-tailed χ2-test, all levels of P-Injury from 1% to 40% had observed frequencies of injury consistent with the expected frequency (Table 2). The observed frequencies for the 50%, 75%, and 90% levels were lower than expected. For all levels of P-Intervention with an expected frequency of injury of two or greater, the observed frequency of injury was lower than the expected frequency, and this difference was significant for the 1% and 50% levels. Using the Hosmer–Lemeshow test, neither P-Injury (p=0.04) nor P-Intervention (p<0.001) was found to demonstrate overall goodness-of-fit compared to expected frequencies of injury.
* P-values for P-Injury and P-Intervention overall are from the Hosmer–Lemeshow test. P-values for individual levels of predicted probabilities are from Pearson’s χ2-tests with 1 degree of freedom, performed for levels in which all expected frequencies were ≥2.
Patient characteristics
The observed frequency of injury was consistent with the expected frequency for both sexes. When stratified by age in years of the subject, the observed frequency was consistent with the expected frequency for all ages except zero years (n=179, mean P-Injury 6.2%, actual injury 12.3%, p=0.001) and 12 years (n=237, mean P-Injury 5.6%, actual injury 2.5%, p=0.046).
Decisions
The proportion of subjects receiving a CT scan increased from 9% to 100% with increasing probability estimates (Table 2). The majority of subjects with a P-Injury of 2% or higher and the majority of those with a P-Intervention of 1% or higher received neuroimaging. The proportion of subjects who received a not comfortable rating similarly increased, although the proportion receiving a CT scan was generally higher than the proportion with a not comfortable rating. Overall, 96% of subjects with a not comfortable rating received a CT scan (very uncomfortable 97%, uncomfortable 99%, neutral 90%), and 20% of subjects with a comfortable rating received a CT scan (comfortable 40%, very comfortable 8%).
Of the subjects with an initial comfortable rating who ultimately received a CT scan, 13 of 426 (3%) were diagnosed with actual injury, and 2 of 426 (0.5%) received intervention. Of the subjects with a not comfortable rating who did not receive a CT scan, 0 of 66 were diagnosed with brain injury after discharge.
The relationship between estimated probability of injury and CT performance was not modified by young age (defined either as age less than 1 year or less than 2 years).
Inter-rater agreement
There were 333 subjects with two raters; 323 had P-Injury estimates by both raters. Of these, 14 subjects had actual brain injury, and none required intervention. Inter-rater agreement was fair to moderate for P-Injury (κ=0.46), P-Intervention (κ=0.34), and comfort ratings (κ=0.60).
Physician employment status
The majority of patients had a primary rater who was a full-time attending physician (Table 3). For all three types of physicians, the observed frequency of injury was consistent with the expected frequency. Subjects seen by resident physicians did have significantly higher probability estimates (p<0.05), lower mean comfort ratings, and higher likelihood of CT performance. The difference in the risk of injury by physician type (attending v. resident) was not statistically significant.
* P-value from Pearson’s χ2-test comparing expected with actual rates of any brain injury.
Test characteristics
The only cut-offs identified with a sensitivity of at least 95% were a risk level of >0% for both P-Injury and P-Intervention. These cut-offs generally performed less well than the corresponding CATCH rule (P-Injury compared to the medium-risk rule, P-Intervention compared to the high-risk rule).
Discussion
Collectively, physicians in this study accurately estimated the probability of brain injury visible on CT in children with minor head trauma with low and moderate degrees of predicted risk (<50%), which constituted the majority of subjects. Physicians overestimated risk when predicting very high levels of risk of injury or when predicting the probability of injury requiring intervention. Inter-rater agreement for probability estimates was only fair to moderate. Although collective risk estimates for low and moderate degrees of risk were accurate, risk was underestimated in certain populations and overestimated in others. Most notably, risk was underestimated in infants, and the mean estimated probability of injury in children less than 1 year of age was approximately half the true risk of injury. Most children with an estimated probability of any injury of 2% or higher, or an estimated probability of injury of 1% or higher, received a CT scan.
Our results support the continued use of clinical prediction rules for identifying children with head injury who need neuroimaging, even in groups for which risk was predicted accurately. We evaluated predicted and observed risk across a broad range, but the decision to obtain a CT is binary. Clinical prediction rules are designed to maximize discriminative ability in addition to accuracy, and the CATCH rule demonstrated substantially better test characteristics than predicted probability by physicians. Furthermore, physicians in our study working in a pediatric emergency department may be more accurate at predicting risk than physicians who see children less often. In addition, although estimated risk was collectively accurate, inter-rater agreement was only fair to moderate, indicating that some physicians may have been less accurate than others when estimating risk.
Increasing estimates of the probability of injury were associated with increasing CT use and an increasing frequency of physicians indicating discomfort not ordering a CT. However, within the same level of P-Injury and P-Intervention, some patients underwent CT and some did not. Some variability in the receipt of CT would be expected because of a change in status of the patient. For example, a child who was initially well-appearing may have become lethargic, or a toddler who was initially irritable may have become playful. However, physicians also varied within levels of P-Injury regarding their degree of comfort not ordering a CT scan. This rating was done at the same time as P-Injury and would not have been affected by a later change in the patient’s status. This variability of decision-making for patients deemed to have the same probability of a poor outcome is interesting to consider. When clinical decision rules are used, the probability of the outcome is often the only factor that determines the recommended action.
Some of the observed variability in decision-making within levels of P-Injury and P-Intervention are likely secondary to differences between and within individual physicians regarding whether they were seeking to identify all visible injuries or only those injuries that required intervention.Reference Osmond, Klassen and Wells 12 , Reference Kuppermann, Holmes and Dayan 13
Variability could also be explained by factors beyond the probability of injury that affect the balance of risks and benefits for a specific patient. For example, recent studies have highlighted the increased risk of harm for younger patients from ionizing radiation used in CT.Reference Hall and Brenner 14 , Reference Pearce, Salotti and Little 15 We did not find effect modification by age on the relationship between estimated probability and decisions, but physicians did underestimate the risk of pathology in children less than 1 year. This underestimation of risk likely reflects the difficulty inherent in assessing infants. However, it is also possible that increased concerns about harm from radiation in very young patients and discomfort about ordering a CT scan somehow made physicians more likely to underestimate the probability of injury in these patients. Variability in decision-making may also be due to varying levels of concern regarding the caregiver’s ability to detect clinical worsening at home and obtain prompt follow-up.
Physician decision-making may also have been impacted by factors unrelated to the risks and benefits to the patient. For example, physicians may have been incorporating patient or family preferences into the decision. Features of the emergency department, such as the degree of crowding or how difficult it was to obtain a CT scan, may also have influenced decision-making. Unfortunately, we were unable to assess these potential confounding variables.
Overall, inter-rater reliability of probability estimates was only fair to moderate, confirming previous findings that pretest probabilities for the same patients vary between physicians.Reference Poses, Bekes and Copare 3 , Reference Dolan, Bordley and Mushlin 7 Resident physicians generally provided higher estimates of probability of injury, but their mean estimates were consistent with the observed frequency of injury for the children that they rated.
The strengths of our study include the relatively large sample size, the prospective nature of data collection using real patients, the availability of outcome data, including phone follow-up for children who did not undergo CT, and the use of two raters for a subset of subjects.
There are important limitations to our study. The wording of the questions and the choices of risk categories were selected by investigators and have not been validated. Our evaluation of the relationship of probability estimates and CT ordering was limited by the fact that final decisions about CT ordering may not have been made by the physician completing the initial assessment, and that new clinical information may have affected CT ordering. The comfort rating does not reflect all of the complexities of real-world decision-making. However, the comfort rating was always completed by the same physician who estimated probability, and, because the case report form was completed at the time of initial assessment, the comfort rating was also unaffected by any later change in clinical status.
Because of the risks of radiation, it was not ethical to perform a CT for every subject.Reference Hall and Brenner 14 , Reference Pearce, Salotti and Little 15 It is possible that some children with injuries that would have been visible on CT, but did not require intervention, were not identified. Practice patterns of individual surgeons may have affected whether subjects received intervention for certain injuries. Probability information was collected in categories rather than asking physicians to provide an estimate of probability. This practice probably increased the proportion of the probability estimation items that were completed but also limited our ability to evaluate finer estimates of probabilities. Our sample was restricted to Canadian pediatric emergency departments and may not be generalizable to settings outside of Canada or to emergency departments where both adults and children are seen. The data presented were collected between 2001 and 2005. Physician understanding about traumatic brain injury has changed substantially in the intervening years, which may impact the accuracy of predicted probabilities.
Conclusions
Physician estimates of probability of any brain injury in children were collectively accurate for children with low and moderate degrees of predicted risk. Risk was substantially underestimated in infants. Risk was overestimated in other circumstances, such as when physicians were asked to evaluate the probability of injury requiring intervention, or in the subset of children with very high risk of actual injury. Estimated risk varied between raters, indicating that some physicians were likely less accurate than others. Further research may identify differences between effective and ineffective strategies for estimating risk, both in infants and older children. Both qualitative research that explores physicians’ thought processes and quantitative research that describes real-life behaviour would be useful.
Acknowledgements
The authors thank the site investigators and other members of the Pediatric Emergency Research Canada (PERC) Head Injury Study Group for their contributions to study design and data collection: Keith Aronyk, MD (University of Alberta), Benoit Bailey (Hospital Ste-Justine), Laurel Chauvin-Kimoff (McGill University), Mark Hamilton, MD (University of Calgary), D. Anna Jarvis (University of Toronto), Gary Joubert (University of Western Ontario), Don McConnell (University of Alberta), Cheri Nijssen-Jordan (University of Calgary), Martin Pusic (Columbia University Medical Centre), Martin Reed, MD (University of Manitoba), Norm Silver (University of Manitoba), Ian Stiell (University of Ottawa), Brett Taylor (Dalhousie University), and Michael Vassilyadi, MD (University of Ottawa). We thank the following site study coordinators for their much appreciated assistance: Jennifer Spruyt, Eleanor Fitzpatrick, Rita Arsenault, Bev Irwin, Rose Jacobson, Sue Heathcote, Lanna Bryska, Nathalie Franc, Geri Siebenga St. Jean, and Diane Laforte. We also thank My-Linh Tran and Sheryl Domingo for data management, the physicians, nurses, and clerks at the study sites who voluntarily assisted with case identification and data collection, and the patients and families who participated.
Disclosure: Dr. Daymont’s time was funded by the Children’s Hospital Research Institute of Manitoba and the Manitoba Health Research Council. The CATCH Study was funded by peer-reviewed grants from the Canadian Institutes of Health Research (CIHR Funding Reference Number: MOP-43911), the Emergency Health Services Branch of the Ontario Ministry of Health, and the Alberta Children’s Hospital Foundation.
Competing interests: None declared.