INTRODUCTION
Community-acquired pneumonia (CAP) is the leading cause of death from infectious diseases in western countries and health expenditures in particular for in-patient management of patients with CAP are substantial [Reference Macfarlane1, Reference Dixon2]. Accurate assessment of disease severity, risk stratification and prediction of outcome are, therefore, prerequisites for the safe identification of patients with CAP at low risk of complications and thus suitable for outpatient management. Several international organizations have developed prediction rules and adopted guidelines to stratify patients with CAP based on predicted mortalities for the identification of patients with CAP that may be managed in an outpatient setting in order to optimize hospital referral and lower hospital admission rates [Reference Niederman3, Reference Woodhead4]. The pneumonia severity index (PSI) is a widely propagated scoring system in North America that assesses the risk of death in a two-step algorithm [Reference Fine5]. The CURB65Footnote † score is the modified version of the British Thoracic Society (BTS) assessment tool which is based on only five predictors and used in Europe [Reference Lim6, Reference Lim7]. The CRB65 score has been put forward as a useful substitute for the CURB65 as it does not rely on laboratory measurements and still shows acceptable discriminatory ability [Reference Lim7, Reference Neill8].
Prior to the implementation of a statistically derived prediction score, an external validation within locally generated data should be conducted [Reference Justice, Covinsky and Berlin9]. With only few exceptions [Reference Flanders10], external validation studies of pneumonia severity scores have focused on discriminative properties, i.e. the ability of the score to distinguish patients with CAP and fatal outcome from those surviving [Reference Flanders10–Reference Aujesky17]. Despite good discriminatory abilities, most validation studies found higher mortality rates of patients with PSI class III and CURB65 class 1 than was reported in the original studies. Because management strategies of patients with CAP depend on cut-off values of absolute predicted mortalities, it is essential that predicted risks agree with observed risks in the population in question. This is referred to as calibration. Miscalibration may lead to inadequate discharge of patients with high mortality (risk underestimation) or inadequate hospitalization of low-risk patients (risk overestimation).
The aim of our study was to validate the calibration and assess the need for recalibration of three well established pneumonia severity prediction scores in a tertiary-care setting in Switzerland.
METHODS
Study sample
For this analysis we pooled data from two randomized controlled studies enrolling patients with lower respiratory tract infections presenting to the emergency department (ED) of the University Hospital of Basel, Switzerland. The design of the two trials was similar and a complete description has been reported in detail elsewhere [Reference Christ-Crain18, Reference Christ-Crain19]. In brief, the first trial included 243 consecutive patients with clinically suspected lower respiratory tract infections including acute and exacerbation of chronic bronchitis, and CAP, admitted from December 2002 until April 2003. The second trial included 302 patients with radiologically confirmed CAP admitted between November 2003 and February 2005. In both trials, patients were randomly assigned to procalcitonin-guided antibiotic therapy (n=124 or n=151, respectively) or to standard treatment according to guidelines (n=119 or n=151, respectively) [Reference Niederman3, Reference Woodhead4]. The aim of both trials was to study whether procalcitonin-guided antibiotic treatment can reduce the amount of antibiotic consumption and 30-day mortality was monitored as a secondary endpoint. The first trial measured procalcitonin only on admission, whilst in the second trial follow-up procalcitonin measurements were performed. For the purpose of this analysis, only patients with a definite diagnosis of CAP were considered. CAP was defined as the presence of a new infiltrate on chest radiograph accompanied by one, or several, acquired acute respiratory symptoms and signs such as cough, sputum production, dyspnoea, fever >38·0°C, auscultatory findings of abnormal breath sounds and rales, leucocytosis >1010 cells/l, or leucopenia <4×109 cells/l [Reference Niederman3]. In-patient or outpatient management of patients was not an exclusion criteria for either trial. Patients with other lower respiratory tract infections than CAP, including bronchitis or exacerbation of chronic obstructive pulmonary disease and asthma were not considered. Furthermore, patients with cystic fibrosis, active pulmonary tuberculosis, hospital-acquired pneumonia and severe immunosuppression (patients infected with human immunodeficiency virus infection and a CD4 count <350×109/l, patients on immunosuppressive therapy after solid organ transplantation and neutropenic patients with a present neutrophil count <500×109/l and patients under chemotherapy with neutrophils 500–1000×109/l with an expected decrease to values <500×109/l) were not eligible for trial inclusion.
Patients were examined on admission to the ED by a medical resident supervised by a board-certified specialist in internal medicine. Baseline assessment included collection of clinical data and vital signs, comorbid conditions, and routine blood tests. All study forms were completed contemporaneously.
Since neither of the trials showed a significant difference between the intervention arm and the control arm regarding all-cause mortality (pooled OR 0·78, 95% CI 0·41–1·50, P=0·46), treatment assignment was not considered any further in this analysis. In addition, the coefficients of the calibration models did not differ between the intervention group and the control group of the study population for any of the three models assessed.
Both trials had been approved by the local Ethical Committees and registered in the Current Controlled Trials Database (ISRCTN04176397); all patients gave written informed consent.
Severity assessment and outcome
The PSI, CURB65 and CRB65 scores were calculated in all patients on the basis of the patients' unique set of prognostic indicators. Identical to the outcome definition of the original models [Reference Fine5, Reference Lim7, Reference Neill8], we used 30-day mortality as outcome for our validation study, which was collected in both trials as part of the trial safety monitoring.
Statistical analysis
We performed external validation of the three original models by assessing calibration and discrimination. We first studied calibration in a descriptive way by tabulating and plotting observed mortality across classes of predicted mortality as given with the original models [Reference Fine5, Reference Lim7]. We then studied calibration in the context of a simple calibration model fitting the logit of the predicted mortality from the original models against the binary outcome (death or alive at 30 days) from our study population using logistic regression [Reference Steyerberg20, Reference Steyerberg21]. This calibration model has the advantage of efficiency since it uses only two free parameters: an intercept α and a calibration slope β. In the ideal case of perfect validity, α=0 and β=1. The parameters can be tested with ANOVA or Wald statistics. If α or β significantly deviate from the ideal case, then there is evidence of miscalibration and model recalibration should be performed [Reference Steyerberg20, Reference Steyerberg21]. The recalibrated risk can be calculated as with probabilities (prob) originating from the original model and the coefficients originating from the calibration model (see Appendix). We did not compare observed mortality with predicted mortality within risk classes per model for all three models due to the low efficiency of this approach. For each model with five classes this approach would result in five one-sample proportion tests (the random observed mortality against the fixed predicted mortality). In fact we would need to spend 5 degrees of freedom (d.f.) per model instead of 2 d.f. with the calibration model approach and additionally we had the problem of multiple comparisons [Reference Steyerberg20].
Discrimination refers to the ability of the model to assign a higher predicted mortality to all patients with an outcome (30-day mortality) compared to patients without an outcome. We assessed discrimination using the c statistic which is equal to the area under the receiver-operating characteristics (ROC) curve. Moreover, we used the Brier score as an overall measure of model performance [Reference Poses, Cebul and Centor22].
We used R version 2.3.1 [23] and the Design library [Reference Harrell24] for statistical analyses.
RESULTS
Baseline characteristics
Between December 2002 and February 2005, a total of 483 (96 in the first and 387 in the second trial) consecutive patients with an initial diagnosis of CAP were screened for eligibility (Fig. 1). CAP was radiologically confirmed in 373 (87 and 286) patients who are included in this analysis. In total, 110 patients were excluded because of the use of immunosuppressive drugs (n=46), hospital-acquired pneumonia (n=17), non-CAP diagnosis (n=16), tuberculosis (n=3), cystic fibrosis (n=1), death before inclusion (n=2) or due to refusal of informed consent (n=25). The median age of the patients was 73 years [inter-quartile range (IQR) 59–82 years], 84 patients (23%) were smokers with a median of 40 pack years (IQR 20–50) and 90 (24%) of the patients had an underlying chronic obstructive lung disease. Forty-nine percent of the patients (n=184) were randomized to receive antibiotic treatment according to procalcitonin guidance and 51% patients (n=189) were allocated to the control group. The majority of the patients (95·4%) were treated as in-patients with a median length of hospital stay of 11 days (IQR 6–17 days). Outpatients were predominantly in low-risk classes of PSI score (53% in class I, 29% in class II) and of CURB65 and CRB65 scores (76% each in class 0). Baseline characteristics of our study population are summarized in Table 1. For comparison, the characteristics of the study populations where the original models were developed are also given in Table 1 [Reference Fine5, Reference Lim7].
PSI, Pneumonia severity index; ICU, intensive care unit.
* Values are expressed as median and interquartile range (IQR).
† Because of rounding, percentages may not sum to 100.
Calibration of the three different rules
Overall, 41/373 patients died (4/96 and 37/373) and the overall mortality was 11%. The proportion of intensive care unit (ICU) admission was 8% (8/96) in the first and 10% (37/373) in the second study.
Table 2 and the calibration plots in Figure 2 illustrate predicted mortality from the original models against observed mortality within classes of predicted mortality. Compared to the observed mortality of 11%, the predicted average 30-day mortality was underestimated with each of the original models (8·4% for PSI, 5·5% for CURB65, 5·0% for CRB65, respectively). Importantly, within the low-risk classes of the original models we observed relevant mortalities of 2·6%, 5·3%, and 3·7% (PSI classes I–III, CURB65 classes 0–1, and CRB class 0, respectively) (Table 2). Low-risk predictions from each of the original models were therefore on average four times underrated compared to the observed mortalities within low-risk classes. The risk estimates in the high-risk classes were more accurate for all scores. The PSI score, but not the CURB65 and CRB65 scores slightly overestimated the risk of death in the highest risk class (Fig. 2). The four patients misclassified in the PSI and the three patients misclassified in the CURB65 score were younger [median age 67 years (IQR 62–75) and 59 years (IQR 57–61)] compared to correctly classified non-survivors [median age 79 years (IQR 70–86)].
PSI, Pneumonia severity index; CURB, Confusion, Urea >7 mmol/l, Respiratory rate >30/min, low Blood pressure (systolic <90 mmHg or diastolic <60 mmHg); CRB65, the same as CURB65, but does not include urea.
Calibration models showed significant miscalibration of the β slope for all scores (P<0·001 each) underlining the necessity of recalibration (see Appendix). Calibration plots in Figure 2 show the impact of recalibration of the original models: the recalibrated mortalities are in good agreement with the observed 30-day mortality and therefore show good calibration for each model. Details on the calculation of recalibrated mortalities are given in a worked example in the Appendix. In the low-risk classes, recalibration corrected risk estimation for the PSI score (classes I–III) from 0·5% to 2·7%, for the CURB65 score (classes 0 and 1) from 1·2% to 5% and for the CRB65 score (class 0) from 0·9% to 3·5%, compared to the observed mortality of 2·5%, 5·3% and 3·7% in the corresponding risk category of each model. Within our tertiary-care setting of high-risk patients, only the recalibrated PSI was an adequate tool for the identification of low-risk patients with a predicted mortality in the low range of 1%.
Patients in the lowest risk classes of recalibrated CURB65 and CRB65 scores still had relevant mortality rates of 3·6% and 3·5% (Table 3) respectively. Only the recalibrated PSI class I score was therefore adequate for classifying patients as low risk in the range of 1%. Table 3 shows the identification of low-risk patients according to the original and the recalibrated PSI score. The original model classified 162 patients (43%) as low risk (classes I–III) with an observed mortality of 2·5%. The recalibrated PSI class I identified 41 patients (11%) as low risk with an observed mortality of 0%. Accordingly, the sensitivities and specificities of the original PSI risk model were 90·2% and 47·3%, and 100% and 12·3% for the recalibrated PSI score, respectively.
PSI, Pneumonia severity index.
* Classes I–III of the original PSI score correspond to a low mortality of ⩽1%. After recalibration of the original model, only class I corresponds to a mortality of ⩽1% and therefore classifies patients as suitable for outpatient management.
† Mortality ⩽1%.
Discriminatory ability
We performed ROC analysis to assess the discriminatory ability of the three prognostic scores (Fig. 3). The PSI score had an area under the curve (AUC) of 0·72 (95% CI 0·65–0·78). The respective values for CURB65 and CRB65 scores were 0·69 (95% CI 0·61–0·77) and 0·66 (95% CI 0·58–0·73). The corresponding Brier score was 0·094, 0·096 and 0·098 for PSI, CURB65 and CRB65 indicating the lowest prediction error for the PSI score. Recalibration did not numerically affect the discriminatory ability of the models.
DISCUSSION
We performed an external validation study of three well established mortality prediction rules in 373 patients with CAP admitted to a tertiary-care centre in Switzerland. There was acceptable discriminatory performance but all scores markedly underestimated the mortality, particularly in the low-risk classes. This leads to misclassification of patients with a substantial mortality in the low-risk classes. As guidelines [Reference Niederman3, Reference Woodhead4] recommend outpatient management for low-risk patients, misclassification may result in inadequate discharge of patients with a considerable risk of death and potential legal consequences. Recalibration of the risk models corrected the miscalibration of predicted mortalities of all models under investigation. Of the recalibrated models, only the PSI was sensitive enough to accurately identify low-risk patients suitable for outpatient management.
Three different prediction rules, namely the PSI, CURB65 and CRB65 scores, have been proposed and extensively validated for risk stratification in CAP [Reference Niederman3–Reference Lim6]. All three rules are originally designed to identify patients who are at low risk of death and who may hence qualify for outpatient management. Algorithms from statistical prediction models reflect the risk profile of patients embedded in a certain health-care setting where the original model was derived. Consequently, when transporting these rules to different settings at different times, validation and adaptation, if needed, is recommended [Reference Steyerberg20, Reference Steyerberg21].
The original PSI, CURB65 and CRB65 classified 43%, 55% and 30% of the patients as low risk with a presumed mortality in the range of 1%. If the original models were well calibrated in our data, the observed mortalities in the low-risk classes would not have exceeded 1%. However, we observed mortalities of 2·5%, 5·3% and 3·6% indicating the need to recalibrate the models in our tertiary-care setting. Using a basic recalibration approach, the miscalibration of each model (Fig. 2) and the resulting misclassification of patients was corrected. Nevertheless, with mortality rates of 3·5% each in the lowest risk classes, the CURB65 and CRB65 scores were too insensitive to identify subjects with low mortality rates in the range of 1%. Consequently, the CURB65 and CRB65 may help to identify patients at high risk, but their ability to recognize low-risk patients is limited. Unlike the CURB65 and the CRB65, the recalibrated PSI score showed an adequate performance in the low-risk range and was able to correctly identify 11% of the patients with a mortality of 1%. Only class I of the recalibrated PSI score can therefore be used to identify patients who qualify for outpatient management.
Prior validation studies have prospectively evaluated severity scores in different clinical settings and reported high mortality rates particularly in patients of PSI class III or above and CURB65 class 1 or above [Reference Flanders10–Reference Aujesky17]. These studies, however, focused mainly on the overall discriminatory ability of the prediction rules with varying results as expressed by differences in the area under the ROC curves. The present study extends these findings showing that an apparently adequate discriminatory model may mislead clinical decisions because of model miscalibration in a particular clinical setting. Importantly, the ROC of a prediction model is numerically not affected by miscalibration because miscalibration affects the magnitude of the predicted overall risk but not the ranking among the individual patients according to their predicted risks. Validation of both, calibration and discrimination, is thus crucial before a model which is derived at a different time and place is implemented in a clinical setting.
The present study does not per se question the utility of these tools, but underlines the importance of adapting these tools to local settings. The study emphasizes the importance of validation by calibration and of recalibration in the case of significant miscalibration. Although our study population differs from the original derivation population, the assessment of model calibration in the high-risk setting is from a pragmatic point of view of interest. CAP guidelines recommend the use of CAP risk scores, also in EDs, but do not specify the setting where the scores are indicated or not. In a high-risk population, as found in our study with a high proportion of referred polymorbid patients, risk scores may underestimate mortality risks, while in a low-risk setting (e.g. primary care) risk predictions may be inadequately high. Importantly, misclassified patients were found to be younger compared to correctly classified patients. As age is the strongest predictor in the risk scores, the risk of younger people may particularly be underestimated in the high-risk setting.
When evaluating a model for risk stratification, one should start using pre-existing knowledge and, if available, validate and update an existing model within the setting in question instead of building a new model from scratch with all the drawbacks of overfitting and lack of reproducibility [Reference van Houwelingen25]. Recalibration of existing models is attractive because of the stability which is related to the fact that only two parameters (intercept and calibration slope) are estimated [Reference Steyerberg20] (see Methods section). Directly using the observed risk pertaining to a certain risk class is hampered through the potential imprecision due to the small number of observations within a certain class of predicted risk (Table 2). The recalibration approach is preferable since it is efficient and uses only two parameters which is of particular relevance in small samples as exemplified in our previous study [Reference Steyerberg20].
This study included a typical spectrum of CAP patients from a university hospital in Switzerland. Our study population was different to the original study population of the PSI score in terms of age, comorbidities (e.g. renal failure, heart disease and neoplastic disease) and laboratory findings. As a study from a European tertiary-care centre, most patients were referred and selected from family physicians requesting in-patient management. Accordingly, patients had more severe pneumonia as assessed by the PSI score, the rate of outpatient management was low and mortality and rate of ICU admission was higher than in the original studies. However, we consider the population from this study as representative for the European tertiary-care setting, especially for Western Europe. In this study, the 5% of patients treated as outpatients were predominantly in the lowest risk classes of PSI, CURB65 and CRB65 scores. In comparison with 11% of low-risk patients according to the recalibrated PSI score, the management of CAP patients was reasonable after all. As outlined by guidelines, mortality prediction rules should be used to support but not replace physician decision-making about outpatient or in-patient management [Reference Niederman3, Reference Woodhead4]. Patients may have rare medical conditions, and patients designated as ‘low risk’ may have medical and psychosocial contraindications to outpatient care. Particularly, the ability to maintain oral intake, cognitive impairment, and ability to carry out activities of daily living need to be considered. Thus, determination of the initial site of care still remains an ‘art of medicine’ decision that, yet, may not be replaced by prediction rules [Reference Niederman3].
Some limitations should be considered in the discussion of our results. First, the number of outcome events to perform an external validation study was rather low [Reference Vergouwe26]. Second, we validated the severity scores in two trials with prospective follow-up where the issues of patient selection and representativeness of the population need to be addressed. However, the two trials consecutively included all patients with CAP, irrespective of in-patient or outpatient management. The study inclusion criteria corresponded to the criteria used in the original studies and to the criteria of CAP guidelines. In the original studies, the main reason for exclusion was non-CAP diagnosis and only a minority of patients was excluded because of severe contraindications such as immunosuppression or tuberculosis. It is reasonable to believe that the existing severity scores would rather underestimate the risk for rare conditions, and thus, an error because of unrepresentativeness would at least be conservative. Third, our analysis is based on predicted risks categorized in five risk classes, as issued by the original models. Preferably, we would have performed validation of the models based on a patient's individual risk using the coefficients of the original risk functions and each patient's risk profile. To the best of our knowledge the original risk functions are not published. With the full model to hand, a more differentiated picture of the performance in new data might have been possible.
In conclusion, without recalibration the original PSI, CURB65 and CRB65 scores misclassified patients with a relevant mortality as low risk in our Western European tertiary-care setting. Recalibration corrected miscalibration in each model, but only the PSI score was sensitive enough to truly identify patients at low risk. Based on this study, we advocate using the PSI prediction rule for severity assessment and consideration of outpatient management for patients with a PSI risk class I. Nevertheless, even recalibrated estimates need ongoing prospective validation and updating.
APPENDIX
Coefficients of the calibration models for the three validated risk scores
Example
Recalibration of the morality estimate from the original model in a patient with PSI class III:
ACKNOWLEDGEMENTS
We thank the staff of the clinics of Emergency Medicine, Internal Medicine and Endocrinology and the Department of Clinical Chemistry, notably Fausta Chiaverio, Martina-Barbara Bingisser, Maya Kunz, Vreni Wyss and Ursula Schild, for most helpful support during the study.
DECLARATION OF INTEREST
None.