INTRODUCTION
Since 1987, a rise in notifications of tuberculosis (TB) has been observed in England [1]. This increase is believed to be real, reflecting an increase in diagnoses of TB, rather than an artefact due to improved reporting [Reference Rose, Gatto and Watson2]. Nevertheless, it has been estimated that between 7% and 27% of cases of TB in the United Kingdom are unnotified [Reference Pillaye and Clarke3]. In 1999, a revised national routine surveillance system for TB, Enhanced Tuberculosis Surveillance (ETS), was introduced to improve the completeness of reporting as well as the information on reported cases [Reference Van Buynder4]. The aim of this study was to estimate the annual incidence of TB in England and assess the completeness of reporting between 1999 and 2002 using record-linkage and capture–recapture methodology.
The accuracy and completeness of surveillance data can be increased through record-linkage between datasets of cases reported from different sources [Reference Sheldon5–Reference Mukerjee8]. This is carried out routinely for cases in ETS by linking notifications with reports of Mycobacterium tuberculosis isolates from the reference laboratories in the UK Mycobacterial Network (MycobNet). The number of cases missed can then be estimated using the overlap between the two data sources through capture–recapture analysis [9]. The preferred capture–recapture method entails log-linear modelling of at least three linked data sources [Reference Fienberg10–13]. The completeness of the different data sources can be assessed by comparison with the case ascertainment, i.e. the total number of patients observed in at least one data source, or the estimated total number of cases. Capture–recapture analysis has been used to evaluate surveillance systems of various infectious diseases in the United Kingdom [Reference Devine14–Reference Breen16]. The same methodology has been applied to TB surveillance in studies in both the United Kingdom and elsewhere [Reference Tocque17–Reference Van Hest20].
METHODS
Case definition and data sources
For the purpose of estimating the number of unobserved TB cases, i.e. cases not registered (‘observed’) in at least one of the linked registers studied, we defined as eligible for inclusion those active TB cases first reported to one or more of three data sources in the four years, 1 January 1999 to 31 December 2002. The three data sources were:
(1) Cases notified through ETS (Notification).
(2) Cases with M. tuberculosis complex isolates reported to MycobNet (Laboratory).
(3) Cases admitted to National Health Service hospitals with a first or secondary hospital discharge code of TB [International Classification of Disease (ICD-10) code A15-A19] provided from Hospital Episode Statistics (Hospital).
Two other data sources used for cross-validation will be mentioned later. An interval of more than 1 year between entries in each of the data sources was considered as a separate episode of disease. To correct for delays in case reporting and mycobacteriological confirmation, records 3 months before and 3 months after the study period were also examined.
Record-linkage
Duplicate entries within each of the three data sources were excluded. Hospital records were linked to the previously linked Notification and Laboratory records. Record-linkage software developed by the Centre for Infections establishes a likelihood of association between two records based on a core set of identifiers (date of birth, age, full postcode and sex of the patient and proximity of date of notification, initial mycobacterial isolate or hospital admission). It allows for visual inspection of available additional information on geographical location, site of disease, ethnicity and smear, culture or histopathology results (when performed). All cases with incomplete or missing information on both the date of birth and age were labelled as ‘insufficient identifiers’ and excluded.
The software allocates an a priori determined maximum number of points to each core identifier for complete agreement, reflecting the perceived relative importance of that identifier. Record pairs with full agreement of all core identifiers are automatically assigned as true links. Points are deducted proportionally to the presumed loss of information for increasing deviation from perfect linkage of each identifier to generate an aggregate score, reflecting the likelihood of association between two patient records. All categories of candidate links other than automatically assigned links were visually inspected and either accepted or rejected. Linked cases were allocated to the year of first known date of notification, culture-confirmation or hospital admission.
False-positive records and correction
All laboratory-confirmed cases reported through MycobNet were assumed true TB cases, as previously found in a local capture–recapture study in England [Reference Tocque17]. Notification and Hospital records not linked with Laboratory could potentially include three groups of false-positive records:
(1) Cases ultimately diagnosed with an infection with Mycobacteria other than tuberculosis (MOTT).
(2) Cases with a final diagnosis other than TB or MOTT infection.
(3) Cases misclassified or miscoded.
The proportion of unlinked Hospital cases attributable to MOTT infection was estimated by linking Hospital data from 2003 with a MOTT database which began in that same year and used to correct the number of unlinked Hospital cases in all years under study using a formula explained below, assuming the annual proportion is similar.
In order to estimate the proportion of cases with a final diagnosis other than TB or MOTT infection Notification cases unknown to Laboratory were linked with Treatment Outcome Monitoring (TOM) data, containing data on Notification cases with a final diagnosis other than TB. At the time of this study TOM data were only available for 2001. The proportion of false-positive Notification cases found was used to correct all years under study assuming the annual proportion is similar.
Previous capture–recapture studies on TB identified a considerable proportion of remaining false-positives among unlinked Hospital cases after examining individual patients' medical files [Reference Tocque17, Reference Baussano19, Reference Van Hest20]. Examining individual patients' medical files was not feasible due to the scale of this study. We estimated the proportion of these remaining false-positive cases through a population mixture model. Briefly, we used 40 covariates (number of admission days, number of admissions during the TB episode, rank number of TB diagnosis (14 possible positions) and 37 different ICD-10 TB diagnosis codes) and the incidence of Hospital records linked with Notification and/or Laboratory to estimate the number of true TB cases among unlinked records, under the assumption that all linked Hospital cases are true TB cases and unlinked Hospital cases are a mixture of true and false-positive TB cases. The best-fitting logistic regression model calculates for every Hospital case the predicted Bernoulli parameter P (reflecting the probability of being a true TB patient) from the covariates. Linked and unlinked Hospital cases have characteristic frequency distributions of values P as ‘signatures’. After standardization we used these signature curves to separate the mixture of unlinked Hospital cases, assuming the subpopulation of true TB cases has a similar signature curve to linked Hospital cases and the false-positive TB cases have a different signature curve (population mixture model available in online Appendix). The corrected annual number of true TB cases known only to Hospital was calculated using the formula:
where N original and N final denote the number of unlinked Hospital cases before and after deducting the projected annual proportion of MOTT infection cases and the estimated annual proportion of remaining false-positive TB cases by logistic regression respectively, Prop true the estimated annual proportion of true TB cases by logistic regression and Prop MOTT the projected annual proportion of MOTT infection cases.
Observed source-specific coverage rates were defined as the number of TB cases in each data source divided by the case ascertainment, expressed as a percentage.
Capture–recapture analysis
The annual and total number of unobserved TB cases was estimated on the basis of the final distribution of observed cases over the three data sources. The independence of data sources and other assumptions underlying capture–recapture analysis have been described previously [Reference Van Hest, Smit and Verhave21]. Interdependencies between the three TB data sources are probable, causing possible bias in two-source capture–recapture estimates. Three-source log-linear capture–recapture analysis was employed to take possible interdependencies into account [Reference Tocque17, Reference Baussano19, Reference Van Hest20]. Estimated source-specific coverage rates were defined as the number of TB cases in each data source divided by the estimated number of TB cases by capture–recapture analysis, expressed as a percentage.
RESULTS
Table 1 shows the initial annual number of cases in each of the TB data sources before record-linkage and the proportion of records excluded from the study because of ‘insufficient identifiers’. The proportion of excluded records is small for all three TB data sources and consistent over the years examined.
The record-linkage process designated 10 539 of the 16 272 (64·8%) Hospital cases as links while 5733 cases (35·2%) remained unlinked. After visual inspection of the identifiers, 94·9% of all records allocated ⩾3000 points by the record-linkage software (from a maximum of 4000 points) were accepted as true links.
Table 2 shows the number, proportion and distribution of TB cases over the data sources after record-linkage, the corrections for estimated and projected proportions of false-positive cases and the final distribution. Record-linkage between the TOM and Notification data sources for 2001 identified 4·1% of cases known only to Notification and 4·1% of cases known to Notification and Hospital with a final diagnosis of not TB or MOTT infection. Record-linkage between Hospital records and the MOTT database for 2003 identified 3·8% of Hospital cases as having MOTT infection. The population mixture model gave a range of the proportion of true TB cases known only to Hospital of 0–38%, with an upper 95% confidence limit of 50%. The value 28% (uncertainty interval 19–50%) was chosen because of good support by the model and prior expectation based on national and international reports. The total estimated and projected percentage of false-positive cases among all Hospital cases was 26·7% (4352/16 272). Since 2000 the proportion of cases known only to Notification or Laboratory has fallen each year and the number of Notification cases linked to Laboratory or Laboratory and Hospital has increased. Of all 28 678 TB cases included in this study, 2990 (10·4%) were identified in the Laboratory data source with a positive culture for M. tuberculosis but unnotified.
NOT, Notification data source; LAB, Laboratory data source; HOSP, Hospital data source.
* After correction for multiple links and exclusion of patient records with insufficient identifiers.
† After correction for estimated proportion of cases with diagnosis other than tuberculosis identified in the Treatment Outcome Monitoring (TOM) dataset.
‡ After correction for estimated proportion of unlinked Hospital cases with diagnosis of Mycobacteria other than tuberculosis (MOTT) infection and false-positive hospital records.
Table 3 shows the annual and overall observed number of TB cases after record-linkage and correction for false-positive records. The overall observed source-specific coverage rates of notified, culture-confirmed and hospitalized TB cases were 84·1%, 54·3% and 41·6% respectively. Overall observed under-notification was 15·9%. The annual observed Notification-specific coverage rate increased from 81·8% to 86·7% between 1999 and 2002. The annual observed Laboratory and Hospital source-specific coverage rates were relatively stable over the study period.
* UI, Uncertainty interval.
Table 4 shows the annual and overall estimated number of unobserved and total TB cases after capture–recapture analysis. For all estimates the saturated log-linear model was preferred based on the Akaike Information Criterion (AIC), as none of the other, more parsimonious, models produced a negative AIC [9, Reference Hook and Regal12]. The overall estimated completeness of case ascertainment was 66·7% (28 678/42 969). The overall estimated source-specific coverage rates of notified, culture-confirmed and hospitalized TB cases were 56·2%, 36·2% and 27·7% respectively. Overall estimated under-notification was 43·8%. The number of unobserved TB cases fell every year. The annual estimated Notification-specific coverage rates between 1999 and 2002 were 48·1%, 51·1%, 59·0% and 66·5% respectively. None of the approximate confidence intervals include expected values of under-notification. We assessed that the interval between the administrative reporting dates used in this study instead of the date of actual disease onset could result in a capture–recapture overestimate of the number of unobserved cases of 1·5% (model available from the authors).
ACI, Approximate confidence interval.
DISCUSSION
Main findings
This study shows that record-linkage of TB data sources and cross-validation with additional TB-related datasets improves data accuracy as well as completeness of case ascertainment. For large TB data sources sophisticated record-linkage software is required and a population mixture model to estimate the proportion of false-positive TB cases among unlinked hospital cases. Since the introduction of ETS the annual observed completeness of notification has increased. However, 10·4% of the observed TB cases in this study were laboratory-confirmed but unnotified. The overall observed under-notification of 15·9% is consistent with previous reports. The 43·8% overall under-notification estimated by a saturated log-linear capture–recapture model is highly inconsistent with previous reports and the validity needs further examination [Reference Pillaye and Clarke3, Reference Tocque17].
Under-notification
In this study an interval of more than 1 year between entries in each of the data sources was considered to indicate a separate episode of disease. Although the number of patients with multiple episodes of TB according to this definition was limited, possibly including a small number of patients whose disease at diagnosis warranted more than 1 year's therapy, an extended definition, e.g. a 2-year interval, would (very) slightly reduce the number of observed cases in the Laboratory and/or Hospital registers and therefore (very) slightly increase the completeness of Notification. Increasing completeness of Notification could be influenced by improved data accuracy and record-linkage over the years.
In comparison with similar studies in Italy and The Netherlands [Reference Baussano19, Reference Van Hest20], the observed completeness of the Notification register in England is similar. The estimated completeness of notification is low in England due to the high estimated total number of TB patients and highly inconsistent with the results in The Netherlands and Italy, probably due to greater violation of the capture–recapture assumptions. The completeness of the Laboratory register is lower than the completeness of the Notification register due to the proportion of culture-negative TB cases. The observed completeness of the Laboratory register in England is lower compared to The Netherlands but higher compared to Italy, indicating efforts to establish bacteriological confirmation of the diagnosis in England and The Netherlands, whereas in Italy apparently more patients are treated on empirical grounds. In England and The Netherlands, the observed completeness of the Hospital register is low, probably reflecting common policies of preferably treating TB patients as outpatients, including isolation at home for infectious patients. The high proportion of hospitalized TB patients in Italy suggests a system of (initial) clinical analysis, diagnosis, treatment or isolation.
An overall observed under-notification of 15·9% suggests that in England about 1100 TB patients may be unnotified annually of which the majority (2990/4534) is culture-confirmed, representing 10·4% of all TB cases. This reflects the most serious public health aspect of under-notification as culture-confirmed TB cases are assumed true cases and are potentially infectious. Failure to notify laboratory-confirmed cases jeopardizes control measures, including contact tracing. The capture–recapture studies in Italy and The Netherlands show proportions of unnotified culture-confirmed TB cases of 5·5% and 4·9% respectively [Reference Baussano19, Reference Van Hest20]. The proportion of unnotified culture-confirmed TB cases in England could be an overestimate resulting from possible imperfect record-linkage or, despite our assumption, remaining false-positive records in the Laboratory data source.
Limitations due to imperfect record-linkage and false-positive records
Imperfect record-linkage causes misclassification and results in observed and estimated numbers of TB cases being too low or too high. Our data show that 94·9% of the linked cases have a high likelihood of association score of ⩾3000 points, and only 5·1% with such a score were unlinked. This indicates that in only a minority of candidate links could an error of classification have occurred. This fulfils our purpose of record-linkage resulting in unbiased numbers in each category, with possibly some balanced misclassification. The relatively stable annual proportional distribution of TB cases and the decreasing annual proportion of unlinked Notification and Laboratory cases give further confidence in the record-linkage software and procedure.
A low positive predictive value of TB data sources results in observed and estimated numbers of TB cases being too high. Lack of specificity of data sources used in capture–recapture studies as a limitation to the validity of this method has previously been described [Reference Papoz, Balkau and Lellouch22, Reference Borgdorff, Glynn and Vynnycky23]. Not all TB cases are defined by gold-standard laboratory confirmation and diagnosis can be based on a clinical intention to treat. The three data sources used employ different case definitions, with consequent variations in specificity. We demonstrated by cross-validation with additional datasets that failure to de-notify or re-classify patients with a final diagnosis of not TB occurs which will also reduce the positive predictive value.
The population mixture model estimates a proportion of 72% remaining false-positive cases among unlinked Hospital cases, contributing to 26·7% false-positive cases among all Hospital cases, and resulting in a final average proportion of true unlinked Hospital cases of 5·4%. These results are in good agreement with comparable record-linkage studies of TB incidence in the United Kingdom and elsewhere, indicating a plausible logistic regression model but expressing concern about the contribution of unscrutinized Hospital data sources to accurate estimates of TB incidence [Reference Mukerjee8, Reference Tocque17, Reference Baussano19, Reference Van Hest20].
Limitations due to violation of the underlying capture–recapture assumptions
The capture–recapture findings have to be placed in the context of the limitations of this study. The assessment of the coverage of the TB data sources was based on three-source log-linear capture–recapture models, only valid in the absence of violation of their underlying assumptions: perfect record-linkage (i.e. no misclassification of records), a closed population (i.e. no immigration or emigration in the time period studied) and a homogeneous population (i.e. no subgroups with markedly different probabilities to be observed and re-observed). In two-source capture–recapture methods one must also assume independence between data sources [i.e. the probability of being observed in one data source is not affected by being (or not being) observed in another] [9]. In the three-source capture–recapture approach dependencies between two data sources (pair-wise interdependencies) can be identified and incorporated in the log-linear model. However, the three-way interaction, i.e. dependency between all three data sources, cannot be incorporated in the model and its absence must be assumed. This and other limitations of capture–recapture analysis are described elsewhere in more detail [Reference Hook and Regal12, Reference Papoz, Balkau and Lellouch22, Reference Desenclos and Hubert24–Reference Tilling29].
Violation of the perfect record-linkage assumption and the problem of possible false-positive cases have already been discussed. Violation of the closed population assumption is presumed to be limited for TB as the opportunities for notification, culture confirmation or hospitalization are, also for immigrants, largely determined within a short period of time. However, this violation could result in overestimation of the number of patients.
TB services in England are organized around close collaboration between clinicians, microbiologists and public health professionals such as communicable disease control consultants and TB nurses. The log-linear capture–recapture models with the best goodness-of-fit were saturated models, i.e. including all two-way interactions. Violation of the absent (positive) three-way interaction assumption, biasing the estimates of the true population size downwards, cannot be ruled out [Reference Hook and Regal12, Reference Cormack26, Reference Hook and Regal27, Reference Regal and Hook30].
Violation of the homogeneity assumption is also likely: age, site of disease and infectiousness, among others, can cause different probabilities of being observed in a TB data source. One way of handling possible heterogeneity is to stratify the population into more homogeneous subpopulations and then to carry out capture–recapture analyses for each of the distinct groups. However, our corrections for the projected and estimated proportion of Notification and especially Hospital records being false-positive, and incomplete availability of relevant identifiers in all data sources prevented meaningful stratification. To investigate possible bias in the log-linear capture–recapture estimates as a result of violation of the homogeneity assumption, we have re-examined the data with alternative models, as described in the capture–recapture literature [Reference Hook and Regal12, Reference Regal and Hook30, Reference Wilson and Collins31]. These models reportedly perform well when compared to log-linear capture–recapture estimates and are arguably more robust to violation of the homogeneity assumption [Reference Regal and Hook30, Reference Hook and Regal32, Reference Smit, Reinking and Reijerse33].
(1) We first applied a structural source model [Reference Regal and Hook30]. This method models potential heterogeneity of the population, partly based on prior knowledge, and estimates the probabilities of conditions that produce the relationships between the data sources; more specifically in this instance, the proportion of patients with pulmonary or extrapulmonary TB in the population. The annual and overall estimated number of unobserved and total TB cases is shown in Table 5 but the structural source model did not fit well. The number of unobserved TB cases is very high in 1999 but then falls considerably every year to lower estimates compared to the saturated log-linear model, although each year the confidence intervals of both estimates overlap. The estimated annual Notification-specific coverage rate improves every year. The approximate confidence interval of the 2002 estimate includes expected values of under-notification.
The structural source model estimates a large majority of the unobserved TB cases to have extrapulmonary TB. Local under-notification of non-respiratory TB of 47% has been reported in the United Kingdom [Reference Mukerjee8]. This possibly reflects health service organization in the United Kingdom where extrapulmonary cases are less likely to be managed by clinicians familiar with notification of infectious diseases. Apart from underestimating the burden of TB, the implications for public health are limited as extrapulmonary TB patients are rarely infectious.
(2) We tested our data using Zelterman's truncated Poisson mixture model, which is also vulnerable to possible violation of underlying assumptions [Reference Zelterman34]. This estimator and similar ones have been used in the social sciences to estimate the size of hidden populations such as illicit drug users and homeless persons [Reference Smit, Reinking and Reijerse33, 35–Reference Hay and Smit37]. A recent publication compares three-source capture–recapture model estimates with the estimates of truncated models, including Zelterman's model, for 19 datasets of infectious disease incidence and discusses the conditions where these estimates are similar or dissimilar [Reference Van Hest38]. The results of this study suggest that for estimating infectious disease incidence and completeness of notification independent (i.e. without pair-wise interdependencies between the data sources) and parsimonious (i.e. incorporating one or two pair-wise interdependencies between the data sources) three-source log-linear capture–recapture models are preferable. However, when saturated models are selected as the best-fit model and the estimates are unexpectedly high and seem implausible the data should be re-examined with truncated models as a heuristic tool, in the absence of a gold standard, to identify possible failure in the saturated log-linear model. When the truncated models produce a lower and more plausible estimated number of infectious disease patients arguments are put forward that the estimates of the truncated models could be preferable. Table 5 shows the annual and overall estimated numbers of unobserved and total TB cases. The estimated numbers of unobserved TB cases were low compared to the structural source model, especially in 1999. From 2000 onwards the estimates fell every year. According to Zelterman's model, estimated completeness of Notification was 70·7% overall and 68·5%, 63·8%, 73·6% and 76·4% for the years 1999–2002 respectively. The confidence intervals do not overlap with the other models but include expected values of under-notification in 2001 and 2002.
In the comparative study mentioned above, the number of TB patients in England was also estimated using a Poisson heterogeneity model and a truncated binomial model [Reference Van Hest38]. Compared to the Zelterman model, the Poisson heterogeneity model estimated a slightly lower overall completeness of Notification (68·7%) and the truncated binomial model estimated a slightly higher completeness of Notification (73·3%). The latter result could be an overestimate due to some violation of the equiprobability assumption underlying the binomial model [Reference Van Hest38].
ACI, Approximate confidence interval.
Hook & Regal state that ‘In no sense is there any proof or reassurance that application of multiple-source log-linear estimators for any particular observed data on real populations results in a valid estimate, nor even necessarily produce an estimate closer to the true value than some alternative approach’ and ‘if the saturated log-linear model is selected by any criterion the investigator should be particularly cautious about using the associated outcome’ [Reference Hook and Regal12]. Confidence in the validity of capture–recapture results may reflect publication bias in favour of successful capture–recapture studies rather than the inherent strength of this methodology [Reference Hay39].
CONCLUSION
Record-linkage, as performed in ETS, improves accuracy of surveillance data as well as completeness of case ascertainment of TB. Hospital-derived data added a limited number of possible true TB patients. Since the introduction of ETS the annual observed completeness of notification has increased. This is probably due to improvements in case reporting combined with improved data collection and record-linkage. This study shows that observed under-notification of TB cases in England might be as high as 10·4% as these cases were laboratory-confirmed but not notified. The overall observed under-notification was 15·9% which is consistent with previous reports. Overall under-notification estimated by a saturated log-linear capture–recapture model was highly inconsistent with previous reports and could be an overestimate due to violation of the underlying assumptions, especially the homogeneity assumption as suggested by the alternative models.
Instead of capture–recapture analysis including hospital episode registers, record-linkage and case ascertainment using the two most relevant sources for infectious disease surveillance, namely notification and laboratory, both with an expected high specificity and hence positive predictive value, as performed in ETS, will often already considerably improve the knowledge of the number of patients and infectious disease incidence rates, as well as the completeness of information on specific demographic, diagnostic or epidemiological variables. All unlinked laboratory cases in addition to the notifications are by definition TB cases. According to Zelterman's truncated model, in England the estimated completeness of the Notification and Laboratory records combined was 78·2%, 74·1%, 81·0% and 83·8% for 1999–2002 respectively, all within the expected range of under-notification and consistent with the results of parsimonious capture–recapture model estimates in some other European countries [Reference Baussano19, Reference Van Hest20]. Real-time record-linkage of laboratory data and incident case reports in ETS allows for appropriate prospective action to be taken, such as identifying and approaching the clinicians treating the unlinked culture-positive TB cases by the local consultants in communicable disease control or TB control nurses, considering the unlinked MycobNet reports as ‘pre-notifications’, and encouraging the clinicians to notify these patients. This would increase the completeness of the notifications register as would campaigns to raise awareness of complying with (compulsory) notifications among clinicians by public health authorities. Appointing a clinician, e.g. one of the consultant chest physicians, as TB coordinator in every hospital, to be consulted for each patient with TB in that hospital, including extrapulmonary cases, could further promote notification.
ACKNOWLEDGEMENTS
We thank David Quinn for designing the record-linkage software and processing the record-linkage procedures, Charlotte Anderson for providing the MOTT data and Valerie Delpech and Filip Smit for their suggestions on alternative models. Permission for this study was obtained from the Hospital Episode Statistics Security and Confidentiality Advisory Group (S&CAG).
NOTE
Supplementary material accompanies this paper on the Journal's website (http://journals.cambridge.org).
DECLARATION OF INTEREST
None.