INTRODUCTION
Surveillance of infectious diseases is an essential part of public health. Mandatory notification is one of the mechanisms to carry out such surveillance but under-notification has been widely reported. For meaningful interpretation of the number of patients with infectious diseases the completeness of notification should be estimated. This can be done through a statistical technique called capture–recapture analysis. Based on certain assumptions, capture–recapture methods use information on the overlap of linked disease registers to estimate the number of patients unknown to all registers and thus the estimated total number of patients [1]. Completeness of notification can then be assessed relative to the estimated total number of patients. In biomedical sciences capture–recapture analysis is frequently used for estimating the number of accidents and injuries [Reference LaPorte2] and patients with mostly chronic diseases such as congenital deformities [Reference Orton, Richard and Miller3], insulin-dependent diabetes mellitus [4], cancer [Reference McClish and Penberthy5], neurological conditions [Reference Tilling, Sterne and Wolfe6] or rheumatological diseases [Reference Mahr7]. Less frequently it has been used for evaluating infectious disease surveillance, especially when record-linkage is based on more than two registers.
The validity of capture–recapture estimates depends on possible violations of the underlying assumptions: cases can be uniquely identified (i.e. registers have a perfect positive predictive value), perfect record-linkage (i.e. no misclassification of records), a closed population (i.e. no immigration or emigration in the time period studied) and a homogeneous population [i.e. no subgroups with markedly different (re)capture probabilities]. In two-source capture–recapture methods one must also assume independence between registers [i.e. the probability of being observed in one register is not affected by being (or not being) observed in the other registers]. In the three-source capture–recapture approach pairwise dependencies, i.e. dependencies between two registers, can be identified and accounted for in a log-linear model [1, Reference Fienberg8–11]. The three-way (highest-order) interaction, however, i.e. dependency between all three registers, cannot be incorporated in the model and its absence must be assumed.
In epidemiological studies violation to some degree of most of the underlying capture–recapture assumptions is unavoidable. This and other limitations of capture–recapture analysis are described elsewhere in more detail [Reference Hook and Regal10, Reference Desenclos and Hubert12–Reference Chao19]. Infectious diseases carry an elevated risk that some capture–recapture analysis assumptions are violated. Especially absence of dependence between the available registers, including three-way interaction, and heterogeneity among the patients cannot be excluded and should be expected. Consequently, the validity of two-source and three-source capture–recapture studies requires critical scrutiny.
Sometimes, it becomes evident that a capture–recapture model breaks down and produces erratic results. While performing three-source log-linear model capture–recapture studies on the completeness of notification of tuberculosis in The Netherlands [Reference Van Hest20] and England we were confronted with unexpected and unrealistic estimates of tuberculosis incidence, despite using well-described procedures for finding the best log-linear model [Reference Van Hest, Smit and Verhave21]. In this context, solely relying on three-source capture–recapture analysis without any cross-validation seems to be inappropriate. We suggest that three-source capture–recapture analyses should be complemented by alternative methods to arrive at, and cross-validate, estimates of population size. Alternative models related to capture–recapture analysis have been described and offer the opportunity to cross-validate outcomes. The aim of this study is to re-examine the data of published and current three-source log-linear model capture–recapture studies on infectious disease incidence with various truncated models for incomplete count data and describe the apparent agreement or discrepancy of the estimates.
METHODS
Data sources
Data sources used were 19 datasets in 16 published or current three-source log-linear model capture–recapture studies on infectious disease incidence known to us.
Truncated population estimators
The data sources were re-examined with three alternative population estimators: a truncated binomial model, a truncated Poisson mixture model (Zelterman) [Reference Zelterman22] and a truncated Poisson model (Chao) [Reference Chao23, Reference Chao24]. Out of the many possible methods we have chosen this combination of truncated models because they have been described as an alternative to capture–recapture methods [Reference Hook and Regal10, Reference Hook and Regal25], can be used on the same data that is needed for the three-source log-linear model and are easy to apply [Reference Van der Heijden, Cruyff and Van Houwelingen26, Reference Smit, Toet, Van der Heijden, Hay, McKeganey and Birks27].
In epidemiology, truncated estimators are usually applied to frequency counts of observations of individuals in a single data source [Reference Bohning28]. They aim to estimate the number of unobserved persons (falling in the zero-frequency class) based upon information on the number of times a person has been observed. Technically, one assumes a specific truncated distribution of the observed data, e.g. Poisson or binomial, and then extrapolates from the observed series to the unobserved number of people never seen [Reference Hook and Regal10]. Observed frequency distributions may not be strictly Poisson and to relax this assumption Zelterman based his model on a Poisson mixture distribution, allegedly allowing greater flexibility and applicability on real-life data [Reference Bohning28]. Conceptual aspects of the Zelterman and Chao models have been discussed in some detail elsewhere [Reference Smit, Toet, Van der Heijden, Hay, McKeganey and Birks27, Reference Wilson and Collins29–Reference Hay and Smit31]. The simple truncated estimators do not need statistical packages. In the social sciences truncated models have been employed to estimate the size of hidden populations such as criminals [Reference Van der Heijden, Cruyff and Van Houwelingen26, Reference Rossmo and Routledge32], illegal residents [Reference Bustami, Klein and Korsholm33] and illicit drug users and homeless persons [Reference Smit, Toet, Van der Heijden, Hay, McKeganey and Birks27–Reference Wilson and Collins29, Reference Hser34]. To our knowledge, truncated estimators have not been used before to estimate the number of infectious disease patients.
As with capture–recapture analysis, the validity of the estimates of truncated models depends on the possible violation of the underlying assumptions. These assumptions are similar to the capture–recapture assumptions described earlier but in addition equiprobability (i.e. equal ascertainment probabilities of all registers) should be assumed when using multiple sources [Reference Hook and Regal10]. Some truncated models are arguably more robust to population heterogeneity because they are partly based upon the lower frequency classes, and the people seen rarely are assumed to have a greater resemblance with the people never seen. This relative insensitivity to violation of the homogeneity assumption of some truncated estimators is supported mathematically and through simulation studies but these can occasionally underestimate the true population size in the presence of heterogeneity [Reference Zelterman22, Reference Wilson and Collins29].
Frequency counts
It is possible to extract frequency counts for the truncated models from multiple-source capture–recapture data, allowing us to use the reported data from the log-linear studies for the truncated models. The ratio between the number of patients registered once (f 1) and registered twice (f 2) plays an important role in the truncated models. When ‘1’ represents being known to a register and ‘0’ represents being unknown to a register, and three linked registers are used, frequency count f 1 is the sum of the cells 100, 010 and 001 in the 2×2×2 contingency table and frequency count f 2 corresponds to the sum of the cells 110, 101 and 011. Similarly, patients observed in all three registers, f 3, are denoted as 111. For all 19 datasets the number of patients in these seven cells are shown later. We use the f 1/f 2 ratio to examine a possible relationship between this ratio and the performance of truncated models vis-à-vis the log-linear models.
RESULTS
Table 1 (available in the online version of the paper) shows the various three-source log-linear model capture–recapture studies of infectious disease incidence and completeness of notification with the number of patients observed and their frequency counts, the objective of the study, the data sources used and the selected log-linear model. The studies involved eight infectious diseases and were performed at the local, regional or national level. One study collected data over a 4-months period, the other studies over 1- to 5-year periods. The observed number of patients varied from 33 to 28 678. Notification, laboratory and hospital registers were the most conventional data sources used. The distribution of the patients over three linked registers in the various three-source capture–recapture studies of infectious diseases is shown in Table 2 (available online).
The log-linear and truncated model estimates with their respective confidence and prediction intervals are shown in Table 3 (available online), as well as the f 1/f 2 ratio among the observed patients and the coefficient of variation of the data source probabilities (see Discussion). The capture–recapture studies varied in estimated number of patients from 46 to 42 969. A second truncated Poisson estimator, Chao's bias-corrected homogeneity model,
gave similar estimates as Chao's heterogeneity model [Reference Chao35]. A second truncated binomial estimator,
gave similar estimates as the truncated binomial model used (data not shown).
f 1/f 2 ratio
On the basis of the f 1/f 2 ratio the studies can be divided in four categories:
(a) f 1/f 2<0·5 (dataset 7). In this study all estimates were similar but the number of observed patients was small.
(b) 0·5<f 1/f 2<1·5 (datasets 1, 2, 6, 8–10, 13a, 13b, 14, 16a, 16b). In these studies the truncated binomial model and Zelterman's model gave similar results as the independent (without interactions) or parsimonious log-linear model while Chao's model estimates were slightly higher. When a saturated log-linear model (with all two-way interactions) was selected the truncated estimates were considerably lower than the log-linear model estimates.
(c) 1·5<f 1/f 2<3·5 (datasets 5, 11, 12, 15). In the first study the results of all truncated models were similar to the parsimonious log-linear model estimate but the number of observed patients was small. In the second study the estimates of Zelterman's and Chao's truncated models were lower but within the 95% confidence interval (CI) of the parsimonious log-linear model estimate while the truncated binomial model estimate was considerably lower. In the third study all truncated model estimates were considerably lower than the saturated log-linear model estimate, the truncated binomial estimate again being lowest. In the fourth study all truncated model estimates were lower than the saturated log-linear model estimate but fell within the broad 95% CI, the truncated binomial model estimates again lowest.
(d) f 1/f 2>3·5 (datasets 3a, 3b, 4). In all studies the truncated model estimates were considerably higher than the parsimonious log-linear model estimates, especially the Zelterman and Chao models.
Selected log-linear model
On the basis of the selected log-linear model the studies can be divided in three categories:
(a) Independent log-linear model (datasets 1, 2). In these studies the truncated models produce similar estimates as the log-linear model.
(b) Parsimonious log-linear model (datasets 3–11). In the 11 studies with a parsimonious log-linear model selected three observations can be made:
• In the three studies with f 1≫f 2 (datasets 3, 4) the truncated binomial model estimates a higher number of patients than the log-linear model while the truncated Poisson and Poisson mixture models estimate a considerably higher number of patients.
• In the three studies (datasets 5–7) with a small number of observed patients the estimates of the log-linear model and truncated Poisson, Poisson mixture and binomial models are comparable.
• In the studies with the f 1/f 2 ratio between 0·5 and 1·5 (datasets 8–10, 13b) the truncated model estimates are similar to the log-linear model but the Chao models can be relatively higher and in one study the truncated Poisson mixture estimate was relatively low.
(c) Saturated log-linear model (datasets 12–16, apart from 13b). In all but one of the studies with a saturated model selected (datasets 12, 13a, 14, 16) the truncated models gave considerable lower and mutually comparable estimates.
DISCUSSION
Main findings
In three-source log-linear model capture–recapture studies of infectious disease incidence with an independent log-linear model selected, truncated models yield comparable estimates. The truncated models also give similar results when parsimonious log-linear models are selected and the number of patients is limited or the f 1/f 2 ratio is between 0·5 and 1·5. When f 1≫f 2 truncated models give considerable higher estimates than parsimonious log-linear models. Compared to saturated log-linear models the truncated models produce considerably lower and often more plausible estimates.
Capture–recapture analysis and chronic diseases
For human diseases capture–recapture analysis has predominantly been applied to estimate the prevalence, incidence or completeness of registers of specific groups of diseases, often diseases with a chronic character as mentioned earlier. Apparently the characteristics of most of these diseases, their patients and their registers best fulfil criteria for feasibility of capture–recapture studies as well as validity of the underlying assumptions. Perhaps with the exemption of some neurological and rheumatological conditions the case-definition is probably unambiguous and uniform over the various registers. Arguably, for these categories of diseases sufficient registers are available and possible relationships between these registers, e.g. clinical registers, laboratory registers, health insurance registers or patient support and advocacy group registers, be they positive or negative, could be avoided by source selection or source merging or accounted for in a log-linear model, thus limiting violation of the independent registers assumption. The permanent character of most of these conditions can reduce violation of the closed population assumption.
Capture–recapture analysis and infectious diseases
For infectious diseases the number of available registers for record-linkage, usually notification-, laboratory- or hospital-based registers, is often limited and (strong) positive interaction between these registers should be expected as a result of the characteristics of infectious disease diagnosis and treatment, and public health regulations. Infectious disease control and surveillance is often organized around close collaboration between clinicians, microbiologists and public health professionals, such as infectious disease and tuberculosis physicians and nurses. Only two of the 19 datasets studied selected the independent log-linear model and 11 datasets selected parsimonious log-linear models incorporating one or two pairwise dependencies. However, six datasets selected the saturated log-linear model, i.e. including all two-way interactions and assuming absence of the three-way interaction [Reference Hook and Regal16, Reference Regal and Hook36]. Our studies of tuberculosis incidence in England and, before correction for suggested imperfect record-linkage and remaining false-positive hospital cases, in The Netherlands both selected a saturated model, resulting in unexpectedly and unrealistically high estimates of the number of tuberculosis patients. The two previous three-source log-linear model capture–recapture studies of tuberculosis incidence resulted in a parsimonious model and both produced plausible estimates within the range of prior expectations [Reference Tocque37, Reference Baussano38]. According to Hook & Regal, if the saturated model is selected by any criterion the investigator should be particularly cautious about using the associated outcome [Reference Hook and Regal10]. At the time of our studies on tuberculosis incidence all but one of the published three-source log-linear capture–recapture studies of infectious incidence used independent or parsimonious log-linear models (studies 1–11). The one published study selecting a saturated log-linear model (study 12) gave a much higher estimate (n=1314) of the number of hepatitis A patients in an outbreak in Taiwan than later established by serology results (n=545) [Reference Chao19]. Recently a three-source log-linear model capture–recapture study of meningococcal disease incidence also selected a saturated log-linear model and resulted in relatively high estimates (study 16) [Reference De Greeff39]. Perhaps confidence in the validity of capture–recapture results may reflect publication bias in favour of apparently successful capture–recapture studies [Reference Hay40]. The unexpectedly high estimates of the saturated log-linear model capture–recapture studies do not result from violation of the ‘absent three-way interaction’ assumption. In the case of infectious disease registers, existing three-way interaction is almost certainly positive, causing a capture–recapture estimate biased downwards [Reference De Greeff39]. The reason for the high estimates must, therefore, be violation of (a combination of) the other underlying assumptions. After correction for possible false-positive records and possible imperfect record-linkage the capture–recapture studies on tuberculosis and meningococcal disease in The Netherlands (studies 13 and 16) produced much lower and lower estimates, respectively. Compared to an initial saturated log-linear model, a covariate log-linear capture–recapture model, reducing violation of the homogeneity assumption, also resulted in a much lower estimate of 886 (95% CI 827–1022) Legionnaires' disease patients in The Netherlands (study 15).
Truncated estimators and infectious diseases
Infectious disease studies where an independent log-linear model was selected produce estimates very similar with the truncated models, which can be partly explained by the independent register assumption underlying the truncated models when applied to three registers. That truncated estimators perform well when data are sparse is demonstrated in studies 5, 6 and 7 as the estimates of the log-linear and the various truncated models are similar. The truncated models also give similar results as the log-linear models when 0·5<f 1/f 2<1·5 but give considerably higher estimates when f 1≫f 2. In the case of saturated log-linear models (studies 12–16), with unexpectedly high estimates of infectious disease incidence, the lower truncated model estimates are more plausible but are they also preferable? We have two arguments to support the view they might be:
(1) In study 12 the saturated log-linear model estimated 1314 patients with hepatitis A infection in an outbreak in Taiwan while the truncated models estimate between 500 and 600 patients. The National Quarantine Service of Taiwan, on the basis of serology tests, later concluded that the true number of infected persons was about 545, making this one of the few capture–recapture datasets where later a true number of patients was established [Reference Chao19].
(2) A saturated log-linear model in dataset 13a gave an implausible estimate of 2053 (95% CI 1871–2443) tuberculosis patients in The Netherlands in 1998, while truncated models estimated between 1600 and 1675 patients. The implausible estimate caused the investigators to have a critical look at the data again and make further corrections for probable imperfect record-linkage and possible remaining false-positive records in the hospital register. The parsimonious log-linear model of dataset 13b fitted the adjusted data well and gave an estimate of 1547 (95% CI 1513–1600) tuberculosis patients and corresponding truncated model estimates. The initial truncated model estimates came relatively close to the final log-linear model estimate.
The equiprobability and number of data source assumptions
The truncated binomial model assumes that all sources have the same probability of capturing a case. In addition the truncated Poisson model assumes an infinite number of sources, although in our data the number of sources was limited to three. On this argument the truncated binomial model for three data sources is a more realistic alternative estimator. However, any departure from equiprobability results in an estimation error, which analytically is overestimation (see Appendix). Realistic estimates of this error can be obtained from the data. In Table 3 the last column shows the coefficients of variation, a measure of variability in the coverages of the three data sources for each study. This is calculated as the standard deviation divided by the mean from the three quantities N 1 (number of cases known on source 1), N 2 (number of cases known on source 2) and N 3 (number of cases known on source 3). We demonstrate the possible effect of violation of the equiprobability assumption by studies 4 and 11. For study 4, which has a high coefficient of variation (0·86), if the sources were truly independent, the number of unobserved cases would be 702, calculated by fitting the log-linear model with main effects only. Our truncated binomial estimator gives 1325 cases, nearly twice as large. For study 11, with a low coefficient of variation (0·06), independence implies that there are 155 unobserved cases, while the truncated binomial estimate is 212, an overestimation by about 30%. Studies 3 and 4 indicate that the high f 1/f 2 ratios result from violation of the equiprobability assumption, producing overestimates by the truncated models.
Two-source validation
Any three-source study can be used to test two-source estimation by treating one source as though it were a complete list of cases and extract a complete 2×2 table. We demonstrate this for two studies, numbers 4 and 11, which we chose above for their coefficients of variation and took register 3 as the complete set. Validation was by comparing the Petersen estimator (N 10N 01/N 11) [1] and the truncated binomial estimator, which for two lists is f 12/(4f 2), on the 2×2 table with the known ‘unlisted’ number. For study 4 there were 451 ‘unlisted’ cases, i.e. on neither of registers 1 and 2. The Petersen estimator is 37 and the truncated binomial estimator 42. The two estimators are similar because registers 1 and 2 have approximately equal coverage but both are far short of the true figure (Zelterman and Chao models estimates are 79 and 84, respectively). For study 11 there were 161 ‘unlisted’ cases and the two estimators were 57 and 64. Again the estimators agree but are short of the true figure. Now the Zelterman and Chao model estimates are 107 and 130, respectively, and perform slightly better. However, we had some hesitation in extracting 2×2 tables from three-source capture-recapture data, more specifically from capture–recapture studies on infectious disease incidence. As explained earlier, (positive) interdependencies between the three conventional registers used for such studies should be expected. Extracting 2×2 tables ignores possible conditional dependence confounding the results thus obtained. The log-linear model in study 4 included one interaction term for pairwise dependencies and the log-linear model in study 11 included two such interaction terms, which may explain the underestimation in the Petersen and truncated estimators. We therefore also validated the two studies with independent log-linear models (studies 1 and 2). We took register 2 as the complete set for study 1 and register 3 as the complete set for study 2. For study 1 there were 73 ‘unlisted’ cases. The Petersen estimator, 43, is a little low, but the truncated binomial estimator, at 201, is too high (Zelterman and Chao model estimates are 397 and 401, respectively). The discrepant (over)estimate by the truncated models can be explained by the different coverages of registers 1 and 3, i.e. violation of the equiprobability assumption. In study 2 the coefficient of variation was low and the coverage of registers 1 and 2 similar. For study 2 there were 22 ‘unlisted’ cases. The Petersen estimator and the truncated binomial estimator are both 25 and similar to the known ‘unlisted’ number, explained by almost absent violation of both the independent sources and equiprobability assumptions. The Zelterman and Chao model estimates are 43 and 51, respectively and the discrepancy with the truncated binomial model estimate can be explained by violation of the ‘infinite number of sources’ assumption.
Alternative models
As an alternative to log-linear capture–recapture models a structural source model has been proposed [Reference Regal and Hook36]. Whereas log-linear models only partly identify and incorporate dependencies between registers, the structural source model models potential interdependencies of the registers and heterogeneity of the population, partly based on prior knowledge, and estimates the probabilities of conditions that produce these interactions between the registers. However, the published data of the capture–recapture studies were insufficient to re-examine these studies with a structural source model.
CONCLUSION
We have indicated conditions where estimates of infectious disease incidence from log-linear models are similar or dissimilar to alternative truncated models for incomplete count data. Our results suggest that for estimating infectious disease incidence and completeness of notification independent and parsimonious three-source log-linear capture–recapture models are preferable. When saturated models are selected as best-fitting model and the estimates are unexpectedly high and seem implausible, first, the data should be re-examined with truncated models as a heuristic tool, in the absence of a gold standard, to identify possible failure in the saturated log-linear model when the truncated models produce a lower estimated number of infectious disease patients. Second, in case of such discrepancy between the log-linear and the truncated model estimates, the data should be re-examined for possible violation of the underlying capture–recapture assumptions, such as imperfect record-linkage, false-positive records or heterogeneity, corrected and the capture–recapture analysis repeated on the corrected data. When after repeated capture–recapture analysis the discrepancy between the log-linear and the truncated model estimates remains or no violation of the underlying assumptions can be identified, the investigator should be cautious about using the associated outcome [Reference Hook and Regal10]. Using truncated model estimates as an early alert could prevent flawed capture–recapture estimates finding their way into the scientific literature. The role of the f 1/f 2 ratio in the agreement or disagreement between three-source log-linear capture–recapture and truncated model estimates for the number of infectious disease patients, especially when a parsimonious log-linear model is selected, should be the subject of further mathematical or statistical studies.
APPENDIX
Equations for the truncated population estimators
Truncated binomial model:
Truncated Poisson mixture model:
Truncated Poisson heterogeneity model:
Equiprobability
If the truncated binomial model is true, i.e. if the sources are independent and equiprobable with probability of capturing any case=p, our estimator (f 1)2/(3f 2) is correct in the sense that the expected number of unlisted cases is given by
If we introduce a small departure from equiprobability so that the list probabilities are (p−h, p, p+h) instead of (p, p, p), the estimation error can be defined as
Differentiating with respect to h, we find that
so that we overestimate, at least for small h. The same happens if we consider an asymmetrical departure (p−h, p, p). In that case,
and there is again an overestimate.
DECLARATION OF INTEREST
None.
NOTE
Supplementary information accompanies this paper on the Journal's website (http://journals.cambridge.org).