INTRODUCTION
Foodborne diseases (FBDs) remain a growing concern for high levels of morbidity and mortality in the human population worldwide [Reference Havelaar1]. There are many indicators hinting at an increase in the global incidence of FBDs [Reference Todd2, Reference Käferstein3]. For industrialized countries, it has been estimated that one-third of the population suffers from foodborne illness every year [4]. A recent study estimated 37·2 million illnesses, 228 744 hospitalizations, and 2612 deaths each year due to FBDs in the United States alone [Reference Scallan5]. Furthermore, FBD morbidity and mortality are assumed to be extensive in resource-limited regions of the world although solid data are lacking in these regions [Reference Motarjemi and Käferstein6, 7]. Beyond morbidity and mortality, FBDs also exert an important impact on development and trade [Reference Kuchenmüller8].
Challenges associated with incomplete data have been emphasized in studies estimating the global burden of pathogen-specific FBDs such as non-typhoidal Salmonella gastroenteritis and typhoid fever [Reference Majowicz9, Reference Crump, Luby and Mintz10]. The lack of data, particularly from developing countries, makes it difficult to calculate global estimates of disease burden, and this hinders appropriate allocation and prioritization of resources for food safety intervention efforts. The current study explores novel Bayesian statistical methods for addressing missing data using the WHO Vital Registration (VR) mortality dataset. The specific aims are twofold: to select a subset of meaningful predictors of FBD mortality and to predict missing FBD mortality rates using the selected variables. This report represents a first attempt to cluster WHO countries based on average values of the selected predictors, and to use the variables and the clusters in a Bayesian hierarchical modelling framework to predict FBD mortality rates for countries with missing data.
METHODS
Dataset
Mortality data
Data regarding mortality rates associated with FBDs were obtained from the WHO VR database (2000–2005). The FBDs in the database include bacterial and viral gastroenteritis, parasitic diseases, and hepatitis A and E. The International Classification of Diseases coding system (ICD-10) was used to classify those diseases; FBDs associated with chemicals and biotoxins were not included in this study due to the lack of specific ICD codes [11, Reference Hanson12]. Mortality rates were averaged over the available years and the mean rate was expressed per 100 000 population based on the 2005 population census. We log-transformed the mortality rate data to stabilize its variance and allow the data to be more normally distributed. Out of 194 WHO countries, only 48 had complete data about national mortality rates associated with FBDs.
Predictor set
We obtained predictors for mortality associated with FBDs from publicly available databases (Food and Agricultural Organization; FAO [13], and World Bank [14]) in two steps. First, we identified 113 predictors from these databases for 48 countries with complete mortality rate data. Next, we selected a meaningful subset of predictors from those 113 predictors. Finally, we retrieved the values for the selected predictors from all 194 WHO countries. The search criteria for the predictors included an established direct and/or indirect association with FBD mortality and the predictor's potential to be a modifiable risk factor, i.e. whether it can be changed through intervention [Reference Todd2, Reference Hanson12, Reference Jahan and Valdez15].
Statistical analysis
Analyses of the data were performed using the freeware statistical tools R version 3.1.2 and JAGS (Just Another Gibbs Sampler) version 3.4.0 [16, Reference Plummer17]. The models were specified and parameterized in R and the analyses were performed by calling JAGS from R using the R2jags package [Reference Plummer17]. A stepwise approach was followed for variable selection and estimation of missing mortality rates associated with FBD as follows.
Correlation analysis (CorA)
First, among a total of 113 predictors, we excluded those having missing values and restricted the analyses to complete cases because CorA requires that values must be present for all predictors. This resulted in 46 predictors with complete values. These variables were subjected to pairwise CorA to identify highly correlated and redundant variables. Those pairs of predictors with a high correlation coefficient, i.e. r⩾0·85 [Reference Taylor18], were identified and one member of them was retained based on biological plausibility.
Elastic net regularization (ENR)
Following CorA, we applied ENR. This method offers a statistically appealing regression approach to select meaningful subsets of predictors of mortality associated with FBDs for the 48 countries with complete mortality data. ENR is a flexible variable selection method proposed by Zou & Hastie, which was developed to overcome the flaws of the commonly used Ordinary Least Squares approach with regard to prediction accuracy [Reference Zou and Hastie19]. The basic form of the linear regression model used to perform variable selection with ENR is shown in equation (1):
where Y is a vector of log-total mortality rates (response variable), X is an n × p matrix of predictors, β is p vector of regression coefficients and e is the vector of residual errors. The ENR uses a mixture of the l 1 [Least Absolute Shrinkage and Selection Operator (LASSO)] and l 2 (ridge regression) penalties, which allows both automatic variable selection and shrinkage, respectively [Reference Zou and Hastie19]. The ENR has two parameters, α and λ. We set α at 0·5 (1 = LASSO and 0 = ridge, 0·5 being for ENR) and performed cross-validation to find the optimal value of regularization parameter λ. The optimal λ value was applied during variable selection. The glmnet package in R was used to perform the ENR procedure [Reference Friedman, Hastie and Tibshirani20]. A detailed description of regression-based ENR as a data-mining technique can be found in the literature [Reference Zou and Hastie19, Reference Friedman, Hastie and Tibshirani20].
Imputation of missing values
After we selected variables using ENR, we searched values of these variables for the remaining 146 countries from publicly available and validated databases (FAO, World Bank) [13, 14]. Whenever multiple values for a given country were available, we took the value for the year closest to 2005. Since not all countries may have full information for the selected predictors, Multiple Imputation (MI) was performed to fill-in missing predictor values using the MICE (Multiple Imputation using Chained Equations) package in R [Reference Buuren21]. The percentage of countries missing the variables is indicated in column two of Table 1.
a Number (%) of countries missing the variable.
MI helps to handle missing data, where missing values are replaced by random draws from the predictive distribution of the missing data given the observed data [Reference Buuren21, Reference Schafer22]. The procedure generates m numbers of complete datasets (also called multiply imputed datasets) ready for further analysis. The optimum number of m varies across studies and may depend on the study design and the proportion of values missing. Prior work suggests that 5–10 multiply imputed datasets are minimally sufficient for generating valid variable estimates [Reference Graham, Olchowski and Gilreath23]. We used 20 multiply imputed datasets in this study. The imputed values were averaged across the number of multiply imputed datasets to provide values for a given missing value. Convergence of the imputation process was assessed by visually examining density plots of each variable to evaluate the plausibility of imputed values across the number of iterations.
Cluster analysis (CA)
Following the imputation step, we carried out CA. The purpose of CA is to aggregate countries into groups based on similar values of predictors such that countries within a cluster have homogenous mortality rates. Although several types of clustering methods exist, we compared four commonly used hierarchical clustering methods to identify the optimal clustering solution for our dataset: single linkage, complete linkage, UPGMA (unweighted pair group mean average) and Ward's minimum variance methods [Reference Borcard, Gillet and Legendre24]. We used visual examination of the resulting dendrograms, Gower's distance [Reference Gower25] and cophenetic correlation to select the method of choice [Reference Saraçli, Doğan and Doğan26]. Based on established rules, smaller Gower's dissimilarity coefficients and larger cophenetic correlations indicate that the preferred clustering solution best fits the data. We selected UPGMA as the clustering method of choice for our data. Thereafter, we determined the optimal number of clusters using the gap-statistic, which is one of the most popular methods for estimating the number of clusters in a dataset [Reference Tibshirani, Walther and Hastie27]. In addition, we evaluated the average silhouette width (SW) which is a composite index reflecting the compactness and separation of clusters. A high SW indicates that the clusters are homogenous. A detailed technical description of these methods can be found elsewhere [Reference Gower25–Reference Rousseeuw28].
Bayesian hierarchical regression
After having developed a dataset for all 194 countries, we fit a Bayesian hierarchical regression model (BHM) for predicting log-total mortality rate associated with FBDs. We incorporated the clusters obtained from the CA as random effects into our BHM. The regression model fit to the data was formulated as follows (2):
where Y i denotes the response variable (log-total mortality); α and β are the intercept and the regression coefficients, respectively; n is the total number of countries; X ki denotes the predictors (K = 8); and J is the number of clusters. The variance (σ 2), α's and β's are parameters estimated from the data. In addition to the model constructed using our four cluster solution, we evaluated a new model incorporating the WHO's Global Environmental Monitoring System/Food (GEMS) cluster for comparing the results. The GEMS/Food cluster categorizes the WHO countries into 17 groups based on food consumption and dietary intake of various chemicals [29]. A non-hierarchical Bayesian framework was also fit to the data, which does not take into account any clustering of the data.
Valid inference from the above model assumes that the missingness in the system is Missing At Random (MAR). Missing data is considered MAR whenever the missingness can be explained by one or more predictors in the dataset. Although it is not possible to directly test the MAR assumption based on the data alone, MAR can be demonstrated by showing association between predictors and missingness of the response variable [Reference Mason30]. For testing the validity of the assumption, we created a dummy variable for whether mortality rate is missing or not, and ran a logistic regression to statistically test if any of the variables are associated with missingness (Table 2). A strong statistical association indicates that the MAR assumption is valid. A detailed description of missing data mechanisms can be found in the literature [Reference Rubin31].
a Log-transformed variables.
b Labour force participation rate for females aged 15–24 years.
Some of the predictors in our dataset were not normally distributed, and therefore, we log-transformed them to stabilize their distribution before applying the regression approach. This subset of predictors were ‘Per capita animal calorie consumption’, ‘Birth per adolescent’, ‘Fertility rate’, ‘Maternal death risk’ and ‘Kilocalories per day’.
In a Bayesian framework all the parameters in the model must have a prior distribution, which is a way of quantifying lack of knowledge about the parameters [Reference Gelman and Hill32]. We assigned all the coefficients to have non-informative prior distribution (i.e. a normal distribution with mean = 0 and a precision = 0·01). This implies that the magnitudes of the regression coefficients are expected to lie between −10 and 10. The prior for the precisions, i.e. the inverse variances, were defined in terms of the standard deviation parameters and given uniform prior distributions on the range (0, 10). (The R-JAGS code for the Bayesian framework applied in this project is provided in Supplementary Appendix A.)
We ran the model for 50 000 iterations with a burn-in of 5000 (i.e. we discarded the first 5000 iterations). We assessed convergence by running two chains of dispersed initial values, and then by observing autocorrelation and density plots of the parameters from the models' outputs. Whenever more than one model was to be evaluated for fit, we used Deviance Information Criteria (DIC) and the effective number of parameters (pD) as model fit comparison tools [Reference Spiegelhalter, Best and Carlin33]. The DIC is a Bayesian alternative of Akaike's Information Criteria for comparing competing models, and the pD is a measure of the complexity of the model [Reference Spiegelhalter, Best and Carlin33]. A difference in DIC of more than 5–10 units is regarded as strong evidence in favour of the model with smaller DIC [Reference Spiegelhalter34].
Model validation
In order to assess the predictive performance of our model, we carried out cross-validation using part of the dataset with complete information on mortality rates. We implemented the leave-one-out cross-validation method (LOOCV) used to estimate the generalizability of a model in the absence of external data [Reference Stone35]. This method takes one observation out of the data, sets it aside as a ‘testing set’, and fits the model using the remaining data, called the ‘training set’ to assess statistical predictions of the model. The resulting coefficients are then applied to the ‘test set’, to generate predicted values that are compared to the observed value of that single case. This procedure is performed repeatedly for all observations of the data and the mean absolute error (MAE) of prediction is calculated [equation (3)] and compared to the baseline MAE (i.e. the MAE computed without cross-validation). This comparison allows assessment of ‘out-of-sample’ predictive performance of the model whenever no external data exist [Reference Willmott and Matsuura36]. The MAE is mathematically expressed as follows:
where n = total number of test sets, f i = predicted log-total mortality and y i = observed log-total mortality. Ninety-five percent confidence intervals (95% CIs) for the MAEs were computed using a non-parametric bootstrapping method with 2000 replications from the boot.ci procedure in R. A small MAE indicates a better out-of-sample prediction of the model.
Sensitivity analysis
We assessed robustness of the results by changing the priors. We also examined the model outputs after specifying priors separately for each cluster. Sensitivity analysis regarding priors was conducted by first assigning uninformative priors to means and inverse variances, and then changing the priors of the precision by a factor of 10 as shown in Table 4. Additionally we investigated the stability of predictions by randomly deleting mortality rates and refitting the model, and also by randomly adding plausible hypothetical mortality rates for a subset of countries missing the data and refitting the model.
RESULTS
Correlation analysis
Out of 46 predictors screened by means of pairwise CorA, we found 20 of them to be highly correlated, of which we dropped 14 predictors that were redundant. This results in 32 final predictors for further analysis (see the correlation matrix in Fig. 1).
Elastic net regularization
The remaining 32 variables that were retained after CorA were subjected to ENR. Eight non-zero coefficient variables were selected as the final subset of predictors. We used these variables for CA and regression analysis, as indicated in the next sections. Description of the eight variables and the proportion of missing values for these variables across all 194 countries are indicated in Table 1.
Cluster analysis
The average SW (0·59) and the Mantel optimal cluster methods resulted in two and three clusters, respectively, while the gap statistic suggested four clusters to be optimal. As the gap statistic is the most recommended approach, we decided to partition our dataset into four clusters as shown in the dendrogram (Fig. 2) [Reference Tibshirani, Walther and Hastie27]. One hundred and forty-two countries were grouped into cluster 1, whereas 29, 3, and 20 countries were categorized into clusters 2, 3, and 4, respectively. The majority of the countries in cluster 1 are high- and middle-income countries, whereas many of the countries in the other three clusters are low- or middle-income countries.
Regression and model validation
Model validation and fit
The results of a statistical test to assess the validity of the MAR assumption are shown in Table 2. Three of the eight predictors were significantly associated with missingness in the data indicating that the MAR assumption holds for the current analysis. Table 3 shows the results of goodness-of-fit assessment (DIC and pD) and the MAEs for the three models having different structures. No substantial differences were observed regarding either DIC or MAE in the three models (Table 2). However, the model fitted with the GEMS cluster (hierarchical B in Table 2) has the highest pD due to the large number of clusters in the data. This model also did not converge well even after a higher number of iterations. We selected the BHM because its structure allows it to ‘borrow strength’ (i.e. to pool information) across clusters. The latter characteristic is very helpful whenever data are lacking within clusters. Assessment of autocorrelation, density plots and trace plots showed that convergence criteria for the BHM were met.
a Four cluster random effect; b GEMS cluster random effect; c Deviance Information Criteria; d Effective number of parameters (measure of model complexity); e mean absolute error obtained from the fitted model; f MAE obtained from the model after ‘Leave One Out’ cross-validation; g Bootstrap 95% confidence interval.
Sensitivity analysis and prediction of mortality rates
Table 4 indicates the range and specifications of priors for the variance and mean components used to evaluate robustness of the BHM. Varying the priors to become more informative did not substantially change the predictions of log-total mortality rates for countries within cluster 1 (Fig. 3a ) and cluster 4 (Fig. 3d ). These clusters contain countries with information regarding mortality rates. Conversely, the priors substantially influenced the uncertainty of predictions within the clusters lacking any mortality rate data, i.e. cluster 2 (Fig. 3b ) and cluster 3 (Fig. 3c ).
1/σ 2, precision (inverse of the variance). In the JAGS model, priors for variances are specified by precision. Gamma distribution is frequently used to specify priors for precision. b0, Average log-total mortality rate; b, priors for regression coefficients.
We set the same prior distribution for all clusters (also called exchangeable priors) and specified smaller values for the variance of the means and precision parameters (see prior set 3 in Table 4) for the final prediction. Constraining the parameter values to be within –10 and 10 [e.g. specifying the prior of b0 at dnorm(0, 0·1)] was not a serious restriction. Since the model is on a log-scale, it is not possible to see values as extreme as –10 or 10, which correspond to mortality rate of e −10 or e10 per 100 000 population.
Deleting the observed mortality rate from cluster 4 (i.e. removing the mortality rate value of Guatemala) and refitting the model resulted in a very large uncertainty of predicted log-total mortality rates for all countries in this cluster (Fig. 4d ). On the other hand, randomly adding plausible values for a subset of countries in cluster 2 (Fig. 4b ) and cluster 3 (Fig. 4c ), i.e. clusters that lack any observed mortality rate data, considerably reduced the uncertainty of predictions. This indicates that uncertainty is highest for clusters with little or no information and the predicted mortality rate for countries lacking the data can be substantially improved if mortality rate values are obtained for a few countries in the cluster. The change in predicted log-total mortality was minimal for countries in cluster 1 (Fig. 4a ) whenever mortality rate values were added or deleted from the other clusters as part of the sensitivity analysis.
Supplementary Appendix B shows the final predicted log-total mortality rates for all countries using the BHM. None of the countries in clusters 2 and 3 had observed mortality rates. For countries within these clusters, the BHM predicted wide 95% CIs for the median of log-total mortality rates indicating large uncertainty. On the other hand, the uncertainty of predictions was very small for countries within clusters 1 and 4. The overall median (95% CI) of the predicted log-total mortality rate ranged from −1·23 (−2·03 to −0·44) for Greece to 5·04 (2·68–7·36) for Afghanistan, which yields median mortality rates of 0·29 (0·13–0·63) and 155·19 (14·66–1572·85) per 100 000 population, respectively. As indicated in Supplementary Appendix B, some of the 95% CIs of predictions did not contain the observed mortality rates.
DISCUSSION
Lack of sufficient and complete data for mortality and morbidity from many countries poses an ongoing challenge for the estimation of global burden of FBDs [Reference Jahan and Valdez15].
The eight predictors we selected in this report are proxy attributes to capture the socioeconomic, food-production, hygiene, and health status of countries. This is in agreement with a previous study regarding variable selection to estimate the missing incidence of specific FBDs [Reference McDonald37]. In addition, a frequentist-based analysis of part of the current dataset highlighted that both health and non-health-related variables can be used as proxy predictors to measure mortality associated with FBDs [Reference Hanson12].
Grouping of WHO countries based on the main predictors of FBD mortality was a novel attempt to create data-driven clusters of countries with relatively homogenous mortality rates. This clustered structure will help ‘borrow strength’ from similar countries while predicting missing FBD mortality rates. The WHO countries have been previously grouped based on geographical attributes (e.g. WHO subregions) or other parameters depending on the goal of the classification scheme. For example, the GEMS Food cluster is designed to group countries based on food consumption and risk assessment [29]. In a recent study to predict missing national-level incidence of specific FBDs, the WHO subregions and the GEMS regions were used as random effects [Reference McDonald37]. In our study, comparison of predictions and assessment of model fit using the GEMS cluster vs. our data-driven clusters indicated that the model with the GEMS cluster lacks convergence while our four-cluster solution has converged quite well. In our study, 47 of the 48 countries with complete mortality rate data grouped into cluster 1, which indicates that those countries that routinely report mortality rates have similarities in the eight predictor values.
Most missing-data analysis approaches require that the MAR assumption is fulfilled for the results to be valid [Reference Little and Rubin38]. Although the MAR assumption, as such, is not testable, we justified its validity for our dataset by demonstrating a statistical association of the predictors with the missingness of mortality rates. Consequently, three predictors were significantly associated with missingness, implying that the missing data can be partly explained by the predictors in the dataset. A previous simulation study has shown that an erroneous assumption of MAR will often have only a slight impact on the estimates [Reference Collins, Schafer and Kam39]. Meanwhile, it is also important to note that all other missing-data analysis approaches require assumptions that are just as difficult to justify.
In our study, the choice of the best-fit model used for prediction of mortality associated with FBDs was based on comparison of model fit parameters, out-of-sample prediction performance, and the method's suitability for analysis of missing data with hierarchical structure. To evaluate predictive performance of a model, using the same data as were initially used to build the model may introduce over-fitting problems [Reference Arlot and Celisse40]. Furthermore, collecting new data to validate the model for predictive strength is not feasible and, therefore, the LOOCV method solves an over-fitting problem and helps to assess the predictive accuracy. Although we did not observe a substantial difference between the three models regarding MAEs, we preferred the BHM for a number of reasons. The structure of a BHM enables ‘borrowing strength’ across clusters which improves prediction of mortality rates. The latter is particularly essential whenever data are missing [Reference Hong41]. Moreover, a BHM facilitates the estimation of several parameters over similar units (e.g. countries within clusters) in order to improve the precision of the estimated effects for each unit. It has also been described previously that a Bayesian approach allows for a more efficient use of data as the method does not depend on the asymptotic theory of large sample approximation [Reference Congdon42]. This is essential whenever there are few observations and a high proportion of missing values in the dataset.
Part of the dataset used herein has been analysed previously using a classical frequentist framework [Reference Hanson12]. Our Bayesian approach, however, has several important advantages over this likelihood-based frequentist method. A Bayesian approach includes an opportunity to assign pertinent information (prior) to unknown parameters (including missing values distribution) [Reference Gelman and Hill32]. This is particularly useful for analysis of a dataset with missing values. Second, Bayesian models can be easily updated rationally when new data become available. As such, future research on FBDs can directly utilize the current results as priors. Additionally, BHM provides a convenient setting for a dataset with inherent hierarchical structure. In our dataset, countries within clusters are assumed to have more similar mortality rates compared to the rates between countries across clusters. Furthermore, implementing BHM allows the pooling of information across clusters, such that clusters with little or no data ‘borrow strength’ of the log-mortality rates from other clusters. In our analyses, the predicted median mortality rates for countries in clusters 2 and 3 were smoothed towards the overall average population estimate (Fig. 4). Although the predicted median mortality rates are close to the overall average log-mortality rate, the uncertainty is large as a direct result of the data quality or lack thereof. The reduction in uncertainty of the predictions was achieved by adding hypothetical but plausible data for a subset of countries and this has a practical implication. For example, data collection strategies for mortality rates can be based on cluster information. If mortality data can be obtained from a proportion of countries from a properly defined cluster, one can use BHM to predict mortality rates for the remaining countries missing the data and thereby make optimal usage of the data available.
A limitation of this study is that the predictors were selected from countries with complete FBD mortality rate data (i.e. from 48 countries that are mostly middle- and high-income countries). At the same time, the clustering of all WHO countries was based on the values of these selected predictors. Since country-level socioeconomic factors may determine the incidence of mortality associated with FBDs, the risk distribution regarding mortality rates due to FBDs may be different between countries. On the other hand, it is possible that similar potential risk factors (predictors) may be shared between developed and developing countries despite the differences in disease burden [Reference Todd2, Reference Käferstein, Motarjemi and Bettcher43]. The other limitation is that there might be a risk of excluding strong predictors while dropping variables containing missing values to fulfil the statistical requirements of CorA. However, the eight selected predictors are considered to be meaningful predictors for FBD mortality.
The high proportion of missing values in the dataset might be the cause for some of the predictions to be outside the observed range. Therefore, the estimates from the final model in this report should be interpreted with caution. The FBD mortality predictions obtained from our suggested method can be improved by obtaining data across regions of diverse social and geographical characteristics of more countries. Such valid additional data would decrease the uncertainty of estimates and possibly change the clustering of the countries. The general approach presented in this paper, however, would not change. Finally, whenever resources are limited, the selected variables can provide suggestions for future data collection regarding risk factors of FBD mortality.
SUPPLEMENTARY MATERIAL
For supplementary material accompanying this paper visit http://dx.doi.org/10.1017/S0950268815003234.
ACKNOWLEDGEMENTS
This work was supported by the NIH Ruth L. Kirschstein National Research Service Award Institutional Training Grant T32 RR023916 and T32 OD010423. The views expressed in this document are solely those of the authors and do not represent the views of the World Health Organization.
DECLARATION OF INTEREST
None.