Bladder cancer (BC) is the most common malignancy of urinary tract and the seventh cause of mortality for cancer (2·8 % of all cancer deaths), with nearly 430 000 new cases and 165 000 deaths per year worldwide(Reference Ferlay, Soerjomataram and Dikshit1,Reference Siegel, Miller and Jemal2) . According to Al-Zalabani et al., up to 80 % of BC can be attributed to lifestyles, including occupation, smoking, exercise and diet(Reference Al-Zalabani, Stewart and Wesselius3). Particularly, it is biologically plausible for dietary factors to influence BC risk considering that beneficial as well as harmful components of a diet are excreted through the urinary tract and in direct contact with the epithelium of the bladder(Reference Piyathilake4). However, as stated in the report by World Cancer Research Fund/American Institute for Cancer Research(Reference Wiseman5), there is still ‘limited’ evidence for the role of diet on the BC risk.
Analysis of overall dietary patterns related to BC has gained a lot of attention during past years(Reference Westhoff, Wu and Kiemeney6,Reference Witlox, van Osch and Brinkman7) . Instead of looking at individual foods or nutrients, analysis of dietary patterns examines the effects of the overall diet, considering the inter-correlations in the consumption of various foods and nutrients. Conceptually, dietary patterns represent a broader picture of food and nutrient consumption, and analysis of dietary patterns may help in better understanding and preventing the development of common cancers.
Several conventional analysis techniques are available for extracting dietary patterns including factor and cluster analyses: investigator-driven methods, such as dietary indices and dietary scores; and data-driven methods, such as principal component analysis. Although these techniques are widely used and might reveal some important information on the relation between dietary patterns and common cancers, they all draw subjective conclusions since they are based on series of a priori assumptions, which may differ among researchers. A relatively new approach in the field of nutritional epidemiology is ‘data mining’. Data mining is a process that uses a variety of data analysis tools to extract hidden predictive information from large data. This technique is considered to be a powerful technology with great potential to help people focus on the most important information of their data(Reference Han and Kamber8). A previous study in the field of nutritional epidemiology already showed that data mining allowed to define unexpected dietary patterns that might not be recognised using conventional statistical methods(Reference Hearty and Gibney9). Therefore, in the present study, we used this technique to examine the combinational foods at individual level to extract some food groups related to the BC risk.
Methods
Study population
The data set used in the present study is part of the ‘BLadder cancer Epidemiology and Nutritional Determinant (BLEND)’ study, which aims at assessing the association between diet and the BC risk. Details on the methodology of the BLEND consortium have been described elsewhere(Reference Goossens, Isa and Brinkman10). The present study included data of eighteen case–control(Reference Bernstein and Ross11–Reference Taylor, Umbach and Stephens28) and one nested case–cohort study(Reference van den Brandt, Goldbohm and van ‘t Veer29) providing information on diet and BC, from twelve different countries across the world, including data on 8320 BC cases and 23 231 non-cases within the age range of 18–100 years. Each study ascertained incident BC defined to include all urinary bladder neoplasms according to the International Classification of Diseases for Oncology (ICD-O-3 code C67) using population-based cancer registries, health insurance records or medical records. Each participating study has been approved by the local ethic committee. Informed consent was obtained from all individual participants included in each study. Most of the BC cases were diagnosed and histologically confirmed in 1990s.
Data collection
All included studies made use of a validated self-administrated FFQ or an FFQ administered by a trained interviewer. Homogenisation of the dietary data was done by making use of the Eurocode 2 Core classification codebook(Reference Poortvliet, Klensin and Kohlmeier30). This codebook consists of main food groups and their first- and second-level subgroups(Reference Hastie, Tibshirani and Friedman31). In order to reduce the variance of individual food items across the world (online Supplementary Table S1), foods were attributed into eleven main groups: milk and dairy products (A); eggs and egg products (B); meats and meat products (C); fishes and fish products (D); fats, oils and their products (E); grains and grain products (F); pulses, seeds, kernels, nuts and their products (G); vegetables and vegetable products (H); fruits and fruit products (I); sugars and sugar products (J); and beverages (non-milk, K). All food groups were measured as servings of food intake per week and divided into quartile, with Q1–Q4 corresponding to lowest and highest intake. In addition to information on diet, the BLEND data set also included data on study characteristics (design, method of dietary assessment and geographical region) and participant demographics (age (continuous), sex (male, female)) and smoking status (never/current/former).
Baseline analysis
Continuous variables were described as mean and standard deviation, and categorical variables as absolute and relative frequencies. Missing values were tested for missing at random (MAR) or missing completely at random (MCAR)(Reference Rubin32,Reference Van Ness, Murphy and Araujo33) . To test for MAR, logistic regression was performed with a missing data indicator created for each variable. No significant relationship between the missingness indicators and the outcome of interest suggests MAR. The assumption that missing data are MCAR was assessed using Little’s MCAR χ 2 test(Reference Little34,Reference Li35) .
Data mining method
All the eleven main food groups and the non-diet variables (i.e. age, sex and smoking status) were selected and entered into data mining procedures.
A classification technique called C5.0(Reference Pandya and Pandya36), which is a variant of the C4.5 algorithm developed by Ross Quinlan, was used since it can represent solutions as decision trees and as rulesets(Reference Quinlan37). It builds a decision tree based on the training/validation sets using the concept of information entropy. The decision tree is built by splitting the data into two parts at the value of one variable that yields the highest normalised information gain. That is, it splits on the value of the chosen variable that separates positive and negative observations (i.e. BC status: case and non-case), most efficiently. The pruning severity of the model was set at the default level of 75. This level yielded the lowest complexity (i.e. which refers to the minimum number of records in each tree branch to allow a split) with sufficient accuracy. Standard 10-fold cross-validation was used in which the entire eligible BLEND data set was divided into ten approximately equally sized parts. Nine parts were used in turn as training sets, and the remaining tenth part was used as the validation set. The validation set (10 %) was chosen within the entire data set according to the distribution of BC status. The participants with missing values were taken into account by using the ratio of the participants with missing values multiplied by the information entropy of the subset of participants without missing values for each variable(Reference Quinlan38). The classification C5.0 algorithm was run for the included diet and non-diet variables within the BLEND data set; meanwhile, variable importance (i.e. attribute usage) for the C5.0 model was calculated by determining the percentage of training set samples that fall into all the terminal nodes after the split, which defines the variable importance value of each diet and non-diet variables in relation to BC(Reference Karaolis, Moutiris and Pattichis39–Reference Louppe, Wehenkel and Sutera42). These importance values range from 0 to 100 %, where 0 % indicates ‘unimportant’ and 100 % indicates ‘extremely important’. Both continuous and categorical variables were included in the models. Node splits in continuous variables can occur at any value and were not predetermined.
Rules were then generated by using the ‘ruleset’ function in C5.0, which transformed the decision tree into specific context associated with BC. The BC status (either case or non-case) was predicted by each rule, and a value between 0 and 100 % indicates the confidence of the risk in relation to BC outcome. The overall performance of the C5.0 classifier was evaluated by classification accuracy, true positive rate, false positive rate and receiver operating characteristic with the AUC. This is the number of correct classifications of the instances from the validation set divided by the total number of these instances, expressed as a percentage. The greater the classification accuracy, the better is the classifier. A sensitivity analysis was performed by categorising age into six groups (years): ≤55, 55–60, 60–65, 65–70, 70–75 and >75, based on the same data mining procedure.
All data analyses were performed with R software version 3.5.1 (using packages ‘C5.0’ and ‘caret’ developed by Max Kuhn; ‘rpart’ developed by Beth Atkison; ‘ROCR’ developed by Tobias Sing and Oliver Sander).
Results
Baseline analyses of the included data
The characteristics of the BLEND participants are presented in Table 1. In total, 31 551 participants are included in the analyses, of which 8320 (26·37 %) were BC cases. The mean age of non-cases (59 years old) was lower than cases (62 years old), and most of the participants were Caucasian (92·27 %). Approximately 66·68 % of participants were smokers, with 33·62 % of those being current smokers and 33·06 % being former smokers.
* Age was coded as the original continuous values and six categorical values, food intakes were coded as quartile-order categorical values, and the other variables were coded as categorical dummy values.
† Q1–Q4: lowest intake to highest intake (servings/week).
Significant results of logistic regression for food-group variables indicated that missing dietary data were not MAR (all P MAR < 0·05). Little’s test also provided evidence against the assumption that missing data were MCAR (all P MCAR < 0·001). Rejection of both MAR and MCAR indicates the missing values are missing not at random. Therefore, the observations with missing data could not be deleted, and the missing values were marked as blank and not replaced by any value.
Extraction of food groups in relation to bladder cancer via the data mining procedure
Fig. 1 presents an example of a decision tree with three different variables. The variables are ranked according to how they were used to split the participants from decision nodes to end nodes. The position of 1 (A) corresponds to the variable that in all trees is the first variable used to split; the position of 2 (B) corresponds to the variable that on average is the second variable used to spit, and so on till finally, all the participants were split into BC cases and non-cases. ‘Sex’ is on the first rank split of the tree, which indicates dietary patterns are differentiated in males and females related to BC. Both non-diet variables (age, sex and smoking status) and five food groups (C, E, F, H and K) were identified as having an influence on the development of BC. The observed importance values of these variables are (Fig. 2): sex (100 %); smoking status (74·60 %); age (62·80 %); beverages (55·81 %); grains and grain products (37·98 %); vegetables and vegetable products (24·30 %); fats, oils and their products (2·95 %); meats and meat products (2·71 %). Other input variables showed to have an importance value of 0 % and were, therefore, considered non-relevant for BC development. The overall classification accuracy is 75·10 %, with true positive rate 0·86 and false positive rate 0·31 (the receiver operating characteristic curves, with AUC from 0·690 to 0·701, for each cross-validation run were performed in online Supplementary Fig. S1).
Table 2 presents the extracted eight rules resulting into BC outcome after application of the ‘ruleset’ classifier of C5.0, with a classification accuracy of 74·90 %. The results from ‘ruleset’ show that the variables identified by the ‘decision tree’ approach are also identified by using the ‘ruleset’ approach. Here, we see that current/former male smokers tended to be BC cases and never male smokers tended to be non-BC cases. However, to be able to split the participants into case or non-case is depending on their dietary habits. Females show relatively simple rules, in which only ‘grain and grain products’ and ‘beverages (non-milk)’ were identified to be related to BC.
C, meats and meat products; E, fats, oils and their products; F, grains and grain products; H, vegetables and vegetable products; K, beverage (non-milk).
* Age: years old; C–K: servings/week.
† Q1–Q4: lowest intake to highest intake (servings/week).
A sensitivity analysis by transforming age into categorical variable was performed based on the C5.0 algorithm; the results shown are similar to the identification of same food groups related to BC (online Supplementary Fig. S2).
Discussion
To our knowledge, this is among the first studies to apply the data mining approach to extract food groups associated with BC risk based on the complexity of the combinational food intake. By applying C5.0 algorithm, the decision tree and rules derived from this approach showed that sex, smoking status, age and five food groups (C: meats and meat products, E: fats, oils and their products, F: grains and grain products, H: vegetables and vegetable products, K: beverages (non-milk)) are in relation with BC risk in both males and females. Apart from the well-established factors (e.g. age, sex and smoking) for BC identified in the data mining procedures, the association of diet, especially specific dietary pattern, with BC risk deserves to be explored due to the limited evidence on this topic and because it reflects a person’s dietary exposure in aggregate rather than in isolation.
Although the use of data mining is relatively new for unravelling diet in relation to the cancer risk, previous studies already examined dietary intake with BC risk using other techniques. In 2008, De Stefani et al. (Reference De Stefani, Boffetta and Ronco43) found that the dietary patterns labelled as ‘sweet beverages’ (high loadings of coffee, tea and added sugar) and ‘Western’ (high loadings of red meat, fried eggs, potatoes and red wine) were directly associated with the risk of BC based on factor analysis. In addition, the negative influence of the Western diet was also observed for BC recurrence: BC patients in the highest tertile of adherence to a Western dietary pattern had a 48 % higher risk of recurrence of BC compared with patients in the lowest tertile(Reference Westhoff, Wu and Kiemeney6). The Western diet is especially low in fresh fruits and vegetables, but generally high in saturated fats and red and processed meats. Results from the present study are in line with these results, with respect to high intake of fat being associated with an increased risk for the development of BC and high intake of vegetables and vegetable products being associated with a reduced risk.
Previous studies on single food item or food groups in relation to BC risk also reported that high intake of vegetables was associated with reduced risk of BC(Reference Xu, Zeng and Liu44–Reference Yao, Yan and Ye47). These studies suggest that the preventive effect could possibly be due to the antioxidant action of vegetables(Reference Boeing, Bechthold and Bub48,Reference Riboli and Norat49) and that each serving of vegetable may result in a 10 % risk decline. Although very powerful, results from the present study only identify ‘vegetables and vegetable products’ as a possible main food group related to BC risk. It remains unclear which specific subgroup is responsible (e.g. starchy/non-starchy, processed/fresh, citrus/cruciferous). Detailed analyses of BLEND data may help to elucidate this uncertainty.
Limited evidence is available on the influence of ‘grains and grain products’ on BC risk. However, our findings are in line with results from a previously conducted case–control(Reference Chatenoud, Tavani and La Vecchia50), suggesting that a high intake of whole grains may reduce the risk of BC. In contrast, a more recent study found that BC risk was negatively influenced by a high intake of refined carbohydrate foods(Reference Augustin, Taborelli and Montella51). Thus, future detailed analyses, especially those focusing on whole grains and refined grain products, may be useful. Of note, our results on grain products might have been influenced by the fact that the ‘grain and grain products’ group of the present study included sweet ‘Fine bakery wares’, such as ‘Sweet biscuits and cookies’ which are high in sugar and thereby promote obesity, is known to be a risk factor for BC(Reference Sun, Zhao and Yang52).
Only few studies discussed the associations between fat, oil and their products and BC risk and were summarised in a systematic review. This review showed that the total fat intake was positively related to BC risk when combining results from three case–control studies. However, no such association was observed in cohort studies(Reference La Vecchia and Negri53). The present study confirms findings from the case–control studies, in that a positive association was found.
A meta-analysis reported that overall meat intake was not related to the risk of BC; however, high red and processed meat intake was reported as a significant risk factor for BC risk, 17 % and 10 % risk, respectively(Reference Wang and Jiang54). This increase is probably caused by the N-nitroso compounds, which have been proposed as possible bladder carcinogens, found in red and processed meats(Reference Catsburg, Gago-Dominguez and Yuan55). In the present study, a high intake of ‘meats and meat products (C)’ was associated with an increased risk of developing BC. Again, future studies investigating specific types of meat could identify the types of meat or meat products that might have beneficial effects.
As an excretory organ, fluid intake might play an important role in the development of BC. A well-established risk factor is arsenic(Reference Baris, Waddell and Beane Freeman56), through which people are most likely exposed by drinking water. The influence of other fluid sources on BC risk, however, is lacking evidence or is inconstant. Here, we observed that high beverage intake is positively associated with BC risk. Again, it should be noted that only total ‘beverage’ intake was assessed, including both beverages with a potential protective effect on BC risk (e.g. green tea(Reference Miyata, Matsuo and Araki57)) and beverages with a potential harmful effect on BC risk (e.g. alcoholic(Reference Vartolomei, Iwata and Roth58) and sweet non-alcoholic beverages(Reference De Stefani, Boffetta and Ronco43)). It, therefore, remains unclear which caused the observed increased BC risk.
Since nutrition and cancer epidemiology is a complex field, the use of advanced analytic tools, such as data mining, is becoming increasingly important for unrevealing diet and health associations. Data mining has demonstrated its potential to complement conventional statistical regressions, particularly for non-linear phenomena such as our dietary habits(Reference Huys and Jirsa59), and without requiring a priori assumptions on the relationship between diet and health outcomes(Reference Crutzen and Giabbanelli60). In addition, data mining splits data files into training and validation sets, especially using cross-validation method gives relatively accurate predictive estimates. Furthermore, overfitting problem of both decision tree and rules could be minimised by using a reduced error pruning technique in C5.0(Reference Pandya and Pandya36) which is often problematic in conventional statistical techniques with a large number of variables and observations, such as the BLEND data set. The strength of the present study is the high classification accuracy, which indicates the data mining methodology could adequately handle missing data and complex-investigating measurements. Therefore, the revealed food groups in the present study could be considered foods or pattern in relation to BC development.
A limitation of our study, however, is that the use of data mining in nutritional cancer epidemiology might only be useful in identifying key food items and can therefore only be seen as a hypothesis generator, which needs further detailed investigation in order to establish causation. Furthermore, we should acknowledge it is a complicated technique, which requires special knowledge and expertise, and thus, translating the results from data mining into simple health message is a difficult challenge. In addition, the trees and rules retrieved in the present study only include main food groups; thereby, conflicting effects on BC risk of food subgroups or specific items was inevitable. Another limitation might have occurred by the designs of the data collection, which may have introduced recall and/or selection bias, especially in case–control studies. In addition, for most included studies, the exposure variable was assessed by FFQ. Therefore, measurement error and misclassification of study participants in terms of the exposure and outcome are unavoidable: a) the inability of an FFQ to capture many details of dietary intake, such as all kinds and exact amounts of foods consumed, b) the difficulty in quantification of the intake and c) the high dependency on memory, which in turn may have influenced the robustness of dietary patterns extracted via the data mining procedure(Reference Rodrigo, Aranceta and Salvador61). Lastly, due to the nature of data mining such as C5.0, there are concerns regarding multiple testing and spurious associations, which might cause some of the observed consequences due to chance alone.
Conclusion
In summary, the data mining technique provided an effective approach to identify some food groups related to BC risk in the large epidemiological BLEND study. The main findings from this study support the data mining approach to be a valuable additional methodology in nutrition and cancer epidemiology, which deserve further examination.
Disclaimer
Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization.
Acknowledgements
We gratefully acknowledge all principal investigators for their willingness to participate in this jointed project. The author E. Y. W. Y. gives thanks to the financial support from China Scholarship Council (no. 201706310135).
This work was partly funded by the World Cancer Research Fund International (WCRF 2012/590) and European Commission (FP7-PEOPLE-618308).
The Hessen case–control study on bladder cancer was supported by the Bundesanstalt für Arbeitsschutz (no. F 1287). The Kaohsiung study was supported by grant NSC 85-2332-B-037-066 from the National Scientific Council of the Republic of China. The Stockholm case–control study was supported by grant from the Swedish National Cancer Society and from the Swedish Work Environment Fund. The Roswell Park Memorial Institute case–control study on bladder cancer was supported by Public Health Service grants CA11535 and CA16056 from the National Cancer Institute. The New England bladder cancer study was funded in part by grant numbers 5 P42 ES007373 from the National Institute of Environmental Health Sciences, NIH and CA57494 from the National Cancer Institute, NIH. The Italian case–control study on bladder cancer was conducted within the framework of the CNR (Italian National Research Council) Applied Project ‘Clinical Application of Oncological Research’ (contracts 94·01321.PF39 and 94·01119.PF39), and with the contributions of the Italian Association for Cancer Research, the Italian League against Tumours, Milan, and Mrs. Angela Marchegiano Borgomainerio. The Brescia bladder cancer study was partly supported by the International Agency for Research on Cancer. The French INSERM study was supported by a grant from the Direction Générale de la Santé, Ministère des Affaires Sociales, France. The Molecular Epidemiology of Bladder Cancer and Prostate Cancer was supported in part by grants ES06718 (to Z.-F. Z.), U01 CA96116 (to A. B.), and CA09142 from the NIH National Institute of Environmental Health Sciences, the National Cancer Institute, the Department of Health and Human Services, and by the Ann Fitzpatrick Alper Program in Environmental Genomics at the Jonsson Comprehensive Cancer Center, UCLA. The Women’s Lifestyle and Health Study was funded by a grant from the Swedish Research Council (grant number 521‐2011‐2955). The Netherlands Cohort Study on diet and cancer was supported by the Dutch Cancer Society. The RERF atomic bomb survivors study was supported by The Radiation Effects Research Foundation (RERF), Hiroshima and Nagasaki, Japan, a public interest foundation funded by the Japanese Ministry of Health, Labour and Welfare (MHLW) and the USA Department of Energy (DOE). The research was also funded in part through DOE award DE-HS0000031 to the National Academy of Sciences. This publication was supported by RERF Research Protocol RP-A5-12. The VITamins and Lifestyle Study (VITAL) was supported by a grant (R01CA74846) from the National Cancer Institute.
Study conception and design: A. W. and M. P. Z.; analyses and interpretation of data: E. Y. Y. and C. S.; drafting of the manuscript: E. Y. Y.; revised the manuscript: A. W. and M. P. Z.; provided the data and revised the manuscript: A. W., M. C. S., X. J., L. T., J. M., E. K., P. v. d. B., C. M. L., H. P., G. S., M. F. A., M. R. K., C. L. V., S. P., A. C., K. G., K. C. J., S. B., Z. F. Z., C. B., J. A. T., E. W., E. J. G., E. W. and J. P.; approved the manuscript: all authors.
All the authors declare no conflicts of interest.
Supplementary material
For supplementary material referred to in this article, please visit https://doi.org/10.1017/S0007114520001439