Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-26T07:18:16.722Z Has data issue: false hasContentIssue false

Principal components analysis of diet and alternatives for identifying the combination of foods that are associated with the risk of disease: a simulation study

Published online by Cambridge University Press:  11 April 2014

Ioannis Bakolis*
Affiliation:
Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, LondonW2 1PG, UK
Peter Burney
Affiliation:
Respiratory Epidemiology and Public Health Group, Imperial College, National Heart and Lung Institute, Emmanuel Kaye Building, Manresa Road, LondonSW3 6LR, UK
Richard Hooper
Affiliation:
Centre for Primary Care and Public Health, Blizard Institute, Barts and The London School of Medicine and Dentistry, Abernethy Building, 2 Newark Street, Whitechapel, LondonE1 2AT, UK
*
*Corresponding author: I. Bakolis, email i.bakolis@imperial.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Dietary patterns derived empirically using principal components analysis (PCA) are widely employed for investigating diet–disease relationships. In the present study, we investigated whether PCA performed better at identifying such associations than an analysis of each food on a FFQ separately, referred to here as an exhaustive single food analysis (ESFA). Data on diet and disease were simulated using real FFQ data and by assuming a number of food intakes in combination that were associated with the risk of disease. In each simulation, ESFA and PCA were employed to identify the combinations of foods that are associated with the risk of disease using logistic regression, allowing for multiple testing and adjusting for energy intake. ESFA was also separately adjusted for principal components of diet, foods that were significant in the unadjusted ESFA and propensity scores. For each method, we investigated the power with which an association between diet and disease could be identified, and the power and false discovery rate (FDR) for identifying the specific combination of food intakes. In some scenarios, ESFA had greater power to detect a diet–disease association than PCA. ESFA also typically had a greater power and a lower FDR for identifying the combinations of food intakes that are associated with the risk of disease. The FDR of both methods increased with increasing sample size, but when ESFA was adjusted for foods that were significant in the unadjusted ESFA, FDR were controlled at the desired level. These results question the widespread use of PCA in nutritional epidemiology. The adjusted ESFA identifies the combinations of foods that are causally linked to the risk of disease with low FDR and surprisingly good power.

Type
Full Papers
Copyright
Copyright © The Authors 2014 

Over the last two decades, there has been an explosion in the use of principal components analysis (PCA) to identify dietary patterns in nutritional epidemiological studies( Reference Slattery, Boucher and Caan 1 Reference Bertuccio, Rosato and Andreano 10 ). The main reason for the explosion of PCA is that recent results have raised questions about the role of diet in the aetiology of certain chronic diseases. A lack of empirical evidence from randomised controlled trials based on observational findings has been noted in chronic diseases such as asthma( Reference Shaheen, Newson and Rayman 11 ), cancer( Reference Greenberg, Baron and Tosteson 12 , Reference Schatzkin, Lanza and Corle 13 ) and CVD( Reference Hennekens, Buring and Manson 14 ), and it has been argued that single foods or nutrients may be less important than dietary patterns in causing disease. Since the use of a single variable to explore associations between diet and disease has led to unreliable results, an alternative is to look at a small number of dietary dimensions each made up of a combination of foods with the use of PCA.

PCA of data from FFQ allocates food items according to the degree with which their reported intakes are correlated. The principal components identified are referred to as ‘dietary patterns’ and can be investigated in relation to health and disease. This approach proved successful, for example, in two large US cohorts, in which the same two patterns of diet – a ‘prudent’ dietary pattern characterised by intake of vegetables, fruit, legumes, whole grains, fish and poultry and a ‘Western’ dietary pattern characterised by intake of red meat, processed meat, refined grains, sweets and desserts, French fries and high-fat dairy products – were identified and subsequently linked to differences in the occurrence of CHD( Reference Hu, Rimm and Stampfer 15 , Reference Fung, Willett and Stampfer 16 ), colon cancer( Reference Wu, Hu and Fuchs 17 , Reference Fung, Hu and Fuchs 18 ) and chronic obstructive pulmonary disease( Reference Varraso, Fung and Barr 19 , Reference Varraso, Fung and Hu 20 ).

PCA is often claimed to resolve the issue of confounding between different dietary exposures( Reference Randall, Marshall and Graham 21 , Reference Jacques and Tucker 22 ), though, even where there is a causal effect from diet, it is not clear whether the foods singled out in a PCA are those that are directly associated with the risk of disease, or are simply confounded with other foods. Some authors have also suggested that by aggregating the effects of different foods, PCA can demonstrate the effects that are too small to be detected by analysing each food separately( Reference Hu 2 , Reference Newby, Weismayer and Akesson 23 ). Slattery( Reference Slattery 24 ) claimed that eating patterns derived from PCA characterise the diet-associated disease risk better than any one food or nutrient. However, in order for this statement to be useful in disease prevention, we need PCA to identify all of the foods (and only those foods) that, in combination, increase or decrease the risk of disease.

As far as we are aware, justifications of this kind for using PCA have not been critically evaluated. We set out to investigate, using simulation, whether PCA really performs better in these respects than an analysis of individual food intakes.

Materials and methods

Simulation

Simulations were performed in Stata 10 (Stata Corporation). To simulate a dietary dataset with a realistic correlation structure, we sampled with replacement from a reference dataset of real FFQ data. We used two different sets of reference dietary data to allow replication of our findings. These datasets comprised 856 adults aged 16–50 years living in Greenwich who took part in the Food, Lifestyle and Asthma in Greenwich (F.L.A.G.) survey (dataset 1)( Reference Shaheen, Sterne and Thompson 25 ) and 201 adults aged 29–54 years living in Ipswich and Norwich who took part in the UK European Community Respiratory Health Survey (ECHRS) II diet survey (dataset 2)( Reference Hooper, Heinrich and Omenaas 26 ). Quantitative FFQ recorded frequencies of 217 different foods (from never to 6 d/week) in the FLAG survey and of seventy-four different foods (from never to 7 d/week) in the UK ECRHS II diet survey over the previous 12 months. Using both datasets, we created food intake variables by estimating weekly intake (g) of foods and food groups by multiplying frequency of consumption by the weight of standard portion sizes using British food composition tables( Reference Paul, Southgate and Buss 27 ). The UK ECHRS II study was conducted according to the guidelines laid down in the Declaration of Helsinki, and all procedures involving human participants were approved by the Ipswich Hospital and Norfolk and Norwich Hospital ethics committees in the UK( Reference Hooper, Heinrich and Omenaas 26 ). The F.L.A.G. survey was approved by the Greenwich Research Ethics Committee( Reference Shaheen, Sterne and Thompson 25 ). Written informed consent was obtained from all the participants in both studies.

In each simulation, we assumed that disease risk depended on a linear combination of m food intakes derived from the FFQ. Let us suppose that these foods are indexed i 1, i 2,…, i m , and absolute food intakes $$x _{ i _{1}},\, x _{ i _{2}},\,\ldots ,\, x _{ i _{ m }} $$ are standardised to have zero mean and unit standard deviation. We assumed a logistic model for the disease risk, p, that is:

(1) $$ p = 1/(1 + \,exp\,( - ( a + b _{1} x _{ i _{1}} + b _{2} x _{ i _{2}} + \ldots + b _{ m } x _{ i _{ m }}))). $$

This model has been widely used in previous simulated epidemiological studies( Reference Fewell, Davey Smith and Sterne 28 , Reference Peduzzi, Concato and Kemper 29 ). We chose the constant a so that the baseline risk at the average intake of all foods was 0·15, i.e. a= ln(0·15/(1 − 0·15)). The constants b 1, b 2,…, b m were chosen so that the OR per standard deviation of food intake was 1·5 or 1/1·5 depending on whether the food was assumed to increase or decrease the risk of disease, i.e. b j = ± ln(1·5). This is comparable to significant OR for dietary patterns that were identified in the systematic review by Newby & Tucker( Reference Newby and Tucker 4 ).

The value of m was chosen to be about one in seven of the total number of foods on the FFQ, i.e. m= 30 (of 217) in dataset 1 and m= 10 (of 74) in dataset 2. The m foods were chosen in two different ways. First, they were randomly chosen in each simulation. Results of these simulations inform us about the average performance of different methods when we do not restrict a priori the combinations of foods that might be important for disease risk. We considered two models of this kind: in model 1, all m foods were assumed to be protective; in model 2, all m foods were assumed to increase the risk of disease. Second, they were predetermined in each simulation to be foods making up a ‘Western’ dietary pattern, which was assumed to be positively associated with the risk of disease (model 3). These foods are listed in Table 1, and were chosen as food intakes with the highest positive loadings on a ‘Western’ dietary pattern obtained using PCA from the original reference dataset (further details are available from the authors). Model 3 might be expected to favour PCA as a means of identifying dietary associations, since the model is based on a principal component in the population.

Table 1 List of foods comprising a ‘Western’ dietary pattern for each dataset*

* Dataset 1 is from the Food, Lifestyle and Asthma in Greenwich (F.L.A.G) survey( Reference Shaheen, Sterne and Thompson 25 ); Dataset 2 is from the UK European Community Respiratory Health Survey (ECRHS) II diet survey( Reference Hooper, Heinrich and Omenaas 26 ).

Having simulated the food intake data for a new individual in the sample, we then calculated the probability of the outcome, p, using equation 1. We determined whether or not the individual had the disease by generating a uniform random number between 0 and 1, and observing whether it was less than p.

The simulated dietary data were then subjected to a PCA (conducted on the correlation matrix of the reported food intakes) with varimax rotation, and the resulting dietary patterns were investigated for their associations with the risk of disease using logistic regression. We considered the results of extracting two, five and ten principal components, since this was the range of components identified in the majority of dietary pattern studies( Reference Kant 3 , Reference Newby and Tucker 4 ).

For comparison, we looked at the results of analysing each food on the FFQ separately in relation to the risk of disease, a process we refer to as an exhaustive single food analysis (ESFA). This is a univariate or independent screening approach as we know from, for example, SNP screening in statistical genetics( Reference Laird and SpringerLink 30 ). All regression analyses of dietary patterns and individual food intakes were adjusted for total energy intake. Adjustment for energy intake is strongly advised in the analysis of observational nutritional studies( Reference Willett 31 , Reference Jakes, Day and Luben 32 ). As pointed out by Willet( Reference Willett 31 ), adjustment for total energy intake should be considered because the level of intake might be a risk factor, might distort the effect of a food or a nutrient on the potential outcome, and variations in nutrient intake between individuals might reflect variations in individuals' energy intake levels.

ESFA was, in the first instance, unadjusted for the effects of other foods, but in order to deal with confounding, we also carried out ESFA adjusting for other foods using three different methods:

  1. (a) Adjusting for the first five principal components of diet. Because net confounding by correlated foods and patterns could result in biased associations between diet and disease, adjustment for dietary patterns has been suggested in the literature( Reference Imamura, Lichtenstein and Dallal 33 ). We chose five components of diet because that was the number of dietary patterns of diet that were identified in the original populations (data not shown).

  2. (b) Adjusting for all foods that were significant in the unadjusted ESFA. In this case, as a first step, we ran an ESFA procedure adjusting for total energy intake keeping the food variables (covariates) with those regression coefficient estimates that had a P value lower than a specific threshold. The threshold was determined by the procedure of Benjamini & Hochberg( Reference Benjamini and Hochberg 34 ), controlling the rate of our false discoveries at 20 %. Then, we re-ran an ESFA procedure adjusting for energy intake and for all these foods (covariates) that were significant in the first round of analysis (unadjusted ESFA). This method is conceptually similar to the iterative sure independence screening method proposed by Fan & Lv( Reference Fan and Lv 35 ). However, herein, we chose our covariates based on a multiple test procedure and not on a penalised likelihood method. Furthermore, we aimed to control the false discovery rate (FDR), whereas the iterative sure independence screening method focused on missed discoveries.

  3. (c) Adjusting for a propensity score for predicting the amount of each index food intake consumed from other food intakes( Reference Imai and van Dyk 36 ). Once the propensity score was estimated, it was used as a confounder in our multivariate model( Reference Sturmer, Joshi and Glynn 37 ).

The process of simulation and testing was repeated a large number of times (10 000) in order to determine the long-run performance of PCA and ESFA in detecting the associations between diet and disease.

Using the ‘powerlog’ sample size calculation routine in Stata( 38 ), we determined that a sample size of 330 would achieve 80 % power at the 5 % significance level to detect an OR of 1·5 per standard deviation, using an unadjusted logistic regression with no allowance for multiple testing. We present results herein for sample sizes of 300, 1200 and 4800.

Evaluating the performance of exhaustive single food analysis and principal components analysis

First, we investigated the statistical power with which ESFA and PCA could detect whether there was any association between diet and disease. For the ESFA, we considered that an association had been found if any of the food intakes were significantly associated with the risk of disease after applying a Bonferroni correction for the number of foods tested (family-wise P< 0·05)( Reference Miller 39 ). For the PCA, we considered an association had been found if any of the dietary patterns were significantly associated with the risk of disease after applying a Bonferroni correction for the number of patterns identified.

We also wanted to see how well the two procedures identified the specific combination of food intakes that were causally associated with the risk of disease. We compared the power and the FDR of ESFA and PCA for detecting these associations. In this context, we extended the concept of ‘power’ to mean the proportion of foods included in the model that were correctly identified as significantly associated with the disease outcome (Fig. 1). The FDR is the proportion of discoveries, or significant findings, that are false (Fig. 1).

Fig. 1 How the results of dietary analyses can be broken down. TN, true negatives; FN, false negatives; FP, false positives; TP, true positives. Power = TP/(FN+TP). False discovery rate (FDR) = FP/(FP+TP) if FP+TP>0. FDR = 0 if FP+TP = 0.

For the ESFA, we considered that there was a ‘significant’ effect of a food if it was identified as such using the multiple testing procedure of Benjamini & Hochberg( Reference Benjamini and Hochberg 34 ), with a nominal FDR set to 20 %. For the PCA, we considered that there was a ‘significant’ effect of a food if it had a correlation >0·3 or < − 0·3 with a dietary pattern that was significantly associated with the risk of disease (P< 0·05) – this being the way in which individual foods tend to be highlighted in a PCA. It should be noted that the procedure of Benjamini & Hochberg( Reference Benjamini and Hochberg 34 ) is designed to control the FDR at no more than the nominal level, but here false discoveries (of foods) occur not just as random errors, but also because of confounding with other foods, so the nominal rate may be exceeded.

Results

Table 2 displays the estimates of the power of PCA and ESFA for detecting an association between diet and disease, for different sample sizes and numbers of principal component scenarios. In model 3, the estimates of power for a sample size of 300 were all close to 100 %, so the estimates for a sample size of 100 are also given. Neither method consistently outperformed the other in this respect.

Table 2 Power (%) of exhaustive single food analysis (ESFA) and principal components analysis (PCA) for detecting any association between diet and disease*

F.L.A.G., Food, Lifestyle and Asthma in Greenwich; ECRHS, European Community Respiratory Health Survey.

* All estimates of power have a standard error < 0·5 %.

In models 1 and 2, one in seven foods (thirty in dataset 1 and ten in dataset 2) are selected at random in each replication from the foods on the FFQ. In model 1, all selected food intakes are negatively associated with the risk of disease; in model 2, all selected food intakes are positively associated with the risk of disease. In model 3, foods comprising a ‘Western’ dietary pattern (thirty in dataset 1 and ten in dataset 2; see Table 1) are used in each replication, with all these food intakes being positively associated with the risk of disease.

Power of PCA exceeds that of ESFA.

Table 3 shows the estimates of the power and FDR of PCA and ESFA for identifying the combinations of food intakes that were causally associated with the risk of disease. In most parts of the table, ESFA ‘dominates’ PCA in the sense of having both a higher power and a lower FDR. In the remainder of the table, neither one dominates the other.

Table 3 Power and false discovery rate (FDR) (%) of exhaustive single food analysis (ESFA) and principal components analysis (PCA) for detecting the foods that are causally linked to the risk of disease*

F.L.A.G., Food, Lifestyle and Asthma in Greenwich; ECRHS, European Community Respiratory Health Survey.

* All estimates of power and FDR have a standard error < 0·5 %.

In models 1 and 2, one in seven foods (thirty in dataset 1 and ten in dataset 2) are selected at random in each replication from the foods on the FFQ. In model 1, all selected food intakes are negatively associated with the risk of disease; in model 2, all selected food intakes are positively associated with the risk of disease. In model 3, foods comprising a ‘Western’ dietary pattern (thirty in dataset 1 and ten in dataset 2; see Table 1) are used in each replication, with all these food intakes being positively associated with the risk of disease.

Power of PCA exceeds that of ESFA, but FDR is also higher.

§ FDR of PCA is lower than that of ESFA, but power is also lower.

Adjusting for other foods that were significant in an unadjusted analysis successfully controlled the FDR at about the 20 % nominal level, though with some loss of power, particularly with low sample sizes (Table 4). Attempting to control the FDR of ESFA by adjusting for principal components of diet, or by adjusting for a propensity score, was not successful (Table 4).

Table 4 Power and false discovery rate (FDR) (%) of exhaustive single food analysis with different methods of adjustment for other foods*

F.L.A.G., Food, Lifestyle and Asthma in Greenwich; ECRHS, European Community Respiratory Health Survey.

* All estimates of power and FDR have a standard error < 0·5 %.

In models 1–3, one in seven foods (thirty in dataset 1 and ten in dataset 2) are selected at random in each replication from the foods on the FFQ. In model 1, all selected food intakes are negatively associated with the risk of disease; in model 2, all selected food intakes are positively associated with the risk of disease. In model 3, foods comprising a ‘Western’ dietary pattern (thirty in dataset 1 and ten in dataset 2; see Table 1) are used in each replication, with all these food intakes being positively associated with the risk of disease.

Discussion

In some scenarios, ESFA had greater power than PCA to detect an association of diet with the risk of disease. Allowing for multiple testing using the procedure of Benjamini & Hochberg( Reference Benjamini and Hochberg 34 ), ESFA also typically had a higher power and a lower FDR for identifying the combinations of foods that were causally linked with the risk of disease than PCA in which foods were singled out if they correlated highly (>0·3 or < − 0·3) with a significant dietary pattern. Even when a simplified ‘Western’ dietary pattern was the real culprit, PCA could not outperform ESFA in reconstructing the foods that were linked with the risk of disease. These findings were replicated in two different FFQ datasets.

In principal components regression, there is no standard procedure for identifying foods as ‘significantly associated’ with the disease outcome. We used a definition that matches what researchers typically do when they single out a combination of foods to mention in their abstract( Reference Shaheen, Jameson and Syddall 40 ) or when they devise an intervention based on the results( Reference Bertuccio, Rosato and Andreano 10 ); that is, to list the foods that are highly correlated with a dietary pattern that is itself strongly associated with the disease outcome. We wanted to know whether this method was better at correctly identifying the combinations of foods that are associated with the risk of disease than an ESFA. ‘Significant’ foods are defined differently in each case (they are different methods), but the performance of each method is assessed using the same criterion: how often does it get it right? In fact, we use two criteria: FDR and power. The two methods can differ in both these respects, making comparison a little harder (think of two different diagnostic tests that might differ both in their sensitivity and specificity); nevertheless, if we observe one method to have both a lower FDR and a higher power, then we can conclude that it has superior performance.

It is common to try to control the FDR at a low level( Reference Benjamini and Hochberg 34 ). We have used a nominal FDR of 20 %; in genetic studies, where the use of FDR is well established, FDR between 5 and 20 % are recommended depending on the circumstances( Reference Benjamini and Yekutieli 41 ). It should be noted that 20 % is still well below the FDR of individual hypothesis testing using P< 0·05 as a cut-off( Reference Ioannidis 42 ). However, it is concerning that the observed FDR of both ESFA (nominally controlled at 20 %) and PCA increase in an uncontrolled fashion as the sample size and power increase (Table 3). It is worth noting that when one in seven foods are causally linked with the risk of disease, as here, an FDR of about 86 % would be achieved by selecting ‘significant’ foods entirely at random. The uncontrolled FDR occurs because all food intakes are correlated to some extent with the causal foods, leading to false positive findings (more so as power increases). We tried a variety of approaches to control for other foods in the ESFA, and found that the FDR could be successfully controlled at the nominal level by adjusting for foods that were significant in a univariate analysis. Achieving a given power requires about twice the sample size of an unadjusted ESFA (with its inflated FDR) and four times the sample size from our original calculation, i.e. for an unadjusted analysis with the criterion P< 0·05.

Our models included one in seven of the foods on the FFQ. We repeated our simulations with a smaller number of foods in the models, and obtained qualitatively similar findings (see online supplementary Tables S1–S3). We did not consider interactive effects of foods in our models; this requires further investigation. Studies have claimed that dietary constituents interact with each other in complex ways( Reference Messina, Lampe and Birt 43 , Reference Sacks, Obarzanek and Windhauser 44 ), and these interactions have an effect, for example, on the risk of CVD( Reference Halliwell 45 ) and breast cancer( Reference Edefonti, Randi and La Vecchia 46 ). PCA is often recommended as a way of dealing with interactions between foods( Reference Hu 2 , Reference Varraso, Fung and Hu 20 ). Furthermore, foods could be associated with the risk of disease in a non-linear way. However, it is questionable whether linear combinations of food intakes produced by the PCA adequately address the issues of modelling interactions or capture potential non-linear effects between diet and disease.

Although we considered a model based on a ‘Western’ dietary pattern, there is no reason why foods with truly causal effects should be foods that are highly correlated with each other in order to be associated with the risk of disease. Hence, in the present simulation study, foods associated with the risk of disease were not necessarily highly correlated. Note it is not obvious that ESFA will outperform PCA in this case, since ESFA must pay a much higher penalty for multiple testing of all the individual foods, while each principal component may include – even if only by chance – a number of foods that have causal effects in the same direction. Where disease risk is explained by other factors that are confounded with diet, however, this confounding is more likely to be at the level of a dietary pattern than with an individual food. A ‘prudent’ dietary pattern, for example, is associated with older age( Reference Agurs-Collins, Rosenberg and Makambi 47 ), female sex( Reference Robinson, Syddall and Jameson 48 ), non-smoking( Reference Fung, Rimm and Spiegelman 49 ), higher income( Reference Perrin, Dallongeville and Ducimetiere 50 ), higher educational level( Reference Raberg Kjollesdal, Holmboe-Ottesen and Wandel 51 ), exercise( Reference Lopez-Garcia, Schulze and Fung 52 ) and supplement use( Reference Heidemann, Schulze and Franco 53 ). These factors are likely to be associated with a number of food intakes contributing to a ‘prudent’ dietary pattern rather than with any one of these foods in particular. This is another reason for adjusting each food effect for others found to be significant, as we suggest this should help control for both measured and some unmeasured confounding. We did not include non-dietary confounders in our simulations because there are just too many different potential confounders and models for their effects to be considered. We suspect, however, that as long as there are foods with truly causal effects, the findings of the present study will generalise to situations where other confounders have been explicitly adjusted for.

PCA does not aim to obtain a clear picture of disease variation but to summarise the overall dietary intake variation in the population. As Jolliffe( Reference Jolliffe 54 ) points out, the mistake when PCA is employed for distinguishing between healthy and unhealthy individuals is to assume that the separation between these groups is along the axis of greatest variation in diet. More appropriate statistical approaches in this respect are linear discriminant analysis( Reference Jolliffe 54 ) and discriminant analysis of principal components( Reference Jombart, Devillard and Balloux 55 ).

Other multivariate approaches such as cluster analysis( Reference Reedy, Wirfalt and Flood 56 ), and data reduction techniques such as reduced-rank regression( Reference Hoffmann, Schulze and Schienkiewitz 57 ) provide an alternative way for identifying dietary patterns. The main advantage of cluster analysis over PCA is that cluster analysis creates mutually exclusive groups that can easily be used in the analysis. Reduced-rank regression constructs dietary patterns according to the covariance matrix of specific biomarkers (taken by blood samples, for example) that are associated with dietary intake variables and assumed to be linked with the investigated disease outcome. Thus, the dietary pattern that is identified to be associated with the risk of disease for a specific biological reason and the combinations of foods that characterise the pattern describe a specific biomarker in the causal pathway between diet and disease.

Another approach is factor analysis, an ambiguous term which in certain textbooks include both PCA and common factor analysis. Even when investigators say that they are employing factor analysis in nutritional studies, they may be employing PCA. This misconception depends on the statistical package that these studies have used, because certain statistical packages (e.g. SAS) treat PCA as a special category of factor analysis. Factor analysis, which is a different statistical technique from PCA( Reference Bartholomew, Steele and Moustaki 58 ), is not recommended for the analysis of nutritional data( Reference Hu 2 ), although factor analysis based on the principal factor method gives generally similar results to PCA( Reference Schulze and Hoffmann 59 ).

Since ESFA outperforms PCA in the present simulation study, dealing with high-dimensional multivariate dietary exposures could be treated as a problem of variable model selection, that is, finding the non-zero regression coefficients in an unknown regression model. Our adjusted ESFA is similar to the iterative sure independence screening method for ultra-high-dimensional data( Reference Fan and Lv 35 ). Other forms of penalised likelihood estimation methods have been developed in the last decade to cope with high-dimensional data and have been lately reviewed( Reference Fan and Lv 60 ). These methods could be potentially useful in nutritional epidemiological studies, and further research is needed.

There is a growing interest in designing dietary interventions around foods rather than nutrients( Reference Jacobs, Gross and Tapseli 61 ) and around particular foods rather than dietary patterns( Reference Jacobs, Gross and Tapseli 61 , Reference Mann and Aune 62 ). Specifically, Jacobs et al. ( Reference Jacobs, Gross and Tapseli 61 ) suggests that the evidence for beneficial effects of a ‘prudent’ diet comes from interventions that only modified the intake of one or two foods. Furthermore, McCann et al. ( Reference McCann, Weiner and Graham 63 ) suggests that fruits and vegetables alone provided the highest discrimination among endometrial cancer cases and controls compared with PCA and other methods of characterisation. Mann & Aune( Reference Mann and Aune 62 ), evaluating the evidence that fruits and vegetables can prevent diabetes, have called for more studies looking at the effects of specific fruits and vegetables.

In conclusion, an FFQ-wide study of associations between food intakes and disease risk outperforms an analysis of dietary patterns derived from PCA. Analysing each food adjusting for others allows truly causal effects to be identified with a low rate of false discoveries and surprisingly good power. Although PCA has proved extremely popular in nutritional epidemiology to date, we question its routine use in this context.

Supplementary material

To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S0007114514000221

Acknowledgements

The present study is part of a programme of research funded by the Department of Health Policy Research Programme (DHTBX P16621).

All authors were responsible for the reported research and made substantial contributions to the conception and design of the simulation study, analysis and interpretation of the data, drafting of the manuscript or revising it critically for important intellectual content. I. B. prepared the manuscript, with R. H. and P. B. contributing to it. I. B. conducted the computer programming and simulation and data analysis. All authors read and approved the final version of the manuscript.

None of the authors has any personal, commercial, political, academic or financial conflicts of interest to declare.

References

1 Slattery, ML, Boucher, KM, Caan, BJ, et al. (1998) Eating patterns and risk of colon cancer. Am J Epidemiol 148, 416.Google Scholar
2 Hu, FB (2002) Dietary pattern analysis: a new direction in nutritional epidemiology. Curr Opin Lipidol 13, 39.CrossRefGoogle ScholarPubMed
3 Kant, AK (2004) Dietary patterns and health outcomes. J Am Diet Assoc 104, 615635.Google Scholar
4 Newby, PK & Tucker, KL (2004) Empirically derived eating patterns using factor or cluster analysis: a review. Nutr Rev 62, 177203.Google Scholar
5 Naska, A, Fouskakis, D, Oikonomou, E, et al. (2006) Dietary patterns and their socio-demographic determinants in 10 European countries: data from the DAFNE databank. Eur J Clin Nutr 60, 181190.Google Scholar
6 Knudsen, VK, Orozova-Bekkevold, IM, Mikkelsen, TB, et al. (2008) Major dietary patterns in pregnancy and fetal growth. Eur J Clin Nutr 62, 463470.CrossRefGoogle ScholarPubMed
7 Northstone, K & Emmett, P (2005) Multivariate analysis of diet in children at four and seven years of age and associations with socio-demographic characteristics. Eur J Clin Nutr 59, 751760.Google Scholar
8 Iqbal, R, Anand, S, Ounpuu, S, et al. (2008) Dietary patterns and the risk of acute myocardial infarction in 52 countries: results of the INTERHEART study. Circulation 118, 19291937.Google Scholar
9 Cottet, V, Touvier, M, Fournier, A, et al. (2009) Postmenopausal breast cancer risk and dietary patterns in the E3N-EPIC prospective cohort study. Am J Epidemiol 170, 12571267.Google Scholar
10 Bertuccio, P, Rosato, V, Andreano, A, et al. (2013) Dietary patterns and gastric cancer risk: a systematic review and meta-analysis. Ann Oncol 24, 14501458.Google Scholar
11 Shaheen, SO, Newson, RB, Rayman, MP, et al. (2007) Randomised, double blind, placebo-controlled trial of selenium supplementation in adult asthma. Thorax 62, 483490.Google Scholar
12 Greenberg, ER, Baron, JA, Tosteson, TD, et al. (1994) A clinical trial of antioxidant vitamins to prevent colorectal adenoma. Polyp Prevention Study Group. N Engl J Med 331, 141147.Google Scholar
13 Schatzkin, A, Lanza, E, Corle, D, et al. (2000) Lack of effect of a low-fat, high-fiber diet on the recurrence of colorectal adenomas. Polyp Prevention Trial Study Group. N Engl J Med 342, 11491155.CrossRefGoogle ScholarPubMed
14 Hennekens, CH, Buring, JE, Manson, JE, et al. (1996) Lack of effect of long-term supplementation with beta carotene on the incidence of malignant neoplasms and cardiovascular disease. N Engl J Med 334, 11451149.Google Scholar
15 Hu, FB, Rimm, EB, Stampfer, MJ, et al. (2000) Prospective study of major dietary patterns and risk of coronary heart disease in men. Am J Clin Nutr 72, 912921.CrossRefGoogle ScholarPubMed
16 Fung, TT, Willett, WC, Stampfer, MJ, et al. (2001) Dietary patterns and the risk of coronary heart disease in women. Arch Intern Med 161, 18571862.CrossRefGoogle ScholarPubMed
17 Wu, K, Hu, FB, Fuchs, C, et al. (2004) Dietary patterns and risk of colon cancer and adenoma in a cohort of men (United States). Cancer Causes Control 15, 853862.Google Scholar
18 Fung, T, Hu, FB, Fuchs, C, et al. (2003) Major dietary patterns and the risk of colorectal cancer in women. Arch Intern Med 163, 309314.Google Scholar
19 Varraso, R, Fung, TT, Barr, RG, et al. (2007) Prospective study of dietary patterns and chronic obstructive pulmonary disease among US women. Am J Clin Nutr 86, 488495.Google Scholar
20 Varraso, R, Fung, TT, Hu, FB, et al. (2007) Prospective study of dietary patterns and chronic obstructive pulmonary disease among US men. Thorax 62, 786791.Google Scholar
21 Randall, E, Marshall, JR, Graham, S, et al. (1990) Patterns in food use and their associations with nutrient intakes. Am J Clin Nutr 52, 739745.Google Scholar
22 Jacques, PF & Tucker, KL (2001) Are dietary patterns useful for understanding the role of diet in chronic disease? Am J Clin Nutr 73, 12.CrossRefGoogle ScholarPubMed
23 Newby, PK, Weismayer, C, Akesson, A, et al. (2006) Longitudinal changes in food patterns predict changes in weight and body mass index and the effects are greatest in obese women. J Nutr 136, 25802587.Google Scholar
24 Slattery, ML (2008) Defining dietary consumption: is the sum greater than its parts? Am J Clin Nutr 88, 1415.CrossRefGoogle ScholarPubMed
25 Shaheen, SO, Sterne, JA, Thompson, RL, et al. (2001) Dietary antioxidants and asthma in adults: population-based case–control study. Am J Respir Crit Care Med 164, 18231828.Google Scholar
26 Hooper, R, Heinrich, J, Omenaas, E, et al. (2010) Dietary patterns and risk of asthma: results from three countries in European Community Respiratory Health Survey-H. Br J Nutr 103, 13541365.Google Scholar
27 Paul, AA, Southgate, DA & Buss, DH (1986) McCance and Widdowson's ‘The composition of foods’: supplementary information and review of new compositional data. Hum Nutr Appl Nutr 40, 287299.Google ScholarPubMed
28 Fewell, Z, Davey Smith, G & Sterne, JA (2007) The impact of residual and unmeasured confounding in epidemiologic studies: a simulation study. Am J Epidemiol 166, 646655.Google Scholar
29 Peduzzi, P, Concato, J, Kemper, E, et al. (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49, 13731379.Google Scholar
30 Laird, NM & SpringerLink, (2011) The Fundamentals of Modern Statistical Genetics. New York, NY: Springer Science+Business Media, LLC. http://dx.doi.org/10.1007/978-1-4419-7338-2.Google Scholar
31 Willett, W (1998) Nutritional Epidemiology, 2nd ed. Oxford: Oxford University Press.CrossRefGoogle Scholar
32 Jakes, RW, Day, NE, Luben, R, et al. (2004) Adjusting for energy intake – what measure to use in nutritional epidemiological studies? Int J Epidemiol 33, 13821386.Google Scholar
33 Imamura, F, Lichtenstein, AH, Dallal, GE, et al. (2009) Confounding by dietary patterns of the inverse association between alcohol consumption and type 2 diabetes risk. Am J Epidemiol 170, 3745.Google Scholar
34 Benjamini, Y & Hochberg, Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol) 57, 289300.Google Scholar
35 Fan, J & Lv, J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Ser B (Methodol) 70, 849911.Google Scholar
36 Imai, Kosuke & van Dyk, David A (2004) Causal inference with general treatment regime: generalizing the propensity score. 99, 854866.Google Scholar
37 Sturmer, T, Joshi, M, Glynn, RJ, et al. (2006) A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol 59, 437447.Google Scholar
38 Ender PB (2002) powerlog: command to perform logistic regression power analysis. http://www.ats.ucla.edu/stat/stata/ado/analysis/.Google Scholar
39 Miller, RG (1981) Simultaneous Statistical Inference, 2nd ed. Springer: Verlag.Google Scholar
40 Shaheen, SO, Jameson, KA, Syddall, HE, et al. (2010) The relationship of dietary patterns with adult lung function and COPD. Eur Respir J 36, 277284.CrossRefGoogle ScholarPubMed
41 Benjamini, Y & Yekutieli, D (2005) Quantitative trait loci analysis using the false discovery rate. Genetics 171, 783790.Google Scholar
42 Ioannidis, JP (2005) Why most published research findings are false. PLoS Med 2, e124.Google Scholar
43 Messina, M, Lampe, JW, Birt, DF, et al. (2001) Reductionism and the narrowing nutrition perspective: time for reevaluation and emphasis on food synergy. J Am Diet Assoc 101, 14161419.Google Scholar
44 Sacks, FM, Obarzanek, E, Windhauser, MM, et al. (1995) Rationale and design of the Dietary Approaches to Stop Hypertension trial (DASH). A multicenter controlled-feeding study of dietary patterns to lower blood pressure. Ann Epidemiol 5, 108118.Google Scholar
45 Halliwell, B (1996) Antioxidants in human health and disease. Annu Rev Nutr 16, 3350.CrossRefGoogle ScholarPubMed
46 Edefonti, V, Randi, G, La Vecchia, C, et al. (2009) Dietary patterns and breast cancer: a review with focus on methodological issues. Nutr Rev 67, 297314.Google Scholar
47 Agurs-Collins, T, Rosenberg, L, Makambi, K, et al. (2009) Dietary patterns and breast cancer risk in women participating in the Black Women's Health Study. Am J Clin Nutr 90, 621628.Google Scholar
48 Robinson, S, Syddall, H, Jameson, K, et al. (2009) Current patterns of diet in community-dwelling older men and women: results from the Hertfordshire Cohort Study. Age Ageing 38, 594599.Google Scholar
49 Fung, TT, Rimm, EB, Spiegelman, D, et al. (2001) Association between dietary patterns and plasma biomarkers of obesity and cardiovascular disease risk. Am J Clin Nutr 73, 6167.Google Scholar
50 Perrin, AE, Dallongeville, J, Ducimetiere, P, et al. (2005) Interactions between traditional regional determinants and socio-economic status on dietary patterns in a sample of French men. Br J Nutr 93, 109114.Google Scholar
51 Raberg Kjollesdal, MK, Holmboe-Ottesen, G & Wandel, M (2010) Associations between food patterns, socioeconomic position and working situation among adult, working women and men in Oslo. Eur J Clin Nutr 64, 11501157.Google Scholar
52 Lopez-Garcia, E, Schulze, MB, Fung, TT, et al. (2004) Major dietary patterns are related to plasma concentrations of markers of inflammation and endothelial dysfunction. Am J Clin Nutr 80, 10291035.Google Scholar
53 Heidemann, C, Schulze, MB, Franco, OH, et al. (2008) Dietary patterns and risk of mortality from cardiovascular disease, cancer, and all causes in a prospective cohort of women. Circulation 118, 230237.Google Scholar
54 Jolliffe, IT (2010) Principal Component Analysis, 2nd ed. New York/London: Springer.Google Scholar
55 Jombart, T, Devillard, S & Balloux, F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11, 94.CrossRefGoogle ScholarPubMed
56 Reedy, J, Wirfalt, E, Flood, A, et al. (2010) Comparing 3 dietary pattern methods – cluster analysis, factor analysis, and index analysis – with colorectal cancer risk: The NIH-AARP Diet and Health Study. Am J Epidemiol 171, 479487.Google Scholar
57 Hoffmann, K, Schulze, MB, Schienkiewitz, A, et al. (2004) Application of a new statistical method to derive dietary patterns in nutritional epidemiology. Am J Epidemiol 159, 935944.Google Scholar
58 Bartholomew, DJ, Steele, F, Moustaki, I, et al. (2002) The Analysis and Interpretation of Multivariate Data for Social Scientists. Boca Raton, FL: Chapman & Hall/CRC Press.Google Scholar
59 Schulze, MB & Hoffmann, K (2006) Methodological approaches to study dietary patterns in relation to risk of coronary heart disease and stroke. Br J Nutr 95, 860869.Google Scholar
60 Fan, J & Lv, J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20, 101148.Google Scholar
61 Jacobs, DR Jr, Gross, MD, Tapseli, LC, et al. (2009) Food synergy: an operational concept for understanding nutrition. Am J Clin Nutr 89, 1543S1548S.Google Scholar
62 Mann, J & Aune, D (2010) Can specific fruits and vegetables prevent diabetes? BMJ 341, c4395.Google Scholar
63 McCann, SE, Weiner, J, Graham, S, et al. (2001) Is principal components analysis necessary to characterise dietary behaviour in studies of diet and disease? Public Health Nutr 4, 903908.Google Scholar
Figure 0

Table 1 List of foods comprising a ‘Western’ dietary pattern for each dataset*

Figure 1

Fig. 1 How the results of dietary analyses can be broken down. TN, true negatives; FN, false negatives; FP, false positives; TP, true positives. Power = TP/(FN+TP). False discovery rate (FDR) = FP/(FP+TP) if FP+TP>0. FDR = 0 if FP+TP = 0.

Figure 2

Table 2 Power (%) of exhaustive single food analysis (ESFA) and principal components analysis (PCA) for detecting any association between diet and disease*

Figure 3

Table 3 Power and false discovery rate (FDR) (%) of exhaustive single food analysis (ESFA) and principal components analysis (PCA) for detecting the foods that are causally linked to the risk of disease*

Figure 4

Table 4 Power and false discovery rate (FDR) (%) of exhaustive single food analysis with different methods of adjustment for other foods*