Number of studies included in the review and summary of identified statistical tests and test combinations used
A total of 102 papers were screened of which 60 were included in the review, while 42 were excluded for the reasons mentioned in the methods. Six different statistical tests were identified, five with one set of interpretation criteria each and one with two sets of criteria (cross-classification in the same or opposite tertiles), resulting in a total of seven possible validity interpretation outcomes (Table
1). The most commonly used test was the correlation coefficient (57 studies, 18 combinations), followed by cross-classification (28 studies, 12 combinations), Bland Altman analyses (27 studies, 10 combinations),
t-test or Wilcoxon signed rank test (22 studies, 7 combinations), weighted Kappa coefficient (15 studies, 9 combinations) and percent difference (5 studies, 4 combinations) (86.). Twenty-one different combinations of the six statistical tests were identified in the 60 studies. The majority of combinations included three or fewer tests, with the coefficient of correlation featuring as a single test (delineated as a “combination” in Table
2) and in all but three of the remaining 20 combinations. Bland Altman analyses and cross-classification were included in approximately half of the combinations, with the weighted Kappa coefficient used less often. The least used test in combinations seems to be the percent difference (Table
2). Not one of the reviewed studies that included Bland Altman analyses considered the clinical importance of the width of limits of agreement (LOA) in their discussion and conclusions regarding the validity of the method being tested. Furthermore, all studies concluded that the test dietary assessment method was valid for use in the respective populations.
Table 1
Summary of identified statistical tests and interpretation criteria for validation of dietary intake assessment methods
| Strength and direction of association at individual level [ 8] | | | |
Paired t-test/ Wilcoxon signed rank test [ 8, 22, 23, 25, 27, 28, 33, 34, 36, 48, 49, 52- 56, 60, 62, 65, 66, 68, 69, 91] | Agreement at group level [ 8] | | | |
Percent difference [ 8, 22, 23, 25, 27, 28, 33, 34, 49, 52- 56, 60, 65- 72, 91] | Agreement at group level (size and direction of error) [ 8] | | | >10% |
Cross-classification (tertiles/ quartiles or quintiles) [ 8, 22, 31, 32, 35- 38, 41, 42, 44- 51, 55- 61, 63- 69, 91] | Agreement (including chance), at individual level [ 8] | ≥50% in same tertile/quartile [ 2] ≤10% in opposite tertile/quartile [ 2] | | <50% in same tertile/quartile [ 2] >10% in opposite tertile/quartile [ 2] |
• In same tertile |
• In opposite tertile |
Weighted Kappa statistics (coefficient) [ 8, 24, 26, 30, 40, 43, 54, 58, 59, 63, 64, 66- 69, 91] | Agreement (excluding chance) at individual level [ 8] | | | |
Bland Altman analysis: Correlation between mean and mean difference) [ 6, 21, 33, 34, 37- 39, 43, 50, 53, 54, 61, 63, 69, 76, 92] | Presence, direction and extent of bias at group level [ 6, 76] | | | |
Table 2
Summary of statistical test combinations applied in reviewed validation studies
1 | X | X | | | | | 8 | |
2 | X | | | | | | 7 | |
3 | X | | X | | | X | 6 | |
4 | X | X | | | | X | 5 | |
5 | X | | | | | X | 5 | |
6 | X | | X | | | | 5 | |
7 | X | | X | | X | | 4 | |
8 | X | X | X | | | | 2 | |
9 | X | X | X | | X | X | 2 | |
10 | X | | X | X | | X | 2 | |
11 | X | | | | X | | 2 | |
12 | X | X | | | X | X | 2 | |
13 | X | X | X | | | X | 2 | |
14 | X | | X | X | | | 1 | |
15 | X | | | | X | X | 1 | |
16 | X | | | X | X | | 1 | |
17 | X | | X | | X | X | 1 | |
18 | X | X | X | | X | | 1 | |
19 | | | X | | X | X | 1 | |
20 | | | | X | | | 1 | |
21 | | | X | | | | 1 | |
Explanation of identified tests, facets of validity reflected and suggested interpretation criteria
Details regarding the identified tests, interpretation criteria and facets of validity reflected are as follows (detail of interpretation criteria are presented in Table
1 and are not repeated in the text):
Correlation coefficients (Pearson, Spearman or Interclass) are widely used in validation studies and measure the strength and direction of the association between the two different measurements at individual level [
8,
14-
69]. They do, however, not measure the level of agreement between the two methods. In cases where more than one questionnaire is used, for instance multiple weighed records or 24-hour recalls, de-attenuated correlation coefficients can be used to adjust for day-to-day variation [
32]. Correlation coefficient values can range between −1 (perfect negative correlation) and 1 (perfect positive correlation), with a coefficient of zero reflecting no linear relationship between the two measurements [
70]. Because correlation coefficients do not provide any insight into the level of agreement between two measurements, [
8,
71,
72] it is not appropriate to use these tests as the sole determinant of validity [
73].
The paired
T-test or Wilcoxon signed rank test reflects agreement between two measures at group level [
74,
75]. Assessment of mean percent difference between the reference and test measure reflects agreement at group level (size and direction of error at group level) [
76,
77]. For calculation of the mean percentage difference the reference value is subtracted from the test measure value, divided by the reference measure and multiplied by 100 for each participant [
74,
75]. The mean percentage difference is then calculated for the total sample.
Cross-classification of participants for both the test and reference methods into categories, usually according to tertiles, quartiles or quintiles depending on the sample size, allows calculation of the percentage of participants correctly classified in the same category and the percentage misclassified in the opposite category [
2,
8,
78]. Accurate classification is important and indicates to what extent the dietary intake assessment method is able to rank participants correctly, this reflects agreement at individual level [
79]. Ranking of dietary intake data is especially important in the investigation of diet-disease associations [
8,
80]. However, cross-classification of data is limited in that the percentage of agreement includes chance agreement [
8].
The weighted Kappa coefficient is typically used for data that are ranked into categories or groups and excludes chance agreement [
2,
8,
10]. The magnitude of weighted Kappa coefficient values are mostly determined by factors such as the weighting applied, as well as the number of categories included in the scale [
80]. Weighted Kappa coefficient values range from −1 to 1 with values between 0 and 1 generally being expected [
81]. Values of zero or close to zero can be considered as an indication of “
no more than pure chance”, while negative values indicate agreement “
worse” than can be expected by chance alone [
80]. The weighting of the Kappa coefficient depends on the number of categories or groups, for instance if there are three categories, a score of 1 is allocated to participants in the same group, 0.5 for those in adjacent groups and 0 for those in opposite groups [
80]. The Kappa coefficient does not take into account the degree of disagreement between methods and all disagreement is treated equally as total disagreement. It also does not indicate whether agreement or lack thereof is because of a systematic difference between the two methods, or because of random differences (error because of chance) [
80].
Bland-Altman analysis reflects the presence, direction and extent of bias, as well as the level of agreement between two measures at group level [
10]. Spearman correlation coefficients are calculated between the mean of the two methods and the mean difference of the two methods to establish the association between the size of the error (or difference between the two methods) and the mean of the two methods, which reflect the presence of proportional bias as well as the direction thereof [
8,
10,
72,
82]. If proportional bias is present i.e. as the mean intake becomes larger, so does the difference in one direction, the Spearman rank correlation coefficient between the mean intakes and the difference between intakes will be significant [
72].
Bland-Altman analysis includes plotting the difference between the measurements (test - reference measure) (y-axis) against the mean of the two measures [(test measure + reference measure / 2)] (x-axis) for each subject to illustrate the magnitude of disagreement, identify outliers and trends in bias [
8,
72,
76,
83]. The LOA [95% confidence limits of the normal distribution] are calculated as the mean difference ± 1.96 SD [
72,
84] and reflect over and underestimation of estimates [
72]. It is important to note that Bland and Altman [
83] indicated that “the decision about what is acceptable agreement is a clinical one; statistics alone cannot answer the question.”
Illustration of the application of identified statistical tests and interpretation criteria using a test data set
The mean (SD) and median (IQ range) estimates for energy and nutrient intakes derived from the test data set are presented in Table
3 (not alluded to in the discussion section). Key outcomes of the application of the six statistical tests and seven interpretation criteria (two for cross classification) for the assessment of the relative validity of these variables as follows (Table
4):
Table 3
Mean(SD) and median(IQ Range) estimates for energy and select nutrient intakes derived from the test data set
1
Energy (kJ) | 12881 | 10093 | 12463 (5854) | 12475 (7686–17091) | 13819 (3677) | 13661 (11003–16225) | −1356 (6025) | −9192 - 8602 | −1642 (−6439 - 2830) |
Protein (g) | 56 | 46 | 67.3 (34.0) | 66.0 (44.7-91.3) | 84.9 (26.2) | 81.6 (68.8-99.5) | −17.6 (38.5) | −75.5-48.0 | −19.3 (−42.4 - 4.2) |
Fat (g) | 118 | 93 | 69.5 (46.9) | 54.8 (38.3 – 86.5) | 83.9 (31.5) | 82.8 (62.7-102.5) | −14.4 (55.5) | −83.3-62.1 | −12.8 (−54.0-16.9) |
Carbohydrates (g) | 130 | 130 | 475.7 (222.3) | 475.7 (289.1-618.3) | 494.1 (148.6) | 487.7 (397.3-572.0) | −18.5 (215.4) | −326.2-325.5 | −16.3 (−153.1-140.3) |
Folate (mcg) | 400 | 400 | 558.0 (355.2) | 454.0 (331.0-788.0) | 419.6 (235.7) | 405.8 (247.5-552.7) | 138.4 (393.7) | −401.6-842.9 | 108.6 (0.9-305.0) |
Vitamin A (mcg) | 900 | 700 | 192.6 (231.1) | 92.0 (51.0-242.0) | 346.7 (276.9) | 277.0 (171.1-402.4) | −154.2 (360.9) | −895.2-373.9 | −105.2 (−278.8-5.1) |
Iron (mg) | 8 | 18 | 12.1 (7.1) | 12.3 (6.5-15.7) | 16.1 (5.5) | 15.4 (11.4-0.2) | −6.0 (8.2) | −18.9 – 10.5 | −6.2 (−11.1- -1.7) |
Table 4
Statistical test outcomes and interpretation for energy and nutrient intakes derived from the test data set
1
Energy (kJ) | 0.26 | P > 0.05 | −9.8 | 46.8 | 19.2 | 0.20 | P < 0.001 |
Validly interpretation | Acceptable | Good | Good | Poor | Poor | Acceptable | Biased |
Protein (g) | 0.23 | P < 0.01 | −19.1 | 42.6 | 23.4 | 0.12 | P < 0.05 |
Validly interpretation | Acceptable | Poor | Acceptable | Poor | Poor | Poor | Biased |
Fat (g) | 0.01 | P > 0.05 | −6.9 | 34.0 | 31.9 | −0.01 | P > 0.05 |
Validly interpretation | Poor | Good | Good | Poor | Poor | Poor | Not biased |
Carbohydrates (g) | 0.40 | P > 0.05 | −1.4 | 50.0 | 17.4 | 0.25 | P < 0.01 |
Validly interpretation | Acceptable | Good | Good | Good | Poor | Acceptable | Biased |
Folate (mcg) | 0.40 | P < 0.05 | 33.0 | 53.2 | 8.5 | 0.30 | P < 0.01 |
Validly interpretation | Acceptable | Poor | Poor | Good | Good | Acceptable | Biased |
Vitamin A (mcg) | 0.15 | P < 0.01 | −22.9 | 34.0 | 14.9 | 0.03 | P > 0.05 |
Validly interpretation | Poor | Poor | Poor | Poor | Poor | Poor | Not biased |
Iron (mg) | 0.38 | P > 0.05 | −24.8 | 51.1 | 23.4 | 0.29 | P < 0.01 |
Validly interpretation | Acceptable | Good | Poor | Good | Poor | Acceptable | Biased |
Total energy intake
Two interpretations showed good validity (Wilcoxon signed rank test and % difference), two showed acceptable validity (Spearman correlation and weighted Kappa coefficient and three poor validity (cross-classification: % in same & opposite tertiles and Bland Altman analyses).
Total protein intake
One interpretation showed good validity (Wilcoxon signed rank test), two acceptable validity (Spearman correlation and % difference) and four poor validity (weighted Kappa coefficient, cross-classification: % in same & opposite tertiles and Bland Altman analyses).
Total fat intake
Three interpretations showed good validity (Wilcoxon signed rank test, % difference and Bland Altman) and four showed poor validity (Spearman correlation, cross-classification: % in same & opposite tertiles and weighted Kappa coefficient).
Total carbohydrate intake
Three interpretations showed good validity (Wilcoxon signed rank test, % difference and cross-classification: % in same tertile), two showed acceptable validity (Spearman correlation and weighted Kappa coefficient) and two showed poor validity (cross-classification: % in opposite tertile and Bland Altman analyses).
Folate intake
Two interpretations showed good validity (cross-classification: % in same & opposite tertiles), two showed acceptable validity (Spearman correlation and weighted Kappa coefficient) and three showed poor validity (Wilcoxon signed rank test, % difference and Bland Altman analyses).
Vitamin A intake
All interpretations showed poor validity with the exception of the Bland Altman analyses, which indicated that bias was not present.
Iron intake
Two interpretations showed good validity (Wilcoxon signed rank test and cross-classification: % in same tertile), two showed acceptable validity (Spearman correlation and cross-classification: % in opposite tertile) and three showed poor validity (% difference, weighted Kappa coefficint and Bland Altman analyses).
The width of the LOA for total energy, macro and micronutrient intakes can most probably be interpreted as being wide when considered within the context of their respective DRIs. The percentage data points within the LOA is above 95% for all nutrients, with the exception of vitamin A (89.4%) and total energy (93.6%) (data presented in the footnote to Table
4).