Discussion
In agreement studies with sole focus on the difference between paired measurements, as in our study 1, the data are ideally displayed by means of Bland-Altman plots, possibly optimised by using log transformation of the original data and accounting for heterogeneity and/or trends over the measurement scale [
10,
20]. In study 1, we observed the duality between Bland-Altman limits of agreement on the one hand and the corresponding RC on the other hand. Actually, various authors of recently published agreement studies defined the repeatability coefficient (or coefficient of repeatability) as 1.96 times the standard deviation of the paired differences [
21‐
25] which is algebraically the same as 2.77 times the within-subject standard deviation in simple settings as our study 1. Lodge et al. referred to the RC as 2.77 times the within-subject standard deviation [
26].
Modelling a more complex situation, in which both fixed and random effects shall be accounted for, leads naturally to a mixed effects model as in our study 2. Here, we applied VCA in order to provide relevant RCs. However, the estimation of both fixed effects and random components was prone to large uncertainty which was reflected by the widths of respective 95 % CIs. In general, the estimation of variance components requires larger sample sizes as the estimation of fixed effects, since the former relates to second moments and the latter to first moments of random variables [
27]. How many observations suffice to demonstrate agreement?.
An ad hoc literature search in PubMed (using the search term
((reproducibility OR repeatability OR agreement)) AND SUV) for the period 1st January 2013 to 30th June 2015 revealed 153 studies with sample sizes between eight [
28] and 252 [
29], where most studies included up to 40 patients. Despite the increased interest in the conduct of agreement and reliability studies over recent decades, investigations into sample size requirements remain scarce [
30,
31]. Carstensen reckoned that little information is gained in a method comparison study beyond the inclusion of 50 study subjects, using three repetitions [
20]. In the context of multivariable regression modelling, 10 to 20 observations should be available per continuous variable and level per categorical variable in order to fit a statistical model, which results in sufficiently accurately estimated regression coefficients [
32‐
37]. Level refers here to a category of a categorical variable; for instance the variable ‘time point’ in our study 2 had 5 levels, meaning five realised time points. The abovementioned rule-of-thumb can lead to large sample sizes in an agreement study, even though only few explanatory fixed and random variables are involved. In our study 2, we employed 10 levels all in all (five time points, three observers, and two scanners), leading to 10 times 20 = 200 observations. Unfortunately, we could only gather around 150 observations due to slow patient accrual, but we learned that at least 20 observations should be employed per continuous variable and level of a categorical variable in an agreement study in order to account for the challenge of sufficiently accurately estimating variance components. Note that subject-observer interaction, i.e., the extra variation in a subject due to a specific observer, can only be isolated when having repeated measurements per observer [
16].
We understand repeatability as an agreement and not a reliability assessment (see
Appendix), whereas the ICC happens to be used as repeatability assessment on occasions [
38,
39]. Since the ICC is heavily dependent on between-subject variation and may produce high values for heterogeneous patient groups [
30,
31], it should be used exclusively for the assessment of reliability. We hope to contribute to a more jointly agreed usage of terms in the future, being in line with the published guidelines for reporting reliability and agreement studies [
6]. Further, we reckon that the biggest challenge most likely is a clear understanding of which exact question a researcher seeks answered, before undertaking an agreement or a reliability study. Guyatt, Walter, and Norman pointed out that reliability indices (like ICC) are used for discriminative purposes, whereas agreement parameters (like RC) are used for evaluative purposes [
40].The former focuses on a test’s ability to divide patients into groups of interest, despite measurement error, whereas the latter relies on the measurement error itself; with small measurement errors, it is possible to detect even small changes over time [
11].
Moreover, the choice of independent variables of the linear mixed effects model (i.e., the potential sources of variation in the data) and the decision to treat a factor as fixed or random is far from trivial and requires thorough planning in order to reflect the clinical situation in the best possible, meaning most appropriate, way. Is the assessment of inter-observer variability limited to only few observers, as these are the only ones handling cases in daily routine (treating ‘observer’ as fixed effect), or is an observer merely a representative of the pool of several potential observers (treating ‘observer’ as random effect)? In the former case, every observer reads all scans; in the latter case, every observer assesses only a portion of all scans, to which he/she gets randomly assigned, which in turn distributes the assessment work on several observers (thereby easing data collection) and increases generalisability.
Apart from factors like observer, time point, and scanner, FDG PET quantification itself is affected by technical (e.g. relative calibration between PET scanner and dose calibrator, paravenous administration of FDG PET), biological (e.g. blood glucose level; patient motion or breathing), and physical factors (e.g. scan acquisition parameters, ROI, blood glucose level correction) [
41]. In our studies, intra- and inter-observer agreement was assessed with respect to the post-imaging process; therefore, technical, biological, and most physical factors came not into play, whereas size and type of ROI used are observer-specific and, thus, cannot be modelled separately from the factor ‘observer’. When investigating day-to-day variation of the scans and dealing with multi-centre trials, the PET procedure guideline [
5] should be adhered to in order to maintain accuracy and precision of quantitative PET measurements best possible. The technical, biological, and physical factors which were discussed by Boellard [
41], can, in principle, partly be included to a statistical model as explanatory variables; however, only those should be considered that justify a respective increase in sample size (see discussion on appropriate sample sizes above).
The guidelines for reporting reliability and agreement studies [
6] include 15 issues to be addressed in order to improve the quality of reporting. Doing so can result in separate publications on agreement and/or reliability apart from the main study, as Kottner et al. put it [
6]:
“Studies may be conducted with the primary focus on reliability and agreement estimation itself or they may be a part of larger diagnostic accuracy studies, clinical trials, or epidemiological surveys. In the latter case, researchers report agreement and reliability as a quality control, either before the main study or by using data of the main study. Typically, results are reported in just a few sentences, and there is usually only limited space for reporting. Nevertheless, it seems desirable to address all issues listed in the following sections to allow data to be as useful as possible. Therefore, reliability and agreement estimates should be reported in another publication or reported as part of the main study.”