The necessity for quantitative evaluation of competing measurement devices, in cases where on device has not been found to be superior, is a significant need in science generally [
1,
2,
5‐
7], and particularly within the radiological sciences community. Specifically, this issue is encountered when comparing distinct positional verification methods for image-guided radiotherapy [
34,
53‐
56]. The difficulty of assessing competing platforms is particularly vexing, as it impedes efforts at cross platform comparison. Our group [
24,
50] and others have implemented several distinct methods for presenting such analysis [
25,
27,
33,
34,
45,
57]. Our previous efforts have utilized several extant method comparison statistical presentations (including Bland-Altman
7, Lin’s concordance [
58], Deming orthogonal regression [
59,
60]); however, what was gained in completeness was lacking in parsimony. To this end, we sought to define an improved algorithm for practical comparison of distinct imaging methodologies, with a non-fixed number of repeated measurements per patient, in the absence of a “gold standard”. Often, inappropriate statistical analyses are implemented in lieu of formal method comparison statistics. The analysis of different measurement devices is not as straightforward as the initial observer may suppose. Bland and Altman demonstrated that mean comparison and linear regression are insufficient for comparison of differing measurement techniques [
1]. The Bland- Altman method is succinct and easily interpretable, making it a classic of medical literature. In a series of seminal papers [
1‐
7], Bland and Altman defined the standard methodology for comparing differing measurements, as well as establishing effective techniques accounting for inter- and intra-method variability/repeatability. However, while the Bland- Altman methodology remains the current benchmark, it fails (by design, one should note) to include generation of a formalized
p − value, instead recommending that a clinically meaningful difference between measures be utilized. Additionally, though repeatability estimation is a recommended component of accurate method comparison, the calculation for greater than two replicates is somewhat unwieldy using the methodology proposed by Bland and Altman. Since many IGRT datasets span > 30 repeated daily measures, the utility of a statistical methodology which can readily integrate large replicate numbers is desirable. The COM
3PARE methodology presented herein represents an attempt to integrate several desirable methodological attributes into a unified, readily performed statistical process. COM
3PARE has several advantages over existing method comparison statistical analyses. Specifically, compared to general linear model [
61,
62] (GLM)-based approaches (such as the
t-test, linear regression, and ANOVA [
63]), which fail to account for multiple sources of random variance, the linear mixed effects (LME)-based COM
3PARE platform integrates variation estimation at multiple hierarchical levels (i.e., between- and within- measurement methods/subjects) [
48]. From a practical point of view, this allows factor-wise assessment of procedural or technical variability of each of the two methods rather than a combined assessment, so that there is the capacity to determine the exact source of disagreement. COM
3PARE is also resilient with regard to uneven numbers of replicates per device, a feature of great practical utility in a clinical setting, such as daily IGRT recording, where the number of IGRT fractions received for each patient may differ based on fractionation regimen of clinical exigency. Additionally, since COM
3PARE has the capacity to fit differences in said variability to a hypothesis testing-friendly Bonferroni-corrected
p − value output, while still implementing clinician-determined thresholds for agreement there is greater interpretability of statistical output, with no loss of clinical relevance. For instance, one could specify a priori that measurement differentials >1 mm would represent a lack of interchangeability globally. Data presentation was performed in this study in an effort to illustrate potential applications of COM
3PARE for replicated image-based measurements of the kind frequently encountered in radiation oncology. The specific dataset included have been previously presented using standard method approaches. By revisiting these data using compare we hope to illustrate implementation of what we perceive to be a more usable and parsimonious approach to conceptualizing method comparison for IGRT applications, expanding upon, rather than obviating the previous work. With regard to the specific dataset presented herein, our analysis points to the difficulties possible when comparing IGRT platforms. For instance, having set our criteria pre-analysis, we were surprised to note that differing measurement methods proved preferable in distinct axes (e.g., CBCT in X-axis, kV X-ray for the Z-axis), while appearing by said criteria interchangeable in the Y-axis. A possible explanation of this phenomenon may appear as a feature of the imaging methodologies themselves. For CBCT, before three-dimensional reconstruction, data is acquired as axial slices (X-axis), while, previous to DRR referencing, the kV X-ray system uses orthogonal projections at oblique angles, parallel to the superoinferior plane (Z-axis). Consequently, method intra-subject repeatability may be tied to the reference plane of image acquisition, though this remains conjecture based on a single dataset. To our knowledge this technique represents the first formal hypothesis testing approach to integrate inter-method bias, inter-subject variability, and intra-subject variability of two methods with any number of replicated measurements for image-guided radiotherapy. As modeled on the aforementioned conceptual schema presented in the “
Methods” section, we postulate that the following criteria be formally evaluated as feature of future image-guided radiotherapy measurement comparison studies comparing two imaging platforms, where multiple repeated observations on the same subject is possible. To meet our criteria for interchangeability [
48]:
1.
The bias and overall agreement must fall within a pre-specified range (e.g., bias/agreement of <0.1 cm between IGRT devices).
2.
There should be no statistically significant, using a pre-specified threshold (e.g., <0.05) difference in the inter-subject variability of the two methods.
3.
There should be no statistically significant difference in the intra-subject variability (i.e., repeatability) of the two methods.
4.
In cases where criteria 2 and 3 are NOT met, the preferred IGRT technique is the one exhibiting the lower intra-subject variability (i.e., greater repeatability).