Bayesian framework
Let
h=1,...,
M,
i=1,...,
Nh, and
j=1,…,
R be our indices for the study, the patient-physician encounter, and the rater. In the OPTION
5 analysis the number of studies is
M=3, the number of encounters within the three studies are (
N1=201,
N2=72,
N3=38), and the number of raters is
R=2 although the methodology applies to all values of these. Let
Y denote the OPTION
5 score divided by 100 (for ease of interpretation),
θ the true amount of shared decision making, and
X indicate the use of a PDA. Although the effect of
X is of interest to this field, our objective is to adjust for it’s effect so as to ensure that the evaluations of ICC are meaningful. Our statistical model is
$$ Y_{hij}|\theta_{hi},X_{hi}\sim\text{ Normal}\left(\mu_{hij},v_{hi}^{2}\right)I(0,1) $$
(3)
where
$$ I\{y_{hij}\in(0,1)\} $$
(4)
restricts the probability distribution of the measured amount of SDM to the interval 0 to 1, and
$$\begin{array}{*{20}l} \mu_{hij}&=\theta_{hi}+\beta_{1}(j-1.5)+\beta_{2}(X_{hi}-\bar{X}) \end{array} $$
(5)
$$\begin{array}{*{20}l} v^{2}_{hi}&=\sigma_{h}^{2}\theta_{hi}(1-\theta_{hi}) \end{array} $$
(6)
with \(\bar {X}\) denoting the sample mean value of X. The dependence of \(v_{hi}^{2}\) on θhi implies that the ICC depends on the true amount of SDM in the encounter; its mathematical expression is referred to as a variance function.
We view the encounters as a random sample from a large population of possible encounters about which we wish to make inferences. The sampling of the encounters and the sampling variability in them is represented mathematically as
$$ \theta_{hi} \vert \text{study}\sim \text{ Normal}\left(\gamma_{h}, \tau_{h}^{2}\right) I(0,1) $$
(7)
where
I, defined in (
4), restricts the possible amount of SDM to be a proportion (0 to 1). The specification for
θhi depends on parameters which are indexed by
h, giving each study its own mean and variance. We set our prior distributions as follows:
$$\begin{array}{*{20}l} \gamma_{h}&\sim\text{ Normal}\left(\beta_{0},\omega^{2}\right)\\ \beta_{k}&\sim\text{ Normal}\left(b_{0}I(k=0),B^{2}\right),\hspace{0.5 cm} k=0,1,2\\ \sigma_{h}^{-2}&\sim\text{ Gamma}(v_{1},v_{1})\\ \tau_{h}^{-2}&\sim \text{ Gamma}(v_{2},v_{2})\\ \omega^{-2}&\sim \text{ Gamma}(v_{3},v_{3}) \end{array} $$
The choice of normal and gamma distributions for the regression (mean or location) and the variance (scale) parameters is common in practice as the conditional posterior distributions of each parameter conditional on the remaining parameters and the data are also normal and gamma distributions. This simplifies model estimation and computation.
The desire for the prior distribution to impart virtually no information onto the analysis is accomplished by specifying distributions with very large variances for the model parameters. As a consequence, the data are solely responsible for estimating the model. In this application we set
b0=0.4,
B2=10, and
vl=10
−3 for
l=1,2,3. Note that parameters such as
θhi that have restricted ranges may be assigned prior distributions with almost no mass within the allowable range if the density is not truncated. If the allowed range is a region over which the unrestricted distribution is essentially flat, then the truncated distribution will be close to uniform - essentially assuming that all allowable values of the parameter are equally likely. As well, parameters may have values such that the mean of the unrestricted distribution is outside the allowed range, and the truncated distribution will still be well-defined. Although the inverse-Gamma prior distributions assumed here for the variance parameters have been shown to yield undesirable results in some applications [
33], we found that they were well suited to our case study in the sense that the results were quite robust to the prior distribution parameters. For example, the results with
vl=10
−2 for
l=1,2,3 were numerically almost identical to those with
vl=10
−3 for
l=1,2,3. We attribute this result to the fact that in our case study the scale of the data has a finite range, which prevents the tails of a prior distribution from having a substantial impact on the posterior.
Under the above model, the ICC for an encounter in study
h with SDM of
θ∗ is given by
$$ ICC_{h}(\theta^{*})=\frac{\tau_{h}^{2}}{\tau_{h}^{2}+\sigma_{h}^{2}\theta^{*}(1-\theta^{*})} $$
(8)
Two salient features are evident in Eq. (
8). Firstly, the within (
σ2) and between (
τ2) encounter variance and scale parameters depend on the index for study. Therefore, the ICC is study specific. Secondly, the within-encounter scale parameter is multiplied by
θ∗(1−
θ∗), which crucially allows for the ability of raters to agree, or rate consistently, to depend on the actual amount of SDM. Because it is easier to distinguish cases against a baseline level of a trait close to 0% or 100% than cases in which the trait is about 50% present (this is seen from the fact that the variability of a restricted scale is greatest around its middle point), the involvement of the binomial variance form
θ∗(1−
θ∗) makes intuitive sense.
In practice, one may choose a value of
θ∗ that has particular meaning or relevance to the application at which to compute the ICC. If multiple values of
θ∗ are important (e.g., the baseline levels for various population strata) a separate ICC can be reported for each of them. Alternatively, or additionally, one may also choose to average over a population of values of
θ∗. For example, if we expect the population of patient-physician encounters on which the instrument will be applied to be described by the probability distribution,
$$\begin{array}{*{20}l} \theta^{*} &\sim \pi(\theta^{*}) = \text{ Normal}\left(\gamma_{h}, \tau_{h}^{2}\right)I(0,1), \end{array} $$
it follows that the population average ICC, given by
$$ {}{ICC}_{h}=\tau_{h}^{2} \int_{0}^{1} \left (\tau_{h}^{2} + \sigma_{h}^{2}\theta^{*}(1-\theta^{*})\right)^{-1}\pi(\theta^{*}) \hspace{0.5cm} d\theta^{*}, $$
(9)
should be computed. The evaluation of multiple measures of ICC yields a much more informative profile of an instrument’s performance than the presently used single number summary derived under overly restrictive assumptions. This function is designed in such a way that the user directly specifies a distribution for θ∗ to maintain flexibility in the calculation of the ICC. This distribution can be specified with known parameters to avoid integration over the hyper parameters γh and τh for simplicity. Alternatively, the user could assume a hierarchical prior where integration over these parameters would also be necessary.
The ICC can also be defined for a scenario where encounters are pooled across studies. Assuming an equal probability of selecting an encounter from each study, the marginal variance across these encounters is
\(\omega ^{2} + \bar \tau ^{2} + \bar {\sigma }^{2}\theta ^{*}(1-\theta ^{*})\) (a more general expression may be substituted if the study selection probabilities are unequal). Hence, the corresponding measure of ICC is given by
$$ {ICC}_{\text{Marg}}(\theta^{*})=\frac{\omega^{2}+\bar{\tau}^{2}}{\omega^{2}+\bar{\tau}^{2}+\bar{\sigma}^{2}\theta^{*}(1-\theta^{*})} $$
(10)
Typically, one would see
$$ {ICC}_{\text{Marg}}(\theta^{*}) \geq \overline{ICC} (\theta^{*}) $$
(11)
Although the pooled or marginal ICC is well-defined under a specified model for sampling encounters from the individual studies, if the intended use of the instrument is to compare encounters across a homogeneous population of subjects (e.g., the reference population for a single study) then ICC Marg(θ∗) makes the instrument look better in a meaningless way as it overstates the heterogeneity between the subjects compared to the heterogeneity between the individuals in the population that the instrument will be used to compare or discriminate between in actual practice.
Summarizing the above, the three forms of ICC are seen to be components of a two-dimensional family of measures of ICC defined under the full statistical model we developed to account for the intricacies of the data. The dimensions are: 1) whether or not the ICC is specific to a particular level of the quantity being studied versus averaging over a distribution of values of that quantity; 2) whether or not variability between studies is included in the between encounter variance (which corresponds to whether or not it is desired for the instrument to discriminate between encounters from different studies in practice). Combining these two dimensions, there are four general types of ICC that are available under the general approach we have proposed.
Case study design
Our main study of interest can be found in [
30]. Data was collected from two previous studies, the Chest Pain Choice trial (Study 1) and the Osteroporosis Choice Randomized trial (Studies 2 and 3) [
37,
38]. Both trials randomly assigned patients to either receive an intervention of use of a Personal Decision Aid (PDA), or receive usual care [
37,
38]. The Osteoporosis Choice Randomized trial contains a subgroup of participants who used the World Health Organization’s Fracture Risk Assessment Tool (FRAX®;) [
38]. For the purposes of our analysis, we consider patients who used FRAX®; as a separate study group (Study 3). The Chest Pain Choice trial recruited participants from St. Mary’s Hospital Mayo Clinic in Rochester, MN while the Osteoporosis Choice Randomized trial recruited from 10 general care and primary care practices in the Rochester, MN area [
37,
38].
Audio-visual recordings of the patient-clinician encounters took place and two-raters independently assessed the recording of each patient-physician encounter across these three clinical studies of decision-aids using the Observer OPTION
5 SDM tool. A total of 311 clinical encounters were included in the study Table
1 summarizes these encounters across the three studies of interest. The overall Observer OPTION
5 score was calculated for each encounter and rater [
30]. The goal of the following analysis is to determine the concordance of the two raters despite the heterogeneity of the study groups and inherent heteroscedasticity.
Table 1
Encounters from the three randomized studies which compared the impact of PDAs to standard care
1 | 101 | 100 | 201 |
2 | 37 | 35 | 72 |
3 | 13 | 25 | 38 |
Total | 151 | 160 | 311 |
In this particular case, the recorded encounters from all three studies were re-rated by the same two raters. Hence, we assume that the differences across the studies are due to the differences in populations and imposed interventions across each study. In many cases, there would also be heterogeneity across raters of studies, leading to even greater between-study heterogeneity than is observed in this case.