Background
An assessment of diagnostic accuracy is crucial in the development of medical testing procedures [
1]. Comparing the accuracy of these procedures in terms of their sensitivities and specificities [
2,
3] relative to a gold standard, is essential to ensuring that the most appropriate tests are deployed in the clinical setting [
4,
5]. The focus of this paper is sample size re-estimation in the comparison of two candidate tests to a gold standard. However, diagnostic accuracy studies do not necessarily involve comparisons; many such studies report the accuracy of a single test.
At the outset of a study, a sample size is calculated based on assumptions made about the expected changes in sensitivity and specificity and, in a prospective design, the likely prevalence of the condition to be tested for in the sample. However, the initial assumptions about parameters in the study, especially the conditional dependence between the two tests, may be revealed to be inaccurate, resulting in a potentially over- or under-powered study. A planned interim analysis can allow the study’s sample size to be updated based on the data already collected. This involves utilising the information observed at the interim stage to refine the sample size estimate. A resulting increase in sample size allows the time, cost and patient discomfort already invested in the study to yield valid results while a decrease in sample size means that less time and cost will be expended overall and patients will not needlessly undergo unnecessary testing [
6].
There are well-established methodologies for interim sample size re-estimation in treatment studies for continuous and normally distributed response variables [
7‐
11], some of which provide mechanisms to maintain blinding in the study [
8‐
10]. Methods also exist for the re-estimation with binary response variables [
12,
13], and mechanisms to maintain blinding have been proposed in this more complex situation where the variance and mean parameters are not separable [
14]. Proschan [
15] gives an overview of sample size re-estimation procedures based on a nuisance parameter. Specifically, procedures for determining the difference of means between two samples with a common, unknown, variance and difference in proportions between two groups, with an unknown overall proportion, are considered. In the case of normally distributed data, the independence of the sample variance and sample mean ensures that the validity of estimates is unaffected by the interim sample size re-estimation and this is shown to hold asymptotically in the binary case. However, Proschan does not consider the case of paired data which is the focus of the current paper. Furthermore, the implications of sample size re-estimation in the context of comparative diagnostics studies, inherently different from those in treatment (randomised controlled) studies [
16], have not been fully explored in the statistical literature.
A number of salient differences in interim analysis between studies comparing diagnostic tests and those comparing treatments are highlighted in Gerke et al
. [
5] and Gerke et al
. [
16]. Firstly, in paired diagnostics accuracy studies, full blinding is often not possible, specifically, certain types of test may not be able to be blinded from the patient, the person administering the test, the person interpreting the test, or the person measuring the outcome. However, as long as the results of the two-tests which are being compared are temporarily blinded from the person measuring the outcome, this is not a major threat to a study’s validity [
17]. In fact, it has the advantage that the patients can benefit from their clinicians knowing the results of both diagnostic tests after testing has taken place. Secondly, in diagnostic accuracy studies, early cessation of the study due to futility is not as easy to establish as in treatment studies. The reasons for this are 1) the fact that treatment studies often test a single outcome while diagnostic studies test two outcomes, sensitivity and specificity, and futility must be established for both simultaneously, and 2) patient outcomes may only be seen further downstream from the test results [
18]. Thirdly, the sample size required for a hypothesis test in diagnostic studies, powered to a given level, is closely related to the conditional dependence between the two testing procedures which has been shown to present problems in a number of contexts [
5,
19‐
24]. More specifically, the lower the conditional dependence between the tests, the greater the sample size will be, with the largest sample size being implied by the maximum negative dependence, given the specified alternative hypotheses. This level of conditional dependence between the tests is one of the primary factors driving the required sample size estimate and it is often difficult to estimate a priori. Gerke et al
. [
5] assert that for comparative diagnostic studies, as long as an interim sample size re-estimation is planned it bears no threat to the validity of the study. However, Gerke et al
. [
5] do not provide justification for this assertion and, furthermore, their assertion does not take the inherent uncertainty of the interim data into account. This study aims to present a method and give practical guidelines for its application, for the initial estimation and interim re-estimation of sample size in a paired diagnostic study which will allow utilisation of information on the conditional dependence between tests at the interim to potentially reduce the required sample size while maintaining the approximate nominal statistical power of the experiment as a whole. While we present a method of estimating the size of the conditional dependence to reduce sample size, it should also be noted that there is a body of literature dealing with the problems caused by conditional dependence in other areas [
25‐
27].
The remainder of the article is organised as follows. The methods section outlines sample size estimation methods for paired diagnostic test studies, introduces a motivating example application, and then goes on to propose a new method for re-estimation based on a multinomial likelihood. The results section first provides extensive simulations of the method under various real world conditions and then moves back to apply the sample size re-estimation method proposed in this paper to the motivating example. The article then continues with a brief discussion of the place of this study in the literature and the optimal interim sample size to choose. Finally, the conclusion, summarises and restates the major outcomes of this study.
Methods
A representation of data from a paired comparative diagnostic accuracy study is given in Table
1. The subjects are initially divided according to whether they are discovered, via the gold standard test, to be diseased or non-diseased. They are then further subdivided as to whether they test positive or negative on tests A and B. For example, the cell
n
A
represents subjects that were found to have the disease via the gold standard test and also tested positive on both test A and B, while cell
n
F
denotes subjects who tested negative on the gold standard and test B but positive on test A.
Table 1
Paired study design
Test A | +ive |
n
A
|
n
B
| Test A | +ive |
n
E
|
n
F
|
-ive |
n
C
|
n
D
| -ive |
n
G
|
n
H
|
A possible initial sample size calculation, using a normal approximation of the logarithm of the ratio of sensitivities and specificities, and assuming a comparison between a new test, test A, and an existing test, test B, follows from Alonzo et al
. [
21] and a full derivation can be found therein. The experiment, as a whole tests jointly both sensitivity and specificity improvement to pre-specified levels, the sample size is calculated for each and the largest sample size is chosen to power the study. Note that this paper concentrates on the situation in which superiority is tested for both sensitivity and specificity. However, the method elaborated below should be extendable to situations where we are interested in testing non-inferiority in either or both of sensitivity and specificity. For details on the construction of the confidence intervals and hypothesis tests in these situations see
Alonzo et al
. [
21]. In the case of the estimation of a sample size for superiority, the initial sample size calculation for sensitivity is given by:
$$ {n}_{p1}={\left(\frac{Z^{\left(1-\beta \right)}+{Z}^{\left(1-\alpha /2\right)}}{\mathit{\log}{\gamma}_1}\right)}^2\left(\frac{\left({\gamma}_1+1\right){TPR}_B-2 TPPR}{\gamma_1{TPR}_B^2}\right)/\pi $$
(1)
where,
α is the type I error rate of the study and
β is the power of the study. The main quantity of interest,
γ
1, is the ratio of true positive rates=
TPR
A
/
TPR
B
,
TPR
B
is the true positive rate (sensitivity) on test B, i.e.
TPR
B
= (
n
A
+
n
C
) / (
n
A
+
n
B
+
n
C
+
n
D
),
TPR
A
is the true positive rate (sensitivity) on test A, i.e.
TPR
A
= (
n
A
+
n
B
) / (
n
A
+
n
B
+
n
C
+
n
D
),
TPPR is the proportion of diseased patients who test positive on both tests, i.e.
TPPR =
n
A
/(
n
A
+
n
B
+
n
C
+
n
D
) and
π is the prevalence of disease. The null hypothesis is that
γ
1 = 1, the alternative hypothesis is that
γ
1≠1.
For testing superiority of specificity we are interested in the true negative rates so the formula is instead:
$$ {n}_{n1}={\left(\frac{Z^{\left(1-\beta \right)}+{Z}^{\left(1-\alpha /2\right)}}{\mathit{\log}{\gamma}_2}\right)}^2\left(\frac{\left({\gamma}_2+1\right){TNR}_B-2 TNNR}{\gamma_2{TNR}_B^2}\right)/\left(1-\pi \right) $$
(2)
where,
γ
2, the main quantity of interest is the ratio of true negative rates =
TNR
A
/
TNR
B
,
TNR
A
is the true negative rate (specificity) on test A = (
n
G
+
n
H
) / (
n
E
+
n
F
+
n
G
+
n
H
),
TNR
B
is the true negative rate (specificity) on test B = (
n
F
+
n
H
) / (
n
E
+
n
F
+
n
G
+
n
H
), and
TNNR is the proportion of non-diseased patients who test negative on both tests =
n
H
/(
n
E
+
n
F
+
n
G
+
n
H
).
It is interesting to note that, following the notation of Vacek [
25] and considering the population 2 × 2 table (in Table
1), the conditional dependence of the two tests can be denoted by
e
b
and
e
a.
, the conditional covariance when the gold standard disease status is positive or negative, respectively [
25]. Therefore, the probability of both tests being positive can be expressed as
TPPR =
TPR
A
∙
TPR
B
+
e
b
and the probability of both tests being negative
TNNR = (1 −
TNR
A
) ∙ (1 −
TNR
B
) +
e
a
. When
e
a
and
e
b
= 0 the tests are conditionally independent, when
e
a
and/or
e
b
≠ 0 the response on one test changes the probability of that response on the other test. For example, when
e
b
> 0 an individual who responds positively on test A is more likely to respond positively on test B.
For initial estimates of
TPPR and
TNNR, from Alonzo et al
. [
21] we can use the fact that
TPPR ≥ (1 +
γ
1)
TPR
B
− 1 and
TNNR ≥ (1 +
γ
2)
TNR
B
− 1 to estimate the lower bounds of the possible values of
TPPR and
TNNR, under the specified hypotheses. The required sample size is largest when
TPPR = (1 +
γ
1)
TPR
B
− 1 and
TNNR = (1 +
γ
2)
TNR
B
− 1, thus, these estimates represent the “worst case scenarios” of maximal negative conditional dependence between the tests, conditional on the fixed values of
TPR
A
and
TPR
B
. The sample size implied by using these levels of
TPPR and
TNNR would very likely overpower the study, i.e. more participants will be recruited than is strictly necessary to achieve the power specified by
β. The required sample size is smallest when the conditional dependence between tests A and B are maximal, conditional on the fixed values of
TPR
A
and
TPR
B
, i.e. when
TPPR =
TPR
B
and
TNNR =
TNR
B
. The implied sample size in this case would likely underpower the study, i.e. too few participants recruited to reach the power specified by
β. The sample size in this “best case scenario” can be substantially lower than that in the worst case scenario. Conservatively, it might be thought a good idea to always use the “worst case scenario” implied sample size estimate which will always power the study sufficiently. However, in cases where the recruitment and testing of participants comes at a premium, both financially and in terms of discomfort to the patients, it might be preferable to apply a more nuanced strategy. Furthermore, the sample size implied by the “worst case scenario” implies the highly unlikely condition of a maximal negative conditional dependence between two tests, which are performed on the same patients to detect the same disease. The implied sample size based on this condition is not recommended [
28]. One possibility, to enable a more accurate evaluation of the conditional dependence between the two tests, and thus the required sample size, is to perform a planned interim sample size re-estimation using this information to refine the sample size estimate.
At a planned interim, where a proportion of the overall sample size has been collected, we would have some information about the true values of TPPR, TNNR, π, TPR
B
and TNR
B
, however, these values would only come from a limited sample size. The crucial parameters to use in re-estimation are those related to the conditional dependence between the tests, i.e., TPPR and TNNR, as these values are difficult to estimate and, for these parameters, it is unlikely that research exists which can provide an approximate value. Conversely, the values of, TPR
B
and TNR
B
, the sensitivities and specificities of an established test, may have known values in the literature and these should preferably be used over those from the relatively small interim sample. For the value of π,the prevalence, a judgement must be made as to whether the researcher feels that any pre-existing estimate of prevalence would be a more accurate reflection of the true prevalence in the specific study population than any interim estimate. In the example given below, we use values for TPPR, TNNR and π at the interim in the sample size calculation.
Naively, it might appear that interim sample size re-estimation would entail a straightforward replication of eqs. (
1) and (
2) with
π, and in the case of (1),
TPPR or in the case of (2),
TNNR, replaced with the estimates at the interim point. However, this approach does not effectively take into account the inherent uncertainty in the interim parameter estimates of
TPPR,
TNNR and
π, nor the fact that only a specific range of values for
TPPR and
TNNR are actually possible under the alternative hypothesis. An approach which does take these factors into account is re-estimation of the sample size based on maximum likelihood estimation, at the interim, of the parameters in question under a multinomial model. This model is constrained by the hypothesised values of
TPR
A
,
TPR
B
,
TNR
A
, and
TNR
B
, i.e. the marginals in Table
1.
Application
The numerical example we use involves an interim sample size recalculation of a study comparing the incremental benefits to sensitivity and specificity of augmenting current methods for diagnosing pancreatic cancer with Positron Emission Tomography (PET) and computed tomography (CT) technologies. The alternative hypotheses were that sensitivity would rise from 81% to 90%, and specificity would rise from 66% to 80%, additionally, the expected prevalence of pancreatic cancer from the literature was 47%.
To calculate the sample size for sensitivity equation
1 was used, taking
\( \alpha =0.05,\kern0.5em \beta =0.2,\kern0.5em {\widehat{\gamma}}_1=\frac{0.9}{0.81}, \)
\( \widehat{TPR_B}=0.81, \)
\( \widehat{TPPR}=0.71 \), and
\( \widehat{\pi}=0.47 \) gives a sample size of
598. To calculate the sample size for specificity equation
2 was used taking
\( \alpha =0.05,\kern0.5em \beta =0.2,{\ \widehat{\gamma}}_2=\frac{0.8}{0.66}, \)
\( \widehat{\ {TNR}_B}=0.66, \)
\( \widehat{TNNR}=0.46 \), and
\( \widehat{\pi}=0.47 \) gives a sample size of
409. The minimum sample sizes for sensitivity and specificity, given
\( \widehat{TPPR}=0.81 \) and
\( \widehat{TNNR}=0.66 \), are
186 and
106, respectively. Given the disparity between the minimum and maximum sample size estimates it was decided to re-assess the sample size at a planned interim.
Table
2 gives the results after data from 187 participants had been collected. The observed values at the interim are:
\( \widehat{TPPR}=0.80 \),
\( \widehat{TNNR}=0.66 \) and
\( \widehat{\pi}=0.44 \). Taking a naive approach and plugging these values directly into equations
1 and
2 the implied sample sizes for sensitivity become
242 and for specificity
100, giving a total sample size for the study of
242 (or
342 and
145, respectively, had we also used the interim values of
TPR
B
and
TNR
B
). However, this method does not take into account the fact that
\( \widehat{TPPR} \) and
\( \widehat{\ TNNR} \) are random variables and we are actually interested in the true value of the probability of
TPPR and
TNNR under the specified alternative hypothesis. In fact, had the observed value for
TPPR been equal to 0.86, the sample size given via the naive method would have been
−22, given the fact that
\( \widehat{TPPR} \) would have been larger than both
TPR
A
and
TPR
B
. Clearly, the naive method, which uses the random value of a single cell, is inappropriate and a method that uses information about the value of
TPPR from all of the observed cells and the specified marginals is required.
Table 2
Interim PET diagnostic study results
Post-PET | +ive | 66 | 3 | Post-PET | +ive | 21 | 4 |
-ive | 3 | 10 | -ive | 11 | 69 |
Sample size re-estimation via maximum likelihood estimation of TPPR
For illustration purposes, we will discuss the re-estimation of the sample size for sensitivity, the estimation procedure for specificity is analogous. Taking
TPR
A
as the test with the highest expected diagnostic utility, i.e. the “new” test whose performance we are comparing to the “standard”, the probabilities corresponding to the cells in Table
1, given the situation of the maximally negative conditional dependence between the tests are:
p
1 =
TPR
B
− (1 −
TPR
A
),
p
2 = 1 −
TPR
B
,
p
3 = 1 −
TPR
A
,
p
4 = 0. The probabilities of the cells when the conditional dependence between
TPR
A
and
TPR
B
is at its maximally positive are given by:
p
1 =
TPR
B
,
p
2 =
TPR
A
−
TPR
B
,
p
3 = 0,
p
4 = 1 −
TPR
A
. We could alternatively specify these cell probabilities according to the covariance between the two tests. Specifically, Vacek [
25] gives the maximum value of the covariance as
TPR
B
(1 −
TPR
A
) and the minimum value as −(1 −
TPR
A
)(1 −
TPR
B
). Thus, the maximum and minimum values for the cells can be ascertained by finding the product of the marginal probabilities associated with a cell and adding the minimum or maximum value of covariance, for cells
p
1 and
p
4, or subtracting the values of covariance for cells
p
2 and
p
3. For example, the minimum value for
p
1 =
TPR
A
∙
TPR
B
− (1 −
TPR
A
)(1 −
TPR
B
). Between the minimum and maximum values lies every permissible joint configuration. Let these possible joint configurations be expressed as vector,
p, with
p
1 =
TPPR,where
\( {\sum}_{i=1}^4{\mathbf{p}}_i=1,{\ p}_1+{p}_2 = {TPR}_A \) and
p
1 +
p
3 =
TPR
B
.
When the conditional dependence is maximally positive the sample size required is the smallest, when it is maximally negative the sample size required is at its largest. At the beginning of the experiment we do not know which of these possible levels of conditional dependence our data were generated under and thus we use the, usually overly conservative, largest possible sample size estimate.
However, at the interim we can use our observed data to infer a likelihood of that data having been generated under each of the permissible joint configurations of cell probabilities given the implied range of probabilities under a multinomial model. A simple method of extracting an estimate of TPPR is to maximise the likelihood function of the interim data given the values of
p implied by the marginal probabilities:
$$ \mathcal{L}\left(\boldsymbol{p}| x\right)=\kern0.5em \prod_{i=1}^4{\boldsymbol{p}}_i^{x_i} $$
(3)
where
p is the vector of joint probabilities defined above and
x are the observed cell frequencies. The constraints imposed on the above multinomial likelihood make the parameter space one dimensional, thus, substituting the constraints in order to express the likelihood in terms of
p
1, gives:
$$ \mathcal{L}\left({p}_1| x\right)={p}_1^{x_1}{\left(\ {TPR}_A - {p}_1\right)}^{x_2}{\left(\ {TPR}_B - {p}_1\right)}^{x_3}{\left(1 - {TPR}_A - {TPR}_B + {p}_1\right)}^{x_4} $$
(4)
$$ {\ p}_1\in \left[\ {TPR}_B-\left(1 - {TPR}_A\right),{\ TPR}_B\right] $$
Code to estimate this in R, via optimisation of the negative log-likelihood, is in the
Appendix. In effect, this method bounds the value for the conditional dependence between the minimum and maximum values under the specified marginals and then uses information from the frequency values of the four cells of the table to infer the most probable value of
p
1. We can use this estimate of
\( {\ \widehat{p}}_1 \) as our value of
\( \widehat{TPPR} \) and use the observed value of the prevalence (if required) as our measure of
\( \widehat{\pi} \) in equation
1 to re-estimate the sample size at the interim.
Discussion
This paper has presented a robust method of sample size re-estimation for use in paired diagnostic accuracy studies where the conditional independence between the two tests may be unknown or inaccurately estimated at the start of the study. In terms of the recommendation of sample size estimation for the experiment as a whole a specific protocol is suggested given the results. Rather than basing the estimate for the experiment as a whole on the case where there is the maximal negative conditional dependence between tests – thus the largest possible sample size - as suggested in Alonzo et al. [
21], we would suggest an alternative strategy, the robustness of which is highlighted in Table
6. Specifically, initially estimating the sample size at the maximal positive conditional dependence between tests, i.e. using
TPPR =
TPR
B
- giving the smallest possible sample size - then, re-estimating the final sample size using the method simulated in Table
6. As long as the initial estimate for prevalence is close to accurate, this protocol is deemed appropriate as it balances the risk of collecting more participants than might actually be needed with collecting the most information about the true conditional dependence at the interim. Table
6 provides strong evidence for the integrity of this method in providing at minimum the nominal power while reducing the sample size when we have a higher than maximally negative true conditional dependence. Should the interim sample size be some other value, the maximum likelihood method will still be appropriate, although it should be kept in mind that the larger the interim sample size, as a proportion of the total possible sample size, the more accurate the interim sample size estimates will be, for individual cases.
Interestingly, the sample size values in the table seem to be somewhat greater, even when using our method than those typically seen in the literature in diagnostic test accuracy studies, see for example van Enst et al. [
29] Although it is difficult to know the specifics of the 859 studies mentioned in the van Enst collection of meta-analyses, e.g. clinically significant differences, sample size estimation and hypothesis testing procedures, it is striking that the 50% covariance sample size is only 87 (IQR 45–185) participants. Very few of our sample sizes in Table
6 are this low for the size of effect (ratios) we are considering, even using our method of sample size reduction. It may be that many diagnostic accuracy studies commissioned do not carefully consider their sample sizes. While the method discussed here of estimating the conditional dependence between the tests via maximum likelihood, given constraints imposed by the specified marginals and under a multinomial model, is pertinent to paired diagnostic accuracy tests, there is little reason why similar processes could not be extended to similar problems. The kernel of the method, maximum likelihood estimation of the parameter related to the conditional dependence using a constrained multinomial model, is equally valid in other applications involving sample size re-estimation for paired binary 2 × 2 tables.