Methods
Model for assessing incremental value without a gold standard
The observed data may be described by a latent class model which assumes that the standard test (T
1) and new test (T
2) are imperfect measures of an underlying latent variable D, or true disease status. Both tests and the disease status are assumed to be dichotomous, positive (+) or negative (-) based on standard cut-offs. The observed data follow a multinomial distribution where each probability of the 4 combinations of 2 tests can be expressed in terms of the sensitivity and specificity of both tests and the prevalence. Furthermore, each probability is a mixture of patients who are D + and D-:
(1)
where π = P(D+) or the prevalence of disease, sens
j = P(Tj + |D+) or sensitivity of the jth test (j = 1,2) and spec
j = P(Tj-|D-) or specificity of the jth test.
The latent class model in Equation (
1) is non-identifiable due to the number of unknown parameters (5, i.e., sensitivity and specificity of both tests and prevalence) exceeding the degrees of freedom (3, i.e., possible test combinations - 1). This model can be estimated using a Bayesian approach with informative priors on at least 2 parameters (5 unknown parameters - 3 degrees of freedom) [
10,
19]. The prior information is combined with the observed data to obtain a joint posterior distribution. A sample from the posterior distribution can be drawn using Markov Chain Monte Carlo methods such as the Gibbs sampler [
20]. To perform a Bayesian analysis, prior information on sensitivity and specificity must be expressed as probability distributions, such as the Beta distribution. Parameters for which no prior information is available may follow objective prior distributions, such as the Uniform distribution that assigns equal weight to all possible values.
Estimation of incremental value
Let P(D + |T1, T2) denote the positive predictive probability given the results of T1 and T2, and let P(D+|T1) denote the positive predictive probability given T1 alone. Following Pencina
et al. [
7], we define the IDI as the difference of the differences between the expected (E) positive predictive probabilities with and without the new test, conditional on D + and D-:
(2)
In the original definition by Pencina et al. [
7], the predicted probabilities were derived from separate models--the old model based on T1 alone and the new model based on both T1 and T2. In the absence of a gold standard, the true disease status is unknown and must be estimated. We assumed that the latent class model for the joint results of T
1 and T
2 in Equation (
1) provides the best estimate of an individual’s disease status, under the assumption of conditional independence. All predicted probabilities, whether conditional on T
1 and T
2 or T
1 alone, were derived from this model. Furthermore, all the probabilities in Equation (
2) can be expressed as functions of the sensitivity, specificity and prevalence estimates from the latent class model. For example,
The predictive values above can also be used to calculate the AUC. It may be calculated as the Wilcoxon rank sum statistic comparing predictive values in the groups D + and D- as follows [
5]:
(3)
where
R
D + is the sum of the ranks of the positive predictive values calculated among the disease positive subjects and
N
D + and
N
D - are the number of disease positive and disease negative subjects, respectively. The AUC based on the probability conditional on T
1 was subtracted from the AUC based on the probability conditional on T
1 and T
2 to obtain the AUC difference (AUC
diff). A WinBUGS program for estimating the latent class model and the IDI and AUC
diff statistics appears in the Additional file
1: Table S1.
Simulation study of model performance
We used the model in Equation (
1) to generate simulated datasets to illustrate the change in IDI and AUC
diff when varying the sensitivity and specificity of T
2. In all simulations, we assumed a sample size of N = 1000 and that both T
1 and T
2 were performed on all individuals. The sensitivity and specificity of T
1 were set at 0.7 and 0.9, respectively; the prevalence was set at 0.3. We considered situations where the sensitivity (S) and/or specificity (C) of T
2 was better (i.e., S
2 = 0.8 and/or C
2 = .95), worse (S
2 = 0.6 and/or C
2 = 0.8), or no different than T
1. The true values of IDI and AUC
diff were calculated in each simulation setting using Equation (
2) and Equation (
3), respectively.
We generated 1000 datasets under each setting. We then fit the latent class model to the simulated datasets and estimated the AUCdiff and IDI statistics under each scenario. We used the results of the simulated datasets to estimate the frequentist properties of the AUCdiff and IDI statistics: average bias (i.e., the average difference between the true value and the posterior median across 1000 datasets), average coverage (i.e., the proportion of the 1000 datasets for which the posterior credible interval of a statistic included its true value) and average 95% posterior credible interval length.
As mentioned above, we need to specify at least 2 informative prior distributions for the model to be identifiable. We used 2 informative priors for the sensitivity and specificity of T
1 (prior distribution ranging 0.7 ± 0.1 for sensitivity and 0.9 ± 0.05 for specificity) and uniform priors for the other parameters (i.e., sensitivity and specificity of T
2 and prevalence). Prior information on the sensitivity and specificity of T
1 was expressed as Beta(α,β) distributions by equating the midpoint of the range to the mean (μ) and one-quarter of the range to its standard deviation (σ) in order to obtain the alpha and beta parameters:
Impact of modeling conditional dependence
If the tests are positively correlated within the D + and D- groups, then their sensitivity and specificity may be overestimated or underestimated if this conditional dependence is ignored [
21,
22]. We carried out additional simulations assuming that the 2 tests are conditionally dependent, while retaining the same values for sensitivity and specificity as given above in the simulations involving the conditional independence model. The joint probabilities may be expressed as:
(4)
where covs and covc are the covariance between the tests in the D + and D- groups, respectively.
As described in Dendukuri and Joseph [
21], these parameters were assumed to be bounded such that
covs ~ dunif(0, min(
sens
1,
sens
2) -
sens
1
sens
2) and
covc ~ dunif(0, min(
spec
1,
spec
2) -
spec
1
spec
2), allowing only for positive conditional dependence. To simulate data from Equation (
4), we set
covs and
covc to the midpoint of their range to reflect a moderate degree of conditional dependence. Due to the addition of 2 unknown parameters, we need to provide informative priors on at least 4 parameters (7 unknown parameters - 3 degrees of freedom). Although the bounds on the covariance provide partial information, we estimated the model with additional informative priors on the sensitivity and specificity of T
2. The corresponding Beta(α,β) prior distributions can be found in Additional file
2: Table S2.
Non-identifiable latent class models are known to be heavily influenced by the subjective prior information used. While some may argue that it is impossible to study the consequences of prior misspecification because the prior information is subjectively defined for a given application, it is possible to study the impact of prior misspecification in a limited way in a simulated setting. We can expect that as the prior information moves away from the true values, the bias of the posterior estimates increases. However, this bias would also depend on the relative weight of the prior versus the data. Clearly, a weak prior distribution would cause less bias than a strong prior in the event that the prior is misspecified.
To examine the sensitivity of the IDI and AUCdiff statistics to prior information, we considered the following three types of prior misspecification that are likely to occur in practice: i) we replaced the range of prior information on the sensitivity and specificity of the standard test (T1) by point estimates that are equal to their true value. These would be very strong prior distributions, ii) we used point estimates of the sensitivity and specificity of T1 that were close to but not equal to their true values and iii) we used wide prior distributions on the sensitivity and specificity of T1 which covered the true value but were not centered on it. These would be weak prior distributions.
Situations (i) and (ii) are akin to assuming that the sensitivity and specificity of the standard test are perfectly known [
23]--an assumption, which though hard to justify, is not uncommonly made in studies of accuracy or effectiveness of a new diagnostic test in the absence of a gold standard [
19,
24]. Situation (iii) reflects the consequences of misspecification of the relative importance of the true values when specifying an informative prior. It is more likely that the misspecified prior information is closer to the true values than being completely unrelated to the true values.
Bayesian estimation
We used the BRugs package within R to fit the latent class model to each simulated dataset. To assess convergence, we ran 3 chains of the Gibbs sampler with different initial values. Convergence was checked by visual inspection of the history and density plots, and the Brooks-Gelman-Rubin statistic available within BRugs. We ran 50,000 iterations and dropped the first 5,000 burn-in iterations to report summary statistics based on 45,000 iterations (AUC was based on 5,000 iterations after model convergence). Median estimates from the posterior distribution are reported along with their 95% credible interval (CrI).
Application to diagnosis of LTBI
We evaluated the incremental value of the QFT over TST in data from 2 published studies in India and Portugal, where both tests were performed simultaneously in healthcare workers with different BCG vaccination exposure. The TST has been shown to be less specific when the BCG vaccine is administered after infancy (e.g., during adolescence) or with multiple shots [
18,
25]. The Indian study consists of 719 healthcare workers, and 71% had a BCG vaccine scar [
26]. Since the BCG vaccine is given once at birth in India, we expect the TST and QFT to perform similarly with respect to specificity [
27]. In contrast, the Portuguese study consists of 1218 healthcare workers, and 70% had received ≥1 BCG vaccination after birth, which would lower the TST specificity [
28].
We obtained prior information on the sensitivity and specificity of TST based on a previous meta-analysis [
18]. The TST sensitivity ranged from 70% to 80%, while its specificity ranged from 96% to 99% for the Indian data. We expressed this as Beta(224.25, 74.75) and Beta(421.53, 10.81) distributions for the sensitivity and specificity, respectively. For the Portuguese data, the TST sensitivity also ranged from 70% to 80%, while its specificity ranged from 55% to 65%, corresponding to a Beta(229.8, 153.2) distribution.
Furthermore, we used Equation (
4) to adjust for conditional dependence, since both TST and QFT measure cellular immune responses to
M. tuberculosis antigens. We used informative priors for the sensitivity and specificity of QFT based on the same meta-analysis [
18]. The sensitivity ranged from 70% to 80%, while the specificity ranged from 96% to 99% for both studies. These values were transformed into Beta(224.25, 74.75) and Beta(421.53, 10.81) distributions for the sensitivity and specificity, respectively. The prevalence and covariances were assumed to follow Uniform distributions. To study the sensitivity of the results to the form of the prior distribution, we replaced the Beta prior distributions by Uniform prior distributions with the same 95% CrI limits as those mentioned above. To study the sensitivity of the results to the prior distribution, we used a wider prior distribution whose 95% credible interval covered the lower and upper limits of the 95% confidence interval estimated for each individual study included in the meta-analysis.
Decision rules for LTBI diagnosis based on observed data
In practice, the diagnosis of LTBI is based on observed test results rather than predicted probabilities from a latent class model [
29]. Therefore, another way to view the incremental value of QFT is the increase in the number of individuals who are correctly classified (i.e., true positive and negative) within the D + and D- groups, compared to the classification based on TST alone. We compared the following decision rules based on one or both tests: 1) diagnose LTBI if TST + 2) diagnose LTBI if both TST + and QFT + 3) diagnose LTBI if either TST + or QFT+. The number of D + patients correctly classified by a decision rule is estimated as P(D + |rule+) multiplied by the number of patients who satisfy the rule. Similarly, the number of D- patients correctly classified is estimated as P(D-|rule-) multiplied by the number of patients who do not satisfy the rule.
As this study was conducted using simulated data and data from published articles, ethics approval was not required.
Discussion
We have described how latent class models can provide information on the incremental value of a new diagnostic or screening test even in the absence of a gold standard test. Our simulations show that both the AUCdiff and IDI statistics can provide useful information on the incremental value in the absence of a gold standard. As in the case when a gold standard is present, the IDI statistic has a larger relative magnitude compared to the AUCdiff and can be interpreted as the average improvement in the predictive value. By considering different simulation settings for the new test’s accuracy, we found that the incremental value was greatest when both sensitivity and specificity of a new test were better than the standard test and that both incremental value statistics were close to zero when the new test was of no value. When adjusting for conditional dependence between tests, the incremental value of T2 was lower. When the model was mis-specified and ignored conditional dependence between the tests, both incremental value statistics were over-estimated as expected.
Bayesian estimation is particularly useful for latent class models that are non-identifiable due to insufficient degrees of freedom in the data, since it allows for the use of information external to the observed data. In our models, we used informative priors on the sensitivity and specificity of the standard test. As was the case in our motivating example of LTBI, evidence on these parameters can be obtained from the literature. One criticism of the latent class models we have used is their sensitivity to prior information. We believe that we have used the best available information on sensitivity and specificity of the TST and QFT tests resulting from a meta-analysis. Further, we carried out sensitivity analyses to other prior distributions. As we have shown, the Bayesian approach also provides credible intervals that have good coverage properties, unlike the limitations of the approximate frequentist intervals described previously for the IDI statistic [
30].
We have argued that in the absence of a gold standard, all available test results are needed in the latent class model to provide the best estimate of the true disease status. Indeed, clinicians almost always rely on all available clinical information to make diagnostic decisions [
24]. Alternative approaches to latent class analysis, including use of a composite reference standard or panel diagnosis, define a decision rule to definitively classify patients as disease positive or disease negative. Once such a definitive classification of disease status is obtained, methods for estimating incremental value in the presence of a gold standard may be used. The concern with this approach, of course, is that it may lead to reference standard bias [
31]. In some situations, it may not be possible to implement these alternatives. In our motivating example, there were only two tests. Thus, it was not possible to define a composite reference standard. Moreover, if we used the simplistic approach of treating the older test as a gold standard, it would be equivalent to assuming that the new test has no incremental (or added) value. Hence, we feel a latent class approach is particularly valuable in this setting.
Since this is the first paper on the topic of estimating incremental value in the absence of a gold standard, we chose to focus on illustrating the concept in the simplest case involving only observed data from two diagnostic tests. We recognize that our model does not include patient characteristics (e.g., age) that may play a role in the diagnostic decision-making process, thereby limiting the variation in predicted probabilities. Further research is needed to extend latent class models to incorporate such covariates that could have an effect on the prevalence, sensitivity or specificity [
32]. In addition, more complex models can be used when test results are continuous or there are more than 2 tests involved. Findings from previous work showing that increased sample size and an increase in informativeness of the prior distributions improve the precision of parameter estimates from a non-identifiable latent class model would also apply here [
33], as all incremental statistics that we have described are functions of the prevalence, sensitivity and specificity parameters in the latent class model. In addition, more research is needed to examine the impact of the choice of a particular conditional dependence structure and the degree of conditional dependence on estimates of incremental value.
Another future direction would be estimating the more common NRI using plausible risk thresholds or even the category-free version [
12]. In particular, our method may be able to address a recent criticism that the NRI cannot measure improvements in risk prediction at the population level [
34], since the latent class model incorporates prevalence into the estimates. It should be mentioned that a number of recent articles have been critical of the IDI and NRI statistics [
30,
35]. In particular, it has been pointed out that they can be inflated for miscalibrated prediction models whereas the AUC may not be. This remains to be studied in the context when there is no gold standard.
An alternative approach to adding covariates to the model is to carry out a subgroup analysis. In our LTBI example, we estimated incremental value within subgroups defined by study setting. The QFT had different incremental value beyond the TST depending on the population and BCG vaccination policy. In low-risk groups, using the TST + and QFT + decision rule could help avoid unnecessary LTBI therapy. On the other hand, using the TST + or QFT + decision rule could help clinicians who are worried about missing LTBI cases in high-risk groups, such as HIV/AIDS patients and young children. Such reasoning has been used to support cost-effectiveness analyses of the TST and QFT for diagnosis of LTBI [
36]. The Bayesian approach we propose is an improvement over such approaches since it takes into account the joint uncertainty in the sensitivity and specificity parameters of both tests [
37].
Several national guidelines now exist on using IGRAs such as the QFT, and many low-incidence countries recommend a two-step process: if TST is positive then perform the QFT as a confirmatory test [
29]. This approach is equivalent to the “diagnose LTBI if TST + and QFT+” rule in terms of incremental value but is cheaper since not all patients receive both tests. In fact, the “diagnose LTBI if QFT+” rule (data not shown) would give similar results compared to the TST + and QFT + rule in the Portuguese data. However, the QFT is sold as a commercial kit that is more expensive than the TST. Ultimately, the decision to implement a new test into practice will depend on many factors, including patient preferences, risk of complications and cost considerations.
Competing interests
The authors declare that they have no competing interests.