To compare two diagnostic tests i and j, we want to estimate the difference in their performance. However, in reality such difference may vary from one paper (study) to the other. Therefore Δi,j,p = PERFi,p - PERFj,p, where the difference Δ depends on paper index p, where PERFi,p is observed performance of test i in paper p. To simplify notation, assume that a single number measures performance of each test in each paper. We relax this assumption later, allowing for the distinction between the two types of mistakes (FNR and FPR, or equivalently TPR and FPR). We decompose the differences
(1) Δ
i,j,p
= PERF
i,p
- PERF
j,p
= δ
i,j
+ δ
i,j,p
,
where
δ
i,j is the 'average' difference between the two tests, and
δ
i,j,p is deviation of the observed difference within paper p from the average
δ
i,j. The
δ
i,j is an estimator for the difference between performance of the two tests. Note by using deviation parameterization (similar to an ANOVA model) [[
12], pp.51 & 45] we explicitly accept and account for the fact that the observed difference varies from one paper to the other, while estimating the 'average' difference. This is similar to a random-effects approach where a random distribution is assumed for the Δ
i,j,p and then the mean parameter for the distribution is estimated. In other words, one does not need to assume 'homogeneous' difference of the two tests across all the papers, and then estimate the 'common' difference [
13].
The observed test performance, PERF, may be measured in several different scales, such as paired measures sensitivity and specificity, positive and negative predictive values, likelihood ratios, post test odds, and post test probabilities for normal and abnormal test results; as well as single measures such as accuracy, risk or rate ratio or difference, Youden's index, area under ROC curve, and odds ratio (OR). When using OR as the performance measure, the marginal logistic regression model
(2) logit(Result
pt
) = β
0
+ β
1
*Disease
pt
+ β
2
*PaperID
pt
+ β
3
*Disease
pt
*PaperID
pt
+ β
4
*TestID
pt
+ β7*Disease
pt
*TestID
pt
+ β
6
*TestID
pt
*PaperID
pt
+ β
7
*Disease
pt
*TestID
pt
*PaperID
pt
implements the decomposition of the performance. Model (2) is fitted to the (repeated measures) grouped binary data, where the 2-by-2 tables of gold-standard versus test results are extracted from each published paper. In the model (2) Result is an integer-valued variable for positive test result (depending on software choice, for grouped binary data, usually Result is replaced by number of positive test results over the total sample size, for each group); Disease is an indicator for actual presence of disease, ascertained by the gold standard; PaperID is a categorical variable for papers included in the meta-analysis; and TestID is a categorical variable for tests included. Regression coefficients
β
2
to
β
7
can be vector valued, meaning having several components, so the corresponding categorical variables should be represented by suitable number of indicator variables in the model. Indexes p and t signify paper p and test t. They define the repeated measures structure of the data [
10]. Note model (2) fits the general case where there are two or more tests available for the disease, where each test has been studied in one or more papers. Some of the papers may have studied more than one test; hence the results are not independent. Also the collection of tests studied may change from one paper to the other, hence incomplete matched groups.
From model (2) one can show that
LOR
pt
= β
1
+ β
3
*PaperID
pt
+ β
5
* TestID
pt
+ β
7
*TestID
pt
*PaperID
pt
and therefore the difference between performance of two tests i and j, measured by LOR, is
LOR
pi
- LOR
pj
= β
5
* TestID
pi
- β
5
* TestID
pj
+ β
7
*TestID
pi
*PaperID
pi
- β
7
*TestID
pj
*PaperID
pj
where we identify δ
i,j of the decomposition model (1) with the β
5
* TestID
pi
- β
5
*TestID
pj
, and identify δ
i,j,p with β
7
*TestID
pi
*PaperID
pi
- β
7
*TestID
pj
*PaperID
pj
.
If there is an obvious and generally accepted diagnostic test that can serve as a reference category (RefCat) to which other tests can be compared, then a "simple" parameterization for tests is sufficient, However, usually it is not the case. When there is no perceived referent test to which the other tests are to be compared, a "deviation from means" coding is preferred for the tests. Using the deviation parameterization for both TestID and PaperID in the model (2), one can show that β
5
*TestID
pt
is the average deviation of the LOR of test t from the overall LOR (the β
1
), where the overall LOR is the average over all tests and all papers. Therefore β
5
*TestID
pt
of model (2) will be equivalent to the δ
i,j of the decomposition model (1), and β
7
*TestID
pt
*PaperID
pt
equivalent to δ
i,j,p.
Proportional odds ratio model
Model (2) expands each study to its original sample size, and uses patients as primary analysis units. Compared to a random-effects model where papers are the primary analysis units, it has more degrees of freedom. However, in a real case, not every test is studied in every paper. Rather majority of tests are not studied in each paper. Therefore the data structure of tests-by-papers is incomplete with many unmeasured cells. The three-way interaction model (2) may become over-parameterized. One may want to drop the term β
6
*Disease
pt
*TestID
pt
*PaperID
pt
. Then for the reduced model
(3) logit(Result
pt
) = β
0
+ β
1
*Disease
pt
+ β
2
*PaperID
pt
+ β
3
*Disease
pt
*PaperID
pt
+ β
4
*TestID
pt
+ β
5
*Disease
pt
*TestID
pt
we have
LOR
pt
=
β
1
+
β
3
*
PaperID
pt
+
β
5
*
TestID
pt
, where the paper and test effects are completely separate. We call this reduced model the Proportional Odds Ratio (POR) model, where the ratio of odds ratios of two tests is assumed to be constant across papers, while odds ratio of each test is allowed to vary across the papers. Note the difference with the proportional odds model where ratio of odds is assumed to be constant [
14]. In the POR model
(4) OR
pt
=
OR
p
*
,
t =
1,
2, ...,
k,
p =
1,
2, ...,
m
where t is an index for the k diagnostic tests, and p is an index representing the m papers included in the analysis. OR
p
is a function capturing the way OR changes across papers. Then to compare two diagnostic tests i and j
OR
pi
/
OR
pj
=
where the ratio of the two ORs depends only on the difference between the effect estimates of the two tests, and is independent of the underlying OR
p
across the papers. Thus the model makes no assumptions about the shape of OR
p
(and in particular homogeneity of ORs) but merely specifies a relationship between the ORs of the two tests.
One may want to replace the PaperID variable with a smooth function of FPR or TPR, such as natural restricted cubic splines. There are two potential advantages. This may preserve some degrees of freedom, where one can spend by adding covariates to the model to measure their potential effects on the performance of the diagnostic tests. Thus one would be able to explain why performance of the same test varies across papers. Also, this allows plotting a ROC curve where the OR is not constant across the curve, a flexible ROC (HetROC) curve.
(5) logit(Result
pt
) = β
0
+ β
1
*Disease
pt
+ β
2
*S(FPR
pt
) + β
3
*Disease
pt
*S(FPR
pt
) + β
4
*TestID
pt
+ β
5
*Disease
pt
*TestID
pt
+ β
6
*X
pt
+ β
5
*Disease
pt
*X
pt
To test the POR assumption one may use model (2) where the three-way interaction of Disease and TestID with PaperID is included. However, in majority of real datasets this would mean an over-parameterized model. Graphics can be used for a qualitative checking of the POR assumption. For instance, the y-axis can be LOR, while the x-axis is paper number. To produce such plot, it may be better to have the papers ordered in some sense. One choice is to compute an unweighted average of (observed) ORs of all the tests the paper studied, and use it as the OR of that paper. Then sort the papers based on such ORs. The OR of a test may vary from one paper to the other (with no restriction), but the POR assumption is that the ratio of ORs of two tests remains the same from one paper to another. If one shows ORs of a test across papers by a smooth curve, then one expects that the two curves of the two tests are proportional to each other. In the log-OR scale, this means the vertical distance of the two curves remains the same across the x-axis. To compute the observed LOR for a test in a paper one may need to add some value (like 1/2) to the cell counts, if some cell counts are zero. However, this could introduce some bias to the estimates.
Among the approaches for modeling repeated-measures data, we use generalized estimating equations to estimate the marginal logistic regression [
15]. Software is widely available for estimation of parameters of a marginal POR model. These include SAS (genmod procedure), R (function geese), and STATA (command xtgee), with R being freely available open source software [
16].
One may use a non-linear mixed effects modeling approach on the cell-count data for estimation of parameters of the POR model. The Paper effect is declared as random, and interaction of the random effect with Disease is included in the model, as indicated in model (2). However, such mixed effects non-linear models are hard to converge, especially for datasets where there are many papers studying only one or a small number of the included tests (such as the dataset presented as example in this paper). If the convergence is good, it may be possible to fit a mixed model with the interaction of Disease, Test, and the Paper random effect. Such model relaxes the POR assumption, besides relaxing the assumption of OR-homogeneity. In other words, one can use the model to quantitatively test the POR assumption. One should understand that the interpretation of LOR estimate from a marginal model is of a population-average, while that of a mixed model is a conditional-average. Therefore there is a slight difference in their meaning.
Expanding the proportional odds ratio model
One may use the frameworks of the generalized linear models (GLM) and the generalized estimating equations (GEE) to extend the POR model and apply it to different scenarios. By using suitable GLM link function and random component [[
17], p.72], one may fit the POR model to multi-category diagnostic tests, like baseline-category logits, cumulative logits, adjacent-categories and continuation-ratio logits [[
17], chapter 8]. A loglinear 'Proportional Performance' (PP) regression may be fitted to the cell counts, treating them as Poisson. Also, one may fit the PP model to the LORs directly, assuming a Gaussian random component with an identity link function. Comparing GEE estimates by fitting the model to 2-by-2 tables versus GEE estimates of the model fitted directly on LOR versus a Mixed model fitted on LOR, usually statistical power decreases across the three. Also, there is issue of incorporation of sample sizes that differ across studies. Note some nuisance parameters, like coefficients of all main effects and the intercept, won't need to be estimated as they are no longer present in the model fitted directly on LORs.
One may avoid dichotomizing results of the diagnostic test by using the 'likelihood ratio' as the performance measure, and fitting a PP model to such continuous outcome. For a scenario where performance of a single test has been measured multiple times within the same study, for example with different diagnostic calibrations (multiple thresholds), the POR estimated by the GEE incorporates data dependencies. When there is a multi-layer and/or nested clustering of repeated measures, software to fit a mixed-effects POR model may be more available than an equivalent GEE POR.
When POR is implemented by a logistic regression on 2-by-2 tables, it uses a grouped binary data structure. It takes a minimal effort to fit the same logistic model to the "ungrouped" binary data, the so-called "individual level" data.
Methods of meta-analysis that allow for different outcomes (and different numbers of outcomes) to be measured per study, such as that of Gleser and Olkin [
18], or DuMouchel [
19], may be used to implement the POR model. This would prevent conducting parallel meta-analyses that is usually less efficient.