Background
The use of systematic literature review to inform evidence-based practice in diagnostics is rapidly expanding. Although the primary diagnostic literature is extensive, there remain a number of problems for systematic reviews of diagnostic tests. Appropriate methods for rigorous evaluation of diagnostic technologies have been well established [
1‐
5]. However, available studies have generally been poorly designed and reported [
6‐
8]. Similarly, although a number of quality checklists for diagnostic accuracy studies have been proposed [
9] and there is growing evidence on the effects of bias in such studies [
10], there has been no rigorously evaluated, evidence-based quality assessment tool for diagnostic studies.
The objective of this study was to investigate the impact of quality on the results of a diagnostic meta-analysis, using regression analysis. A large diagnostic systematic review was required to enable the use of regression analysis to investigate the impact of components of quality upon results.
We have recently completed a systematic review, which aimed to determine the most appropriate pathway for the diagnosis and further investigation of UTI in children [
11]. It included an assessment of the accuracy of tests for three different clinical stages of UTI: the diagnosis UTI, localisation of infection, and further investigation of patients with confirmed UTI. The nature of the tests included in these three clinical sections of this review differed. Tests used to diagnose UTI were generally laboratory-based or near-patient methods, with relatively objective interpretation of results, e.g. dipstick tests and microscopy. By contrast, tests used to investigate confirmed UTI mainly utilised imaging technologies which are largely subjective in their interpretation, and where diagnostic thresholds are difficult to define. Tests used to localise infection spanned both categories. We hypothesised that the components of methodological quality affecting results were likely to differ between the three sections of the review. Such potential differences may indicate a need for topic-specific checklists for the assessment of quality in diagnostic studies.
A secondary aim of this study was to contribute to the evaluation of QUADAS, an evidence-based tool for the assessment of the quality of diagnostic accuracy studies that was specifically developed for use in systematic reviews of diagnostic tests [
12], by investigating the importance of specific QUADAS items.
Methods
We used QUADAS [
12] (Table
1) to assess the quality of primary studies included in the review. Items were rated as 'yes', 'no', or 'unclear'. We examined differences in the individual QUADAS items fulfilled, as well as their impact on test performance. The review divided the diagnosis and further investigation of UTI into the following three clinical stages: diagnosis of UTI, localisation of infection, and further investigation of the UTI. Each stage used different types of diagnostic test, which were considered to involve different quality concerns.
1. | Was the spectrum of patients representative of the patients who will receive the test in practice? |
2. | Were selection criteria clearly described? |
3. | Is the reference standard likely to correctly classify the target condition? |
4. | Is the time period between reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests? (disease progression bias) |
5. | Did the whole sample or a random selection of the sample, receive verification using a reference standard of diagnosis? (partial verification bias) |
6. | Did patients receive the same reference standard regardless of the index test result? (differential verification bias) |
7. | Was the reference standard independent of the index test (i.e. the index test did not form part of the reference standard)? (incorporation bias) |
8. | Was the execution of the index test described in sufficient detail to permit replication of the test? |
9. | Was the execution of the reference standard described in sufficient detail to permit its replication? |
10. | Were the index test results interpreted without knowledge of the results of the reference standard? (test review bias) |
11. | Were the reference standard results interpreted without knowledge of the results of the index test? (diagnostic review bias) |
12. | Were the same clinical data available when test results were interpreted as would be available when the test is used in practice? (clinical review bias) |
13. | Were uninterpretable/ intermediate test results reported? |
14. | Were withdrawals from the study explained? |
We analysed results grouped by clinical stage. Within these groups, we pooled studies of similar tests or test combinations where sufficient data were available and where pooling was clinically meaningful. (Table
2) The minimum number of studies that we required for regression analysis was ten. This choice was made based on published guidance [
13,
14].
Table 2
Tests included/excluded in the regression analysis
Diagnosis
|
Dipstick: nitrite (23), LE (14), nitrite or LE positive (15) |
Clinical history (6) |
Microscopy: pyuria (28), bacteriuria (22) |
Dipstick: nitrite and LE positive (9), glucose (4), protein (2), blood (1), protein and LE positive (1), combinations of 3 dipstick tests (5) |
|
Microscopy: pyuria or bacteriuria (8), pyuria and bacteriuria (8) |
|
Culture: standard (1), dipslide (1) |
|
Combinations of different tests (10) |
Localisation
|
Ultrasound (20) |
Clinical history (5) |
|
Laboratory based tests (16) |
|
Imaging techniques: MCUG (7), MRI (1), CT (1), IVP (4), cystography (2), scintigraphy (3) |
Further investigation
|
Detection of reflux: Ultrasound (28): standard (11), contrast enhanced (17) |
Detection of reflux: IVP (4), voiding radionuclide cystography (3), NAG/creatinine ratio (1), scintigraphy, (3), risk scoring system (1) |
|
Prediction of scarring: ultrasound (2), IVP (1), non-invasive indicators (1), MCUG (2) |
|
Detection of scarring: IVP (4), static scintigraphy (7), dynamic scintigraphy (2), MCUG (4), cystography (1), MRI (1), US and MCUG (1) |
We estimated summary receiver operator characteristic (SROC) curves using the following equation [
15]:
a and b were estimated by regressing D against S for each study:
D = a + bS
D = {logit (sensitivity) - logit (1-specificity)} = log diagnostic odds ratio (DOR)
S = {logit (sensitivity) + logit (1-specificity)}
We used both weighted and unweighted models. For the weighted model we weighted on sample size. We chose to weight on sample size rather than inverse variance, a method sometimes used in this type of analysis, as we believe that weighting on the inverse variance can produce biased results. The reason for this bias is that the DOR is associated with its variance and so large DORs will inevitably have large variances, which will be reflected in the weightings.
We assessed between study heterogeneity through visual examination of forest plots and statistically using the Q statistic [
16]. Where sufficient data were available, we used regression analysis to investigate whether individual QUADAS items and additional variables thought likely to be associated with diagnostic accuracy were associated with the DOR and hence whether differences in these items between the studies accounted for some of the observed heterogeneity. Where data were available, the following additional variables were investigated:
Patient age (<2 years, <5 years, <12 years and <18 years) was included to examine possible variation with age within the paediatric population.
The geographic region where studies were conducted was included to account for possible regional differences in test technology and infective agent.
Specific variations in index test technique were also included. For microscopy for pyuria and bacteriuria a variable on whether the sample was centrifuged was included, and for microscopy for bacteriuria a variable for Gram stain was included. For ultrasound for the detection of reflux a variable for whether or not the ultrasound involved a contrast agent was included.
The SROC model [
15], was extended to include each of the 14 QUADAS items and each of the variables above as individual covariates [
17]. As each QUADAS item can be scored as "yes", "no" or "unclear", we included QUADAS items as categorical variables with 3 possible outcomes, thus including the comparisons of "yes vs no", and "yes vs unclear". This allowed us to make some distinction between associations of aspects of methodological quality with test performance and associations of completeness of reporting with test performance. A number of QUADAS items only received two of the three possible scores (i.e. were scored either "yes" or "no", or "yes" or "unclear", or "no" or "unclear"). These items were therefore included as dichotomous variables.
A multivariate linear regression analysis was conducted. Initially, we performed univariate analysis with all items included separately in the model. Items that showed moderate evidence of an association with D, defined as p < 0.10, were investigated further using step-down regression analysis. All items found to show moderate evidence of an association in the univariate models were entered into the multivariate model, then dropped in a step-wise fashion with the item with the weakest evidence of an association (largest p-value) dropped first. For covariates with more than one level, evidence of an association of one indicator variable with test performance was considered sufficient for inclusion in the model. The final model was achieved when all items remaining showed strong evidence of an association with D, defined as p < 0.05. Interaction terms were not included. Associations of covariates with D were expressed as relative diagnostic odds ratios (RDOR). The DOR is used as an overall measure of diagnostic accuracy. It is calculated as the odds of positivity among diseased persons, divided by the odds of positivity among non-diseased. When a test provides no diagnostic evidence then the DOR is 1.0. The RDOR is calculated as the DOR when the covariate is present divided by the DOR when the covariate is absent. It therefore provides an indicator of the overall impact on diagnostic accuracy of the presence of a given covariate.
Discussion
The methodological quality of primary studies remains a significant issue for systematic reviews of diagnostic tests [
8,
205,
206]. The STARD initiative has provided clear guidance for the reporting of diagnostic accuracy studies [
5]. This should have a positive impact on the quality of the diagnostic literature in the future. The QUADAS tool facilitates systematic evaluation of the quality of diagnostic accuracy studies, and was specifically developed for use in systematic reviews of diagnostic tests [
12]. However, where studies are poorly reported the information that can be derived from quality assessment becomes limited. We cannot know whether an unreported QUADAS item reflects a true methodological flaw or poor reporting of a study that may be methodologically sound. Many of the studies included in our review were poorly reported. Our assessment of the impact of components of methodological quality on diagnostic accuracy may therefore partially reflect completeness of reporting. Whilst poor reporting remains a widespread problem, it is almost impossible to assess the impact of components of methodological quality on the results of diagnostic meta-analyses.
The common practice of using summary quality scores in systematic reviews has been widely debated elsewhere [
207‐
209]. Summary scores, when used to inform quality-based analyses, may mask important effects of individual quality components [
210]. As we report, the numbers of QUADAS items that were adequately addressed by studies included in our review were similar between the three clinical stages assessed in the review. Had the number of QUADAS items fulfilled been used as a summary score, potentially important variations in the individual items fulfilled would have been hidden. We therefore advocate that components of quality assessment should be reported fully, and their impact on outcome measures analysed individually rather than as summary scores.
Although ours was a large review, it included 187 studies reporting 487 data sets, our analysis of the impact of methodological quality on diagnostic accuracy was severely limited both by the diversity of the included studies (few tests were evaluated by sufficient studies to allow meaningful use of meta-analytic pooling and investigation of heterogeneity), and by incomplete reporting. All of the data sets used were sub-optimal, in that the numbers of observations were low in comparison to the number of variables investigated in the multivariate analyses[
13]. Although different types of diagnostic tests were evaluated in the three clinical stages used by the review, generalisibility is limited in that all data concerned a single condition (UTI). A number of the items found to be associated with test performance related to specific test methodologies (e.g. Gram stain and contrast-enhanced ultrasound) and have no generalisability elsewhere. These items were found to show association in both the weighted and unweighted analyses. For the individual quality items there were some differences between the results of the weighted and unweighted analyses. In general, the results of weighted analyses showed more intuitive associations. Unweighted analyses more often produced results that were difficult to explain, for example, in leukocyte esterase dipstick tests the unweighted analysis found that the test was more accurate in the group of children aged <12 years than in those aged <2 years. This might be expected and would probably reflect a higher likelihood of sample contamination in younger children, however, no difference in accuracy was found between under 18's and children aged <2 years. For both tests on the diagnosis of and further investigation of UTI weighted analyses showed an association between a number of variables relating to quality of reporting and diagnostic accuracy (well reported studies had higher DORs). We might expect this association to extend to diagnostic accuracy studies of all types of tests, but the present study is not adequate to demonstrate this. Weighted analysis of studies of ultrasound for the detection of reflux showed that the DOR was higher where studies reported information to determine that disease progression bias had been avoided. Disease progression bias is a particular issue for imaging studies of this type where follow-up examinations (used as the reference standard of diagnosis) may be scheduled some time after ultrasound (usually the initial examination). This association was not shown in the unweighted analysis.
The information derived from these analyses is also limited by the use of the summary ROC approach to pool studies. This method takes the DOR as the dependent variable. The DOR is used as a single indicator of test performance and shows how much more frequently a positive test result occurs in a person with the condition of interest than in one without the condition, relative to how much more frequently a negative result occurs in a person without the condition than in one with the condition. Using the DOR to investigate heterogeneity means that we cannot assess whether the factors investigated are associated with paired measures of diagnostic accuracy, such as sensitivity and specificity, or positive and negative likelihood ratios. Often factors that lead to an increase in sensitivity will lead to a decrease in specificity and vice versa. Factors that lead to this pattern of change may have no effect on an overall measure such as the DOR. Using the DOR to investigate heterogeneity may thus miss relevant clinical associations. Recently a new method for pooling sensitivity and specificity has been developed. This method is known as the "bivariate model" [
211]. It preserves the underlying two-dimensional nature of the data and produces direct pooled estimates of sensitivity and specificity, incorporating any correlation that might exist between these two measures. The model can be extended to include explanatory variables leading to separate effects on sensitivity and specificity. This method has two advantages over the standard methods: (1) the pooled estimates of sensitivity and specificity take into account the correlation between these two measures; (2) the effect of possible sources of heterogeneity on both sensitivity and specificity can be investigated in a single model rather than just looking at the effect of these variables on a single measure of test performance, the DOR. These methods may have potential applications in future studies of this type.
Conclusion
Given the limitations we describe, the results of this study should be treated as hypothesis generating. Further work is needed to elucidate the influence of components of the methodological quality of primary studies on the results of diagnostic meta-analyses. Large data sets of well-reported primary studies are needed to address this question. Without significant improvements in the reporting of primary studies, progress in this area will be limited. The components of quality assessment should always be reported, and their impact on summary outcome measures be investigated, individually rather than as summary quality scores. Careful consideration should be given to the choice of weighting when conducting regression analyses. Weighting by sample size appears the most appropriate method for analyses of diagnostic accuracy studies, but this area requires further investigation.
Acknowledgements
We would like to thank Professor Martin Bland of the University of York and Mr Roger Harbord of the University of Bristol for their statistical advice.
The work was done as part of a project commissioned and funded by the NHS R&D Health Technology Assessment Programme (project number 01/66/01). The views expressed in this review are those of the authors and not necessarily those of the Standing Group, the Commissioning Group, or the Department of Health.
Competing interests
The author(s) declare that they have no competing interests
Authors' contributions
All authors contributed towards the conception and design of the study and the interpretation of the data. They also read and approved the final manuscript. PW and MW participated in data extraction, the analysis of data, and drafted the article.