Introduction

Chronic pancreatitis (CP) is a disabling inflammatory disease of the pancreas characterized by severe recurrent or continuous abdominal pain and considerable impact on the quality of life [14]. Patients with CP usually develop endocrine and exocrine insufficiency during the course of the disease as a result of the progressive loss of pancreatic parenchyma.

There is lack of international consensus regarding the initial diagnosis of CP, particularly at its early stages. The diagnosis is often made by a combination of clinical symptoms (e.g. abdominal pain, malabsorption, diabetes mellitus), pancreatic function tests (e.g. fecal elastase-1) and morphological abnormalities seen on imaging (e.g. calcifications, ductal lesions, pseudocysts) [5, 6]. Imaging plays a key role in the diagnosis and therapeutic management of patients with CP. The most frequently used imaging modalities for CP are endoscopic ultrasonography (EUS), endoscopic retrograde cholangiopancreatography (ERCP), magnetic resonance imaging (MRI), computed tomography (CT) and ultrasonography (US).

The aim of this meta-analysis was to determine the diagnostic accuracy of imaging modalities for the initial diagnostic assessment of CP.

Methods

Search

A search was performed in Cochrane Library, MEDLINE, EMBASE and CINAHL databases, without restrictions for publication date or language up to September 2016. The search included terms for chronic pancreatitis, EUS, ERCP, MR imaging, CT and US. For detailed search details, see Appendix Table 5.

Selection of studies

All search hits were screened on title and abstract and eligible articles on full text by two reviewers independently (YI and MAK). Disagreements were solved through discussion with a third reviewer (MAB). Studies were eligible when EUS, ERCP, MR imaging, CT or US was evaluated in patients with suspected CP. Duplicates, reviews, letters, case reports and book chapters were excluded. The remaining studies were potentially eligible and their full text was retrieved. To identify additional relevant studies, the reference lists of the included studies were checked manually. Studies were included if they met the following criteria: (1) sufficient data was reported to construct 2 × 2 tables (true positive, false positive, true negative and false negative); (2) the imaging technique was compared with a reference standard (e.g. surgery, histology, follow-up). Exclusion criteria were: (1) evaluation of imaging techniques other than the aforementioned (e.g. PET-CT, EUS-FNA, EUS-elastography); (2) imaging techniques used for treatment of patients with CP (e.g. therapeutic ERCP, EUS-guided pseudocyst drainage); (3) in vitro studies; (4) studies that included less than five patients with CP; (5) studies where no separate analysis were done for patients with CP; and (6) full-text articles that were not available or retrievable.

Data extraction and critical appraisal

Data was extracted systematically from the included studies by using a structured study record form. The following study design and patient characteristics were extracted: name of the first author, country of origin, year of publication, name of journal, study design, total number of patients included, number of included patients with CP, median or mean age, the proportion of male patients, and the patient inclusion criteria.

Data was extracted regarding the imaging characteristics: type of imaging modality, scoring criteria, technical features for each modality, and reported observer experience. Also data on the reference standard was extracted, such as clinical follow-up, surgery and histology.

The methodological quality of the included articles was assessed by the Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) tool [7]. The QUADAS-2 tool evaluates the risk of bias in four domains (patient selection, index test, reference standard, flow and timing) and the clinical applicability in the first three domains. Signaling questions were used to help assess the risk of bias and applicability. Possible answers were ‘yes’, ‘no’ or ‘unclear’ in which ‘yes’ indicates no risk of bias. In addition the GRADE scoring system for diagnostic tests was used, which assesses the quality of evidence for each imaging modality [8, 9]. Although the criteria are applicable to diagnostic test accuracy, the methods are less well established compared to interventional studies [10]. Two reviewers independently (YI and MAK) assessed the QUADAS-2 and the GRADE scoring system and all disagreements were resolved by reaching consensus.

Data analysis

Overall diagnostic accuracy

For each included study we constructed a 2 × 2 contingency table for each imaging modality. If diagnostic accuracy was compared between different observers, mean values were calculated. Sensitivity and specificity estimates, the positive predictive value and negative predictive values, and the accuracy were calculated from the reconstructed contingency tables. We used the I 2 test with 95% confidence interval (95% CI) to quantify heterogeneity [11]. Mean logit sensitivity and specificity were acquired, and the anti-logit transformation was then obtained to calculate summary estimates of sensitivity and specificity with 95% CIs. Forest plots were made to visualize the sensitivity and specificity with the 95% CIs. Summary estimates of sensitivity and specificity, including 95% CI, were obtained by using a random-effects model [12]. In cases where a negative covariance between the logit sensitivity and logit specificity was obtained, summary receiver operating characteristic curve (sROC) were generated for each separate imaging modality. We used the z test to evaluate differences in sensitivity and specificity between the five imaging modalities. A p value of less than 0.05 indicated a statistically significant difference.

Heterogeneity exploration

The following factors were incorporated in the bivariate model and we evaluated the effect on the sensitivity and specificity, and cause of heterogeneity for all imaging modalities according to the QUADAS-2 tool: clear description of criteria for bias (low bias versus high bias or unclear) for (a) patient selection, (b) criteria for the index test used, (c) sufficient description and verification with the reference standard, and (d) the flow and timing.

Head to head comparison

A head to head comparison was performed in studies that compared the diagnostic accuracy of two or more imaging modalities. Heterogeneity was quantified by I 2 test, with 95% CI. The random-effects (I 2 > 25%) and fixed effects (I 2 ≤ 25%) models were used to obtain summary estimates of sensitivity and specificity, and compared with one another by a paired z test.

For data analysis, Review Manager (RevMan, version 5.3. Copenhagen: The Cochrane Collaboration, 2014) and SAS (version 9.3; SAS Institute, Cary, NC) were used. We adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [13].

Results

Study selection

The initial search resulted in 11,111 hits, of which 2988 duplicates were removed, resulting in a total of 8123 titles and abstracts that were screened for eligibility. The full text of 277 articles was retrieved; 43 of these articles fulfilled the inclusion criteria. See Appendix Table 6 for the excluded articles. Figure 1 shows the flow chart of the search.

Fig. 1
figure 1

Flow chart

Study and patient characteristics

Study characteristics, including the reference standard for the diagnosis of CP for each included study, are listed in Table 1. The 43 included studies were published between 1975 and 2016; 26 studies were prospective and 23 studies were published after the year 2000. A total of 3460 patients were evaluated, of which 1242 patients were diagnosed with CP [1456]. The age of the patients ranged from 36 to 65 years, with a median of 50% male. Criteria for selection of patients were those with suspected pancreatic disease or patients with suspected CP. Patient characteristics are depicted in Table 2.

Table 1 Study characteristics of included studies
Table 2 Patient characteristics of included studies

The risk of bias, assessed by QUADAS-2, was low in 28% of the studies and high in 19% of the studies. The concerns about applicability were low in 30% of the studies and high in 40% of the studies. The QUADAS-2 characteristics for each domain are depicted in Fig. 2 and outlined for each study in Appendix Table 7. The quality of evidence for all five imaging modalities according to the GRADE scoring system was very low. The GRADE scores for each imaging modality and characteristics for each study are outlined in Appendix Tables 8 and 9.

Fig. 2
figure 2

Summary of study quality (QUADAS-2)

EUS was the most frequently evaluated imaging modality; 16 studies including 1249 patients [15, 1923, 27, 28, 36, 37, 41, 42, 48, 51, 53, 56]. ERCP was studied in 11 studies including 742 patients [14, 20, 26, 28, 29, 33, 34, 39, 46, 50, 52]; MRCP, including secretin-enhanced MRCP, was evaluated in 14 studies including 933 patients [14, 1618, 25, 30, 38, 4244, 47, 49, 54, 55]; CT in 10 studies including 700 patients [20, 24, 25, 29, 31, 33, 40, 45, 46, 50] and abdominal US in 10 studies which included 1005 patients [20, 24, 26, 29, 32, 3436, 46, 50]. The imaging characteristics for each study and modality in an individual study are listed in Appendix Table 11. Three of the 43 articles reported about complications of the imaging modality used; these were complications related to ERCP (being post-ERCP pancreatitis) with a mean complication rate of 4% [14, 20, 28].

Overall diagnostic accuracy

Analyses for summary estimates of sensitivity and specificity were done for EUS, ERCP, MRI, CT and US (Table 3). Figures 3 and 4 show sensitivity and specificity of individual studies in forest plots and in receiver operator curves (ROC), respectively. A negative covariance between the logit sensitivity and logit specificity was not obtained; therefore, no sROC for MRI and US could be drawn. The summary estimate of sensitivity for EUS, ERCP, MRCP, CT and US was 81%, 82%, 78%, 75% and 67%, respectively. The summary estimate of specificity for EUS, ERCP, MRCP, CT and US was 90%, 94%, 96%, 91% and 98%, respectively. Sensitivity of ERCP was significant higher than sensitivity of US (p = 0.018). Other pairwise comparisons of sensitivity between imaging modalities revealed no significant difference. Specificity did not differ significantly among all modalities (Table 3). Sensitivity and specificity values for each study are listed in Appendix Table 10.

Table 3 Estimated overall sensitivity, specificity and heterogeneity according to imaging modality
Fig. 3
figure 3

Forest plot for sensitivity and specificity

Fig. 4
figure 4

Receiver operator curves (ROC)

Heterogeneity exploration

The bivariate model for heterogeneity exploration showed that the factor ‘flow and timing’ was significantly associated with a higher sensitivity of US (p = 0.01). ‘Description and verification with the reference standard’ was significantly associated with a higher specificity for MRCP (p = 0.0002).

Head to head comparison

Six head to head comparisons were performed (Table 4). The specificity of ERCP and EUS, and the sensitivity of ERCP, EUS and CT in the summary estimates of the head to head studies were significantly higher as compared with US.

Table 4 Head to head comparison

The head to head comparison of US versus ERCP comparison yields a sensitivity of 57% (49–65%) versus 78% (71–85%) (p < 0.001); and a specificity of 94% (74–99%) versus 98% (89–100%) (p = 0.003), respectively [20, 26, 29, 34, 46, 50]. The comparison between US and CT yields a sensitivity of 58% (49–66%) and 77% (68–83%) (p = 0.002), respectively [20, 24, 29, 46, 50]. And finally, the comparison of EUS versus US comparison yields a sensitivity of 90% (82–98%) versus 63% (49–76%) (p = 0.001); and a specificity of 100% versus 91% (82–99%) (p = 0.04), respectively [20, 36]. There were no significant differences in the sensitivity and specificity estimates between ERCP and EUS [20, 28, 53], MRCP and sMRCP [30, 47, 55] or ERCP and CT [20, 29, 33, 46, 50]. The heterogeneity (I 2) between US and ERCP (>25%) was higher (>25%) than in the other comparisons (I 2 ≤ 25%).

Discussion

EUS, ERCP, MRI and CT all have comparable high diagnostic accuracy in the initial diagnosis of chronic pancreatitis. EUS and ERCP are outperformers and US has the lowest accuracy. The choice of imaging modality can therefore be made on the basis of invasiveness, local availability, experience and costs.

Several recent guidelines [5759] advocate the use of EUS, MRCP or CT for the diagnosis of CP, although summary estimates of their accuracy, thus far, were lacking. There is one guideline from Germany on CP that has reported sensitivity and specificity regarding EUS, ERCP, MRCP and US, although not for CT [60]. In this guideline 14 studies were selected, reporting ranges rather than pooling the data on sensitivity and specificity estimates. This method resulted in results slightly different from those in the present meta-analyses. For example the guideline reports a sensitivity of 70–80% for ERCP and 88% for MRI versus summary estimates of 82% and 78%, respectively, in the present meta-analyses. The European Society of Radiology (ESR) is developing the ESR iGuide, a clinical decision support system for European imaging referral guidelines, covering various clinical scenarios, indications and recommendations (www.esriguide.org) [6163]. The results from the present systematic review may be useful to incorporate in that system.

We excluded three studies where sensitivity and specificity data were provided, but it was not possible to extract sufficient data to produce 2 × 2 tables and calculate the diagnostic accuracy values, because only the sensitivity and specificity estimates were given [6466]. In the study by Wang et al., estimates of sensitivity and specificity for EUS, ERCP and US were in line with the present results; the sensitivity of MR imaging and CT, however, were much lower (66% and 61%) [66]. The studies by Clave et al. and Orti et al. showed a lower sensitivity of ERCP (62% and 70%, respectively) compared to present results (82%) [64, 65].

The risk of missing important studies was minimized by performing a search in four major databases by two reviewers independently, without setting any restrictions for language and publication date. However, this systematic review has some limitations. The heterogeneity of the pooled studies was moderate to high in all analyses (between 39% and 93%). However, in the head to head comparison analyses, the heterogeneity was low in most comparisons (<25%). Furthermore, the heterogeneity of the reference standards used in the studies could have influenced individual study results. Surgery, histology and long-term follow-up of patients are reliable methods. Some reference standards, such as the use of endoscopic pancreatic function test (ePFT) for establishing the diagnosis of CP, could have resulted in under- or overestimation of the sensitivity and specificity. In addition, the diagnosis of CP and the criteria used are different in different stages of the disease (e.g. absence of calcifications in the early phase of the disease). Another limitation was that our analyses included imaging studies and imaging protocols performed over the last 40 years in different centres with inherent variations in techniques and equipment. Especially in the last decade the quality of some imaging modalities (e.g. MRCP and CT) has improved considerably. Also there were concerns about the quality of the available evidence, as assessed by QUADAS-2 and the GRADE scoring system.

The highest scores for accuracy in the diagnosis of CP were found for EUS and ERCP, but these are invasive techniques. ERCP has a relatively high risk of complications, such as post-ERCP pancreatitis (1.6–15.7%, mean complication rate of 4%) and is nowadays only used for therapeutic purposes (e.g. stenting of pancreatic duct) [6769]. To date, diagnostic ERCP is largely replaced by EUS and the cross-sectional imaging modalities CT and MRCP.

It has been suggested that CT is better in detecting parenchymal calcifications and intraductal calcifications compared to MRCP [7073]. On the other hand, MRCP is more often able to detect significant abnormalities of the pancreatic duct (e.g. PD dilatation and strictures) and slight changes of the pancreatic parenchyma and side branches, which can be attributed to early signs CP (i.e. atrophy, side branch ectasia) compared to CT [74]. Early diagnosis can also lead to a timely start of treatment, which has been associated with improved long-term outcome [75]. Nevertheless, for very early CP this association needs to be established in further research, such as the ESCAPE trial, evaluating the effect of early intervention in patients with CP [76]. As diagnostic sensitivity of CT and MRCP is not significantly lower than that of ERCP and EUS, and specificity is comparable, non-invasive modalities except for US are a likely first choice in patients with suspected pancreatic disease including chronic pancreatitis.