Readers
Twelve board-certified breast radiologists who use breast ultrasound in their practices were recruited as readers for the trial. Remuneration for 3.5 days was at the prevailing US rate. Eleven readers were from the USA and one from Great Britain. Eleven had no experience with AWBU. One had reviewed limited AWBUs 8 years earlier during the developmental phase of the technology. No reader had foreknowledge of the positivity rate of the test set.
Each reader had a 4-h tutorial with one author (KK) explaining the AWBU reading station operation. The readers reviewed and discussed approximately 12 AWBUs with known cancers, not part of the test set. They were not in the test set because either palpable findings were present or there were no concurrent mammograms. Nothing concerning the study was discussed, other than the use of the data form (Appendix
A) and the number of cases to be reviewed.
Procedure
A set of 51 malignant cases (3 cases with bilateral cancers), including invasive and in situ cancer were collected for the trial (Table
1). Screening mammography and AWBU were performed within 2 months of each other. No cancers were associated with prospective palpable findings or symptoms suggestive of cancer. The mammograms were heterogeneously dense or extremely dense breast tissue (BIRADS 3 or 4) on the original reports. All imaging was performed from 2003 to 2008. The data set included all cases meeting the above criteria in the AWBU archives. Twelve cancers were included that were not prospectively reported on either imaging technique, but are visible in retrospect. Four of these became palpable within 1 year, three in more than 1 year; five were discovered in a subsequent screening round, three by AWBU only, and two by both AWBU and mammography.
Table 1
Pathological diagnosis of 51 positive cases (54 cancers)
DCIS | 2 | 0 | 4 | 6 |
IDC | 17 | 19 | 5 | 41 |
ILC | 3 | 2 | 1 | 6 |
Mixed IDC and ILC | 0 | 1 | 0 | 1 |
Total | 22 | 22 | 10 | 54 |
Fifty-one normal cases performed from 2003 to 2008 were matched with each of the positive cases for the following factors:
2.
Digital or analog mammogram
4.
American breast cup size (A–DD)
5.
ACR BIRADS breast density
6.
Implant (saline or silicone) and location (pre- or retropectoral)
The normal case matching factors 1 to 7 closest to the age of the positive case was matched as the normal partner case. The mean difference in age between the positive case and its matched normal was 31 days.
Testing occurred on a subsequent date at each reader’s own site with only the reader and a research assistant (monitor) present. The same monitor was present for all readers. She had no knowledge of the test set makeup, had no mammography or ultrasound training, reviewed the test data forms in real-time for completeness, and transferred the data to the study database.
At each test site 102 mammograms were placed on a film alternator in random order, generated once, and used for all readers. Excluding breaks, the test subject’s time for review was recorded. The upper half of a data form (Appendix
A) was completed for each case, checked by the monitor, and entered into the database.
Four questions were asked:
1.
Would you request further evaluation based on this mammogram, or recommend routine screening?
2.
Where is/are the most suspicious (up to 3) lesions, identifying their location by breast and clock face position?
3.
What would be your prediction of the final ACR BIRADS after any needed diagnostic workup was completed?
4.
What is the reader’s confidence level that the woman has or does not have cancer (DMIST likelihood scale)?
The American College of Radiology Breast Imaging Reporting and Data System (BIRADS) is a seven-point scale (0 = incomplete, needs additional assessment; 1 = normal; 2 = benign; 3 = probably benign; 4a = possible malignancy; 4b = probable malignancy, or 5 = highly suggestive of malignancy) designed to categorize the results of mammography and other imaging studies [
3,
11]. Scores from 1 to 5 were allowed. Similar to the DMIST [
12], readers were asked to predict a BIRADS score before any diagnostic workup.
The DMIST likelihood rating is a seven-point scale to express the confidence of the diagnosis, and ranges from definitely not cancer to definitely cancer [
3,
11,
12].
A correct location response was recorded for an hour position marked within the half of the breast centered at the middle of the cancer.
A true positive (TP) was recorded for mammography for any malignant case if ‘callback’ was marked for mammography and any correct tumor location was identified. A TP was recorded for mammography plus AWBU if ‘callback’ was marked on either or both halves of the form in the malignant cases, with at least one correct location identified. Thus, a correctly identified TP found with mammography would remain TP even were it not identified again on AWBU. AWBU findings could change the outcome to TP if a cancer was correctly identified with AWBU , but missed with mammography. We evaluated readings on a per-case (i.e., per-patient) basis rather than a per-score basis because screening serves as a “go no-go” gatekeeper for subsequent workup [
13].
A true negative (TN) was recorded for mammography for any normal case if ‘callback’ was not marked for mammography. A TN was recorded for mammography plus AWBU for any normal case if ‘callback’ was not marked on the second half of the form. This allowed the reader to reverse a callback for an asymmetric density seen mammographically but cleared by the AWBU as no suspicion. To validate TN cases, all cases were followed for at least 1 year or more.
A false positive (FP) was recorded for mammography in two situations:
1.
Callback was marked for mammography in a normal case.
2.
Callback was marked for mammography in a cancer case, but none of the marked locations corresponded to the cancer.
An FP was recorded for mammography plus AWBU in the same two situations as above when callback was marked for AWBU. A false negative (FN) was recorded for mammography when callback was not marked in a cancer case in the mammography portion of the form. Similarly, an FN was recorded for mammography plus AWBU when callback was not marked in a cancer case in either portion of the form.
The 102 ABWUs were reviewed by readers on a review station brought by the research assistant acting as a monitor. They worked approximately 8 h daily for 3 days, with breaks at the readers’ choosing. The readers were given the corresponding mammograms with each AWBU and completed the second half of the data sheet with the knowledge from the mammogram-only evaluation available. The same questions were answered for AWBU and the reading time of each AWBU recorded.
Statistical analysis
Analyses were conducted in a multi-reader multi-case (MRMC) framework where each reader screened all cases and each case contained both screening techniques. The MRMC design efficiently reduces the number of readers and cases needed to detect improvements across techniques [
14]. Analyses appropriate for an MRMC design were chosen both to correctly model correlations between readings on the same case across readers and to estimate correctly standard errors. Unless specified otherwise, analyses were conducted in SAS software version 9.2 (SAS Institute Inc., Cary, NC, USA). We present
F statistics, shown as
F(numerator degrees of freedom, denominator degrees of freedom), and
p values for comparisons between mammography plus AWBU and mammography alone.
Cases identified for further imaging were assessed by four binary measures: sensitivity = number of TP/number of cancer cases; specificity = number of TN/number of non-cancer cases; positive predictive value (PPV) = number of cancer cases/(number of TP + FP cases); and negative predictive value (NPV) = number of non-cancer cases/(number of FN + TN). Random-effect logistic regression models were used to test whether each binary measure differed significantly between mammography plus AWBU versus mammography alone. To account for the MRMC framework, we included random effects for readers and cases similar to the DBM model [
15].
Accuracy was assessed through BIRADS ratings and DMIST likelihood scores, comparing two commonly used indicators of accuracy between mammography plus AWBU versus mammography alone: areas under the curve (AUC) and figures of merit (FOM). The FOM incorporates information from each reader on the region of suspected malignancy, as well as their confidence level in the finding, incorporated in an AUC. Because it includes both confidence level and location accuracy, the FOM is more powerful than AUC in detecting differences between techniques. We include both analyses, as described below:
Areas under the curve (AUC) were estimated in DBM MRMC 2.1 [
15] (available from
http://perception.radiology.uiowa.edu) using the trapezoidal/Wilcoxon method. Readers and patients were treated as random factors. We also present reader-averaged receiver operating characteristic (ROC) curves; average values were calculated from separate ROC analyses conducted on each reader in the PROC LOGISTIC procedure.
Figures of merit (FOM) were estimated by using jackknife alternative free-response receiver operating characteristic methodology as implemented in JAFROC Version 1.0 [
16] (available from
http://www.devchakraborty.com). The FOM is defined as the probability that a cancer on an abnormal image is scored higher than a falsely marked location on a normal image and is analogous to the ROC curve; a higher FOM indicates improvement in reader performance.