Introduction
1. Relevance | 1.1. What problem is the application intended to solve, and who is the application designed for? Define the scope of application; end-users; research vs. clinical use; usage as double reader, triage, other; outputs (diagnosis, prognosis, quantitative data, other), indications and contra-indications |
1.2 .What are the potential benefits, and for whom? Consider benefits for patients, radiologists/referring clinicians, institution, society | |
1.3. What are the risks associated with the use of the AI system? Consider risks of misdiagnosis (including legal costs), of negative impact on workflow, of negative impact on quality of training | |
2. Performance and validation | 2.1. Are the algorithm’s design specifications clear? Check robustness to variability of acquisition parameters; identify features (radiomics) or network architecture (deep learning) used |
2.2. How was the algorithm trained? Assess population characteristics and acquisition techniques used, labeling process, confounding factors, and operating point selection | |
2.3. How has performance been evaluated? Check proper partitioning of training/validation/testing data, representativeness and open availability of data. Assess human benchmarks, application scope during evaluation, source of clinical validation | |
2.4. Have the developers identified and accounted for potential sources of bias in their algorithm? Assess training data collection, bias evaluation, stratification analyses | |
2.5. Is the algorithm fixed or adapting as new data comes in? Check whether user feedback is incorporated, if regulatory approval is maintained, and if results are comparable with previous versions. * | |
3. Usability and integration | 3.1. How can the application be integrated into your clinical workflow? Consider integration with your information technology (IT) platform, check for compliance with ISO usability standards, consider issues related to practical management of the software |
3.2. How exactly does the application impact the workflow? Identify modifications to bring to your current workflow, identify roles in the new workflow (physicians and non-physicians) | |
3.3. What are the requirements in terms of information technology (IT) infrastructure? Consider on-premise vs. cloud solutions. Identify requirements in terms of hardware and network performance, consider network security issues | |
3.4. Interoperability - How can the data be exported for research and other purposes? Check whether the export formats are suitable | |
3.5. Will the data be accessible to non-radiologists (referring physicians, patients)? Check whether the form of the output is suitable for communication with patients/referring physicians | |
3.6. Are the AI model’s results interpretable? Check whether and which interpretability tools (i.e. visualization) are used | |
4. Regulatory and legal aspects | 4.1. Does the AI application comply with the local medical device regulations? Check whether the manufacturer obtained regulatory approval from the country where the application will be used (CE, FDA, UKCA, MDSAP, or other local guidance), and for which risk class |
4.2. Does the AI application comply with the data protection regulations? Check whether the manufacturer complies with local data protection regulations and provides contractual clauses protecting patient’s data | |
5. Financial and support services considerations | 5.1. What is the licensing model? Assess one-time fee vs. subscription models, total costs, scalability |
5.2. How are user training and follow-up handled? Check whether training sessions are included and at which conditions further training can be obtained | |
5.3. How is the maintenance of the product ensured? Check whether regular maintenance is included, assess the procedure during downtime and for repair | |
5.4. How will potential malfunctions or erroneous results be handled? Assess the procedure in the event of malfunction and post market surveillance and follow-up |
1. What problem is the application intended to solve, and who is the application designed for? | |
2. What are the potential benefits and risks, and for whom? | |
3. Has the algorithm been rigorously and independently validated? | |
4. How can the application be integrated into your clinical workflow and is the solution interoperable with your existing software? | |
5. What are the IT infrastructure requirements? | |
6. Does the application conform to the medical device and the personal data protection regulations of the target country, and what class of regulation does it conform to? | |
7. Have return on investment (RoI) analyses been performed? | |
8. How is the maintenance of the product ensured? | |
9. How are user training and follow-up handled? | |
10. How will potential malfunctions or erroneous results be handled? |
Relevance
What problem is the application intended to solve, and who is the application designed for?
-
What are the medical conditions to be diagnosed, treated, and/or monitored?
-
Who are the intended end-users—i.e., radiologists, clinicians, surgeons—as well as their required qualifications and training?
-
What are the principles of operation of the device and its mode of action?
-
Is the application intended to be used as a research tool or for clinical use?
-
Will the AI solution be used as a double reader, to triage examinations, to perform quality control, or for some other function [10]?
-
Does the application provide useful information that was not available before?
-
Are there any other considerations such as patient selection criteria, indications, contra-indications, warnings?For SaMD, the “intended use statement” of the product regulatory documentation should provide this information.
What are the potential benefits, and for whom?
Patients
Radiologists and referring physicians
-
Increased productivity and decreased reporting time, which can impact clinician’s and radiologist’s satisfaction [18]
-
Increased time spent with patients, which can impact patient’s and radiologist’s satisfaction [19]
-
Reduced time spent on “menial” tasks
-
Faster diagnosis in time-sensitive situations (e.g., stroke)
-
Potential decrease in physical or psychological strain
-
Increased quality control, reduced malpractice risk, legal and insurance costs
Institution
Society
What are the risks associated with the use of the AI system?
Performance and validation
Are the algorithm’s design specifications clear?
-
Which image processing steps are used? How are differences in resolution, contrast, and intensity handled on images from different machines?
-
For radiomics approaches, which features does the algorithm assess? How does the algorithm represent images prior to learning and analysis? This information can then be linked back to peer-reviewed literature for critical appraisal of performance.
-
For deep learning AI algorithms, which neural network is used (e.g., U-Net is a popular architecture for segmentation)? Such information, ideally with reference to the relevant literature, may help identify possible failure modes of the algorithm. Vendors should be able and willing to explain broadly how their algorithms operate to both non-specialists and specialists embedded within radiology departments. If not, this should count as a negative point in the competitive analysis with other solutions.
How was the algorithm trained?
-
What data was used to train the AI algorithm? This must include the number of patients, controls, images, and occurrence of pathology or abnormality. Clinical and demographic data on patients (with inclusion and exclusion criteria) must be provided, together with information about location and type of acquisition sites. Technical parameters including vendors, modalities, spatial and temporal resolution of images, acquisition sequence details, field strength if applicable, patient position, injection of contrast agents, and the like must be specified. The sample used to develop the algorithms should have characteristics that are representative of the target population for which the algorithm will be used to avoid bias (i.e., same age, ethnicity, breast typology….), but also follow the same processing steps that will be applied during deployment [25].
-
How was labeling performed? What was the experience level of readers? How many readers per case? Were the readers given realistic conditions for image interpretation? In particular, did they have access to native resolution images, with their usual viewers and tools? Did they have access to relevant clinical information and other images? Was there a time constraint?
-
Are there confounding factors in the data? For example, in multi-site data, were more patients at one site diagnosed with a particular disease than in another site?
-
Based on which criteria were the operating points chosen, and on which dataset?
How has performance been evaluated?
-
What data was used to validate and tune the AI algorithm? Is there an overlap with the training data? If so, this is a red flag.
-
What data was used to test the AI algorithm? Is there an overlap with the training and validation data? Again, this is a red flag.
-
Is the test set realistic? Is it representative of the population in which the system will be used (e.g., age, sex, BMI, prevalence of pathologies, comorbidities)? If not, radiologists should be aware that results could be sub-optimal in some cases that have not been thoroughly tested, such as obese patients.
-
Are the test set (including imaging and clinical data), and the ground truth available and/or open for reproducibility?
-
Has the algorithm been benchmarked against experts in the field?
-
Are performance results reported for the AI algorithm as a stand-alone clinical decision support system, or as a second reader? Has the added value for human readers (in terms of performance) been assessed?
-
Is the clinical validation done by sources external/independent from the creator of the algorithm? Is the clinical study design of good quality?
-
How reproducible is the algorithm against variability in acquisition parameters (e.g. contrast, signal-to-noise, resolution parameters)? This is typically a weak point in academic/research systems, where AI algorithms can easily latch onto acquisition details unrelated to pathology if these are confounders, but commercial systems should present evidence that they are reproducible in the deployment environment [26].
-
How repeatable (deterministic) is the algorithm? For algorithms outputting single values (e.g., volumetry), the repeatability coefficient and Bland-Altman plots should be provided.
-
How does the algorithm handle differences in data quality? Was the algorithm evaluated on artefactual/non-ideal data? What were the results?
-
For classification algorithms (e.g., diagnosis): Are both threshold-dependent (e.g., sensitivity, specificity) and threshold-independent metrics (such as the area under the receiver operating curve (ROC)) reported? For imbalanced datasets, are appropriate metrics (balanced accuracy, no-information rate, Kappa…) provided? Are confidence intervals provided?
-
For regression algorithms (e.g., linking clinical scores or liquid biomarker levels to images, such as bone age assessment): Are both metrics of typical performance (mean average error (MAE)) and more extreme performance (root mean-squared error (RMSE)) provided? For forecasting (prognosis), is a benchmark with respect to the one-step naïve forecast, e.g., using mean absolute scaled error [27] (MASE), provided?
-
For detection algorithms (e.g., anomaly detection in mammography): Are metrics presented both in terms of patient-level classification metrics with an explicit and motivated definition of true positive and negative, false positive and negative; and in terms of the trade-off between anomaly-level sensitivity and individual false positives rate, such as the free-recall ROC (FROC) curve? Is the matching criterion, such as intersection-over-union threshold, clearly defined?
-
For segmentation algorithms: Are both overall voxel-level metrics such as Jaccard or Dice coefficients and absolute volume differences provided? Are instance-level metrics such as per-lesion accuracy metrics provided?
Have the developers identified and accounted for potential sources of bias in their algorithm?
Is the algorithm fixed or adapting as new data comes in?
-
Does the system adapt to your local data over time or via updates?
-
Is feedback obtained from the users (such as pointing out erroneous detections) incorporated?
-
If the algorithm undergoes continuous improvement, is that covered by the regulatory approval? Currently, no adaptive AI systems are regulatory approved, though this may change as the technology progresses.
-
If performance is increased in future updates, the algorithm is changed. How are results obtained with the prior versions handled? Will they still be valid and can one still compare them to the results obtained with the new version of the algorithm?
Usability and integration
How can the application be integrated into your clinical workflow?
-
Is manual interaction needed, or is the processing performed automatically in the background?
-
How fast is the processing cycle from data acquisition to the result?
-
How can the processing status of a specific dataset be checked?
-
Is there integration of identity management with the hospital system?
-
Are there different roles/users defined in the product?
-
Who can assign new users and/or roles? How much work does this represent?
-
If interaction is needed, are all actions trackable?
How exactly does the application impact the workflow?
What are the requirements in terms of information technology (IT) infrastructure?
Interoperability—how can the data be exported for research and other purposes?
-
How can the data be exported for research purposes? Are there accessible application programming interfaces (API) such as a DICOMweb interface?
-
Is the output in a standards-compliant format such as Digital Imaging and Communications in Medicine (DICOM) structured report (SR) following SR template identifier (TID) 1500?
-
Are standard export formats (e.g., simple comma-separated values (CSV) format) supported?
-
Are the results saved, or must the computation be performed anew every time?
Will the data be accessible to non-radiologists (referring physicians, patients)?
Are the AI model’s results interpretable?
Regulatory and legal aspects
Does the AI application comply with the local medical device regulations?
For Europe, is the AI application CE marked?
For the US, is the AI application FDA-cleared or FDA-approved?
Other medical device regulations
Does the AI application comply with the data protection regulations?
-
What are the contractual guarantees given by the manufacturer? Are there specific clauses in the contract related to the protection of data?
-
Does the manufacturer have a reference person for data protection issues?
-
Does the processing of data occur on premise or remotely? Is the manufacturer or the subcontractor hosting the processing compliant with information security standards ISO 27001/27017/27018?
-
Is the data pseudonymized, and if yes, where are the mapping tables stored?
Financial and support services considerations
What is the licensing model?
-
Does the manufacturer offer a trial period? Is it possible to proceed to a real-life evaluation of the product on the hospital’s own data before purchase?
-
What are the exact costs now, and in the future (install costs, yearly software license, maintenance fees, costs of potential future updates, internal efforts, etc.)?
-
How does the solution scale to more users, or more DICOM modalities (devices)? Would there be additional costs?
-
Is the AI system offered through an “App store” portal from an established EHR, dictation, or PACS vendor, or AI marketplace? If so, will the purchase of that application simplify your access to other applications in the future (by leveraging the same computing architecture and/or AI user interface)?
How are user training and follow-up handled?
-
Does the purchase of the product include training sessions? Who should participate and how much time is required per function?
-
Can additional training sessions be arranged for new users? How much would that cost?
-
If a question comes up, is there a way to contact the vendor and a guaranteed reaction time?
How is the maintenance of the product ensured?
-
Will there be regular maintenance?
-
If the product is down, would it still be possible to proceed with reading the relevant images by other means? What is the procedure for repair? What would be the delay? Who would have to cover the costs?
-
What is the guaranteed uptime of the servers the software runs on?
How will potential malfunctions or erroneous results be handled?
-
How will malfunctions be addressed? If severe, is there a guarantee that the problem will be fixed?
-
What is the pathway to file a potential malfunction? Is there an automatic monitoring in place or do the users have to report malfunctions?
-
What are the adverse event reporting pathways?
-
How is post market surveillance and post market clinical follow-up to be conducted?