Before implementation – predictive performance
During the development phase, the internal validation of the predictive performance of a tool is the first step to make sure that the tool is doing what it is intended to do [
54,
55]. Predictive performance is defined as the ability of the tool to utilise clinical and other relevant patient variables to produce an outcome that can be used to supports diagnostic, prognostic or therapeutic decisions made by clinicians and other healthcare professionals [
12,
13]. The predictive performance of a tool is evaluated using measures of discrimination and calibration [
53]. Discrimination refers to the ability of the tool to distinguish between patients with and without the outcome under consideration. This can be quantified with measures such as sensitivity, specificity, and the area under the receiver operating characteristic curve – AUC (or concordance statistic, c). The D-statistic is a measure of discrimination for time-to-event outcomes, which is commonly used in validating the predictive performance of prognostic models using survival data [
95]. The log-rank test, or sometimes referred to as the Mantel-Cox test, is used to establish if the survival distributions of two samples of patients are statistically different. They are commonly used to validate the discrimination power of clinical prognostic models [
96]. On the other hand, calibration refers to the accuracy of prediction, and indicates the extent to which expected and observed outcomes agree [
48,
56]. Calibration is measured by plotting the observed outcome rates against their corresponding predicted probabilities. This is usually presented graphically with a calibration plot that shows a calibration line, which can be described with a slope and an intercept [
97]. It is sometimes summarised using the Hosmer-Lemeshow test or the Brier score [
98]. To avoid over-fitting, tools’ predictive performance must always be assessed out-of-sample, either via cross-validation or bootstrapping [
56]. Of more interest than the internal validity is the external validity (reliability or generalisability), where the predictive performance of a tool is estimated in independent validation samples of patients from different populations [
52].
During implementation – potential effect & usability
Before wide implementation, it is important to learn about the estimated potential effect of a predictive tool, when used in the clinical practice, on three main categories of measures: 1) Clinical effectiveness, such as improving patient outcomes, estimated through clinical effectiveness studies, 2) healthcare efficiency, including saving costs and resources, estimated through feasibility and cost-effectiveness studies, and 3) patient safety, including minimising complications, side effects, and medical errors. These categories are defined by the Institute of Medicine as objectives for improving healthcare performance and outcomes, and are differently prioritised by clinicians, healthcare professionals and health administrators [
99,
100]. The potential effect, of a predictive tool, is defined as the expected, estimated or calculated impact of using the tool on different healthcare aspects, processes or outcomes, assuming the tool has been successfully implemented and is used in the clinical practice, as designed by its developers [
41,
101]. A few predictive tools have been studied for their potential to enhance clinical effectiveness and improve patient outcomes. For example, the spinal manipulation clinical prediction rule was tested, before implementation, on a small sample of patients to identify those with low back pain most likely to benefit from spinal manipulation [
102]. Other tools have been studied for their potential to improve healthcare efficiency and save costs. For example, using a decision analysis model, and assuming all eligible children with minor blunt head trauma were managed using the CHALICE rule (Children’s Head Injury Algorithm for the Prediction of Important Clinical Events), it was estimated that CHALICE would reduce unnecessary expensive head computed tomography (CT) scans, by 20%, without risking patients’ health [
103‐
105]. Similarly, the use of the PECARN (Paediatric Emergency Care Applied Research Network) head injury rule was estimated to potentially improve patient safety through minimising the exposure of children to ionising radiation resulting in fewer radiation-induced cancers and lower net quality adjusted life years loss [
106,
107].
In addition, it is important to learn about the usability of predictive tools. Usability is defined as the extent to which a system can be used by the specified users to achieve specified and quantifiable objectives in a specified context of use [
108,
109]. There are several methods to make a system more usable and many definitions have been developed, based on the perspective of what usability is and how it can be evaluated, such as the mental effort needed and the user attitude or the user interaction, represented in the easiness of use and acceptability of systems [
110,
111]. Usability can be evaluated through measuring the effectiveness of task management with accuracy and completeness, measuring efficiency of utilising resources in completing tasks and measuring users’ satisfaction, comfort with, and positive attitudes towards, the use of the tools [
112,
113]. More advanced techniques, such as think aloud protocols and near live simulations, are recently used to evaluate usability [
114]. Think aloud protocols are a major method in usability testing, since they produce a larger set of information and a richer content. They are conducted either retrospectively or concurrently, where each method has its own way of detecting usability problems [
115]. The near live simulations provide users, during testing, with an opportunity to go through different clinical scenarios while the system captures interaction challenges and usability problems [
116,
117]. Some researchers add learnability, memorability and freedom of errors to the measures of usability. Learnability is an important aspect of usability and a major concern in the design of complex systems. It is the capability of a system to enable the users to learn how to use it. Memorability, on the other hand, is the capability of a system to enable the users to remember how to use it, when they return back. Learnability and memorability are measured through subjective survey methods, asking users about their experience after using systems, and can also be measured by monitoring users’ competence and learning curves over successive sessions of system usage [
118,
119].
After implementation – post-implementation impact
Some predictive tools have been implemented and used in the clinical practice for years, such as the PECARN head injury rule or the Ottawa knee and ankle rules [
120‐
122]. In such cases, clinicians might be interested to learn about their post-implementation impact. The post-implementation impact of predictive tools is defined as the achieved change or influence, of a predictive tool, on different healthcare aspects, processes or outcomes, after the tool has been successfully implemented and used in the clinical practice, as designed by its developers [
2,
42]. Similar to the measures of potential effect, post-implementation impact is reported along three main categories of measures: 1) Clinical effectiveness, such as improving patient outcomes, 2) Healthcare efficiency, such as saving costs and resources, and 3) Patient safety, such as minimising complications, side effects, and medical errors. These three categories of post-implementation impact measures are differently prioritised by clinicians, healthcare professionals and health administrators. In this phase of evaluation, we follow the main concepts of the GRADE framework, where the level of evidence for a given outcome is firstly determined by the study design [
64,
65,
68]. High quality experimental studies, such as randomised and nonrandomised controlled trials, and the systematic reviews of their findings, come on top of the evidence levels followed by observational well-designed cohort or case-control studies and lastly subjective studies, opinions of respected authorities, and reported of expert committees or panels [
65‐
67]. For simplicity, we did not include GRADE’s detailed criteria for higher and lower quality of studies. However, effect sizes and potential biases are reported as part of the framework, so that consistency of findings, trade-offs between benefits and harms, and other considerations can also be assessed.
Applying the GRASP framework to grade five predictive tools
In order to show how GRASP works, we applied it to grade five randomly selected predictive tools; LACE Index for Readmission [
125], Centor Score for Streptococcal Pharyngitis [
126], Wells’ Criteria for Pulmonary Embolism [
123,
124,
127], The Modified Early Warning Score (MEWS) for Clinical Deterioration [
128] and Ottawa Knee Rule [
122]. In addition to these seven primary studies, describing the development of the five predictive tools, our systematic search for the published evidence revealed a total of 56 studies; validating, implementing, and evaluating the five predictive tools. The LACE Index was evaluated and reported in six studies, the Centor Score in 14 studies, the Wells’ Criteria in ten studies, the MEWS in 12 studies, and the Ottawa Knee Rule in 14 studies. To apply the GRASP framework and assign a grade to each predictive tool, the following steps were conducted; 1) The primary study or studies were first examined for the basic information about the tool and the reported details of development and validation. 2) Other studies were examined for their phases of evaluation, levels of evidence and direction of evidence. 3) Mixed evidence was sorted into positive or negative. 4) The final grade was assigned and supported by the detailed justification. A summary of grading the five tools is shown in Table
2 and a detailed GRASP report on each tool is provided in the Additional file
1: Tables S3-S7.
Table 2
Summary of Grading the Five Predictive Tools
LACE Index is a prognostic tool designed to predict 30 days readmission or death of patients after discharge from hospitals. It uses multivariable logistic regression analysis of four administrative data elements; length of stay, admission acuity, comorbidity (Charlson Comorbidity Index) and emergency department (ED) visits in the last 6 months, to produce a risk score [
125]. The tool has been tested for external validity twice; using a sample of 26,045 patients from six hospitals in Toronto and a sample of 59,652 patients from all hospitals in Alberta, Canada. In both studies, the LACE Index showed positive external validity and superior predictive performance to the previous similar tools endorsed by the Centres for Medicare and Medicaid Services in the United States [
129,
130].
Two studies examined the predictive performance of LACE Index on small sub-population samples; 507 geriatric patients in the United Kingdom and 253 congestive heart failure patients in the United States, and found that the index performed poorly [
131,
132]. Two more studies reported that the LACE Index performed well but not better that their own developed tools [
133,
134]. Using the mixed evidence protocol, the mixed evidence here supports external validity, since the two negative conclusion studies have been conducted on very small samples of patients and on different subpopulations than the one the LACE Index was developed for. There was no published evidence on the usability, potential effect or post-implementation impact of the LACE Index. Accordingly, the LACE Index has been assigned Grade C1.
Centor Score is a diagnostic tool that uses a rule-based algorithm on clinical data to estimate the probability that pharyngitis is streptococcal in adults who present to the ED complaining of sore throat [
126]. The score has been tested for external validity multiple times and all the studies reported positive conclusions [
135‐
142]. This qualifies Centor score for Grade C1. One study conducted a multicentre cluster RCT usability testing of the integration of Centor score into electronic health records. The study used “Think Aloud” testing with ten primary care providers, post interaction surveys in addition to screen captures and audio recordings to evaluate usability. Within the same study, another “Near Live” testing, with eight primary care providers, was conducted. Conclusions reported positive usability of the tool and positive feedback of users on the easiness of use and usefulness [
143]. This qualifies Centor score for Grade B1.
Evidence of the post-implementation impact of Centor score is mixed. One RCT conducted in Canada reported a clinically important 22% reduction in overall antibiotic prescribing [
144]. Four other studies, three of which were RCTs, reported that implementing Centor score did not reduce antibiotic prescribing in clinical practice [
145‐
148]. Using the mixed evidence protocol, we found that the mixed evidence does not support positive post-implementation impact of Centor score. Therefore, Centor score has been assigned Grade of B1.
Wells’ Criteria is a diagnostic tool used in the ED to estimate pre-test probability of pulmonary embolism [
123,
124]. Using a rule-based algorithm on clinical data, the tool calculates a score that excludes pulmonary embolism without diagnostic imaging [
127]. The tool was tested for external validity multiple times [
149‐
153] and its predictive performance has been also compared to other predictive tools [
154‐
156]. In all studies, Wells’ criteria was reported externally valid, which qualifies it for Grade C1. One study conducted usability testing for the integration of the tool into the electronic health record system of a tertiary care centre’s ED. The study identified a strong desire for the tool and received positive feedback on the usefulness of the tool itself. Subjects responded that they felt the tool was helpful, organized, and did not compromise clinical judgment [
157]. This qualifies Wells’ criteria for Grade B1. The post-implementation impact of Well’s Criteria on efficiency of computed tomography pulmonary angiography (CTPA) utilisation has been evaluated through an observational before-and-after intervention study. It was found that the Well’s Criteria significantly increased the efficiency of CTPA utilisation and decreased the proportion of inappropriate scans [
158]. Therefore, Well’s Criteria has been assigned Grade A2.
The Modified Early Warning Score (MEWS) is a prognostic tool for early detection of inpatients’ clinical deterioration and potential need for higher levels of care. The tool uses a rule-based algorithm on clinical data to calculate a risk score [
128]. The MEWS has been tested for external validity multiple times in different clinical areas, settings and populations [
159‐
165]. All studies reported that the tool is externally valid. However, one study reported MEWS poorly predicted the in-hospital mortality risk of patients with sepsis [
166]. Using the mixed evidence protocol, the mixed evidence supports external validity, qualifying MEWS for Grade C1. No literature has been found regarding its usability or potential effect.
The MEWS has been implemented in different healthcare settings. One observational before-and-after intervention study failed to prove positive post-implementation impact of the MEWS on patient safety in acute medical admissions [
167]. However, three more recent observational before-and-after intervention studies reported positive post-implementation impact of the MEWS on patient safety. One study reported significant increase in frequency of patient observation and decrease in serious adverse events after intensive care unit (ICU) discharge [
168]. The second reported significant increase in frequency of vital signs recording, 24 h post-ICU discharge and 24 h preceding unplanned ICU admission [
169]. The third, an 8 years study, reported that the post-implementation 4 years showed significant reductions in the incidence of cardiac arrests, the proportion of patients admitted to ICU and their in-hospital mortality [
170]. Using the mixed evidence protocol, the mixed evidence supports positive post-implementation impact. The MEWS has been assigned Grade A2.
Ottawa Knee Rule is a diagnostic tool used to exclude the need for an X-ray for possible bone fracture in patients presenting to the ED, using a simple five items manual check list [
122]. It is one of the oldest, most accepted and successfully used rules in CDS. The tool has been tested for external validity multiple times. One systematic review identified 11 studies, 6 of them involved 4249 adult patients and were appropriate for pooled analysis, showing high sensitivity and specificity predictive performance [
171]. Furthermore, two studies discussed the post-implementation impact of Ottawa knee rule on healthcare efficiency. One nonrandomised controlled trial with before-after and concurrent controls included a total of 3907 patients seen during two 12-month periods before and after the intervention. The study reported that the rule decreased the use of knee radiography without patient dissatisfaction or missed fractures and was associated with reduced waiting times and costs per patient [
172]. Another nonrandomised controlled trial reported that the proportion of ED patients referred for knee radiography was reduced. The study also reported that the practice based on the rule was associated with significant cost savings [
173]. Accordingly, the Ottawa knee rule has been assigned Grade A1.
In the Additional file
1, a summary of the predictive performance of the five tools is shown in Additional file
1: Table S8. The c-statistics of LACE Index, Centor Score, Wells’ Criteria and MEWS are reported in Additional file
1: Figure S4. The usability of Centor Score and Wells Criteria are reported in Additional file
1: Table S9 and post-implementation impact of Wells Criteria, MEWS and Ottawa knee rule is reported in Additional file
1: Table S10.