Background
During the last decade, several machine learning models have been introduced to predict outcomes in the intensive care unit (ICU) [
1‐
3]. These models include a gradient boosting machine (GBM) model for in-hospital mortality [
3], a recurrent neural network-based model for major complications, and a hybridized convolutional neural network and long short-term memory (LSTM) model for 3 to 14-day mortality [
4]. Previous studies have reported the excellent performance of models, suggesting their potential use in clinical practice [
5].
However, the real-time clinical performance of models remains unclear because most models were developed to predict mid- to long-term outcomes using the first 24 h of ICU admission [
6‐
9]. In general, the ICU mortality rate peaks in the first 24 h and then declines with management in the ICU [
10,
11]. Before applying the models to routine monitoring in clinical practice, the performance of the models should be validated.
Another challenge in applying machine learning models to real-world clinical practice is that performance can vary depending on the training data and clinical setting [
12]. A recent review article reported that approximately half of the ICU mortality prediction models have not been externally validated [
13]. Some variables used in previous models, such as insurance type and diagnosis codes, are not standardized across countries, making them very difficult to apply internationally [
3]. Therefore, models using variables commonly measured in most clinical settings should be developed and validated in multinational cohorts to ensure good performance in other clinical settings.
Here, we aimed to develop a machine learning-based real-time prediction model for short-term (24 h) mortality risk in critically ill patients using only variables readily available from electronic health records in most clinical settings. We reduced the number of input parameters to avoid overfitting and developed an ensemble model that uses the collective results of many different model architectures. We then validated the model’s performance using international datasets from Asia, America, and Europe. We hypothesized that the performance of the real-time model for predicting short-term ICU mortality using a minimal set of common clinical variables and ensemble machine learning techniques would be well maintained in international validation.
Discussion
In this study, we developed and internationally validated a machine learning-based model for real-time mortality prediction within 24 h in critically ill patients. Although we developed our model based on single-center data, using common clinical features and ensemble techniques, our model outperformed conventional risk scores in the real-time application to the internal and external validation cohorts. However, the performance slightly declined in the external validation.
Previous studies have reported machine learning-based mortality prediction models for critically ill patients can achieve significantly better predictive performance than conventional scoring systems, such as the APACHE II or SAPS II [
6,
9,
22‐
25]. However, most models were designed to predict mortality at a single time point such as 24 h after admission [
22,
25], which hardly reflects management during the ICU stay. Real-time models have been only used during the first 24 h after admission [
24], with a 1-day interval [
23], or for long-term outcomes [
6]. We developed a model that can be applied hourly, intended for real-time monitoring in the ICU, and evaluated its performance.
Previous studies have also reported that the accuracy of mortality prediction models declines in the later stages of the ICU stay [
6,
26]. Despite exhibiting a similar decline in performance over time, our model’s AUROC remained above 0.82 in both internal and external validations, except for the AmsterdamUMCdb (Additional file
1: Fig. E9c). This could be interpreted as our model trained to predict short-term mortality and optimized for real-time performance. In both internal and external datasets, the score of our model consistently increased in the mortality cases as death approached, showing its utility for real-time monitoring in the ICU (Fig.
2a).
When applying real-time models in clinical practice, alarm fatigue is one of the major concerns that can lead to the complete inactivation of the alarms [
27]. However, at a sensitivity level of 0.891, the MACPD of our model was 2.344, fewer than three alarms per bed per day. Although the alarm rate was increased more than twofold in the MIMIC-III, threefold in the AmsterdamUMCdb, and fourfold in the eICU-CRD, it was still significantly lower than that of the NEWS and was fewer than 10 alarms per bed per day. We considered this alarm rate acceptable and would not increase the risk of alarm fatigue.
Our model included features routinely measured and monitored in the ICU, such as heart rate, SpO
2, or GCS, including commonly monitored variables that allow our model to be easily applied in daily care, without requiring specialized laboratory tests or monitoring equipment. Moreover, the model explained the predictions for each patient at each point in time. As shown in Additional file
1: Fig. E7, the Shapley values indicated the impact of each input feature on the model output. As the European Union’s General Data Protection Regulation took effect as a law in April 2018, the interpretability of the algorithmic decision-making model became essential [
23]. Nevertheless, whether changing the variables based on feature importance improves the outcomes requires further investigation.
Although our model showed better calibration with the lowest ECE compared to other scoring systems, it still tended to underestimate the risk of mortality in both internal and external cohorts. The low mortality rate in the developmental cohort (1%) may be attributed to the underestimating model. Other models besides iMORS were initially developed to provide early warnings for the deterioration of patients in general wards and underestimate the risk of mortality [
28]. Furthermore, although some models, such as APACHE II, were developed to suggest the risk of mortality for ICU patients, they provide an overall risk of mortality rather than short-term mortality. Considering that the calibration analysis of the cohort with the highest mortality rate, AmsterdamUMCdb, showed the highest ECE for all prediction models when compared to other datasets, we can speculate about the potential underestimation due to differences in mortality rate.
Regarding all subgroups of age, sex, ethnicity, insurance, and ICU type in both internal and external cohorts, our model showed good performances with an AUROC of > 0.85, suggesting the universal applicability of the prediction model for all types of critically ill patients. Interestingly, the AUROC of our model was highest for Native Americans in both the MIMIC-III and eICU-CRD and lowest for Asians in the eICU-CRD (Additional file
1: Fig. E8a). This divergence could be attributed to the limited number of Native American patients and possible differences in the data distribution of the Asian population in the eICU-CRD as compared to the Korean cohort used for training. Our model showed good predictive ability for all types of ICU, with AUROCs > 0.90 in the internal testing dataset and 0.83 in the external cohorts. In both internal testing dataset and external cohorts, the AUROC was consistently lower for the CCU than for the other ICU types except for AmsterdamUMCdb. This can be attributed to the distinct features of the CCU, which play a role in both the ICU and the post-procedural care unit after the cardiac intervention. At SNUH, the CCU plays a limited role as an ICU and does not provide specialized modalities, such as mechanical ventilators or continuous renal replacement therapy. Limited therapeutic options, such as the characteristics or severity of the illness, may affect the distribution of patients in the CCU.
The applicability of prediction models in clinical practice is as crucial as predictive performance. As our model utilized vital signs and laboratory tests routinely measured in the ICU, and the input design incorporated real-time updates with each new value, our model suggests the potential for an automated mortality prediction using real-time data from electronic health record systems. Furthermore, the model’s explainable nature, which identifies factors contributing to predicted mortality, indicates its potential utility as a clinical decision-support tool in clinical practice. Therefore, clinical trials that validate the clinical utility of the model are warranted. Additionally, improving the model’s performance by utilizing additional input, newer architecture, and more data should be considered in future studies.
This study has several limitations. First, we developed our model using data from a single tertiary academic hospital where the distribution of patients differed from that of other institutions. The presence of a specialized unit for close monitoring in the general ward, such as a “sub-ICU,” might imply an increased severity of illness in patients who are admitted to the ICU. Although the external validation using the MIMIC-III, eICU-CRD, and AmsterdamUMCdb showed that the model had good performance, the mortality rate was similar to or even higher in the external cohorts. Therefore, the prediction model should be applied with caution, and recalibration may be required for other cohorts, particularly those with lower mortality rates. Second, the predictive performance was reduced in the external cohort. As shown in Additional file
1: Table E2, there were differences in mortality among the cohorts, while the cohorts from the USA (MIMIC and eICU-CRD) are relatively similar. The difference in the severity of each feature may reduce the model’s performance. Third, the model’s performance on the external cohorts showed a decreasing trend as the ICU length of stay was prolonged and patients’ age increased. Except for the subgroup of the age 70–79 in the AmsterdamUMCdb, there was a consistent decline in the model’s predictive performance as patients’ ages increased and their ICU stay duration prolonged (Additional file
1: Fig. E9c, d). Despite a reduction in performance in high mortality risk subgroups, the model demonstrated an acceptable AUROC of over 0.8 across all age and ICU length of stay subgroups except for the subgroup with an ICU length of stay of more than 8 days in the AmsterdamUMCdb. Finally, all the validations in this study were conducted retrospectively. Therefore, unavoidable bias may occur, and prospective validation is required. Whether predicting mortality in ICU can improve outcomes should also be evaluated in future studies.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.