Background
Heart failure is a complex clinical syndrome caused by structural or functional impairment of the heart [
1,
2]. Heart failure has a high incidence in critically ill patients, especially among those in intensive care units (ICUs), and it is responsible for poor outcomes by causing myocardial injury and increased in-hospital mortality [
3]. Critical-illness scoring systems, such as the acute physiology and chronic health evaluation-II (APHACHE-II) and the simplified acute physiology score-II (SAPS-II), have been widely used in critical care medicine. However, they have been only modestly successful in heart failure populations [
4‐
6]. Nowadays, the prognosis for critical patients with advanced heart failure remains poor, and a proportion of patients require higher acuity care in the ICU. We need a more precise risk stratification tool to improve the quality of heart failure care in the ICU [
7,
8]. On the other hand, traditional prediction models based on logistic regression analysis for heart failure, such as Get With Guidelines Heart Failure (GWTGW)-HF Registry, may not capture multi-dimensional correlations that contain prognostic information from large amounts of high dimensional data while we can get much characteristic information from the detection instrument in the ICU [
9]. In contrast, novel machine learning techniques can capture the nonlinear relationship between patients’ prognosis and clinical manifestations and identify patterns from large datasets that have many variables [
10‐
12]. Extreme gradient boosting (XGBoost) is an ensemble learning algorithm combining multiple machine learning algorithms serially to obtain a better model that can learn more complex decision boundaries and efficiently handle missing data [
13]. XGBoost gained significant favor in the last few years due to helping individuals and teams win virtually every Kaggle structured data competition. What is more, XGBoost has had good performance in prognostic prediction models [
14‐
16].
In this study, we used XGBoost methods to generate a more precise risk predictive model on in-hospital mortality among critically ill patients with heart failure compared with traditional prediction models and critical illness scoring systems. We further validated the machine learning model by plotting the decision curve and assessing predictive performance in external populations.
Materials and methods
Database
Two distinct databases were used for this study. The model was developed from a retrospective analysis of a cohort of patients from Medical Information Mart for Intensive Care (MIMIC-III) a large public database that includes information on 46,520 patients who were admitted to ICUs from 2001 to 2021 at the Beth Israel Deaconess Medical Center in Boston, MA, USA [
17]. The database contains records of demographics, hourly vital signs from bedside monitors, laboratory tests, International Classification of Diseases and Ninth Revision (ICD-9) codes diagnoses, and other clinical characteristics. The users were required to pass a test to qualify to register for the database and to be approved by the MIMIC-III database administration staff. The second cohort of patients was from the Telehealth Intensive Care Unit (eICU) Collaborative Research Database (eICU-CRD) as a validation dataset. The eICU-CRD, a multi-center critical care database, covers more than 200,000 ICU stays of 139,367 unique patients admitted to ICUs between 2014 and 2015 from 208 hospitals in the United States [
18]. After passing a training course, “Protecting Human Research Participants,” on the website of the National Institutes of Health, we had permission to extract data from the two databases for research purposes (certification number: 37903239).
Study population
The study focused on ICU patients with heart failure. We exported the patients who were diagnosed with heart failure at admission to an ICU from the MIMIC-III and the eICU-CRD through ICD-9 codes or who were recorded as heart failure patients. Other criteria for inclusion were (I) heart failure without sepsis at admission to the ICU; (II) older than 16 years old and younger than 90 years old; (III) first hospital stay and the first ICU admission; IV) longer than 24-h stay in the ICU; (V) ICU vital signs data and laboratory test data available.
Initially, we extracted as many features as possible for constructing the baseline model and feature screening from the MIMIC-III database. First, we collected demographic data, including age, gender, weight, height, and ethnicity. Then, the vital signs data and laboratory data during the first 24 h after admission to the ICU were extracted, including heart rate, blood pressure, respiratory rate, temperature, oxyhemoglobin saturation (SpO2), creatinine, chloride, glucose, hematocrit, hemoglobin, platelet count, potassium, partial thromboplastin time (PTT), prothrombin time (PT), sodium, blood urea nitrogen (BUN), white blood cell (WBC) count, red blood cell count, red cell distribution width (RDW), Pappenheimer O2 (pO2), partial pressure of carbon dioxide (pCO2), and HCO3. The clinicians and nurses collected these data hourly. For mining more information about these features, we took the maximum, minimum, mean, and range values of vital signs and laboratory data over a period as candidate features. Comorbidities of patients were also collected. The urine output and Glasgow Coma Scale were calculated in the first 24 h after ICU admission. The primary endpoint was all-cause in-hospital mortality, so patients without discharge information were excluded from the final cohort. Finally, these features were integrated into a single data frame for analysis. The data extraction process was conducted by use of the PostgreSQL programming language.
Data preprocessing
After data extraction, the data set was preprocessed. The records with physiologically impossible values were eliminated. We then transformed character variables into categorical variables. If categorical variables were unordered, we coded them by One-Hot Encoding. Missing data, which were common in the databases, would introduce bias to subsequent analysis [
19,
20]; to avoid introducing this bias, we excluded covariates with > 40% missing data and patients with > 20% missing covariates. In the missing data imputation stage, we compared three methods: (1) median imputation, (2) random forest imputation, and (3) Extreme gradient boosting (XGBoost) imputation. Since the XGBoost method had the best effect to predict in the baseline model, we selected it to handle the missing data.
Model development
Generating the risk prediction model consisted of two stages: feature selection and model building. The feature selection stage selected the smallest and most predictive subset of features that were included in the final prediction model to minimize overfitting, as overfitting can lead to over-training of the training cohort and loss of prediction power in other populations. We used the permutation-based XGBOOST selection method, which ranks features by the variable importance metric of the XGBOOST and eliminated features one by one to get the best predictive subset (details in Additional file
1: Fig. S2).
Since the aim was to provide decision-making support for clinicians in evaluating the risk of in-hospital mortality of heart failure patients after ICU admission, the primary outcome of the model was the mortality rate of the ICU patients. The machine learning model was developed with the XGBoost algorithm [
21,
22]. The algorithm was dependent on continuous iterative correction of residuals from previous weak models, meaning that the current classifier is determined based on the previous classifier to optimize predictive power [
23,
24]. The MIMIC-III dataset provides more detailed information than the eICU dataset: First, through data preprocessing, the number of candidate feature set in the MIMIC-III dataset is 177, while the eICU is 89. All the features in eICU were incorporated in the MIMIC-III dataset, whereas the MIMIC-III dataset contains additional features regarding blood gas analysis and comorbidity information, such as arterial base excess, plasma bicarbonate, hematocrit, chronic pulmonary heart disease, valvular disease, pulmonary circulation, hypothyroidism and so on. Second, the size of the study cohort of the MIMIC-III dataset is 5676, while the eICU is 1349. In order to construct superior models and explore the most discriminating subset of variables, we used the MIMIC-III dataset as derivation data. We randomly divided the derivation data into a training cohort (90%) and a testing cohort (10%). The training cohort was used to train the predictive model, and the testing cohort was used to validate the performance of the predictive model. To train the machine learning model, we used the tenfold cross validation method in the training cohort for model hyperparameter tuning [
25]. We used the best predictive model and calculated the area under the receiver operating characteristic curves (AUC) in the testing cohort. We also constructed other models (logistic regression and SAPS-II) to compare with the machine learning model in the testing cohort. For logistical regression, we constructed a new feature set by variable interactions. Then, the performance of stepwise logistical regression, Lasso, Ridge and Elastic Net was compared between the original feature set and the new feature set (details in Additional file
1: Fig. S2). The stepwise logistic regression model was conducted using these significant variables identified by forward stepwise analysis with each variable iteratively added to minimize the Akaike Information Criterion (AIC). Finally, the best model was selected and compared with the machine learning model. The data extraction process and model building were conducted with Python 3.8.3.
Discussion
In this work, we used innovative machine-learning to construct a risk predictive model for hospital mortality among heart failure patients in intensive care units. Compared with traditional risk prediction, machine-learning techniques can capture the nonlinearity between risk predictors and mortality from large amounts of high dimensional data [
26‐
28]. The techniques can overcome the challenge of accurately identifying high-risk patients in the ICU, especially for those with complex phenotypes, such as heart failure [
29]. Matthew et al. [
30] demonstrated the superiority of machine learning methods to predict the risk of heart failure. Our machine learning model had the best ability to distinguish among the three predictive models, with an AUC of 0.831 in the internal validation dataset. According to the DCA of the three models, the net benefit for the XGBoost model was maximum, suggesting that the XGBoost model is optimal. It also had acceptable performance, with an AUC of 0.809 (95% CI 0.805–0.814) in the external validation. The XGBoost model had satisfactory calibration and good risk stratifying ability both in the internal testing dataset and the external validation dataset.
Using the XGBoost model, we divided the risk probabilities into < 5%, 5–10%, 10–30%, 30–50%, > 50% as very low, low, moderate, high, and very high-risk strata in the derivation population, respectively. In addition, the risk strata were presented in the external validation dataset. We documented the feasibility of the XGBoost model to distinguish risk patients from other populations. Through the use of the XGBoost model, the risk probability of each patient can inform and support clinicians in decision making. However, there were some deaths in low-risk strata and some survivors in high-risk strata. We suspect that these exceptions may be due to different phenotypes of heart failure patients in various risk stratification. For instance, Matthew et al. [
31] identified phenogroups of patients with machine learning-based unsupervised cluster analysis. Consequently, we may use other methods for further analysis and for making experimental validations in future research.
The machine learning-based model identified 24 variables from the feature set. Anion gap was most associated with death among ICU heart failure patients through the predictive model. Age was generally associated with death, and the Glasgow Coma Scale was also a predictor of mortality in ICU patients. Blood coagulation status at ICU admission, such as platelet count and PTT, was associated with in-hospital mortality among heart failure patients. Disturbance of blood coagulation has been reported to seriously threaten patients’ survival [
32]. However, most heart failure patients receive anticoagulant therapy, which will add to coagulation abnormalities. Hence, clinicians should be cautious in prescribing anticoagulant therapy for patients who are at high risk because the agents may increase the risk of inducing coagulopathy. In order to implement faster and more accurate coagulation management, we could early implement thromboelastography (TEG) or rotational thromboelastometry (ROTEM) to high-risk patients [
33]. Furthermore, the high-risk patients may receive mechanical thromboprophylaxis with intermittent pneumatic compression, graduated compression stockings, or percutaneous left atrial appendage closure [
34,
35]. The volume of urine output was the third important predictor in the predictive model, and a higher volume of urine output may indicate a better prognosis. Lin et al. [
36] indicated that decreased urine output could be a compensatory mechanism to maintain intravascular volume, and in that circumstance, patients may be at risk of renal injury. Meanwhile, oliguria and worsening renal function may drive fluid retention increasing the burden on the heart, which causes damage to the heart and aggravates symptoms of heart failure. Several studies in HF patients have demonstrated that fluid overload is independently associated with increased mortality [
37,
38]. One reason was that HF patients are at risk of death not only from cardiovascular disease but also from multiorgan failure. Many features in blood gas analysis were among the most important features from the predictive models: pO2, pCO2, anion gap, and arterial base excess. However, through the machine learning method, we could only appreciate that heart failure was associated with these features; the method could not explain the mechanisms responsible for heart failure. Hence, further research is needed to determine the role of these features in ICU patients with heart failure.
As a retrospective analysis, this study has limitations. First, our predictive model was constructed from a single-center dataset, which may not be appropriate for other populations. Although our model had good performance in the external dataset, it needs verification in other datasets and populations. Second, because of missing data, some features that have been identified as risk predictors of heart failure, such as N-terminal pro-B-type natriuretic peptide [
39,
40], were not assessed. Third, we did not make the most of time sequence data monitoring from the ICU; we only extracted the minimum, maximum, mean, and range of features within 24 h. The pattern of change for a period in a feature may contain information that can increase the prediction and understanding of mechanisms. In future work, we could divide the 24 h into shorter time intervals. One strategy is that the 24 h period can be divided into two time periods according to the maximum or minimum point of each time series feature. Then, we could extract additional summary statistics of the feature for the two time periods, such as mean value, variance, deviation and Shannon entropy, and incorporate them in the statistical models [
41]. Nonetheless, our model can help clinicians identify heart failure in ICU patients who are at high risk for in-hospital mortality.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.