Missing data
There were 7,019 consecutive admissions in total. Records with missing values for admission type (n = 142, of which 47 resulted in death) and SAPS II scores (n = 10) were excluded from the analysis, resulting in 6,867 admissions. GCS had 977 missing values; these were considered to be normal (value = 15) and were therefore imputed in the training and the validation sets. The percentage of missing values of other relevant variables varied from 0% to 10%: urine production within 24 hours (n = 310), lowest bicarbonate (n = 536), urea (n = 693), mechanical ventilation within 24 hours after admission (n = 0), lowest systolic blood pressure (n = 214), and lowest pH (n = 670). The tree-fitting algorithm automatically handled missing values as described below.
Statistical analysis
For continuous variables, we used the t distribution for calculating the 95% confidence intervals (CIs) and the Welch modification of the two-sample t test for calculating the p values for differences between means. This modification allows one not to assume equal variance in the survival and non-survival groups. We used Wilson's method for calculating the 95% CI for proportions and binomial probabilities such as mortality rate in the various patient subgroups and the positive predictive values (PPVs). The two-sided proportion test with Yates' continuity correction was used for testing differences between proportions (except for differences between PPVs, for which bootstrapping [with 1,000 bootstrap samples] was used because the patient groups partially overlap). Bootstrapping with 1,000 bootstrap samples was also used to calculate the CI of differences between Brier scores. The Hosmer-Lemeshow test with 10 degrees of freedom was used for testing model calibration.
In this study, data were analyzed by means of RPA, among other methods [
10]. RPA is an alternative to more standard model-based regression techniques for multivariable analyses. In contrast to such numeric-based techniques, RPA results in a symbolic representation called a classification tree, which can be easily interpreted as a collection of 'if-then rules,' each with a condition part and a conclusion part. An example of a rule is 'IF the GCS score is greater than 6 AND the patient is admitted to the ICU after planned surgery AND the urine production during the first 24 hours is more than 1.25 liters, THEN the risk to die before hospital discharge is 11.8%.' The classification tree is obtained by finding the split – a variable and its value or cut-point value (for example, GCS score of more than 6) – that 'best' partitions the whole group of patients into two subgroups. These subgroups, one fulfilling and one not fulfilling the condition in the split, appear graphically under a left and a right branch emanating from the group.
The term 'best' refers to a partition resulting in the lowest entropy, meaning essentially that a probability of an event (such as survival status) differs most between the two subgroups. Next, each subgroup in turn is itself further partitioned (hence the term 'recursive partitioning' in RPA). This process is repeated until a stopping criterion is met. Each path from the root to a leaf node in the tree corresponds to an if-then rule in which the conclusion part consists of the probability of the event in the leaf node.
When the tree algorithm finds the split that best partitions a group of observations, it also identifies 'surrogate splits' used to handle missing values. A surrogate split partitions the observations in a way very similar to the original split (in terms of the 'left' and the 'right' subgroups). Suppose that the original split is 'minimum bicarbonate of less than 22.6 μmol/l'; for an observation missing the minimum bicarbonate value, the surrogate split 'maximum bicarbonate of less than 25.3 μmol/l' can be used to decide on whether the observation should go to the left or to the right branch. The surrogate-split mechanism is, in effect, a flexible way to impute a missing value depending on where it is encountered in the tree.
The surrogate splitter contains information that typically is similar to what would be found in the primary splitter. In our study, the root of the tree corresponds to the whole developmental sample and is associated with the prevalence (the a priori probability) of hospital mortality in the developmental set. Each variable is then assessed to determine which one discriminates most (in terms of information gain) between those who are discharged from hospital alive and those who did not survive hospital treatment.
This process is repeated on the new nodes, creating a tree structure as a result. This process was first allowed to completely overgrow the tree to overfit the data. Then the optimal tree size was determined as the size that results in the minimum cross-validation error, as described below. Then the original overgrown tree was pruned back to the optimal size.
Cross-validation is performed for increasing tree sizes (in essence, this corresponds to the number of nodes in the tree). The cross-validation error is based on a 10-fold cross-validation in which the developmental set is randomly split into 10 mutually exclusive subsets. Nine sets are used to grow a new tree of the given size, and the 10th is used to assess the accuracy of that tree in predicting the outcomes in this 10th subset. This process is repeated for each of the remaining nine sets to assess the performance, resulting in ten error estimates. The cross-validation error associated with the given tree size is the average value of the classification error of the 10 trees of that size. The cross-validation error will usually first decrease with tree size, then reach a minimum that is associated with the optimal tree size, but then start increasing again due to overfitting.
The resulting pruned tree was then validated by measuring its predictive performance on the validation set, which was not used in any way during the development of the tree. We used systematic sampling, including every third successive admission in the validation set.
We used the Rpart package for recursive partitioning [
11] and the generalized linear model function for fitting logistic regression models within the statistical environment S-PLUS (commercially available software, Insightful Corporation, Seattle, USA) [
12].
The predictive ability of the tree was compared with the predicted mortality based on the original SAPS II score and with the predicted mortality based on the SAPS II model after first-level customization for a Dutch population of very elderly patients by means of the developmental database [
13]. First-level customization means refitting the model to obtain new coefficients without changing the score itself. Second-level customization implies adapting each item of the score; this was not attempted here. A receiver operating characteristic (ROC) curve was generated for the logistic regression SAPS II models and the classification tree. The ROC curve is a graphical display of sensitivity plotted against 1 – specificity for all possible thresholds that can be used to predict hospital mortality. Estimates of the area under the ROC curve (ROC-AUC) and its standard error were obtained using the non-parametric approach of DeLong and colleagues [
14]. The ROC-AUC measures the discriminative ability of a model. It is not, strictly speaking, a proper scoring rule; that is, its maximum value can also be obtained when the predictions are not equal to the true probabilities. This is because it is not sensitive to the distance between the predicted probability and the true probability of an event, which is a measure of calibration. Therefore, we also measured the Brier score (that is, the mean of the squared errors of the predictions), which is a proper scoring rule. We also performed a Hosmer-Lemeshow test with 10 degrees of freedom.