Background
a) Patients, treatment and variables | ||||
---|---|---|---|---|
Study and marker | Remarks | |||
Marker | OS = 86-probe-set gene-expression signature | |||
Further variables | v1 = age, v2 = sex, v3 = NMP1, v4 = FLT3-ITD
| |||
Reference | Metzeler et al. (2008) | |||
Source of the data | GEO (reference: GSE12417) | |||
Patients
|
n
|
Remarks
| ||
Training set | Assessed for eligibility | 163 |
Disease: acute myeloid leukemia | |
Patient source: German AML Cooperative Group 1999-2003 | ||||
Excluded | 0 | |||
Included | 163 |
Treatment: following AMLCG-1999 trial | ||
Gene expression profiling: Affymetrix HG-U133 A&B microarrays | ||||
With outcome events | 105 |
Overall survival: death from any cause | ||
Validation set | Assessed for eligibility | 79 |
Disease: acute myeloid leukemia | |
Patient source: German AML Cooperative Group 2004 | ||||
Excluded | 0 | |||
Included | 79 |
Treatment: 62 following AMLCG-1999 trial 17 intensive chemotherapy outside the study | ||
Gene expression profiling: Affymetrix HG-U133 plus 2.0 microarrays | ||||
With outcome events | 33 |
Overall survival: death from any cause | ||
Relevant differences between training and validation sets
| ||||
Data source | Same research group, different time (see above) | |||
Follow-up time | Much shorter in the validation set (see text) | |||
Survival rate | Higher in the validation set (see Figure 2) | |||
b) Statistical analyses of survival outcomes
| ||||
Analysis
|
n
|
e
|
Variables considered
|
Results/remarks
|
A: preliminary analysis (separately on training and validation sets)
| ||||
A1: univariate | 163 | 105 | v1 to v4 | Kaplan-Meier curves (Figure 1) |
79 | 33 | |||
B: evaluating clinical model and combined model on validation data (models fitted on training set, evaluated on validation set)
| ||||
B1: overall prediction | Prediction error curves (Figure 5) | |||
Integrated Brier score (text) | ||||
Training | Comparison of Kaplan-Meier curves for risk groups: | |||
163 | 105 | - Medians as cutpoints (Figure 6), | ||
B2: discriminative ability | OS, v1 to v4 | - K-mean clustering (data not shown - see text) | ||
C-index (text) | ||||
Validation | K-statistic (text) | |||
B3: calibration | 79 | 33 | Kaplan-Meier curve vs average individual survival curves for risk groups (Figure 7) | |
Calibration slope (text) | ||||
C: Multivariate testing of the omics score in the validation data (only validation set involved)
| ||||
C1: significance | 79 | 33 | OS, v1 to v4 | Multivariate Cox model (Table 3) |
D: Comparison of the predictive accuracy of clinical and combined models through cross-validation in the validation data (only validation set involved)
| ||||
D1: overall prediction | 79 | 33 | OS, v1 to v4 | Prediction error curves based on repeated cross-validation (Figure 8) |
Prediction error curves based on repeated subsampling (data not shown - see text) | ||||
Prediction error curves based on repeated bootstrap resampling (data not shown - see text) | ||||
Integrated Brier score based on cross-validation (text) | ||||
E: Subgroup analysis (E1-E3 based on training and validation sets, E4 and E5 only on validation set; for all, separate analysis for female and male population)
| ||||
E1: overall prediction | Female | OS, v1 to v4 | Prediction error curves (Figure 9) | |
E2: discriminative ability | t.: 88 54 | C-index (text) | ||
v.: 46 16 | K-statistic (text) | |||
E3: calibration | Male | Calibration slope (text) | ||
E4: significance | t.: 74 51 | Multivariate Cox model (text) | ||
E5: overall prediction | v.: 33 17 | Prediction error curves based on cross-validation (Figure 10) |
a) Patients, treatment and variables | ||||
---|---|---|---|---|
Study and marker | Remarks | |||
Marker | OS = 8-probe-set gene-expression signature | |||
Further variables | v1 = age, v2 = sex, v3 = FISH, v4 = IGVH
| |||
Reference | Herold et al. (2011) | |||
Source of the data | GEO (reference: GSE22762) | |||
Patients
|
n
|
Remarks
| ||
Assessed for eligibility | 151 |
Disease: chronic lymphocytic leukemia | ||
Patient source: Department of Internal Medicine III, University of Munich (2001 - 2005) | ||||
Training set | Excluded | 0 | ||
Included | 151 |
Criteria: sample availability | ||
Gene expression profiling: 44 Affymetrix HG-U133 A&B microarrays, 107 Affymetrix HG-U133 plus 2.0 microarrays | ||||
With outcome events | 41 |
Overall survival
| ||
Assessed for eligibility | 149 |
Disease: chronic lymphocytic leukemia | ||
Patient source: Department of Internal Medicine III, University of Munich (2005 - 2007) | ||||
Validation set | Excluded | 18 | Due to missing clinical information | |
Included | 131 |
Criteria: sample availability | ||
Gene expression profiling: 149 qRT-PCR (only selected genes) | ||||
With outcome events | 40 |
Overall survival
| ||
Relevant differences between training and validation sets
| ||||
Data source | Same institution, different time (see above) | |||
Measurement of gene expressions | Affymetrix HG-U133 vs. TaqMan LDA (see text) | |||
Survival rate | Lower in the validation set (see Figure 4) | |||
b) Statistical analyses of survival outcomes
| ||||
Analysis
|
n
|
e
|
Variables considered
|
Results/remarks
|
F: preliminary analysis (separately on training and validation sets)
| ||||
F1: univariate | 151 | 41 | v1 to v4 | Kaplan-Meier curves (Figure 3) |
131 | 40 | |||
G: Multivariate testing of the omics score in the validation data (only validation set involved)
| ||||
G1: significance | 131 | 40 | OS, v1 to v4 | Multivariate Cox model (Table 5) |
H: Comparison of the predictive accuracy of clinical and combined models through cross-validation in the validation data (only validation set involved)
| ||||
H1: Overall prediction | 131 | 40 | OS, v1 to v4 | Prediction error curves based on cross-validation (Figure 11) |
Integrated Brier score based on cross-validation (text) |
Data
Acute myeloid leukemia
Chronic lymphocytic leukemia
Methods
Scores
Strategies
-
the procedure used to derive a combined prediction score;
-
the evaluation scheme used to compare the prediction accuracy of the clinical and combined prediction scores on the validation set.
Evaluation criteria
Aspect | Measure | Characteristics |
---|---|---|
Discriminative ability | Kaplan-Meier curves for risk groups | Better with greater distance between the Kaplan-Meier curves for the low- and high risk groups |
C-index | Estimates the concordance probability, i.e. the probability that the score correctly orders two patients with respect to their survival time; higher values correspond to better prediction | |
K-statistic | Alternative to the C-index; works only under the proportional hazards assumption | |
Calibration | Survival curves | Compares the observed survival function with the average predicted curve |
Calibration slope | Computes the regression coeffcient of the prognostic score as unique predictor; the best values are those close to 1; related to overfitting issues | |
Overall prediction | Prediction error curves | Presents the Brier score versus time; the closer the curves are to the X-axis, the better the prediction |
Integrated Brier score | Computes the area under the prediction error curves; the smaller is the value, the better the prediction |
Results
Acute myeloid leukemia
Variable | Coeff | Sd(coeff) | P-value |
---|---|---|---|
Omics score
| 0.523 | 0.243 | 0.0312 |
Age (continuous) | 0.022 | 0.015 | 0.1340 |
Sex (male) | 0.643 | 0.404 | 0.1114 |
FLT3-ITD
| 0.436 | 0.440 | 0.3220 |
NPM1 (mutated) | -0.377 | 0.404 | 0.3497 |
Log-hazard ratios | ||
---|---|---|
Variable | Training | Validation |
Omics score
| 0.642 (0.172) | 0.523 (0.243) |
Age (continuous) | 0.021 (0.008) | 0.022 (0.015) |
Sex (male) | -0.024 (0.208) | 0.643 (0.404) |
FLT3-ITD
| 0.448 (0.253) | 0.436 (0.440) |
NPM1 (mutated) | -0.370 (0.215) | -0.377 (0.404) |
Chronic lymphocytic leukemia
Variable | Coeff | Sd (coeff) | P-value |
---|---|---|---|
Omics score
| -0.589 | 0.150 | 8.65×10-05
|
Age (continuous) | 0.113 | 0.023 | 6.82×10-07
|
Sex (female) | 0.157 | 0.343 | 0.6472 |
FISH =1 | 0.171 | 0.459 | 0.7092 |
FISH =2 | 1.352 | 0.590 | 0.0219 |
FISH =3 | -0.195 | 0.665 | 0.7694 |
FISH =4 | -0.459 | 0.427 | 0.2823 |
IGVH (mutated) | 0.695 | 0.416 | 0.0949 |
Discussion
Conclusion
-
When testing is performed for a multivariate model on the validation data, the omics score may have a significant p-value but show poor or no added predictive value when measured using criteria such as the Brier score. This is because a test in multivariate regression tests whether the effect of the omics score is zero but does not assess how much accuracy can be gained through its inclusion in the model.
-
To gain information on – and “validate” – predictive value, it is necessary to apply models with and without the omics score to the validation data. There are essentially two ways to do that.
-
The first approach (denoted “Evaluating the clinical model and the combined model on validation data” in this paper) consists of fitting a clinical model and a combined model on the training data and comparing the prediction accuracy of both models on the validation data. This is essentially the most intuitive way to proceed in low-dimensional settings. The problem in high-dimensional settings is that the omics score is likely to overfit the training data. As a result, its effect might be overestimated when its regression coefficient is estimated using again the same set using for its construction. We have seen how this leads to serious problems, especially in term of bad calibration. Furthermore, this approach is not applicable when the omics data has been measured with different techniques in the training and validation sets, as in the CLL data.
-
The second approach, which we recommend in high-dimensional settings, consists of using a cross-validation-like procedure to compare models with and without the omics score using the validation set. By using the validation set only, we avoid the overfitting problem described above. When using this approach, it is recommended performing as many repetitions of CV as computationally feasible (and to average the results over the repetitions) in order to achieve more stable results.
-
Alternatively, one could also fit the models on the validation set and use an additional third set to assess them. This approach would avoid the use of cross-validation procedures that are known to be affected by a high variance, especially in high-dimensional settings. However, the opportunity to assess the models based on a third set is rarely given in the context of omics data, since datasets are usually too small to be split.
-
In any case, it is important that training and validation sets are completely independent. The practice of evaluating the prediction ability of a model, correctly fitted only on the training set, on the whole dataset obtained by merging the training and validation sets is not appropriate. This would indeed result in an overoptimistic estimation of prediction accuracy, because of the overoptimism observed due to the evaluation on the training data, only partially mitigated by the correct estimate obtained on the independent validation data [7, 55].
-
All in all, our procedures are in line with the recommendations given in a recent paper by Pepe and colleagues [22]. This paper suggests that, in the case of binary outcome, all the tests based on the equality between the discriminative abilities of the clinical and the combined scores refer to the same null hypothesis, namely the nullity of the coefficient of a predictor in a regression model. Assuming that this statement also roughly applies to the survival analysis framework considered in our paper, it would mean that we can rely on the likelihood test performed on the regression coefficient of the omics score in the combined Cox model to test the difference in performance of the models with and without omics predictors. However, the same authors also claim that estimating the magnitude of the improvement in the prediction ability is much more important than testing its presence [22]. This cannot be done by looking at the regression coefficient of the omics score, as often discussed in the literature [56, 57] and illustrated through our AML data example. In this paper we have seen some procedures to quantify the improvement in prediction accuracy of a model containing an omics score derived from high-dimensional data, in order to validate its added predictive value.
-
Subgroup analyses might give valuable insights into the predictive value of the score, and therefore illustrated through the example of the AML dataset. Normally, the subgroups analysis should be inspired by a clear biological reason and, importantly, performed as far as allowed by the sample sizes. However, one should keep in mind that these analyses are possibly affected by multiple testing issues. Their results should be considered from an explorative perspective.