Background
Providing guidance on specific therapies for pathologically distinct tumor types/stages to maximize treatment efficacy and minimize toxicity is an important goal in clinical oncology [
1‐
6]. The development of prediction models using the TNM staging system, primary tumor (T), regional lymph nodes (N), and distant metastasis (M), with using basic clinical covariates to classify target patients as high-risk or low-risk for treatment recommendation has been used for more than a decade.
Recent developments in microarray technology have accelerated research in the development of genomic biomarker classifiers for safety assessment, disease diagnosis and prognosis, and prediction of response for patient assignment [
6‐
10]. Several microarray studies have shown an association between patient survival and gene expression profiles [
10‐
20]. Some recent publications have investigated the use of microarray gene expression data alone or in combination with the clinical covariate variables [
21‐
23] as an improvement over the standard approach of using only clinical variables in estimating patient survival. It is well known that use of all genes to develop a microarray-based prediction model can suppress its performance. Selection of a subset of relevant genes to enhance predictive performance becomes an important part in developing a microarray-based classifier. However, at the present time, there is no consensus about what types of algorithms are best for modeling gene expression data alone or in combination with clinical variables for binary prediction. Selection of the most relevant genes to develop prediction models for survival risk presents additional challenges.
In the evaluation of a prediction model, two most important considerations are 1) predictability (predictive performance) – ability to predict the survival risks of patients accurately and 2) reproducibility (generalizability) - ability of the model to predict new samples generated from different locations or on different times. A good prediction model should perform well in both predictability and reproducibility. A model with a higher reproducibility does not necessarily imply better predictability; it should be noted that reproducibility is valid only when the model has a good predictability. Evaluation of the performance of prediction models for binary outcomes has been well studied for data generated from a single medical center. The performance of a binary classifier is typically evaluated in terms of the positive and negative predictive values, sensitivity and specificity, and/or accuracy in terms of the number of true and false positives, and the number of true and false negatives. In contrast, when the outcome is survival time in the presence of censored observations, the measure of predictability is less apparent. Survival prediction modeling is usually performed to classify patients into two or more risk groups, not to predict exact survival time so that patients would be treated based on the risk group classification. Common measures to assess the predictive performance of a survival prediction model include the hazard ratios, significant difference in the Kaplan-Meier survival cures between identified risk groups, the concordance index [
24‐
26], Brier scores [
27], absolute measure of predictive accuracy [
28] and several others [
29‐
33]. Together, these measures evaluate different aspects of predictability of the model and its ability to accurately characterize patient’s survival risk.
Assessment of the generalizability of a prediction model is to determine whether its performance is reproducible for similar data generated from either same or different locations and/or different times. A prediction model is to be applied to predicting new samples. In addition that the model should perform well in predicting the samples obtained from the current study, its predictive performance must be generalizable across different studies. A prediction model developed from one study, that has been shown to perform well, might not be reproduced its performance when it is applied to other studies. The issue of the lack of reproducibility of predictive signatures and predictive models across studies has been aware. There were several large-scale screening studies [
34‐
36] have identified several gene signatures with high predictive performances in their original discovery dataset, yet a recent report has indicated that these signatures are seldom in common across different studies [
37]. For example, Shedden et al. [
21] attempted unsuccessfully to validate the signatures reported by Chen et al. [
13]. The lack of reproducibility makes these biomarkers difficult to be applied in clinical usage for treatment recommendation.
Justice et al. [
38] considered the two terms for assessing a prognostic system:
accuracy (calibration and discrimination) and
generalizability (reproducibility and transportability). They defined
calibration as “predicted probability is neither too high nor too low” for an individual patient and
discrimination as “relative ranking of individual risk is in correct order”.
Reproducibility was defined as “ability to produce accurate predictions among patients not included in the development of the system but from the same population”, and
transportability as “ability to produce accurate predictions among patients drawn from a different but plausibly related population”. Note that assessment of
accuracy is to evaluate the predictive performance. However, in the context of Justice et al. [
38] the
accuracy assessment covers both the predicting probabilistic risk of an individual patient (
calibration) and ranking of his/her risk as compared to other patients (
discrimination). This paper focuses only on the evaluation of the ranking of risks to match observed survival times and classifying patients into risk categories accordingly. Furthermore,
generalizability defined by Justice et al. [
38] consisted of
reproducibility (internal validity) and
transportability (external validity). Predictive performance (or accuracy) should be evaluated based on the patients that are not included in the model development [
7,
39]. Typically, the current samples are used in two ways: (i) as training samples to develop the prediction model and (ii) as test (future) samples to assess predictive performance [
7,
39‐
41]. That is, assessment of
reproducibility within a study has been integral part of model development. A prediction model developed from a single study which does not reflect many sources of variability outside research conditions such as historical, geographic, methodologic, spectrum, and follow-up interval aspects described in Justice et al. [
38], represents an internal validation [
39]. In this paper, reproducibility refers the ability to produce performance on patients from other studies,
transportability. Therefore, “reproducibility” has the meaning as “generalizability”. The term “reproducibility” is a common terminology used in the evaluation of different platforms, studies, gene signatures, etc. [
42,
43]. More detailed approaches and discussions on the development of a prediction model from a single study are given in the Discussion section. The definitions of the terminologies considered in this paper are summarized in Table
1.
Table 1
Definitions of key terms
Predictability (Predictive performance) | Ability of a model to predict risk scores of patients that can match their survival risks (not survival times). |
Generalizability (Reproducibility) | Ability of a model to predict risk scores of patients generated from different studies (different locations or different times). |
Consistency | Agreement between two centers to predict the risk scores of a targeted center. |
Transferability. | Agreement between one center and the targeted center to predict risk scores of targeted center. |
Internal validation | An assessment of predictive performance of a model in which the available data are divided into a training set and a test set, the model is developed in the training set and applied to the test set. |
The primary objective of this paper is to present approaches to investigating reproducibility of predictive models and signatures across different medical centers. Reproducibility across centers is evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. We considered eight risk prediction models based on established approaches for modeling clinical variables and microarray gene expression data. Two recent studies on lung cancer [
21] and colon cancer [
20], where data were collected from more than one center, are used in the evaluation of the predictability and generalizability of predictive models and signatures.
The first step in the evaluation of reproducibility of a prediction model is to assess its predictability. In theory, some models may have a good predictability but a poor reproducibility, or vice versa. Models with high predictability and reproducibility are obviously desirable. Since standard measures to assess predictability of survival prediction models have not been fully established, various predictability measures are considered in the evaluation. The predictability and reproducibility measures of each of the eight models are calculated to assess the overall performance of each model. However, we do not attempt to propose or identify the best approach/model to predict patient survival risk, the purpose is to illustrate the differences among the eight models.
Methods
Models developed from training dataset
Eight survival prediction models to estimate patient survival risk were considered. These eight models included two clinical models, two gene expression models, and four models based on combinations of the two clinical and two gene expression models. The two clinical models were 1) the Cox proportional hazards model (Model A) and 2) the regression tree (Model B), these are two well-established methods for modeling survival data. All clinical variables including AJCC (The American Joint Committee on Cancer) stage, gender, age, and histology were considered in both models. The Cox proportional hazards model approach involved fitting the relevant clinical variables to a multivariate Cox model [
44]. The regression tree modeling approach consisted of two steps. The first step was to use a standard survival tree model [
45‐
50] to classify patients into different risk groups according to their incidence rates. The second step involved fitting a univariate Cox model using the patients’ incidence rates as an independent variable.
It is well known that gene expression data typically involve a large number of genes; selection of a subset of relevant genes to enhance predictive performance becomes an important part in the model development. The data were first analyzed using the univariate Cox model to select a set of “significant” genes. There still could be too many significant genes in a model, which could make the model estimate unstable. The dimensional reduction using principal component analysis can be applied to extract the
k relevant meta-genes, the linear combinations of the all selected genes. The
k can be tuned by cross-validation, but we set
k=5. An alternative approach is to select the k most significant genes to develop the model, we set k = 10. For the set of selected genes, two gene expression models were developed using a multivariate Cox model with covariates provided by 1) the first five principal components (Model C) and 2) the top 10 ranked genes (Model D). Each gene expression model was additively combined with each clinical model to develop four clinical and gene expression models: E=A+C, F=A+D, G=B+C, and H=B+D. A summary of the eight models is given in Additional file
1: Table S1.
Assessment of predicted risk scores for the patients in test dataset
The regression coefficients of the fitted Cox model developed from the training data, , were used to compute the predictive risk scores for each patient in test data, . The predictive risk scores were then used to compute predictive performance measures to evaluate the survival prediction model built from the training data. Although the continuous risk scores for test data are adequate to rank the risk levels, clinicians often use the stratified risk groups to exhibit the risk categories to the patients. Therefore, both approaches are considered in the evaluation: single-group analysis and two-group comparison.
In the single-group analysis, the p-value of hazard ratio (
), R
2, Somers’ rank correlation D
xy[
26], and time dependent receiver operating characteristic (ROC) curve are obtained to evaluate the predictive scores. The p-value of hazard ratio and R
2 are calculated from the fitted univariate Cox model of the predictive risk scores. The Somers’ rank correlation D
xy and R
2 measure the goodness-of-fit in terms of agreement and explained variation between the risk scores and survival times, respectively. ROC curve is a measure of predictive ability of binary classifiers, and Hegerty et al. [
30] firstly applied it to develop the time dependent ROC, ROC(t), curve for censored survival data to evaluate a diagnostic marker. They have shown that it can lead to inconsistence of the negative probability mass if the true positive rates, TPR(t), and false positive rates, FPR(t), for ROC(t) curve are estimated by the conditional probability. However, the ROC(t) curve in this paper does not result in the inconsistence (Additional file
1).
The two-group comparison is the most frequently used approach for performance assessment. The test data are first segregated into high-risk and low-risk groups by a cutoff threshold, and the Cox model or log-rank test is then applied to compare the difference in survival time between the two groups. This approach depends on the choice of threshold. We use the median of the training scores as the threshold. A significant p-value implies that the survival times between the high-risk and low-risk group ranked by the risk scores are different significantly. We calculated both p-values of the hazard ratio in Cox model and log-rank test for completeness.
Correlation coefficient for measure of consistency and transferability
Reproducibility of survival risk predictions was evaluated in terms of the two measures: consistency and transferability. Both are measures of an agreement of predictive risk scores predicted by two centers. Consistency is the agreement between two centers to predict the risk scores of another center, which can be one of the two centers or an independent third center. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be in terms of 1) whether a predictive model developed from one center can be applied to predict the survival risk for the patients from other centers (model transferability) or 2) whether signature markers of a predictive model developed from one center can be applied to predict patients from other centers (signature transferability). Both consistency and transferability are measures of an agreement of two centers to predict risk scores of a targeted center. The transferability characterizes the applicability of a model built from one center and applied to the targeted center. Consistency characterizes an agreement between two centers to predict a targeted center. Consistency is a general terminology covering two or more centers. Since the agreement between two centers is of the primary interest, assessment of transferability is more useful.
Agreement between two centers is evaluated using the Pearson’s correlation coefficient.
The transferability from center i to center j can be expressed mathematically as and the consistency of centers i and j to predict center k can be expressed as , where X
i
and ρ are the predictor matrix and the Pearson correlation coefficient and , and are the coefficients of the fitted models developed from the centers i, j, and k, respectively. When k = j (or i), the consistency is identical to transferability from center i to j (or from j to i). The coefficient can be the estimate from the model developed using the entire dataset or using a partial dataset in the center i. The use of entire dataset will evaluate the transferability once, i.e., the correlation coefficient between the two sets of predicted scores developed by two centers is computed once. On the other hand, the use of partial data can compute the correlation coefficient multiple times with different sets of partial data. In the analysis of two cancer datasets shown below, is estimated based on the entire dataset for the lung cancer data and is estimated based on the partial dataset for the colon data (Results).
Discussion and conclusions
Development of a risk prediction model for clinical use involves the two stages: 1) model development based on a set of signatures, and 2) model validation with a perspective clinical trial. This paper mainly considers the first stage in the model development. Model development also involves two stages: 1) model building and 2) model (analytical) validation. Model building involves fitting a Cox survival model by selected a set of relevant predictor signatures, from the present study. Model validation is to assess if the fitted model can predict relative risk of patient samples generated from the available data, which can include the present study and other studies. Since prediction model is typically developed based on a single study, model validation often refers to the assessment of predictive performance.
Two methods are commonly used to assess performance of a prediction model: the split-sample method and cross-validation method. In the split-sample procedure, the sample dataset is split into two subsets (either randomly split the entire data or a designated test dataset), a training set for model building and a test set for model validation. Cross validation involves repeatedly splitting the data into a training set and test set. The predictive performance is the “average” of the numerous training-test partitions. The split-sample procedure provides a single analysis of performance metrics, such as D
xy and p-values, etc. For data from a single (center) study, cross validation can be more valuable than the randomly split method. A common analysis of data from a multicenter study is often limited to evaluation of performance metrics using the split-sample method [
20,
21]. The multicenter study provides valuable data for further model validation. In addition to investigate the predictability of a model (Tables
2,
3,
4 and
5,
7,
8, and Additional file
1: Tables S2-S5), this paper presents statistical analysis to illustrate an assessment of cross-center reproducibility. Assessment of reproducibility across centers provides another layer of model validation.
The cross-validation method has also been applied to tune the parameters in some training methods for gene selection such as univariate selection, forward stepwise selection, principal components regression, supervised principal components regression, partial least squares regression, ridge regression and LASSO [
52‐
54], and the approach may lead to less over fitted training models. The over fitting can result in poor prediction ability which may be caused by other reasons such as inappropriate models, and the prediction ability indices are more appropriate to assess the performance. Thus the cross-validation for tuning parameters is not applied to obtain the models in this paper, and some of these models we used are the well-established methods in [
33,
41].
The cross-center reproducibility is measured by correlations of the two sets of predicted scores derived from two centers, whereas the standard performance assessment considers the analysis of predicted scores from the test data. We present two terminologies to describe cross-center reproducibility: consistency and transferability. The consistency is a general term referring to an agreement between two sets of predicted scores derived from two entities (centers). Transferability refers specifically to an agreement of two sets of risk scores for a target center, one set is derived from a model developed from the target center and another set is predicted by another center. The risk scores of the target center can be the training scores derived from the fitting of entire data (lung cancer data), or they can be the predicted scores derived from fitting a partially set of data (colon cancer data). Although most cancer study does not involve more than two centers, the transferability should have more use for assessment of reproducibility between two centers in practice.
The lung cancer data consists of four medical centers. The predictability of the each of the eight models was assessed by evaluating the performance metrics between center predictions (Additional file
1: Tables S2-S5). The consistency and transferability of the predicted scores derived from two centers were further evaluated (Table
6). In this analysis, the entire data set was used in the evaluation; that is, the consistency and transferability correlations were evaluated for the entire target center. In this analysis, Model A appears to perform the best in terms of both the predictability and cross-center reproducibility. In terms of the cross-center prediction, the prediction from HLM to DFCI is the best. It could result from the high agreement for the clinical variables in the different data. A conclusion is that a good prediction model shows a high cross-center consistency. It should be emphasized that higher consistency does not necessarily imply better performance.
The colon cancer data were analyzed slightly different. The colon dataset consisted of 55 VMC patients and 177 MCC patients. We used the MCC as a target center to evaluate the prediction models built from VMC. In this analysis, 55 randomly selected MCC patients (a partial dataset) were used to develop a model to predict 122 remaining patients and compared with the model developed from 55 VMC patients. The consistency and transferability correlations were evaluated only for the 122 patients. We considered signature transferability and model transferability to assess the generalizability of prediction models. Although a major concern in the validation of a microarray-based prediction model is model transferability i.e. the usefulness of a transferred model outside of its intended use, it is also desirable that the signature developed from the internal dataset is applicable to predict future samples.
For gene expression Models C and D, the signature consistency appears to be higher than the model consistency (Figure
1). However, neither model performs well when compared to the clinical models (Tables
7 and
8). It should be emphasized that higher consistency does not necessarily imply better performance. Model G has the highest D
xy value, but Models A and B have better consistency in both the classifier transferability and signature transferability. In all, it appears that Model A (Cox model) performs more consistent than Model B (regression tree). Therefore, the reproducibility of cancer survival model including gene expression data across different centers or studies could be still controversial, that could be caused by the geographic and/or methodologic variations, and it should be extensively studied. Finally, the transferability of model or signature is meaningful only after the model has established its performance.
Generally, a prediction model has medical utility only if it enables clinicians to make better treatment decisions for individual patients. Establishing medical utility of a prediction model requires validation from a prospective clinical trial. Although there have been a number of publications discussing the design and analysis of clinical trials for validation of cancer prognostic and predictive models [
55‐
58], very few clinical trials have been conducted. A major factor is due to the lack of reproducibility of the prediction model to justify conducting a prospective clinical validation trial. In this paper, we illustrate an analytical (external) validation of risk prediction modeling to assess reproducibility across studies. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform well with high degree of consistency and transferability and inclusion of gene expression variables shows little improvement.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JJC conceived the study and wrote the manuscript. HCC developed and implemented the methodology and performed the analysis. Both of the authors read and approved the manuscript.