Patients and materials
Gene-expression data from 230 stage I to III breast cancers, without individual patient identifiers, were provided to the MAQC project by the University of Texas M.D. Anderson Cancer Center (MDACC) Breast Cancer Pharmacogenomic Program. Gene-expression results were generated from fine-needle aspiration specimens of newly diagnosed breast cancers before any therapy. The biopsy specimens were collected sequentially during a prospective pharmacogenomic marker discovery study approved by the institutional review board between 2000 and 2008. These specimens represent 70% to 90% pure neoplastic cells with minimal stromal contamination [
12]. All patients signed informed consent for genomic analysis of their cancers. Patients received 6 months of preoperative (neoadjuvant) chemotherapy including paclitaxel, 5-fluorouracil, cyclophosphamide, and doxorubicin, followed by surgical resection of the cancer. Response to preoperative chemotherapy was categorized as a pathologic complete response (pCR = no residual invasive cancer in the breast or lymph nodes) or residual invasive cancer (RD). The prognostic value of pCR has been discussed extensively in the medical literature [
13]. Genomic analyses of subsets of this sequentially accrued patient population were reported previously [
9,
14,
15]. For each endpoint, we used the first 130 cases as a training set to develop prediction models, and the next 100 cases were set aside as independent validation set. Table
1 and Additional file
1 show patient and sample characteristics in the two data sets.
Gene-expression profiling
Needle-aspiration specimens of the cancer were placed into RNAlater™ solution (Qiagen, Germantown, MD, USA) and stored at -80°C until further analysis. RNA extraction and gene-expression profiling were performed in multiple batches over time, as described previously [
16,
17] by using Affymetrix U133A (Affymetrix, Santa Clara, CA, USA) microarrays. Gene-expression data have been uploaded to the Gene Expression Omnibus website under the accession number GSE16716. Normalization was performed by using MAS 5.0 software (Affymetrix, Santa Clara, CA, USA) with default settings. Quality-control assessment of the hybridization results were performed with SimpleAffy software by Bioconductor; the percentage present call had to be more than 30%, scaling factor less than 3, and the 3'/5' ratios for β-actin less than 3, and for GAPDH, less than 1.3. These quality-control metrics are presented for each case in Additional file
2.
Ranking of classification problems by informative feature utility score
To assess the relative difficulty of the three classification problems that we selected to study, we adopted an approach similar to that described in [
18]. This method defines the utility of a feature
i as its Fisher score,
where
μ1iand
μ2iare the class means, and
σ1iand
σ2iare the class standard deviations for the feature
i, respectively. If features are ordered
f1 ≥
f2 ≥ ... then, for each endpoint, the cumulative information is defined as
where N is the sample size. This cumulative information score assumes that the features are independent and that their effect on the classification performance is additive. This is rarely the case, as features are often correlated. Nonetheless, this cumulative information score is a simple and straightforward approach to estimate the relative difficulty of a classification problem early in the classifier-development process: an easier problem tends to have larger values for F than does a more difficult problem.
Classification algorithms
We examined eight classifiers in combination with the previously mentioned feature-selection methods, including linear discriminant analysis (LDA), diagonal linear discriminant analysis (DLDA), quadratic discriminant analysis (QDA), logistic regression (LREG), two k nearest neighbors classifiers with k = 3 (KNN3) and k = 11 (KNN11), and support vector machines with a radial basis function kernel with two different values for the kernel parameter: γ = 0.5 (SVM05) and γ = 2.0 (SVM2), respectively. Overall, 40 models were developed for each of the three prediction problems.
Estimation of predictive performance
Leave-N-out cross-validation and other resampling methods of the training set are often used to select a final predictor for independent validation. Therefore, it is important to understand how resampling-based predictive performance correlates with predictive performance on independent validation cases. To study this question, we used a nested two-level cross-validation scheme, in which the cross-validation in the outer loop had the role of estimating the performance of the whole modeling procedure, whereas the cross-validation in the inner loop was used for selecting the optimal number of features [
19].
The procedure in the inner loop is as follows. For each combination of a feature-selection method F and a classification algorithm C, the number of features j(F, C) in the model was considered as a free-parameter (within a predefined set of allowable values) and was optimized. In the inner loop, a repeated (5 times), stratified (to preserve the proportion of the two classes in all training and testing splits), fivefold cross-validation was used to define the number of features that maximized the AUC. A ranking of the features was first obtained by applying
F on the reduced internal training set (obtained by leaving aside one fold from the current training set). Then the classifier
C was trained on the same set, but considering only the top j(
F, C) features. The predictions on the internal testing set (the left-out fold) were recorded, and the procedure was repeated. At the end, an estimation of the AUC was obtained, corresponding to the given combination of
F, C, and j(
F, C). The procedure was repeated with different folds, and an average estimate of the AUC was obtained for each
F, C, and j(
F, C). The optimal number of features j*(
F, C) was selected as the value j(
F, C) yielding the highest average AUC. The number of features allowed for each model was chosen
a priori, to avoid overfitting of models and to limit the computation time. For the prediction of ER status, the feature size was chosen to contain all values between 2 and 15, whereas for both pCR endpoints, it was {2,5,8,...,41}; 41 being almost half the size of the smallest training set (
n = 85 ER-negative cancer). For a pseudo-code that details the schema used for cross-validation [see Additional file
3]. To avoid adding variability due to random partitioning the data into folds, all estimates were obtained on the same splits of the data.
We investigated two methods in the outer loop. The first method is a stratified 10-times-repeated fivefold cross-validation (10 × 5-CV). In each of the five cross-validation iterations, 80% of the data were first used as input to the inner loop procedure for feature selection and training the classifier with the selected features, and finally, the remaining 20% of the data were used to test the classifier. The 95% CI for the area under the receiver operating characteristics curve (AUC) was approximated by [AUC - 1.96 SEM, AUC + 1.96 SEM]. The SEM was estimated by averaging the 10 estimates of the standard error of the mean obtained from the five different estimates of the AUC produced by the 5-CV.
The second method in the outer loop is a bootstrap-based method, also known as a smoothed version of cross-validation [
20]. Efron and Tibshirani [
20] proposed the leave-one-out bootstrap method on the performance metric error rate, and their technique was recently extended by Yousef and colleagues [
11] to the performance metric AUC. This method uses a leave-pair-out (LPO) bootstrap approach to estimate the mean AUC (mean over training sets) and a "delta method after bootstrap" to estimate the variability of the estimated mean AUC. We point out that this variability captures both the effect of finite training-set size and the effect of finite testing-set size. In the LPO approach, multiple (
n = 5,000) training sets are obtained by stratified bootstrap resampling, and each training set is used as input to the inner-loop procedure for feature selection and training the classifier with the selected features. In testing, any pair of cases (one from the positive class and one from the negative class) is tested on the classifiers trained on the bootstrap samples that do not contain the two held-out cases. The Wilcoxon-Mann-Whitney statistic of the prediction results on pairs of cases is averaged over all bootstrap-training sets and is used to estimate the mean AUC. An advantage of this technique is that it allows estimating the variability of the AUC estimator by using the influence function method [
11,
20]. By assuming that the estimated AUC is asymptotically normal, the 95% CI of the AUC can be approximated by [AUC - 1.96 SEM; AUC + 1.96 SEM].
The estimated performance and the associated CIs from the training and internal-assessment process are compared with the independent validation performance. The conditional validation performance was obtained by selecting features and training the classifier with the training data set and testing on the validation data set. This performance is conditional on the particular finite training set and may vary when the training set varies. Therefore, we estimated the mean of this conditional performance where the mean is over multiple training sets and obtained by bootstrapping the training set multiple times and averaging the conditional AUCs, as tested on the validation set [
21].
We also estimated the variability of the conditional validation performance and decomposed the variance into two components: the variability due to the finite size of the training set and the variability due to the finite size of the test set [
21]. The training variability reflects the stability of the classifier performance when the training set varies, and the testing variability reflects the expected performance variation for different test sets.
To compare the ability of the performance estimates of 10 × 5-CV and the LPO bootstrap to predict the performance on the independent set, we used a root mean square error (RMSE) measure, which is defined as
where
F and
C index feature selection and classifier, respectively,
denotes the mean AUC; the superscript "internal" can be "10 × 5-CV" or "LPO bootstrap."
Estimation of predictor learning over increasing training-set size
Predictor learning was evaluated for the models that performed nominally the best in independent validation for each of the three prediction problems. All 230 cases were included in the analysis to fit learning curves to these three models. For the ER-status endpoint, 10 different training-sample sizes, ranging from
n = 60 to
n = 220 by increments of 20, were used to estimate the dependence of the performance parameters on the sample size. For each sample size, 10 different random samples were drawn from the full set by stratified sampling, and fivefold cross-validation was used to assess the error rate and AUC of the models where all the parameters of the models were recalculated. A similar approach was taken for the pCR (
n = 50, 70, ..., 210) and "pCR in ER-negative cancer" predictors (
n = 25, 40, ..., 85). By following the work of Fukunaga [
22], the following learning-curve model was fit to the resulting AUC:
Y =
a+
b/
TrainingSize.
Congruence of different predictors at gene and functional pathway level
We were interested in examining the congruence of informative features that were selected by different methods for the same prediction endpoint and also for different endpoints. Both gene-level and pathway-level analyses were performed as described previously [
23]. MetaCore protein-function classification was used to group genes into protein functions, and GeneGo Pathway maps were used for functional classification of predictive features. We assessed congruency by using the kappa statistics. The input for kappa involves "learners" that classify a set of objects into categories. We considered each feature-selection method as a learner and each probe set as an object. The probe sets used in this analysis are presented in Additional file
4. Each probe set from the rank-ordered lists is categorized by each feature-selection method either as 1 (that is, selected as informative) or 0 (that is, nonselected). By using such an 0/1 matrix for all probe sets × all feature-selection methods for every prediction endpoint as input, we can calculate Cohen's kappa function for the congruency. For pathway-level analysis, we mapped the probe sets to pathway lists by using hypergeometric enrichment analysis. The pathways are ranked by enrichment
P values, and the top n pathways (n equals the number of genes in the input list for comparison and consistency between the two levels) were selected for presentation.
All statistical analysis was performed by using R software.