Background
Our goal is to associate patient-related characteristics and treatment outcome, tumor pathology and biomarker status with newly derived information from genomic and proteomic studies to advance the theranostics of breast carcinoma. Cellular heterogeneity of tissue specimens has been a complicating factor in determining analyte (protein or gene) levels in specific cell types, e.g., [
1‐
4]. Numerous studies, including our own, have reported a “molecular signature” of different cancer types, including breast cancer. However, there is great variation in methods utilized to obtain these gene expression profiles, including the use of breast cancer cell lines [
5], whole tissue extraction [
6‐
11] and laser capture microdissection (LCM) procured cells [
12‐
15]. In order to obtain a clinically relevant gene set for breast cancer, our hypothesis was examined under the premise that a particular gene should be present in multiple gene expression profiles despite the differences in methodology used to determine the molecular signature. By data-mining these studies collectively, a gene set was compiled and analyzed for clinical utility in breast cancer patients.
In this study, we constructed Cox proportional hazards [
16‐
18] models to predict risk of disease recurrence and overall survival, using a selected panel of candidate biomarkers with suspected association with breast cancer outcomes. To rigorously develop our models, we used the least absolute shrinkage and selection operator (LASSO) [
19] for variable selection and evaluated their predictive ability using repeated splits of the data into training and test sets. Models based on gene expression are compared with models based on clinical information to evaluate the gain in predictive accuracy over standard clinical management parameters. Our ultimate goal is to develop a clinically-relevant gene expression-based test for use in hospital laboratories, in order to assist in clinical decisions improving breast cancer management, as well as gain insight into the interrelationships between the genes and clinical outcome of breast cancer patients. Our approach includes the identification of new molecular targets for drug design and developing companion diagnostics.
Methods
The investigations described were part of a study that is approved by the Human Subject Protection Program Institutional Review Board at the University of Louisville. A unique IRB-approved Database and Biorepository composed of de-identified tissue specimens previously collected under stringent conditions [
20] for clinical assays of estrogen (ER) and progestin receptors (PR) were used. De-identified specimens of primary invasive ductal carcinoma of the breast obtained from tissue biopsies collected from 1988–1996 were examined using REMARK criteria [
21]. Germaine to our studies (e.g., [
22,
23]), analyses of ER and PR were performed by FDA-approved methods quantifying levels of these clinical biomarkers under stringent quality control measures (e.g., [
20,
24]) unlike the majority of reports that used immuno-histochemical analyses prior to the release of the College of American Pathologists/American Society of Clinical Oncology (CAP/ASCO) Guidelines [
25]. Patients were treated with the standard of care at the time of diagnosis. Tissue-based properties (e.g., pathology, grade, size, and tumor marker expression) and patient-related characteristics (e.g., age, race, smoking status, menopausal status, stage, and nodal status) were utilized to determine relationships between gene expression and clinical parameters. A retrospective analysis of frozen tissue specimens from 225 biopsies of invasive ductal carcinoma was performed (Additional file
1: Figure S1). De-identified clinical and pathological characteristics for each patient evaluated in the study are included in Additional file
2: Table S1. Tissue sections utilized for analyses of gene expression contained a median of 60% breast carcinoma cells (range of 10-95%) and 25% stromal cells (range of 5-65%).
Gene list selection
In order to obtain a clinically relevant gene set for this investigation, our hypothesis was that a particular gene should be present in multiple gene expression profiles of breast cancer despite the differences in methodology used to determine the molecular signature. GenBank Accession numbers (NCBI) of genes deciphered from our studies using LCM-procured carcinoma cells and those of other published studies [
5‐
14] were entered into the UniGene database (National Center for Biotechnology Information (NCBI)), which separates GenBank sequences into a non-redundant set of gene-oriented clusters. UniGene identifiers for all studies were compiled into Microsoft® Access and analyzed collectively. This comparison identified genes appearing in at least three signatures, generating candidates (EVL, NAT1, ESR1, GABRP, ST8SIA1, TBC1D9, TRIM29, SCUBE2, IL6ST, RABEP1, SLC39A6, TPBG, TCEAL1, DSC2, FUT8, CENPA, MELK, PFKP, PLK1, XBP1, MCM6, BUB1, PTP4A2, YBX1, LRBA, GATA3, CX3CL1, MAPRE2, GMPS and CKS2) for investigating associations with clinical behavior of breast cancer. PGR was also included in the candidate gene list due to its known implications in breast carcinoma [
20].
Gene expression analyses
Levels of mRNA expression were analyzed after isolation with Qiagen (Valencia, CA) RNeasy® RNA isolation kits. Quality of RNA was evaluated with Agilent RNA 6000 Nano Kits and the Bioanalyzer™ Instrument (Agilent Technologies, Palo Alto, CA). Total RNA extracted from the intact tissue section was reverse transcribed in a solution of 250 mMTris-HCl buffer, pH 8.3 containing 375 mMKCl, and 15 mM MgCl2 (Invitrogen, Carlsbad, CA), 0.1 M DTT (dithiothreitol, Invitrogen), 10 mMdNTPs (Invitrogen), 20 U/reaction of RNasin™ ribonuclease inhibitor (Promega, Madison, WI) and 200 U/reaction of Superscript™ III RT (reverse transcriptase, Invitrogen) with 5 ng T7 primers. cDNA obtained from this reverse transcription reaction was diluted 10-fold in 2 ng/μl polyinosinic acid and used in qPCR reactions.
qPCR reactions were performed in a 384-well plate using a total volume of 10 μl/well. Reactions contained Power Sybr™ Green PCR Master Mix (Applied Biosystems, Foster City, CA), forward/reverse primers and diluted cDNA obtained from the reverse transcription reaction. Primers were designed with Primer Express™ (Applied Biosystems) to generate sequences closer to the 3’ end of the transcript for use with the oligo (dT) primer in reverse transcription reactions. qPCR reactions were performed in triplicate with duplicate wells in each 384-well plate. Relative gene expression levels were determined using the ΔΔCt method using ACTB for normalization and Universal Human Reference RNA (Stratagene, La Jolla, CA) as the calibrator.
Power
The power available in this study to detect a hazard ratio of a given magnitude was determined by the following formula,
[
26]. Here α = 0.05/32, β = 0.2,
z are quantiles from the standard normal distribution,
D = 68 is the number of breast-cancer related mortality outcomes, and σ = 1.8 is the median standard deviation of the log
2 expression values among all 32 genes. The result is that there is 80% power in the current study to detect hazard ratios of 1.116 or larger (equivalently, 0.90 or smaller) per unit increase in log
2 expression.
Descriptive statistics and univariate survival analysis
Summary statistics were reported for both gene expression values and clinical covariates. Univariate Cox regression models [
16] were fitted to evaluate the association of both gene expression values and clinical covariates with overall and disease-free survival. Calculations and model development were performed using log
2 transformations of relative gene expression data as determined by qPCR (Additional file
3: Table S2). To account for multiple comparisons,
p-values were adjusted to control the false-discovery rate (FDR). Because the gene expression values were highly correlated, the method of Benjamini and Yekutieli (BY) [
27], which controls for multiple dependent hypothesis tests, was used in lieu of the standard Benjamini and Hochberg (BH) method [
28] (the BH method, however, was used for clinical covariates).
Multivariable Cox models, variable selection, and predictive accuracy
A multivariable Cox proportional hazards model was used to develop a predictive model of overall and disease-free survival, based on the gene expression values and clinical covariates. The model has the following form
Where
x
1
,…,x
p
are covariates (here, either gene expression values or clinical covariates),
λ(t|x
i
) is the hazard at time
t for the
i
th
observation,
λ
0
(t) is the unspecified baseline hazard function, and
is the vector of regression coefficients [
29].
Due to the noted shortcomings of stepwise selection strategies [
30] and the high correlation between gene expression values, initial variable selection to determine which genes were significant predictors of breast cancer survival and recurrence was done by incorporating a LASSO (least absolute shrinkage and selection operator), or L
1, penalty [
19] on the regression coefficients
β
1, …
β
p
. The LASSO penalizes the size of the parameter vector,
so that unimportant variables (variables whose
β coefficients are close to zero) are removed from the model. This results in a penalized log partial likelihood function of the form
, where
l(
β) denotes the standard Cox log partial likelihood. The maximum likelihood estimates
are those which maximize this penalized likelihood. The parameter
λ is the shrinkage parameter and determines the extent of variable selection, with larger values corresponding to a larger penalty and a greater number of variables removed. The optimal value for
λ was determined using 10-fold cross-validation.
To better assess predictive ability and model performance, we performed 1000 independent splits of the data into training (70%) and test (30%) samples. Splits into training and test samples were stratified on the basis of tumor stage, so that training and test samples were balanced on percent composition of each tumor stage. For each split, a Cox regression model with a LASSO penalty was used to simultaneously fit the model and perform variable selection amongst the 32 genes. For each model, the selected genes and their associated
β coefficients were recorded, and the number of times that each gene was kept in a model was tabulated. A permutation test was used to calculate a null distribution and determine the significance threshold for the number of times (out of 1000 total permutations) that each gene was retained in a model. Genes with counts above the highest count among the permuted data sets were declared to be significant (roughly corresponding to an empirical p-value of 1/32 = 0.03). Performance of each model was evaluated by the C-index for right-censored data [
31], calculated on the test data. The C-index estimates the probability that, for a randomly selected pair of individuals, the individual with the higher risk score (shorter predicted survival time) has the shorter actual event time. Additionally, predictions based on the L
1-penalized Cox model were used to separate patients in the test data into low and high risk classes based on the linear predictor
, with the cut-point for low/high risk based on the median of the linear predictors from the training data. Kaplan-Meier plots based on the original (non-permuted) data were compared to those obtained from the permuted data in order to validate the prognostic significance of the models evaluated.
The selected genes were again used to fit multivariable Cox models based on 1000 independent splits of the data into training and test samples, without any variable selection. C-indexes were calculated for test data predictions based on models fitted to the training data. To assess whether the gene expression values offered any gain in prediction over clinical parameters, models with clinical covariates significantly associated with disease mortality and recurrence were compared with models including both gene expression values and clinical covariates. C-indexes were also calculated separately for ER+ and ER- subsets of breast cancer patients, to assess whether the gene signature was equally effective in each subset. Empirical 95% confidence intervals for the differences in C-indexes between the two sets of models were calculated using the 2.5th and 97.5th percentiles of the differences.
All analyses were performed using R version 2.14.1 [
32]. Univariate Cox models were fitted using the R package
survival[
33]
, while multivariable Cox models with the LASSO penalty were fitted using the
penalized package [
34]. The C-index was calculated using the
rcorrcens function in the
rms package [
35], and adjustment for multiple comparisons was done using the
multtest package [
36].
Validation using the TRANSBIG data
Gene expression models for both overall disease survival and recurrence were validated using AffymetrixU133a GeneChip data collected by the TRANSBIG Consortium [
37,
38]. These data consisted of clinical and gene expression measurements on 198 node-negative patients from five different medical centers. The data were obtained from the Bioconductor package ‘breastCancerTRANSBIG’ [
39], and processed to remove duplicate probes mapping to the same Entrez Gene ID (probes with the largest variability are retained). The final gene expression data set consisted of measures on 12,701 transcripts (genes) for 198 patients. Since qRT-PCR and microarray measurements do not always correlate well, rather than validate the fitted models based on our data, we validated whether the genes selected were important for predicting breast cancer survival and recurrence. Therefore, we split the data into 1000 training (70%) and test (30%) samples, and fit Cox regression models based on genes selected for mortality and recurrence to the training sets. Separate models were also fitted based on clinical data and a randomly selected gene set of the same size, to evaluate whether our gene expression model offered improved performance relative to this information. C-index values were calculated for all models based on predictions for the test data sets. Gene expression model fitting and C-indexes calculations were also performed separately for ER+ and ER- subsets of breast cancer patients to evaluate any differences in model fit and efficacy for either ER+ or ER- carcinomas.
Discussion
For decades, the protein products of ESR1 (estrogen receptor-α) and PGR (progesterone receptor) have been recognized as predictors of prognosis and response to hormone therapy, e.g., Tamoxifen (e.g., [
20,
23‐
25,
40‐
42]). Therefore, it is expected that any gene subset predicting clinical behavior of breast cancer would contain ESR1 and/or PGR, and indeed this was shown earlier (e.g., [
10,
13,
23,
43]). However, several genes identified in our investigation, which were clinically relevant to breast cancer outcome, represent new targets for developing diagnostics and potentially designing targeted therapies.
The protein product of SLC39A6 (LIV-1) has been reported to transport zinc into the cytoplasm from either outside the cell or from intracellular stores [
44,
45]. There is increasing evidence that aberrant expression of the SLC39A family of zinc transporters promotes the epithelial-to-mesenchymal transition and leads to uncontrolled cell growth [
45‐
47]. LIV-1 protein was shown to be regulated by estrogen, hence associated with ESR1 expression [
45,
48]. In addition, elevated LIV-1 protein expression in breast cancer has been associated with improved clinical outcome [
48].
GABRP encodes the π-subunit of the g-aminobutyric acid (GABA) receptor, which is a transmembrane protein that is poorly understood, especially in breast tissue [
49,
50]. GABRP was reported to be down-regulated in 76% of breast cancers and was progressively down-regulated as tumor growth progressed, suggesting that its role may be as a tumor suppressor [
50].
Wang et al. suggested that LRBA plays a role in the EGF receptor pathway [
51]. In light of the fact that LRBA is a member of the WBW (WDL (WD-like)-BEACH-WD40) gene family, structural features suggest it is involved in a signaling pathway requiring interactions with other proteins, inositol phospholopids or PKA [
51]. LRBA was also shown to be induced by mitogens in immune cells and over-expressed in several cancer types compared to normal tissue [
51]. Of interest to our studies, LRBA was identified as co-clustering in breast tumor biopsies expressing estrogen receptor-α [
52].
Although the specific intracellular functions of TBC1D9 are unknown, the TBC1 domain family of proteins is known to stimulate the GTPase activity of RAB proteins [
53]. While the role of TBC1D9 is unknown in breast cancer, there is evidence that alterations in RAB GTPases play a role in progression of certain carcinomas [
54].
NAT1 metabolically activates aromatic and heterocyclic amines to electrophilic intermediates that initiate carcinogenesis [
55,
56]. The high frequency of NAT1 acetylators genotypes are important modulators of cancer susceptibility [
55]. Breast cancer tissues are reported to exhibit lower promoter methylation rates than normal breast, and DNA hypomethylation of the NAT1 gene plays a significant role in breast carcinogenesis [
57]. Recently, small molecular inhibitors of NAT1 have been successful in inhibiting proliferation and invasiveness of breast cancer cells in culture [
58].
Since functions of the genes in the clinically-relevant molecular signatures are involved in a variety of critical pathways in cellular differentiation and growth, no collective relationship was obvious. However, since the combined expression levels of these genes in a breast cancer biopsy appear strongly associated with a patient’s risk of recurrence and overall survival, we examined the gene set in relation to various parameters used in clinical management of the lesion. Our goal was to ascertain the competency of the gene signatures identified in our investigation to predict breast cancer outcome in comparison with that derived from conventional clinical information alone. Our objective included development of targeted gene sets (small molecular signatures) that reproducibly and objectively deduced the clinical course of breast cancer by removing the subjectivity that is often encountered from assessing multiple clinical parameters.
The variable selection strategy that we employed for determining significant genes whose expression predicted both disease mortality and recurrence involved fitting LASSO regulated Cox regression models to multiple splits of the data into training and test samples. This was accomplished as a robust approach to avoid over-fitting and produce an authentic assessment of the predictive ability of these models. To contrast, backwards elimination fitted to the entire data set retained 13 genes associated with mortality prediction and 10 genes for assessing risk of recurrence, and would undoubtedly produce an over-fitted model inapplicable with external data. Additionally, PGR, TBC1D9, and NAT1, which were found to be important predictors based on gene selection using LASSO, were not included in either backwards elimination model.
However, our approach is not without limitations. In particular, selection of genes aggregated over each gene individually, and ignored particular combinations of genes which occurred in the LASSO selected models. Assessment of these models revealed the most common model to be PGR alone (27% and 18% for mortality and recurrence, respectively), followed by various two-gene combinations which occurred at small frequencies (7% or less). Thus, there was no clear consensus of a clinically relevant gene combination based on the LASSO selected models. Another potential limitation is that all genes were retained with consistent direction of effect above the permutation testing threshold, without optimizing the number of genes to include in the final model. However, additional analyses of disease mortality by sequentially adding the most frequently occurring genes retained by LASSO resulted in little difference between models containing two to five genes (median C-indexes all 0.65). Lastly, our approach did not evaluate potential interactions between genes or non-linear effects of gene expression values. These maybe incorporated by including interaction terms in the Cox regression models, or by using a non-parametric approach [
59,
60]. However, the cost in potential predictive gain is increased computational burden and reduced interpretability, and the multivariable main effects Cox model is an important reference point on which to build more complicated models.
Since we used multiple testing and training splits of the entire data set to determine clinically relevant gene predictors, our assessment of predictive accuracy based on fitted models using these genes is slightly biased. This could be avoided by using two nested splits into testing and training samples, similar to that used in the double cross-validation method. But this approach would potentially result in a different number and combination of genes for each outer split, and would thus prevent assessment of a given combination of genes in predicting breast cancer mortality and recurrence. Another alternative would be a single split of the data into testing and training samples, but then the results could be sensitive to the particular split that is utilized. Our approach of using multiple testing/training splits to determine clinically relevant genes followed by a second round of multiple testing/training splits to determine predictive ability produces a model that is robust with an assessment that is only slightly upward biased. The true accuracy of our multivariable gene expression models most likely lies between the median values for the LASSO selected models and the five/six gene signature models, which is comparable to the accuracy based on clinical parameters alone.
The ability of multivariable gene expression models to accurately predict both breast cancer mortality and recurrence was evaluated relative to models based on standard clinical parameters. Further, inclusion of gene expression values was evaluated in the context of improving predictions relative to that provided by the use of clinical information alone. Results indicated that use of gene expression signatures alone was comparable to that derived from clinical information in their value for predicting both breast cancer mortality and recurrence. Thus the predictive competency of the gene signatures identified in our investigation was confirmed for assessing breast cancer outcome.
Furthermore gene expression models appeared to improve predictions for recurrence relative to those using clinical information alone (though the result was not statistically significant). Our approach for combining gene expression and clinical data was one-dimensional, in that only main effects for each variable were included in the model. However, additional investigations for disease mortality were performed to determine whether statistically significant interactions between gene expression values and disease stage, ER/PR status and treatment regimen existed. Although 11 of the 32 genes investigated exhibited significant unadjusted p-values for interaction with disease stage, none remained significant after adjustment for multiple comparisons. Nevertheless, an expanded examination of more complex models including interactions between gene expression and clinical information is warranted in future studies.
Examination of our novel gene signature separately among patients with either ER+ or ER- breast carcinomas revealed that although the predictive accuracy was diminished relative to the entire cohort, no differences in either accuracy or fitted model coefficients existed between the two subsets (Figure
6). This is in contrast to the TRANSBIG validation study, where considerable heterogeneity in both estimated model coefficients for gene expression values and predictive accuracy existed between ER+ and ER- subsets of breast cancer (Figure
8, Additional file
6: Figure S4). However, when considering the gene expression values themselves stratified by ER status, there was good agreement between our data set and that of the TRANSBIG study (Additional file
4: Figure S2 and Additional file
5: Figure S3). This suggests an alternative mechanism for the observed heterogeneity in the TRANSBIG study, perhaps involving nodal status since the TRANSBIG study consisted solely of node negative patients. However, an investigation of potential interactions between gene expression values and nodal status for predicting disease mortality and recurrence revealed no significant interactions after adjustment for multiple comparisons. Additionally, among the genes selected in our model, only NAT1 had a significant unadjusted p-value (p = 0.015) for interacting with nodal status. However, a more exhaustive model evaluation among subsets of patients stratified by clinical covariates (nodal status, ER status) is an area of future research.
Validation of selected genes using the TRANSBIG study indicated that though our gene subset performed relatively well for predicting disease mortality (median C-index of 0.61, Figure
7), the performance for disease recurrence was somewhat disappointing (median C-index of 0.56). Predictions on disease mortality did, however, compare favorably to those based on clinical data and those based on the previously published Veridex signature. It should be noted that there was considerable heterogeneity in the fitted Beta coefficients between the TRANSBIG study and our investigation, indicating that models fitted to one data set cannot be directly used to make predictions in the other data set. This can be attributed to a variety of reasons, including patient genetic and physiological heterogeneity, differences in sample processing and tissue collection, differences between qRT-PCR and microarray expression measurements, and the fact that the TRANSBIG study consisted solely of node negative patients (though, no significant interaction was found between nodal status and the genes in our novel gene signature). In light of this, the gene expression models derived in our investigation warrant further examination, validation, and possible refinement in a clinical trial setting in order to be adopted as routine clinical tests.
Conclusions
Our goal is to identify small, clinically-relevant gene subsets to develop gene expression-based tests and gain insight into the interrelationships between these genes and clinical outcome. Based on published reports describing statistical and apparent clinical significance of various large molecular signatures of breast cancer, mRNA levels of each of 32 gene candidates were evaluated by qPCR in 225 breast carcinoma specimens. Over-expression of ten genes (RABEP1, PGR, NAT1, PTP4A2, SLC39A6, ESR1, EVL, TBC1D9, FUT8 and SCUBE2) was associated with reduced time to disease-related mortality, while four genes (RABEP1, PGR, SLC39A6 and FUT8) were associated with reduced recurrence times.
Multivariable analyses using the LASSO revealed that expression of PGR, ESR1, NAT1, GABRP, TBC1D9, SLC39A6 and LRBA of the 32 gene candidates was collectively the most important predictors for both breast cancer mortality and recurrence. Molecular signatures consisting of either five genes (PGR, GABRP, TBC1D9, SLC39A6 and LRBA) for predicting disease mortality or six genes (PGR, ESR1, GABRP, TBC1D9, SLC39A6 and LRBA) for predicting disease recurrence were identified. When taken alone, gene signatures were as effective in predicting recurrence/mortality as standard clinical parameters. However, combining the gene signature and clinical information resulted in an improvement for predicting disease recurrence relative to that derived from clinical information alone. Results from this investigation advanced the findings derived from our earlier studies [
13,
15,
61,
62], as well as those of other investigators [
5‐
12,
14] exploring breast cancer gene expression profiles. Importantly, our results identified small, biologically significant and clinically relevant gene sets in breast cancer biopsies, which predict risk of recurrence and overall survival of breast cancer patients. These molecular signatures have been sufficiently evaluated to warrant examination in a larger independent patient population (a validation cohort) such as a cooperative clinical trial, e.g., NSABP or SWOG, to verify significance for development of a routine clinical test to assess risk of breast cancer recurrence and overall survival. Prediction of the clinical outcome at the time of surgical removal of the primary lesion will facilitate improved treatment planning and disease surveillance thus enhancing individualized patient care.
Competing interests
The authors declare that they have no competing interests. The University of Louisville has filed a patent related to expression of these genes and their relationship to predicting breast cancer outcome.
Authors’ contributions
SAA performed qPCR, interpreted data, and drafted the manuscript. GNB performed statistical analysis, interpreted data, and drafted the manuscript. JLW conceived the study, designed and interpreted experiments, and drafted the manuscript. All authors read and approved the final manuscript.