Introduction
Oestrogen receptor (ER) negative (-) breast cancer accounts for about 30% of all breast cancer cases and generally has a worse prognosis compared with ER positive (+)disease [
1,
2]. Nevertheless, a significant proportion of ER- cases have shown a favourable outcome and could potentially benefit from a less aggressive course of therapy [
3]. Reliable identification of such ER- patients with a good prognosis is, however, difficult and at present only possible through examining histopathological factors.
Recently, attempts have been made to explain the observed clinical heterogeneity of ER- disease in terms of gene expression signatures [
4‐
7]. However, most of these studies clearly indicated the difficulty of identifying a prognostic gene expression signature for ER- disease [
4,
6,
7], unlike ER+ breast cancer where a multitude of alternative prognostic signatures have been identified [
3,
8‐
11]. Nevertheless, using an integrative analysis of gene expression microarray data from three untreated (no chemotherapy) ER- breast cancer cohorts (a total of 186 patients) [
3,
8,
10] and a novel feature selection method [
11], it was possible to identify a seven-gene immune response expression module associated with good prognosis,. This suggests that at least part of the observed clinical heterogeneity in ER- disease can be explained on the basis of mRNA expression levels [
5]. Specifically, overexpression of this immune response gene module identified a subclass of basal ER- breast cancer, about 25% of all ER- cases, with a reduced risk of distant metastasis (Hazard ratio [HR] = 0.49; range 0.29 to 0.83; p = 0.009) compared with ER- cases without overexpression of this module [
5], a result that was validated in two independent untreated test cohorts (58 ER- samples) [
9,
12].
The important role that immune system-related gene expression signatures play in breast cancer prognosis has been further supported by four recent reports [
13‐
16]. Specifically, one study reported that high expression of lymphocyte-associated genes identifies a good prognosis subgroup within lymph node negative (LN-) human epidermal growth factor receptor 2 positive (HER2+) breast cancer [
13]. A further study focused on LN- breast cancer and identified a prognostic B-cell metagene signature, confirming that overexpression of this signature correlated with good prognosis in ER- breast cancer, while underexpression correlated with good prognosis in ER+ breast cancer [
14]. A similar contrasting result between ER- and ER+ breast cancer was also found by deriving a gene expression signature for lymphocytic infiltration (LI) and demonstrating its positive and negative association with good prognosis in ER- and ER+ disease, respectively [
15]. All these results are consistent with our findings and highlight the importance of stratifying breast cancer patients into ER+ and ER- subtypes before associations with clinical outcome can be derived [
5,
16].
The discovery and construction of a molecular classifier that can robustly identify ER- patients with a good prognosis is important for two main reasons. First, identification of ER- patients with a good prognosis based on histopathological predictors like LN status or Adjuvant! is far from optimal [
17]. Second, reliable identification of ER- patients of good prognosis could help guide the management of ER- patients further, by providing less aggressive treatment regimens for such patients. Building on our previous results [
5] here we report on the construction of a seven-gene prognostic classifier and further validate this single-sample predictor across six (four untreated and two partially treated) independent ER- breast cancer cohorts: 'UPP' [
12], 'JRH-2' [
9], 'UNC248' [
18], 'CAL' [
19], 'Loi' [
20] and 'Kreike' [
6]. This therefore confirms the validity of this classifier in more than 469 ER- patients.
Materials and methods
Linear and quadratic discriminant analysis
Before discussing Mixture Discriminant Analysis (MDA), it is convenient to briefly review Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) [
21]. We assume that we have a training data set
X of dimension
p ×
N, where
p is the number of dimensions (ie, genes) and
N is the number of training samples (ie, tumour samples). We also assume that we have a test set
Y of dimension
p ×
n and that we have
C phenotype classes among the training set samples.
In the training process of discriminant analysis one attempts to learn parameters that specify the clusters associated with each of the phenotype classes. In the maximum likelihood framework, one learns parameters (
π,
θ) = (
π
k
,
θ
k
= 1,...,
C) such that the likelihood function
(1)
is maximised. In the above,
f
k
denotes the probability function that specifies the probability that the observation
x
i
is generated from cluster
k,
π
k
denotes the weight of this cluster and
θ
k
parameterises the cluster. The optimisation of the likelihood is performed using the EM-algorithm, subject to the constraint that
, yielding estimates
.
Having estimated the parameters, we can now classify a test sample
y using Bayes' Theorem as follows. The probability that
y belongs to class
k is just the posterior probability
p(
k|
y), which by Bayes' Theorem can be written as
(2)
Assigning y to the class which maximises this posterior probability (the maximum probability criterion) minimises the expected misclassification error. Thus,
k = class(y) = max{p(c|y)|c = 1,..., C}
To compute the posterior probabilities one needs to estimate the functions
f
k
or, if the functional form is prespecified, the parameters
θ
k
. The simplest functional approximation one can make is to assume that the clusters are multivariate Gaussians, so that
where μ
k
is the mean and Σ
k
the covariance matrix of the Gaussian. If, furthermore, we assume that the covariance matrices are identical for each cluster (ie, Σ
k
= Σ ∀ k), then the classification function becomes a linear function of y, known as LDA. In the more general case where the covariance matrices of each class are allowed to differ, the classification function is a quadratic form of the y and the analysis is known as QDA.
Mixture Discriminant Analysis
The assumption that a phenotype class is best modelled by a multivariate Gaussian is often violated. In the context of gene-expression analysis, gene expression profiles often exhibit bi-or multimodality, even when restricted to one phenotype class [
5]. Similarly, gene expression profiles typically also have longer tails than Gaussians. In such circumstances, it seems more appropriate to model each
f
k
as a mixture of multivariate Gaussians, since any general density can be approximated by such a mixture. Therefore, one assumes that
(4)
where the number of Gaussians to use for phenotype label
k is given by
G
k
. This number may or may not be specified in advance resulting in a variety of different implementations. In ordinary MDA [
22], one assumes that
G
k
is known in advance for each class
k and that the covariance matrices are all identical (ie, Σ
kj
= Σ). However, these assumptions are not necessary and instead one can use the training data to learn the best mixture model fit for each phenotype class using for example the Bayesian Information Criterion (BIC) [
21] or a variational Bayesian framework for model selection [
23]. This model selection step is a cluster-inference procedure that yields estimates for(
τ
kj
,
μ
kj
,Σ
kj
,
G
k
), from which classification of test samples proceeds as before using the maximum probability criterion. Therefore, MDA is a direct generalisation of LDA and QDA and may reduce to these if the data does not support multiple components per phenotype class [
21].
Classification in heterogeneous cancers: the MDAhet classifier
Using mixtures of Gaussians, the densities of each phenotype class can be estimated more accurately. Thus, provided that the inferred Gaussian components are biologically meaningful, this approach should in general lead to an improved classification performance. However, the implicit assumption in MDA is that we are still interested in classifying samples into the C phenotype classes, whereas in certain circumstances we may be only interested in classifying into certain subtypes within the phenotype classes. Therefore, while in MDA one allows for heterogeneity of each phenotype label by estimating the density of each class as a mixture of Gaussians, classification is subsequently performed into each phenotype class. On the other hand, it is possible to classify samples into the Gaussian subcomponents inferred for each phenotype class, a variation of MDA called Heterogeneous Mixture Discriminant Analysis (MDAhet), because this explicitly takes the heterogeneity of each phenotype class into account by attempting to classify the samples into these subcomponents.
As an example, consider the case of two phenotype classes with MDA predicting two Gaussian components for each class. Thus, training data is used to learn the parameters and weights for four Gaussian clusters and classification of test samples is subsequently performed via the Bayes' classifier (equation 3) on these four subclasses. Note therefore that in MDAhet, the cluster-inference step of MDA is used to define the classes for which classification is then performed. Since these inferred classes make up subtypes of the original phenotype labels, this classification framework explicitly takes the heterogeneity of the phenotypes into account.
In the context of cancer gene-expression studies it has been a problem in certain cancers to derive reliable prognostic classifiers as is the case for ER- breast cancer. Typically, in the context of prognosis one would expect discriminative gene-expression profiles to exhibit bimodal distributions with the two modes mapping roughly to the two prognostic groups (good and poor) [
11]. However, as previously shown [
5], the best candidate gene-expression prognostic markers can also exhibit bimodal (or multimodal) profiles (ie, mixtures of Gaussians) within a given prognostic class, indicating that these phenotypes are themselves heterogeneous and that classification analysis should attempt to take this heterogeneity explicitly into account. Thus, in such circumstances the proposed classifier MDAhet seems the more appropriate classification scheme to use.
Time-dependent negative predictive value analysis
Following the work by Heagerty and colleagues [
24], we estimate time-dependent sensitivity
SE(
t) and specificity
SP(
t) values using Kaplan-Meier estimators for the predicted subclasses. In our context, we assume that samples have been classified into two groups, so that the predictor
X = 1 predicts poor prognosis, while
X = 0 predicts good prognosis (ie, the 'good-up' group) Thus,
where
(
t) denotes the Kaplan-Meier estimator for the overall survival function, while
(
t|
X =
c) denotes the Kaplan-Meier survival estimate for the particular subgroup
X =
c (
c = 1, 2) [
24]. In our context, however, the most important performance measure is the negative prective value (NPV), since this is the probability of correctly identifying a patient with a good prognosis. Adapting the same methods as used by Heagerty and colleagues [
24] we can obtain time-dependent estimates for the NPV and positive predictive value (PPV) simply as:
Discussion
Based on the seven genes we had identified previously as defining an immune response-related prognostic module in ER- breast cancer, we have now constructed a single-sample classifier and have validated it in six external, independent ER- cohorts, four of which were untreated populations. Remarkably, we find that overexpression of this immune response-module considerably reduces the risk of disease-specific death or distant metastasis in both untreated and partially untreated ER- populations (HR = 0.15; 95% confidence interval 0.07 to 0.36;
p < 10
-6) (Table
4). Importantly, we also found that this association is independent of LN status (Table
4). In terms of binary outcome measures, the classifier shows clinical promise with consistently high NPV values across all test cohorts, even when time-dependent outcome measures are taken into account (Table
3). For example, the NPV and sensitivity values at four years after surgery were 100% in four of the six cohorts and in all cases larger than 85%. Thus, the classifier could potentially be used for identifying high-grade ER- patients that may benefit from a less agressive or nonexistent course of chemotherapy.
The remarkably high NPV values in the test cohorts, however, raise some important questions. First, we found that the performance in the test sets was better than in the training set (Tables
3 and
4). While this is true for the NPV analysis, the Cox-regression analysis also shows that the 95% confidence intervals (CI) are overlapping. Therefore, statistically, there is no discrepancy. In any case, a plausible explanation for why the performance is slightly worse in the training set could be related to the merging step involved in building the training set [
5]. By merging different microarray expression sets together we gain power from the considerable increase in sample size; however, merging may also compromise the accuracy of the expression profiles, because these need to be renormalised before merging is performed [
5]. Therefore, it is entirely plausible that small errors in the merging procedure may have affected the classifier's performance in the training set. In this context it is important to point out that the training set is only used to derive a classifier and that the gold-standard evaluation of any classifier is determined by its performance in the test cohorts [
25]. As shown here, the MDAhet classifier is strongly prognostic across six totally independent breast cancer cohorts profiled on different array platforms.
A second important point relates to the nature of the MDAhet classifier. As remarked in a previous study [
9], in the context of validating gene expression signatures across different array platforms, some renormalisation is inevitable. Thus, our MDAhet classifier is not strictly speaking a single-sample predictor because the gene expression value of a test sample needs to be renormalised (a simple centering and scaling) across all the test samples in the same cohort, before classification is performed. However, this does not preclude the classifier from being a potential single-sample predictor because in the clinical setting such platform differences would not exist and so no normalisation step would be necessary. Hence, in line with other classifiers presented in the literature [
9,
26] our MDAhet classifier is also a single-sample predictor because, modulo the normalisation step, the classification is performed solely with information taken from the training set (Table
1).
Given the association of overexpressed immune response related genes with good prognosis in ER- breast cancer, as supported now by several studies [
5,
13‐
16], it is natural to ask about the biological meaning of such overexpression. One plausible explanation for the overexpression of immune response genes in these tumours is a higher degree of LI, because some of the genes involved are lymphocyte markers [
13‐
15] and LI itself is associated with good prognosis in ER- breast cancer patients [
6,
14,
15]. However, there is also evidence for a more complex role of the mRNA expression of these genes [
5]. First, it was found that the prognostic performance of the seven-gene module previously reported [
5] was independent of LI. Second, it was shown that the good prognosis class was heterogeneous with only about half of the cases mapping closely to medullary breast cancer, a morphologically distinct subclass associated with high LI and marginally better prognosis as compared with the other ER- subtypes (ie, the basal and the HER2+ subtypes) [
5,
27]. Thus, the best prognosis is attained by the other half of the samples that are not necessarily related to high LI and medullary breast cancer [
5]. All these findings are consistent with the marginal association of LI or LI-associated gene expression with good prognosis in ER- breast cancer, as reported recently [
6,
13‐
15], and suggest that only part of the overexpression of the immune response-module is due to LI [
5]. Lending further support to this, it was also found that one gene member (
SPP1) is consistently underexpressed in patients with a good prognosis. To conclude, we can therefore hypothesise that the MDAhet classifier and associated immune response-module may be identifying another good prognosis ER- subset of tumours, but with a significantly better prognosis than medullary high-LI breast cancer (Tables
3 and
4). In any case, even if the expression pattern of the immune response-module is entirely due to variable LI, the MDAhet classifier appears to provide a much more reliable prognostic classifier than LI-scores derived from immunohistochemistry [
6] or lymphocyte-specific gene expression markers [
14,
15]. Further larger studies with reliable LI data are required to answer these questions conclusively [
15].
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AET conceived of the study, performed all statistical analyses and wrote the manuscript. CC contributed to the writing of the manuscript.