Background
Colorectal cancer is one of the most common malignancies in the western world andaccounts for about 10% of all cancer deaths in both Europe and the USA.Traditionally, colorectal cancer classification (Dukes, AJCC (American JointCommittee on Cancer)) is based in the extent of the cancer: depth of tumor invasioninto the wall of the intestine, number of nearby affected lymph nodes and whetherthe cancer has metastasized to other organs of the body. Surgery is curative for abig proportion of patients at early stages, but is not enough for many patients atadvanced stages. Most of these patients need adjuvant chemotherapy in order to avoidrelapse or to increase survival. Unfortunately, only a small portion of them showsan objective response to chemotherapy, becoming problematic to correctly predictpatients’ clinical outcome [
1]. Microarray gene expression profiling is a powerful tool for theidentification of prognostic gene signatures. Supervised analysis of gene expressionhas been used to discover gene signatures to identify patients at risk of recurrencein colon cancer [
2‐
8]. Recently two extensively validated gene signatures have been reportedOncotype-DX and ColoPrint [
9,
10]. A different approach is to use unsupervised analysis. Clustering methodsgroup together samples with similar expression profiles. With this strategy, newsubtypes of tumors can emerge or the existing classification may be redefined withthe result of more uniform groups of tumors. Molecular homogeneity may be essentialin order to identify specific biological pathways affected, to discover precise drugtargets in each subgroup or to obtain individual survival classifiers. Previousattempts to subdivide colon tumors into sub-classes or to correlate gene expressionto Dukes stages using unsupervised analysis haven’t been conclusive. Someauthors were able to correctly classify normal colon, Dukes B and C but not Dukes Aand D and no new subgroups were identified [
11]. Others were able to classify in one group normal tissue with Dukes A, inanother cluster B with C, and D clustered separately [
12]. Other authors were unable to find differences between stages B, C and D [
13]. Some reports found differences between normal and tumor tissue and genesdifferentially expressed between metastatic and nonmetastatic samples [
14‐
16] or segregate normal tissue from primary carcinomas and from livermetastasis and carcinomatoses [
14,
17]. Other authors using class comparison between Dukes A and D identified agene signature that could be used for the classification of low- and high-riskpatients in Dukes B and C [
7]. Another interesting approach described the identification of a geneexpression profile generated from an experimental model of colon cancer metastasisthat was able to predict cancer recurrence in patients with colon cancer [
18]. Other authors reported that epidermal growth factor receptor pathway wasup-regulated in metachronous liver metastasis while angiogenesis was up-regulated insynchronous liver metastasis [
19]. Unfortunately, even if there is an ample selection of gene signaturesreported in the literature, almost none of them have reached the clinical practice.There is a need of prognostic and predictive factors to provide authoritativeinformation for medical decisions in routine clinical practice. Our study was mainlyaimed to obtain more homogeneous groups of tumors in colorectal adenocarcinomashypothesizing that discovering molecularly more uniform groups of tumors, wouldlikely discriminate patients with different clinical outcomes, as well. In additionunderstanding the biological pathways underlying each tumor subtype wouldpotentially help in future, to find the appropriate treatment regime.
Methods
Patients
Patients from all stages were selected, keeping approximately equal proportion ofeach stage (24 Dukes A or AJCC (6 edition) stage I; 26 B or II; 19 C or III and19 D or IV)). Tumor samples were taken from the Bank of Tumors of the HospitalClinico San Carlos between 2001 and 2006. The Bank of Tumors follows the rulesestablished by the hospital including the patient consent approved by theEthical Committee of the Hospital Clinico San Carlos.
Histological analysis of tumor samples
Many reports have shown the importance of tumor associated stroma in thedevelopment of cancer; therefore for our study we did not consider to do lasermicrodissection to get just the transformed epithelial cells. We wanted toanalyze both, tumor cells from the malignant epithelia and the alteredsurrounding stroma. We took a representative fragment of the complete tumor andwe carried out a very detailed pathological analysis of the frozen tumorfragments used to extract the RNA and of the corresponding paraffins of thetumor. Only samples with more than 80% of tumor component were included,considering tumor stroma as part of the tumor component.
RNA extraction and quality control
RNA was extracted directly from the frozen samples using TRIZOL (Invitrogen,Carlsbad, CA) and a homogenizer (Ultraturrax T8-S8N-5 G Rose Scientific Ltd,Canada). Afterwards, RNA was treated with DNAse using RNeasy Microkit (QiagenGmbH, Germany). RNA quality was measured with Agilent Bioanalyzer 2100 (Agilenttechnologies, Palo Alto U.S.A) and only good quality samples, RIN (RNA IntegrityNumber) [
20,
21] higher than 7.5, were selected for the analysis.
Microarray analysis
Agilent G4112 microarrays were used to analyze gene expression in 88 colon tumorsand 7 normal colon tissues. A reference RNA preparation (pool of normal colontissue RNAs obtained from 68 individuals) was used for double hybridization:tumor-Cy5/pool-Cy3, normal-Cy5/pool-Cy3. Agilent recommended protocols werefollowed. Fluorescence was measured and normalized (LOWESS) using Agilentmicroarray scanner and Feature Extraction software. Quality Control Report wascarried out to discard the microarrays that did not fulfill good qualitycriteria. From the original 44 K features microarray, a total of 28462 spotswithout flags in 90% of the microarrays were used. Only probes that weresignificantly (p < 0.01) up or down regulated
vs. thereference pool, in at least 7 samples (considering the 7 normal tissue samplesas the smallest group) were selected to obtain 17392 spots. Probes with the samegene identification were averaged to obtain a total of 14764 genes. Forclassification purposes we chose the genes that showed higher variations betweentumors, selecting the genes that in more than 7 samples had at least a 2.5-foldchange from the gene median value, resulting 1722 genes that were used for theunsupervised analysis of the 89 samples (tumor CT102 was replicated). Clusterreproducibility was measured by the robustness index (R-index) and by thediscrepancy index (D-index); [
22] analyses were performed using BRB-ArrayTools developed by Dr. RichardSimon and BRB-ArrayTools Development Team. Transcript Profiling: [ArrayExpressE-TABM-723].
Functional analysis of KEGG pathways
A functional analysis of KEGG pathways using class comparison tools(Goeman’s global, LS, KS Efron. Tibshirani’s tests) was carried outto find differentially affected pathways between the four tumor subtypes. 164gene sets were studied and the threshold used was set at p = 0.005.Multiple comparisons were corrected using resampling and gene permutations.Since Goeman's method tests the null hypothesis that no genes within a givengene set are differentially expressed and LS test, KS test andEfron-Tibshirani's methods, test the hypothesis whether the average degree ofdifferentially expression is greater than expected from a random sample of genes(BRB-ArrayTools), KEGG pathways selected had to be significant at least in twotests: Goeman’s test and any of the other three tests carried out.
Tissue microarrays (TMA), IHC and mutation analysis
Tissue microarrays were assembled as in [
23] for immunological analysis of β-catenin (clone17c2 NovocastraLaboratories Ltd. Newcastle upon Tyne, UK), M30 (M30 CytoDEATH Roche DiagnosticsGmbH Mannheim Germany) for apoptosis and KI67 (clone M1B1, Dako, Glostrup,Denmmark) for proliferation. Presence of mutations in
KRAS,
BRAF and
PI3K as well as microsatellite instability (MSI)were also assessed. See Additional file
1:Supplementary Information for more information about the protocols followed forantibody staining and analysis of MSI and gene mutations.
Identification of tumor subgroups in an independent data set
Eschrich et al. [
2] data set was used as an external patient collection. Data wascombined using the method published by Hu et al. [
24]. The genes that had the same UniGene Cluster ID were averaged and thegenes that did not have a UniGene Cluster ID were eliminated from our data setresulting 11017 genes out of the 14764 genes and 96 samples (normal and tumorsamples). Eschrich data set consists of 78 samples (23B, 22 C, 30D and 3adenomas) and 32208 normalized transcripts. Spots without IDs or with more than25% missing values were eliminated and spots with the same UniGene ClusterID were averaged. Genes with 90% of data were selected to obtain a total of9229 genes.
Combination of data sets: both data sets were combined usingthe software “Distance Weighted Discrimination”(
https://genome.unc.edu/pubsup/dwd/) to obtain a collection of174 samples (166 tumors) and 5319 common genes.
Classification of theexternal data set: A Nearest Centroid predictor was built in our dataset including only genes differentially expressed between classes at ap < 0.001. LOOCV (Leave-One-Out Cross-Validation) and 100 randompermutations were used to compute miss-classification rate. This predictor wassubsequently used to classify the external samples into the four novel clusters.
Hierarchical clustering: To analyze whether the externalpatient’s set clustered with our patients in the same tumor subtypes,Centered Pearson correlation and average-linkage-hierarchical clustering of thecombined set (159 tumor samples excluding samples from cluster-5, normal tissuesand adenomas) was carried out using the 461 common genes between both data setsout of the 1722 originally selected genes.
Generation of a low-stroma-subtype predictor
Eschrich samples were classified as belonging to the Low-stroma-subtype orbelonging to the other tumor subtypes using the K-nearest-neighbor,K = 3 (KNN3) prediction method. A predictor was generated in ourdata set using the 461 common genes between both data sets out of the 1722originally selected genes. Genes included in the predictor were differentiallyexpressed between classes at a p < 0.001. LOOCV and 100 randompermutations were used to compute miss-classification rate.
Statistical analysis and correlations with clinical parameters and survivalanalysis
Qualitative variables are given with their frequency distribution. Quantitativevariables are given with their mean and standard deviation (SD). Means werecompared with Kruskal-Wallis test. Proportions were compared by the chi squaretest for independent groups. Survival functions were estimated by the actuarialmethod. Cumulative risks over time and their corresponding standard errors (SE)are provided along with the number of patients at risk (n). Likelihood exacttest was used to compare survival functions for the different subgroups. A Cox'sproportional hazards regression model was fitted. Significance was taken as adrop in the likelihood estimator of the models compared. Adjusted hazard ratios(HR) and its 95% confidence interval (95%CI) are provided in theresults. In each hypothesis contrast the assumption of rate proportionality wasverified. In all hypothesis contrasts (survival analysis and clinical parameterscorrelations) the null hypothesis of no difference was rejected with a type I orα-error of less than 0.05. Correction of p-values was not performed.Statistical analysis was performed with SPSS 15.0 for Windows (SPSS Inc.,Chicago, USA).
Discussion
A general approach to find prognostic markers in colon cancer is using supervisedanalysis of gene expression. Class comparison between patients with good and badprognosis has been carried out, and gene signatures that discriminate between highand low risk patients have been reported [
2‐
10]. In this study, we have used a different strategy, hypothesizing that theidentification of distinct molecular tumor subtypes would likely discriminatepatients with different clinical outcomes, as well. In addition understanding thebiological pathways underlying each tumor subtype would likely help to find theappropriate treatment scheme.
We report a molecular classification of colon adenocarcinomas in four novel tumorsubtypes identified by unsupervised analysis of gene expression.Tumor-associated-stroma was clearly associated with this classificationcharacterizing a Low-stroma-subtype and a High-stroma-subtype. Mucinous histology,MSI,
BRAF mutations as well as lower levels of nuclear β-catenincharacterize the Mucinous-subtype. Tumor subtypes were independent of thehistopathological stages. Lack of association with the histopathological staging isimportant, because it implies that tumor subtypes are established since initialstages of the tumor, consequently contributing to the selection of the patients atearly stages. Additionally, explains why many studies were unable to reliablyassociate molecular classification to Dukes stages [
11,
12,
14,
17]. The nature of the genes expressed in each cluster and the biologicalpathways affected supported the association of the molecular and pathologicalparameters with the tumor subgroups. Low-stroma-subtype, High-stroma-subtype andMucinous-subtype were robust, associated to biological characteristics and validatedin an external patient set. The combination of two different microarray studies inone data set is challenging; many of the important genes in each data set may belost in the merged spreadsheet. Even though, when we combined our data set with theexternal data set we still kept important features in the combined data set. Thenovel molecular subtypes were also identified in the external data set (at leastthree of the four clusters).
Relevant reports identified stroma gene signatures associated to survival in diffuselarge-B-cell lymphoma [
26] and in breast cancer [
27] reflecting the importance of tumor microenvironment in the aggressiveprogression of the disease [
28]. Moreover a report in colon cancer showed that the presence of a highamount of stroma, predicts worse survival for stage I-II colon cancer patients [
29]. Stroma was highly associated to our molecular classification. Genescorresponding to pathways related to cell communication, ECM-receptor interaction,Focal adhesion and CAMs were down-regulated in the Low-stroma-subtype andup-regulated in the High-stroma-subtype and in the Mucinous-subtype.High-stroma-subtype had the highest percentage of stroma in the tumors and thehighest level of stromal components. Mucinous-subtype also had high levels of stromaassociated genes and the proportion of stroma was not significantly lower than inthe High-stroma-subtype. Although clusters-3 and −4 share similar expressionpatterns of some of these stromal genes, there are other important genes thatclearly are different between these two subtypes, genes characteristic of gobletcells, trefoil factors and mucins, as well as other genes like
REGIV,
COX2 or
CD55 are specifically up-regulated in cluster-4 orMucinous-subtype.
Microenvironment is important for tumor development and more interestingly may be thetarget of novel treatments. In this line, promising studies are underway. Althoughinitial studies using antibodies against activated fibroblast proteins, like
FAP, did not obtain objective tumor responses [
30]. New developments are taking advantage of the enzymatic activity of FAP.With this strategy, a prodrug is administrated in an inactive form that isproteolytically activated by the
FAP present in cancer activatedfibroblasts localized in tumor microenvironment. Once activated, the drug targetsany cell contained in the tumor [
30,
31]. Other therapies anti-stroma under development targetintegrins-extracellular membrane interactions [
32,
33] or target tumor stroma using T cells [
34] or human mesenchymal stem cells [
35,
36]. Consequently is an active field of research and the identification of ahigh stroma subtype group of patients may be essential to obtain benefit from thesetreatments, administrating anti-stroma therapies just to this group of patients.
Since our survival results showed that Low-stroma-subtype identified lower riskpatients and High-stroma-subtype and Mucinous-subtype identified higher riskpatients, we contradict many reports indicating that MSI tumors have better clinicaloutcome than MSS/L tumors [
37,
38]. However, the Mucinous-subtype retains important factors usually found inpoor prognostic tumors; a) mucinous tumors have worse clinical outcome and in ourstudy mucinous and MSI tumors clustered together; b) high levels of
SPP1,
FAP,
GREMLIN1,
CD55 or
REGIV among othershave been reported to be associated with cancer invasion, metastasis and poorprognostic in colon cancer [
39‐
45]. These genes are up-regulated in clusters-3 and −4; c) theincreased levels of
TFF2 and
MUC1, characteristic of theMucinous-subtype, have also been associated to a poor clinical outcome [
46]; d)
BRAF mutations have been shown as a worse prognostic factor [
47,
48]. Four out of the five MSI tumors in the Mucinous-subtype harbor
BRAF mutations. For all there reasons, consequently, we could expectthat patients of High-stroma-subtype and Mucinous-subtype had a worse clinicaloutcome.
The largest cluster was the Low-stroma-subtype and shows key clinical properties thatspecially distinguish this subtype from the other tumor subtypes. First, a 167 genesignature associated to this group of tumors distinguished low risk patients in anexternal clinical cohort. Second, eight different reported gene signatures includingthe extensively validated Oncotype-DX and ColoPrint [
2‐
4,
7‐
10,
18], classified the Low-stroma-subtype patients in one group and the othertumor subtypes in a second group. Comparing microarray analysis across differentstudies and platforms is challenging. In general there is little or none overlappingamong different gene signatures. In our study we found that eight different reportedsurvival predictors and our 167 genes Low-stroma-subtype predictor, with almost nooverlap among them, recognized the same group of patients in our data, theLow-stroma-subtype. Furthermore, our 167 genes Low-stroma-subtype predictor was ableto identify in the external data set the patients with better clinical outcome. Whatis important and relevant for the application to the clinics is recognizing the sametype of patients, not to demonstrate overlapping among different gene lists. Thiscoincidence is important to confirm the potential of microarray gene expression forthe identification of low risk patients. Nevertheless, it should be remarked thatsurvival outcomes have not been confirmed with our own survival data and in thesetting of a multivariable analysis. A higher sample size of homogeneous groups ofpatients will be necessary to establish the prognostic value of this molecularclassification.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BPV, FMS and EDR supervised and designed the original study with the collaboration ofTC. SHP, JALA and JSA performed the pathological study. ARL and BPV carried out themicroarray experiments. ARL, BPV, GLC and FMS analyzed the data. CFP and ARLperformed the statistical analysis. AC, JS and RA performed the selection andclinical study of the patients. All authors contributed to revising the article. BPVwrote the paper. All authors read and approved the final manuscript.