Background
Prostate cancer is an extremely heterogeneous disease [
1]. For patients with castration-resistant prostate cancer (CRPC), overall survival can range widely from months to years. Accurate prediction of survival is crucial for clinical management and for patient stratification into clinical trials. Unfortunately, monitoring genetic alterations in metastatic prostate cancer has been inhibited by the difficulty in obtaining serial metastatic biopsies, since these are not routinely needed for clinical management. Blood-based biomarker assays are minimally invasive and can be easily implemented in clinical practice. As such, diagnostic and prognostic models built on peripheral blood gene expression have been reported for various types of cancers [
2‐
9]. Two recently published studies from our respective groups [
10,
11] suggested that the RNA transcript levels of specific gene sets in whole blood samples were significantly associated with overall survival in patients with CRPC. However, the lists of genes identified by the two studies were completely non-overlapping and questions remained regarding the underlying pathogenic processes reflected by the two distinct signatures.
Such lack of consistency is not uncommon in genome-wide biomarker discovery studies given the large pool of candidate genes with complex correlation structures, relatively small sample sizes, the noisy nature of high-throughput technologies, and cross-platform variables. Specifically, a six-gene signature reported by Ross et al. [
11] was derived from qRT-PCR profiling and modeling of 168 pre-selected genes associated with inflammation, immune response, angiogenesis, apoptosis, tumor suppression, cell cycle, DNA repair, and tumor progression using whole-blood RNA samples from CRPC patients. Gene expression changes in patients with increased mortality was associated with down-regulation of cellular and humoral immunity and monocyte differentiation towards the production of tissue macrophages. A second signature developed by Olmos et al. [
10] was constructed by selecting top ranking differentially-expressed genes from microarray whole blood RNA profiling data comparing a group of CRPC patients showing worse survival. This resulting gene signature associated a poor prognosis to increased CD71(+) erythroid progenitor cells. While both models strongly predicted prognosis, the very different gene signatures suggested different underlying immunological drivers.
Computational techniques can improve the results of genome-wide biomarker discovery studies, although each has its own shortcomings. For instance, meta-analysis identifies robust biomarkers that correlate with the phenotype of interest across multiple datasets [
12]. However, multiple datasets must be available with similar experimental designs. Advanced machine learning techniques, such as ElasticNet [
13], can construct predictive models from genomic data, but these models are overly reliant on the training dataset; the resulting algorithms cannot distinguish genuine from random correlations with phenotype. Furthermore, there is often no clear molecular mechanism underlying these biomarker models. As a result, it is difficult to develop biological interpretations of the generated models.
To overcome these issues, we developed a novel computational strategy that builds robust prognostic models by selecting genes within stable co-expression modules. This method integrates independent mRNA expression datasets that come from different experimental designs, and derives stable co-expression modules among candidate signature genes. Representative genes are then selected from each stable co-expression module to build a predictive model. This method thus generates gene expression models which, together with underlying biological pathways, facilitate hypothesis formation. We applied this novel strategy to reanalyze the Olmos et al. [
10] dataset and generated a superior four-gene prognostic model. The new model was then validated in two independent CRPC cohorts.
Methods
Workflow of a co-expression module-based integrative approach to build robust prognostic models
Step 1. Create a list of candidate prognostic genes
The Olmos dataset [
10] was downloaded from GEO (GSE37199) and the non-CRPC samples were removed from the dataset. A list of candidate prognostic genes was created by applying differential expression analysis to the two groups of CRPC patients with different survival outcomes in Olmos dataset. We used the R package LIMMA [
14] and identified 2,209 candidate prognostic genes at a false discovery rate of <0.05 [
15].
Step 2. Identify stable co-expression modules among candidate prognostic genes
We extracted whole blood gene expression profiles of 437 males from the Iceland Family Blood (IFB) study [
16] and 99 male samples from the Genotype-Tissue Expression (GTEx) study [
17]. Based on each of the two datasets, we identified co-expression modules among the up-regulated and down-regulated candidate genes from step 1, separately using the R package WGCNA [
18]. We then compared modules derived from the two datasets and ranked the overlap between modules according to their significance (Fisher’s exact test). We noted significant overlap (
P value of Fisher’s exact test <0.01) of stable co-expression modules. If the list of up-regulated stable co-expression modules was not of the same length as that of the down-regulated ones, we discarded the bottom ranking stable co-expression modules from the longer list to make them the same length.
Step 3. Identify functional cores of stable co-expression modules
We carried out gene set enrichment analysis for each stable co-expression module from step 2 using two types of gene sets. The first gene set was the canonical pathway downloaded from the MsigDB database [
19]. The second set consisted of genes overexpressed in specific types of hematopoietic cells, obtained from the HematoAtlas study [
20]. The functional core of each module was defined as the intersection between the module and its most significantly enriched canonical pathway (
P value of Fisher’s exact test <1×10
−4, corresponding to a family wise error rate of 0.1 after Bonferroni correction). In case there was no significantly enriched canonical pathway for the module (the first type of gene set), we used the intersection between the module and its most significantly enriched gene set of cell type-specific overexpression (the second type of gene set).
Step 4. Select representative genes for each co-expression module
From the functional core of each stable co-expression module (step 3), a representative gene was selected as the most differentially expressed between good and poor prognosis groups in step 1. To avoid selecting genes with very low expression levels, we also required the expression level of the representative gene to be higher than half of genes in the genome. We thus obtained two lists of representative genes from up-regulated and down-regulated modules, respectively, which were ordered according to their corresponding modules, i.e. P value of the overlapping significance (step 2).
Step 5. Train and cross-validate prognostic models
We then built gene models based on the representative genes (step 4), using the Olmos dataset as the training dataset and the naïve Bayesian classifier (R package e1071) as the learning algorithm. The pre-assumption of features independent of the Bayesian classifier was largely satisfied since the representative genes were chosen from modules with distinct expression profiles. We used leave-one-out cross-validation to determine the optimal number of genes included in the model (Additional file
1).
Validation sets I and II
The first validation dataset (I) consisted of 25 CRPC patients recruited from Mount Sinai Medical Center in New York. Whole-blood RNA was extracted using the PAXgene RNA extraction kit. After proper RNA quality control, the samples were sent for RNA-seq at the Genomic Core Facility at Mount Sinai. Illumina HiSeq 2500 was used for RNA-seq with 100 nt single read and poly(A) enriched library. The TopHat software was used to generate fragments per kilobase of exon per million fragments mapped (FPKM) values for each gene. We applied a gene-wise standardization strategy [
21,
22] to adjust the platform difference between the training and validation datasets. More specifically, for each gene in the validation dataset, we linearly transformed the log2 FPKM value to make its median and median absolute deviation the same as that of the training dataset. We then calculated the four-gene score based on the gene expression after transformation. Similarly, to calculate Ross six-gene score in the validation dataset, we scaled the log2 FPKM values according to the gene distribution in the Ross training dataset [
11]. Since the original data (by qRT-PCR using a custom Taqman array) to optimize the parameters and the cutoff value of the Olmos nine-gene score were no longer available, such transformation was not applicable to this score.
To get four-gene PCR measurements for validation set I, first-strand cDNA was synthesized from oligo-dT primed RNA templates using SuperScript® III First-Strand Synthesis System for RT-PCR (Life Technologies). Expression levels of individual genes in the four-gene signature were determined on the ViiA7 qPCR instrument using custom-made Taqman Array Cards (Life Technologies) with the Taqman Universal qPCR master mix. The delta Ct value was normalized using 18S RNA as endogenous control. To adjust the platform difference, we did a similar transformation of delta Ct value according to its distribution in the training dataset.
The second validation dataset (II) consisted of 66 CRPC patients recruited from the Urology Clinic at the University of Technology in Munich, Germany. Whole blood samples were collected in PAXgene™ Blood RNA tubes. The four-gene qPCR measurements were obtained as described for the first validation set.
Ethical considerations
The first validation dataset (I) consisted of 25 CRPC patients recruited from Mount Sinai Medical Center in New York. The PPHS (Program for the Protection of Human Subjects) at Mount Sinai Medical Center approved the study (protocol #10-1180; PI: W.Oh) to allow blood collection. All patients provided written informed consent to allow linking of clinical data and serum specimens for research purposes through participation in this specimen-banking protocol.
The second validation dataset (II) consisted of 66 CRPC patients recruited from the Urology Clinic at the University of Technology in Munich, Germany. The study was approved by the Ethics Committee (ethikkommisson, fakultät für Medizin) (project # 313/13; PI: M. Heck) to allow blood collection and all patients provided written informed consent.
The IFB dataset was downloaded from GEO database with accession number GSE7965. The Olmos dataset was downloaded from GEO database with accession number GSE37199. The GTEx dataset was downloaded from dbGap database with study accession phs000424.v5.p1. These three datasets are publicly available. Further consent for using these datasets was not required.
Discussion
Herein, we developed a module-based integrative computational strategy to construct robust prognostic models from expression profiles by dissecting candidate genes into stable co-expression modules that were functionally related to cancer progression. The advantages of our strategy and the resulting four-gene model are summarized below.
First, in selecting signature genes to be included in the model, we focused on stable co-expression modules that reflect the activity of biological pathways rather than individual genes. It is not a ‘black box’ learning approach, but rather a gene-selection approach guided by underlying biology. We showed that all of the up-regulated modules were overexpressed in myeloid cells and all of the down-regulated modules were over-expressed in lymphoid cells. A simplistic interpretation would be that observed mRNA expression changes may represent alterations in the composition of hematopoietic cells during prostate cancer progression. However, the four-gene score performed better than cell count-based clinical parameters in both validation datasets (Tables
3 and
4), suggesting that cell component change was only one factor contributing to the patients’ prognosis. For example, there was a significant correlation between the gene expression level of TMEM66 (overexpressed in T cells) and lymphocyte count (Additional file
1: Figure S6A, Pearson’s correlation coefficient = 0.48), indicating TMEM66 expression level reflected lymphocyte cell abundance change. However, TMEM66 gene expression level predicted patient survival much better than lymphocyte cell count using a bivariate cox regression model (
P = 0.002 and 0.2 for TMEM66 and lymphocyte count, respectively), suggesting TMEM66 gene expression level carried more prognostic information than T cell or change in lymphocyte counts. Another related cell count-based clinical measurement is the neutrophil to lymphocyte ratio (NLR), which has been shown to be prognostic in several cancer studies [
28‐
31]. We similarly observed a trend of patients with higher NLR having a worse survival outcome (Additional file
1: Figure S7). However, since the HR was relatively small (1.52 and 1.38 for validation sets I and II) and the sample size in our study was smaller than those of the previous studies, the prognostic power of NLR was not statically significant in our validation sets (Tables
3 and
4,
P >0.05). While there was a significant correlation between the four-gene score and the NLR in our study (Additional file
1: Figure S6B, Pearson’s correlation coefficient = 0.55), our four-gene score demonstrated much better prognostic power than NLR. We reason that beside cell count changes, gene expression levels also reflect cellular or pathway activity, and it is likely that the alteration of both the abundance and activity of different cells eventually leads to differential prognostic outcomes. Another explanation is that the expression change also reflects a combination of cell count changes of multiple types or sub-types of cells which were not directly measured in our study. The observation that up-regulated stable co-expression modules were also overexpressed in early erythroid cells, myeloid progenitor cells, and hematopoietic stem cells suggests that their up-regulation may come from myeloid-derived cells whose counts are not routinely measured. For example, they may represent myeloid progenitor cells which have ‘leaked’ from bone marrow due to metastasis [
32] or circulating myeloid-derived suppressor cells, which have been shown to greatly influence tumor progression and metastasis [
33].
Second, the module-based procedure enabled us not only to comprehensively represent diverse pathways but also to distinguish biological signals from data-specific ‘noise’. There are many advanced machine learning algorithms (e.g. Lasso [
34] and ElasticNet [
13]) which can automatically select the best set of features to be included in the model. However, since the features are usually learned entirely from the training dataset, they may be biased to dataset-specific effects. For instance, the model trained using ElasticNet showed high accuracy in the training dataset by cross-validation, but such high accuracy failed to be reproduced in the independent validation datasets (Additional file
1: Figures S8 and S9 and Supplementary Methods in Additional file
1).
Third, the new four-gene model was evaluated in a multi-stage, multi-platform, and multi-institutional process. The training dataset and the two validation datasets were generated from CRPC cohorts recruited at three different institutions using three different platforms, i.e. Affymetrix array, RNAseq, and qPCR. Our four-gene model performed extremely well across all of these datasets with a universal cutoff value. We also showed that the four-gene score was stable for intra-patient and inter-day blood samples and the four-gene score changed along with disease progression. More details about the four-gene score variability can be found in Additional file
1.
There are many important clinical and translational implications to these data. First, if host immune function is so reproducibly critical to prostate cancer progression and survival, then current efforts to model therapeutic efficacy in certain models, such as patient-derived xenografts, will likely fail to represent the true outcome in patients. Second, the current development of promising immunotherapies in cancer, including vaccines, checkpoint inhibitors, and other immunomodulatory agents, will clearly need improved biomarkers to predict benefit and to better guide personalized therapies. Whole blood RNA profiles hold great promise in evaluating such baseline and serial changes in immune parameters, given its ability to provide a potentially holistic view of the key RNA transcripts involved in clinical benefit. Finally, clinical trial stratification using prognostic and predictive models based on whole blood RNA profiles will enable more rapid drug development by targeting specific populations with differential outcomes in CRPC but also with different baseline characteristics that would be more likely to benefit from specific therapies.
Despite these encouraging findings, there are important limitations and unaddressed questions that need further study. For instance, some alternative biomarker approaches, such as circulating tumor cell count [
35], were not directly compared in this study. Halabi et al. [
36,
37] described how standard clinical variables can be used to predict prognosis for CRPC. While we included as many clinical parameters available to us, there were several variables not available in our current study (e.g. opioid analgesic use and Eastern Cooperative Oncology Group performance status). Follow-up studies are needed to uncover the causal and mechanistic interactions between blood gene expression changes and clinical disease progression.
Acknowledgements
The project was partially funded by Young Investigator Award from Prostate Cancer Foundation (LW), R01MH090948 (JZ), and U01AG046170 (JZ). None of the aforementioned funding bodies were involved in the study design and conduct. We would like to thank our team of clinical coordinators including Teena Kochukoshy, Manpreet Brar, and Victoria Gresia for consenting patients, collecting blood, and providing database support for the study. JDB's laboratory is supported by a Cancer Research UK Centre grant, Experimental Cancer Medicine Centre funding, a Prostate Cancer UK and Movember Centre of Excellence grant, and a National Institute for Health Research Biomedical Research Center to the Royal Marsden/ICR.
Competing interests
All authors declare that they have no competing interests.
Authors’ contributions
LW contributed to all data analysis and wrote the first draft of the manuscript. YG contributed to experimental design, sample acquisition, and data generation. UCV contributed to sample acquisition and data generation. MH, MR, and RN contributed to sample acquisition of the Validation set II. MG and ES contributed to data interpretation. JDB and DO contributed to data acquisition of the training set. JZ and WKO conceptualized the project and guided the experimental design, data analysis, and manuscript writing. All authors read and approved the final manuscript.