Introduction
Survival prediction using high-dimensional omics data has been a widely discussed topic in the field of precision medicine, particularly when it comes to cancer research [
1‐
3]. Genomic data that contains abundant hereditary information largely determines the phenotype heterogeneity of cancer patients [
4,
5]. In recent years, high-throughput sequence technologies facilitate the extensive application of genomic information to predict the patient’s prognosis [
6]. The challenge lies in how to construct efficient and robust models for survival prediction in the context of high-dimensional data.
Regularization methods, such as Lasso, relaxed Lasso, and elastic-net, are recognized as powerful modeling tools yielding predictive and interpretable models [
7]. These methods were extended to the Cox model for better handling the survival data [
8]. When using genomic data, these methods construct models based on individual genes, treating them as independent predictors. However, the progression and prognosis of cancer are regulated by multiple biological signaling pathways, and thus incorporating pathway-level information into model building can be a more accurate representation of the underlying biological processes [
9‐
11]. In this light, several extensions, such as the group Lasso (grlasso) and composite minimax concave penalty (cMCP), are able to integrate the biological pathway information as group structure into the modeling procedure [
12,
13]. Besides, several attempts have been made to build pathway-based modeling strategies. Chen and Wang proposed to integrate prior defined biological pathway information and gene expression profiles for cancer prognosis [
14]. Zhang et al. proposed a two-stage strategy integrating risk scores derived from pathway-based models to make cancer survival prediction [
15]. Kim et al. utilized a directed random walk algorithm that navigates through the pathway network, generating an effective genomic feature extraction [
16]. However, the majority of these are single-model based methods, usually leading to unstable prediction. Others employ similar concepts with the naive stacking learning.
Stacking strategy is a wise ensemble learning method that combines cross-validated (CV) predictions from multiple varied algorithms or models [
17]. By leveraging the strengths of different models, stacking methods often yield more robust and accurate predictions than using a single model [
18]. However, the implementation of stacking methods to survival data is more complex since the predicted survival probability is varied across time. Andrew Wey, et al. proposed using the inverse probability of censoring weighted Brier Score (IPCW-BS) as the objective function for survival stacking models based on multiple time points [
19]. Golmakani and Polley assumed that candidate models were all on the condition of proportional hazards and used cross-validated negative log partial likelihood as an optimization function [
20]. Robert Tibshirani, et al. demonstrated that the logistic regression estimation fitting the events of different time points is the approximate estimation of the Cox model and thus one can cast survival analysis as a stacking classification problem [
21]. Ginestet, et al. proposed an ensemble procedure based on the pseudo-observation-based-AUC loss to optimally stack predictions from survival algorithms [
22].
In the present study, we introduced a novel survival stacking method that integrated group structure information to improve the robustness of cancer survival prediction using high-dimensional omics data. Our approach involved grouping genomic data into multiple sub-data based on biological pathway knowledge. Sub-models were then independently trained using the features in each sub-data. In addition to a non-negative linear combination of sub-models using a traditional optimization method based on the integrated Brier Score (IBS) loss function, we also proposed a Bayesian hierarchical generalized linear model (BhGLM) using the non-negative mixture double-exponential (DE) prior, as well as an artificial neural network (ANN), to ensemble the predictions of sub-models. We compared the proposed methods to several competitors, including the widely used survival penalized method and the extensions that consider the group structures, through simulation study and real-world data application. The results showed that the proposed survival stacking strategy exhibited favorable properties in prediction and interpretability.
The paper is organized as follows: In Section 2, we presented a detailed illustration of the proposed strategy. Section 3 compared the prediction performance of the proposed method and existing methods through a simulation study. In Section 4, the proposed methods were applied to several real-world data. Lastly, Section 5 concluded the paper and discussed several critical issues related to our methods.
Applications to real data
We applied the proposed method to three real-world cancer datasets with survival records and large-scale gene expression profiles. For these datasets, gene expression data were standardized using
covariates function in
BhGLM package. We randomly partitioned the original data into two subsets of equal sample size: one for training models and the other for evaluating model performance. The process was repeated 100 times in case of casual results due to data split. To ensure a balanced response, we performed a log-rank test on the survival curves between training and test data and considered those with
Plog − rank > 0.5 being balanced splits that would be retained for further analysis. Genes were mapped to pathways using genome annotation tools. More precisely, we first mapped gene symbols to Entrez Ids using
annotateI package and then mapped genes to KEGG pathways (default parameter) using
clusterProfiler package [
31].
TCGA breast cancer dataset
We obtained the transcriptome profiles (in TPM format) and the corresponding latest survival information for TCGA Breast Cancer (BRCA) from “GDC Data Portal” (
https://portal.gdc.cancer.gov/). We selected the female samples that had both survival outcomes and gene expression profiles. Genes with > 50% of zero expression were filtered out and those with > 20% quantile variance were retained. Eventually, we ended up with a dataset consisting of 1060 samples and 13,745 genes. These genes were mapped to 140 pathways involving 3855 genes (see Supplementary Table
3).
Prior to the stacking process, we performed an initial pathways screening to identify those with potential predictive value. We fitted a Lasso Cox for all 140 pathways in the original data separately and obtained the C-index for each pathway. A total of 116 pathways had a C-index > 0.5. However, many of them were not predictive but introduced variance, which was detrimental to the ensembled prediction. We further constrained the enrolled candidate pathways to these with C-index > 0.55, resulting in 48 pathways for the subsequent analysis.
Table
3 summarizes the average time-AUC and time-BS at the three time points of various methods applied to BRCA dataset. In general, gsslasso and grlasso showed superior predictive performance over other single-model methods. Pathway-stacking methods outperformed single-model methods in terms of discrimination. The stacking methods also demonstrated a high calibration in the early and middle survival time. Among the survival stacking methods, solnp(Lasso) exhibited a preferable calibration consistently across time but inferior discrimination. Nsslasso(Lasso) had a favorable performance in the early and middle periods while ANN(Lasso) performed better discrimination at middle-late survival time.
Table 3
The measurements (mean(SD)) of penalty and group penalty methods and pathway-stacking methods for TCGA breast cancer dataset (N = 1060) by 100 times random spilt to training set (N = 530) and test set (N = 530)a
Single-model methods | AUC | BS | AUC | BS | AUC | BS |
Lasso | 0.509(0.086) | 0.064(0.088) | 0.549(0.060) | 0.093(0.074) | 0.555(0.064) | 0.151(0.048) |
gsslasso | 0.560(0.096) | 0.030(0.039) | 0.574(0.066) | 0.064(0.034) | 0.599(0.062) | 0.133(0.023) |
grlasso | 0.569(0.082) | 0.060(0.083) | 0.582(0.064) | 0.089(0.071) | 0.595(0.063) | 0.150(0.045) |
grSCAD | 0.543(0.068) | 0.101(0.108) | 0.544(0.058) | 0.124(0.091) | 0.561(0.060) | 0.170(0.058) |
cMCP | 0.558(0.095) | 0.123(0.113) | 0.548(0.069) | 0.143(0.095) | 0.559(0.080) | 0.183(0.060) |
Pathway-stacking methodsb |
solnp(Lasso) | 0.600(0.069) | 0.028(0.005) | 0.605(0.060) | 0.077(0.009) | 0.608(0.050) | 0.191(0.011) |
nLasso(Lasso) | 0.598(0.074) | 0.028(0.005) | 0.609(0.056) | 0.078(0.009) | 0.613(0.043) | 0.190(0.013) |
nsslasso(Lasso) | 0.605(0.071) | 0.028(0.005) | 0.611(0.056) | 0.078(0.009) | 0.615(0.044) | 0.190(0.013) |
ANN(Lasso) | 0.593(0.067) | 0.028(0.005) | 0.619(0.059) | 0.077(0.009) | 0.622(0.046) | 0.204(0.011) |
An advantage of nLasso and nsslasso is that they can identify important pathways owing to their sparsity nature. When applied to the whole dataset of TCGA BRCA, nsslasso(Lasso) and nLasso(Lasso) could select similar pathways. Nsslasso(Lasso) found three pathways including Huntington’s disease (
w = 0.962), HIF-1 signaling pathway (relative weight,
w = 0.076), and Leishmaniasis (
w = 0.062) (see Supplementary Table
4). nLasso(Lasso) found four pathways including Huntington’s disease (
w = 0.749), HIF-1 signaling pathway (
w = 0.114), Leishmaniasis (
w = 0.086), and Oxidative phosphorylation (
w = 0.051), with the former three being selected by both methods.
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) data consists of comprehensive information on more than 2000 breast cancer patients, including clinical data, gene expression data, and mutation data. We obtained gene expression data and survival data from cBipPortal (
https://www.cbioportal.org/). After data preprocessing (as described in 4.1), we obtained a dataset with 1420 samples and 19,494 genes. These genes were mapped to 146 pathways involving 3709 genes (see Supplementary Table
5).
After the pathways pre-screening, we included 138 of 146 pathways with a C-index > 0.60 for the following analysis. Among the single-model methods, grlasso still had the most superior predictive performance. Pathway-stacking methods showed favorable discrimination compared to grlasso (Table
4). nLasso(Lasso) and nsslasso(Lasso) performed well both in discrimination and calibration.
Table 4
The measurements (mean(SD)) of penalty and group penalty methods and pathway-stacking methods for METABRIC dataset (N = 1420) by 100 times random spilt to training set (N = 710) and test set (N = 710)a
Single-model methods | AUC | BS | AUC | BS | AUC | BS |
Lasso | 0.705(0.022) | 0.160(0.003) | 0.679(0.018) | 0.213(0.004) | 0.651(0.020) | 0.235(0.005) |
gsslasso | 0.701(0.022) | 0.159(0.004) | 0.675(0.017) | 0.215(0.006) | 0.653(0.019) | 0.239(0.008) |
grlasso | 0.699(0.020) | 0.160(0.003) | 0.681(0.017) | 0.213(0.004) | 0.660(0.018) | 0.235(0.005) |
grSCAD | 0.695(0.022) | 0.162(0.004) | 0.677(0.021) | 0.215(0.005) | 0.655(0.022) | 0.235(0.006) |
cMCP | 0.697(0.024) | 0.161(0.004) | 0.671(0.020) | 0.215(0.004) | 0.644(0.021) | 0.237(0.005) |
Pathway-stacking methodsb |
solnp(Lasso) | 0.706(0.021) | 0.162(0.003) | 0.682(0.016) | 0.222(0.003) | 0.663(0.019) | 0.235(0.005) |
nLasso(Lasso) | 0.712(0.020) | 0.163(0.005) | 0.688(0.016) | 0.221(0.006) | 0.668(0.019) | 0.218(0.006) |
nsslasso(Lasso) | 0.712(0.020) | 0.163(0.005) | 0.688(0.016) | 0.221(0.006) | 0.669(0.019) | 0.218(0.007) |
ANN(Lasso) | 0.718(0.020) | 0.177(0.007) | 0.692(0.016) | 0.228(0.009) | 0.671(0.019) | 0.227(0.013) |
The survival stacking model (nsslasso(Lasso)) fitted using the METABRIC dataset identified seven pathways (Supplementary Table
6). nLasso(Lasso) also found the same seven pathways: MAPK signaling pathway (W = 0.018), Focal adhesion (W = 0.041), Cellular senescence (W = 0.170), Choline metabolism in cancer (W = 0.125), Endocytosis (W = 0.014), Carbon metabolism (W = 0.311), Apoptosis (W = 0.215); and another two pathways: PPAR signaling pathway (W = 0.099) and p53 signaling pathway (W = 0.007).
TCGA ovarian cancer dataset
Alike BRCA data, we acquired TCGA ovarian cancer (OV) dataset from the “GDC Data Portal”. After data preprocessing, we obtained a dataset with 415 samples and 13,764 genes. These genes were mapped to 124 pathways involving 3596 genes (see Supplementary Table
7).
After pre-screening, a total of 90 pathways had a C-index > 0.5 and the highest C-index was 0.58. We selected all 90 pathways for the following analysis. Table
5 showed that the pathway-stacking methods outperformed the single-model methods in prediction accuracy and variance (lower standard deviation especially for BS). The four stacking methods had similar and stable prediction performance.
Table 5
The measurements (mean(SD)) of penalty and group penalty methods and pathway-stacking methods for TCGA OV dataset (N = 415) by 100 times random spilt to training set (N = 207) and test set (N = 208)a
Single-model methods | AUC | BS | AUC | BS | AUC | BS |
Lasso | 0.525(0.063) | 0.154(0.041) | 0.518(0.052) | 0.230(0.040) | 0.512(0.047) | 0.241(0.137) |
gsslasso | 0.558(0.057) | 0.112(0.012) | 0.548(0.039) | 0.223(0.014) | 0.535(0.047) | 0.241(0.030) |
grlasso | 0.547(0.054) | 0.149(0.063) | 0.551(0.050) | 0.228(0.016) | 0.548(0.052) | 0.240(0.009) |
grSCAD | 0.549(0.057) | 0.152(0.063) | 0.556(0.049) | 0.228(0.016) | 0.548(0.048) | 0.241(0.009) |
cMCP | 0.523(0.048) | 0.171(0.068) | 0.518(0.035) | 0.235(0.015) | 0.514(0.045) | 0.244(0.010) |
Pathway-stacking methodsb |
solnp(Lasso) | 0.562(0.058) | 0.117(0.011) | 0.559(0.042) | 0.231(0.007) | 0.547(0.041) | 0.227(0.007) |
nLasso(Lasso) | 0.558(0.061) | 0.117(0.011) | 0.559(0.039) | 0.236(0.010) | 0.549(0.038) | 0.232(0.008) |
nsslasso(Lasso) | 0.562(0.059) | 0.117(0.011) | 0.560(0.038) | 0.236(0.010) | 0.551(0.037) | 0.232(0.009) |
ANN(Lasso) | 0.564(0.054) | 0.117(0.011) | 0.570(0.039) | 0.239(0.005) | 0.551(0.036) | 0.227(0.006) |
In application, nsslasso(Lasso) identified four pathways (Supplementary Table
8). nLasso(Lasso) found another two pathways, namely, Cell cycle (
w = 0.038) and Proteasome (
w = 0.079), in addition to the four pathways that were selected by nsslasso(Lasso) but with different weights: Influenza A (
w = 0.360), Peroxisome (
w = 0.268), B cell receptor signaling pathway (
w = 0.128), and T cell receptor signaling pathway (
w = 0.129).
Discussion
The present study proposed a novel survival stacking strategy that can incorporate genome pathway information into the development of cancer prognosis models. This strategy demonstrated an advantage over existing methods that rely on a single group model (such as grlasso, grSCAD, gsslasso) by using a stacking method to improve prediction robustness. Additionally, we extended the super learner to hierarchical GLM and ANN, thereby enriching the combination of sub-models. Generally, solnp uses IBS as an optimization function to obtain a lower time-BS. Hierarchical Lasso and sslasso inherit the sparse property that makes them effective at handling multiple sub-models. The sslasso super learner could outperform Lasso in certain cases, while in others, the two methods performed similarly. The ANN method can capture more nonlinear relationships, leading to better prediction performance. However, it may also capture more noise information and overfit the data.
In the simulation study, stacking methods consistently exhibited superior performance in terms of discrimination over the methods using a single model, except for Scenarios 1 and 4. Scenarios 1 and 4 represented the situation of a higher theoretical generalized R
2 or a small residual variance, in which the predictive information was easy to capture. The advantage of the stacking methods was not evident since these methods based on a single model had achieved a fairly well prediction. However, stacking methods demonstrated superior discrimination performance than any single model in the situation with more noise because they could borrow advantages from various models. Real-world data is typically characterized by a higher level of noise, which may account for the favorable performance of the proposed methods in the real-world data applications [
32]. However, this may come at the expense of some calibration accuracy.
A noted point of the stacking using nsslasso is the interpretability of the resulting models. Firstly, the proposed stacking method demonstrates increased sensitivity in identifying disease-related pathways, which may be too subtle for gene-level models to detect [
33]. Second, we implemented the methods considering group structure (e.g, gsslasso) to the real-world data (see Supplementary Table
9). The results indicated that while gsslasso exhibited good predictive performance, it did not effectively indicate pathway importance. Third, unlike Lasso which imposes an equal penalty on all coefficients, sslasso adaptively employed weak compression to strong effects and strong compression to weak effects [
33]. We observed that sslasso tended to retain fewer pathways, while Lasso prefers to include more pathways with small effects. For instance, nsslasso(Lasso) identified several important pathways in METABRIC dataset, such as cellular senescence, choline metabolism in cancer, carbon metabolism, apoptosis, and PPAR signaling pathway. These pathways are deeply involved in the cell cycle and carcinogenesis process [
34,
35]. nLasso(Lasso) could find two additional weak signal pathways, namely MAPK and p53 signaling pathways. These two popular pathways are associated with the prognosis of breast cancer [
36,
37]. However, many MAPK family genes and TP53 are also contained in the other four pathways, indicating limited information that the two pathways can provide (Supplementary Table
6). Similarly, Huntington’s disease pathway identified in TCGA BRCA contains TP53. Huntington’s disease seems to be unrelated to the prognosis of breast cancer. However, several epidemiology studies have shown a lower risk of cancer among patients with Huntington’s [
38‐
40]. Additional research has delved into their relationship at the molecular level, including the impact of Huntington and ErbB2/HER2 signaling on the development and metastasis of breast cancer [
41,
42].
In total, the proposed methods possess advantageous features in identifying pathways that offer prognostic information. Also, the weights assigned to these sub-models (based on pathways) signify their predictive significance. We anticipate that focused research on these prioritized pathways will aid in discovering cancer targets. Another obvious property of the pathway-based stacking strategy is that sub-models are constructed independently, circumventing the gene-overlapping issue. In addition, one commonality of the stacking methods is having an improved discrimination than the single-based models, which may help identify high-risk patients. A limitation of our approach is that it takes more time due to the CV procedure in the sub-model construction. But the cost pays off in the more robust and accurate prediction. Last but not least, although the proposed survival stacking strategy is based on a two-level process of gene-pathway structure, our ideas can be naturally generalized to other biological processes with similarly hierarchical levels.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.