Background
For a long time, breast cancer has been a leading cancer diagnosed in women worldwide, with 276,480-odd new diagnoses per year, accounting for 30% of female cancers. Breast cancer represents the second highest death rate, behind lung and bronchus cancer, responsible for more than 42,170 deaths per year (15% of all cancer-related deaths in women) [
1]. According to reports, after diagnosis and treatment of the primary tumor, approximately 30% of breast cancer patients may experience metastasis, causing approximately 90% of cancer-related deaths. Compared with early-stage breast cancer, metastatic breast cancer has a significantly reduced cure rate and can even be incurable. Although high-throughput sequencing technology has significantly advanced treatments for cancer, it has little effect on the treatment of metastatic breast cancer [
2,
3]. Early assessment of metastatic status and recurrence risk is essential to improve breast cancer prognosis. To date, the effective clinical treatment targets for metastatic breast cancer are ER, PR and HER2 [
4]. Relatively few studies utilize appropriate methods to predict breast cancer metastatic status. Therefore, finding new biomarkers related to metastasis is an urgent task to predict the metastatic status of breast cancer and provide new therapeutic targets.
With the explosive growth of high-throughput sequencing technology, big data has become a hot topic of research in the field of oncology. Continuous accumulation of multiomics sequencing data has supported cancer research from a bioinformatic perspective. However, the characteristics of tumor gene expression data, such as high dimensionality, small sample sizes, and category imbalance, usually bring about great computational challenges [
5]. Machine learning, a kind of computer algorithm that improve its performance automatically with experience, has unique advantages in solving problems such as clustering, classification and regression. Many machine learning approaches are applied to deal with biological multiomics data of high-dimensional samples [
6]. Compared with traditional biometric methods, the maximum likelihood method is more flexible and has been widely used in oncology [
7]. Using machine learning algorithms, many studies have achieved improved accuracy by using various tumors to predict the diagnosis and survival outcome of breast [
8‐
10], ovarian [
11], and lung cancers [
12], among others. In terms of the classification of breast cancer, the support vector machine (SVM) algorithm was used to classify breast cancer patients into triple-negative and non-triple-negative groups using tumor gene expression data [
10]. Based on serum biomarkers and clinicopathological data instead of sequencing data, the random forest (RF)-based model was used to predict the metastatic status of breast cancer; however, the area under the receiver operating characteristic (ROC) curve was only 0.75 [
8], indicating a low accuracy rate in clinical practice.
Due to their tendency of learning in large classes while ignoring small classes, traditional machine learning algorithms aim at high accuracy (ACC) without considering the misclassification cost, leading to great bias in classifiers [
13]. For example, during the cancer diagnosis, 98% of patients are typically tumor-free, and only 2% have cancer; if the model simply predicts that everyone is tumor-free, then the overall prediction accuracy is as high as 98%. Neglection of any patient with cancer can lead to fatal outcomes clinically. In addition, the cost of diagnosing metastatic patients as nonmetastatic is much higher than that of the opposite diagnosis. There is a critically unmet medical need to distinguish metastatic from nonmetastatic breast cancers. The eXtreme Gradient Boosting (XGBoost), as a variant of the Gradient Boosting Machine (GBM), is an open-source machine learning classifier developed by Chen et al. [
14]. XGBoost has been widely applied for classification problems. There have been reports on the capabilities of XGBoost in handling label-imbalanced data by adjusting the weights of positive and negative samples [
15]. XGBoost is often more accurate in cancer research than other machine learning algorithms, such as the RF, SVM, logistic regression (LR), and K-nearest neighbors (KNN) algorithms. For instance, XGBoost is the most precise model for predicting the 1-year survival rate of patients with non-small-cell lung cancer (NSCLC) bone metastases [
16]. XGBoost can deduce the tissues of origin for 10 different cancer types with better performance than other traditional machine learning algorithms [
17].
In this research, an efficient model of eXtreme Gradient Boosting (XGBoost) optimized by a grid search algorithm is established to realize auxiliary identification of metastatic breast tumors based on gene expression. Estimated by ten-fold cross-validation, the optimized XGBoost classifier achieved an overall higher mean AUC of 0.82 compared to other classifiers, such as DT, SVM, KNN, LR, and RF. A novel 6-gene signature (SQSTM1, GDF9, LINC01125, PTGS2, GVINP1, and TMEM64) was selected by feature importance ranking, and a series of in vitro experiments were conducted to verify the potential role of each gene. We explored the potential role of each gene of the proposed gene signature during breast cancer metastasis from the viewpoint of tumor cells and immune cells. Our results will allow for a more accurate prediction of the metastatic status of breast cancer and will benefit the mining of breast cancer metastasis-related biomarkers.
Methods
Data preparation
Tumor expression data for modeling in this research were based upon data generated by the Cancer Genome Atlas (TCGA) database. All paired clinical data and transcript profiles of breast cancer (BRCA) samples were obtained and trimmed from the TCGA Data Portal by R package “GDCRNATools” [
18]. The original data consisted of 1,097 samples in total. According to the pathologic_M column in the clinical information table, data rows with M0 and M1 status remained unchanged, and data with MX status (ambiguous metastatic status) were removed. A total of 923 BRCA samples, including 901 non-metastatic samples and 22 metastatic samples, were finally retained. Then, we grouped the BRCA samples into 2 groups depending on the status of pathologic metastasis. The metastatic and non-metastatic groups were labeled M1 and M0, respectively. M0 and M1 were used as labels for binary samples before classification by the following machine learning algorithms. Single-cell sequencing data were achieved from the Gene Expression Omnibus (GEO) database, GEO accession number was GSE162726 [
19]. The R package “Seurat” [
20] was used for the quality control and integration of the single-cell RNA-seq data.
After data were downloaded and integrated, we grouped the BRCA samples into 2 groups according to the status of pathologic metastasis. The metastatic and non-metastatic groups were labeled M1 and M0, respectively. Samples with unclear metastatic status were removed. The R package “DESeq2” [
21] was adopted to generate DEGs between the two groups based on a negative binomial distribution. The significance criteria for DEGs was
P value < 0.05, and the |log2 Fold-change| ≥ 1. The R package “EnhancedVolcano” was used to conduct the volcano plot and visualize the results of differential expression analyses. Then, clustering and visualization of the non-redundant biological terms of genes in a functionally grouped network was conducted with the Cytoscape (V3.8.0) desktop application and the “ClueGO” plug-in.
Machine learning model selection
Different machine learning models were adopted to decide which model was the best one suitable for the present study. The XGBoost, DT, SVM, KNN, LR, and RF classifiers were used to establish the classification model. Tenfold cross-validation was performed for each model, and the ROC curve was plotted to calculate the mean area under the curve (AUC). The model with the highest mean AUC value was selected for modeling. We used Jupyter Notebook (version 6.1.4), a web-based application for interactive computing in Anaconda Navigator (anaconda3), to implement different machine learning algorithms. Scikit-learn module in Python (version 3.9) programming was adopted.
XGBoost classifier
XGBoost classifier is a gradient boosting method that combines the regression tree [
14]. The goal function of the XGBoost algorithm model is
\(obj(\theta ) = L(\theta ) + \Omega (\theta )\), where
\(L\left( \theta \right)\) is the training loss function, and
\(\Omega \left( \theta \right)\) is the complexity function of the tree.
\(L(\theta ) = \sum\nolimits_{i = 1}^{{\text{n}}} {l(y_{i} ,\hat{y}_{i} )}\),
\(l(y_{i} ,\hat{y}_{i} )\) corresponds to the training loss function for each sample, where
\(y_{i}\) represents the true value of the
ith sample, and
\(\hat{y}_{i}\) represents the estimated value of the
ith sample.
\(\hat{y}_{i} = \sum\nolimits_{k = 1}^{K} {f_{k} (x_{i} )} ,f_{k} \in F\), where
K represents the number of trees,
F represents all possible DT, and
f denotes a specific CART tree.
\(\Omega (f) = \gamma T + \frac{1}{2}\lambda \sum\nolimits_{i = 1}^{T} {w_{i}^{2} }\), where
wi is the score on the
ith leaf node, and
T is the number of leaf nodes in the tree. By adjusting parameters, the objective function was continuously optimized, and optimal results were obtained. The grid search algorithm was used to optimize the hyper-parameters, including max_depth, min_child_weight, gamma, subsample, colsample_bytree and learning_rate in each iteration.
Cell transfection to obtain knockdown cell lines
The lentivirus construction to knockdown SQSTM1, GDF9, LINC01125, PTGS2, GVINP1, and TMEM64 was purchased from Genepharma (Shanghai, China). Breast cancer cell line MCF-7 was plated in six-well dishes at 50% confluence and then infected with the above 6 lentiviruses (termed as shSQSTM1, shGDF9, shLINC01125, shPTGS2, shGVINP1, and shTMEM64), or control (termed as shCtrl) in MCF-7 cell, respectively. Stable cell lines were generated by selection using puromycin at a concentration of 4 µg/mL for 2 weeks. The cell transfection protocol described above was in accordance with the manufacturer’s instructions.
In MTT (3-[4,5-dimethylthiazol-2-yl]-2,5 diphenyl tetrazolium bromide) assay, the shSQSTM1, shGDF9, shLINC01125, shPTGS2, shGVINP1, shTMEM64 and shCtrl MCF-7 cells were seeded into 96-well plates (Cat. # 3599, Cornning) at a density of 2 × 10
3 cells per well over night. Before adding 150 μL of DMSO, 20 μL of MTT at a concentration of 5 mg/mL was added to each well and incubated for 4 h. Then, a microplate reader was used to measure the optical density at 490 nm. Colony formation assay and transwell assay were performed according to our previous research [
22]. Cell migration was observed using a wound healing assay. Transfected MCF-7 cells were maintained in 6-well plates and upon reaching 90% confluence, scratches were created using micropipette tips. Cells were washed 3 times using sterile PBS to wash off non-adherent cells generated by the scratch, and fresh serum-free medium was replaced to continue culturing the cells. The wound status was observed at 0 h and 24 h after scratching with an X71 inverted microscope (Olympus). The means of intercellular distances were calculated using the ImageJ software. All experiments were performed in triplicates.
Correlations between the proposed gene signature and immune cells
We analyzed the correlation between the expression of the selective gene signature and several immune cell markers to determine the association of infiltrating immune cells with our proposed gene signature. Immune gene markers were selected from the website of R&D Systems or from the GEPIA 2.0 [
23] recommendations, including markers of B cell, naïve T cell, effector T cell, resident memory T cell, T helper 1 (Th1) cell, regulatory T cell (Treg), T cell exhaustion, macrophage, tumor-associated macrophage (TAM), monocyte, natural killer (NK) cell, neutrophil, and dendritic cell (DC). Gene expression correlation analysis was performed of BRCA tumor datasets of TCGA expression data by GEPIA 2.0. Correlation coefficients were determined by the Spearman method.
Discussion
Breast cancer is one of the most common malignancies among women worldwide and is the leading cause of most cancer-related deaths. The high mortality rate of breast cancer has been linked to multiple factors, with metastasis identified as the main cause [
1,
29]. Several multigene assays have been employed in studies of breast cancer. For example, a 70-gene signature has been identified for better prediction of clinical outcome and contributes to the treatment decisions for women with early-stage breast cancer in selecting patients for adjuvant chemotherapy according to standard clinicopathological criteria [
30]. An 18-gene signature for predicting relapse in the indicated breast cancer subtype of ER-positive, HER2-negative breast cancer has been derived using penalized Cox regression [
31]. A 5-gene metabolic signature has been demonstrated to predict worse overall and disease-free survival in patients with breast cancer through proteomic profiling [
32]. Compared to predicting prognosis and subtyping for breast cancer, relatively few studies utilize appropriate methods to predict breast cancer metastatic status. Therefore, finding new biomarkers related to metastasis is an immediate task to predict the metastatic status of breast cancer and provide new therapeutic targets. Some studies have conducted preliminary explorations. For instance, a 39-gene signature was reported to screen out breast cancer patients with early metastasis using survival prediction analysis (AUC = 0.734) [
33]. Based on the integrated gene expression profiles and clinical information, a 51-gene signature and a centroid classifier were constructed to predict bone metastasis in breast cancer (AUC = 0.66) [
34]. However, the lack of a high AUC and the large number of gene signatures identified in previous studies mean that the efficiency in clinical application remains to be proven.
In this paper, we utilized machine learning algorithms for data mining, followed by biological experiments for experimental validation of the selective gene signature. We constructed a novel 6-gene signature (SQSTM1, GDF9, LINC01125, PTGS2, GVINP1, and TMEM64) and used an XGBoost model to predict the metastatic status in breast cancer (AUC = 0.82). Meanwhile, we explored the potential role of each gene of the proposed gene signature during breast cancer metastasis from the viewpoints of tumor and immune cells. Based on the results above, we could infer that SQSTM1 functioned from the perspective of tumor cells since it was significantly upregulated in metastatic breast cancer, and its knockdown attenuated the ability of tumor cells to invade metastases. A previous study reported the ability of SQSTM1 to extend the mRNA half-life of pro-metastatic factors in melanoma cells [
35] and mediate the epithelial-to-mesenchymal transition in nasopharyngeal carcinoma cells [
36]. Therefore, it may be possible to target the SQSTM1 gene for high expression in the metastatic breast cancer group, providing a basis for drug development against tumor cells as targets.
The metastatic cascade relies on reciprocal interactions between cancer cells and their microenvironment. Immune cells in the tumor microenvironment (TME) are known to facilitate metastasis formation [
37,
38]. In the present study, based on our in vitro experiments and the fact that the expression of SQSTM1 was upregulate ed in metastatic breast cancer tissues, while the expression of the remaining genes was downregulated, we speculated that 5 genes, LINC01125, GDF9, PTGS2, GVINP1, and TMEM64, contributed to breast cancer metastasis from the perspective of immune cells. Our results showed that these 5 genes might play a role in the metastasis of breast cancer through naïve T cells, effector T cells, resident memory T cells and DC populations. Decreased expression of these genes in metastatic breast cancer tissues weakened the function of some T cell subsets and antigen-presenting cells (APCs) such as DCs, which could in turn decrease the immune function and therefore promote breast cancer cell metastasis. Previous studies also showed that tumor cells could directly present antigens to CD8 T cells via MHC class I molecules, and initiating immune responses required DCs to exert antigen presentation [
39]. Lymph nodes of metastatic breast cancer have a significantly increased proportion of CD8 T cells and a skewing toward an effector or memory phenotype of CD4 and CD8 T cells, indicating an ongoing immune response [
40].
Through comparison each gene of our gene signature with existing studies separately, we found that there are many studies on the association between gene PTGS2, gene SQSTM1 and breast cancer. Researches on gene LINC01125 [
25], gene GDF9 [
41,
42] and gene GVINP1 [
43] is very limited, no studies have shown that there is a link between gene TMEM64 and breast cancer. Most of the genes we have proposed in this research have been biologically verified, which also proves the reliability of our gene signature. Meanwhile, we discovered new gene that not reported association with breast cancer, which also provides clues for our follow-up research.
Conclusions
In conclusion, our present research constructed a novel 6-gene signature (SQSTM1, GDF9, LINC01125, PTGS2, GVINP1, and TMEM64) by feature importance score and used an XGBoost model to predict the metastatic status in breast cancer (AUC = 0.82, higher than the previous studies to our knowledge). In summary, we assigned the effects of SQSTM in tumor cells as a risk factor and the effects of other 5 genes (GDF9, LINC01125, PTGS2, GVINP1, and TMEM64) in immune cells as protective factors. Therefore, mining gene expression data using appropriate machine learning algorithm can predict the metastatic status of breast cancer more accurately and can assist physician decision-making to some extent. Biomarkers used to predict metastasis of breast cancer can be used as complements to serological indicators and imaging examination in clinical, it also provides new targets and ideas for the treatment of metastatic breast cancer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.