In this study, for the first time, we established a groundbreaking subtyping system for BC that integrates the gut microbiota, human gene expression patterns, and patient prognosis, enabling the prediction of molecular characteristics and treatment responses. A novel subtype characterized by an increase in genetic mutations and a highly complex immune environment was identified and termed “challenging BC”. Additionally, a score index related to patient prognosis was developed, facilitating the identification of “challenging BC” cases and the prediction of therapeutic responses in BC patients. Overall, we leveraged multiomics data analyses based on the gut microbiome using machine learning methods to provide a robust scientific foundation for predicting molecular characteristics and treatment responses in patients with “challenging BC”.
Methods
Sample collection
We analysed 16 S rRNA sequencing data of the gut microbiome collected from four public datasets (PRJNA86188, PRJNA817689, PRJNA639644, and PRJNA658160). PRJNA861885 included 260 normal samples and 428 CRC samples. PRJNA817689 and PRJNA639644 included 140 normal samples and 124 GC samples, and PRJNA658160 included 308 normal samples and 350 BC samples.
The neoadjuvant therapy data were obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) public database (GSE163882).
A total of 6 patients with BC were enrolled from the Department of Breast Surgery, Fudan University Shanghai Cancer Center, Shanghai Medical College, Fudan University (Shanghai, P. R. China) in 2022. Fresh breast tumour tissues were collected for single-cell transcriptome analysis (2 samples) and single-cell spatial transcriptome analysis (4 samples). All diagnoses of BC were based on histopathology and were made in accordance with the World Health Organization criteria. Ethical approval for the study was obtained from the Fudan University Shanghai Cancer Center Ethics Committee. Our single-cell transcriptome data and single-cell spatial transcriptome data have been deposited in NCBI BioProjects GSE252175 and GSE252176.
Microbiome analysis
The raw sequencing data were in FASTQ format. Paired-end reads were then preprocessed using Trimmomatic software [
55] to detect and trim ambiguous (N) bases. Low-quality sequences with an average quality score of less than 20 were also removed using the sliding window trimming approach. After trimming, paired-end reads were assembled using FLASH software [
56]. The parameters used for assembly were as follows: 10 bp of minimal overlap, 200 bp of maximum overlap and a 20% maximum mismatch rate. Further denoising was performed on the sequences as follows: reads with ambiguous sequences, homologous sequences, or fewer than 200 bp were removed; reads in which 75% of the bases had a quality score of more than 20 (Q20) were retained; and chimaeric reads were then detected and removed. These steps were performed using QIIME software [
57] (version 1.8.0).
The clean reads were subjected to removal of primer sequences and clustering to generate operational taxonomic units (OTUs) using Vsearch software [
58] with a cutoff of 97% similarity. The representative read of each OTU was selected using the QIIME package. All representative reads were annotated and BLASTed against the Silva database version 123 (or Greengenes) (16 S/18S rDNA) using the Ribosomal Database Project (RDP) classifier [
59] (confidence threshold, 70%). All representative reads were annotated and searched against the Unite database (ITS rDNA) using BLAST [
60].
Clusters were then identified. Based on the identified differentially abundant genera between normal and cancer samples, we further employed a random forest model to assess the importance of the genera, selecting those with an importance greater than the average value to obtain the final genus information [
61]. Using the SOM neural network [
62] library (Kohonen), we determined the optimal number of clusters and performed clustering of the cancer samples. For each identified cluster type, we used a decision tree classifier to establish classification rules.
To identify human-related genes via KEGG, we used PICRUSt [
63] to project the KEGG pathways within the gut microbiome data across three cancer datasets. We integrated these predictions with the microbial clustering data and employed one-way ANOVA to identify specific pathways with significant differential enrichment across the subgroups. Subsequently, we determined the functional modules shared among CRC, GC, and BC. Then, the KEGG pathways that were also present in humans were retained. Finally, we screened the genes involved in these pathways and integrated cancer patient survival data from TCGA to identify gene sets significantly associated with survival.
Identification of TCGA-BRCA cancer sample subtypes and construction of the score model
Subtyping method: Cluster analysis was performed using the R ConsensusClusterPlus [
64] package with a distance-based k-means algorithm, with the number of subsets (reps) set to 1000.
Scoring method [
31]: For each sample, the score was calculated as ∑ (beta × Exp), where beta is the independent prognostic coefficient obtained through single-factor Cox regression analysis of the gene and Exp is the expression level of the gene.
Pathway enrichment analysis [
65]: Gene set variation analysis (GSVA) is an algorithm building on gene set enrichment analysis (GSEA) that is available at
http://www.gsea-msigdb.org/. Analysis of hallmark gene sets and pathways was conducted using the GSVA package in R. The limma package in R was used to identify significantly differentially expressed genes (DEGs) in pairwise comparisons. The R packages GSEABase, clusterProfiler, and org.Hs.eg.db were used for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of the differentially expressed genes. The Benjamini–Hochberg procedure was used to control the false discovery rate (FDR; p.adj) for multiple comparisons, and FDR < 0.05 was applied as the threshold for selection.
Single-cell transcriptome analysis
Sequencing data quality control and gene quantification: Raw data generated via high-throughput sequencing, in fastq format, were processed using the official 10x Genomics software Cell Ranger (version 7.0.1). This software allows the acquisition of data quality statistics and alignment to the reference genome (human: GRCh38, mouse: mm10). By identifying cell-specific barcode markers and unique molecular identifiers (UMIs) for each mRNA molecule within a cell, Cell Ranger quantifies high-throughput single-cell transcriptome data, calculating quality control statistics such as the number of high-quality cells, the median number of genes, and sequencing saturation.
Gene quantification quality control and data preprocessing: After preliminary quality control processing with Cell Ranger, additional quality control processing was performed using Seurat (version 4.0.0). Based on the distribution of indicators such as nUMI, nGene, and percent.mito, filtering criteria were applied to retain high-quality cells. The specific quality control criteria included retention of cells with a gene count of greater than 200, a UMI count of greater than 1000, a log10GenesPerUMI value of greater than 0.7, and a mitochondrial UMI count of less than 5%; and a percentage of red blood cells expressing a gene of less than 5%. Additionally, DoubletFinder software (version 2.0.3) was utilized to remove doublet cells. After quality control, the NormalizeData function in Seurat was applied for data normalization.
Dimensionality reduction and clustering analysis: The FindVariableGenes function (mean.function = FastExpMean, dispersion.function = FastLogVMR) in Seurat was used to select the top 2000 highly variable genes (HVGs). Principal component analysis (PCA) was performed using the expression profiles of the highly variable genes, and the results were visualized in two-dimensional space using uniform manifold approximation and projection (UMAP; a nonlinear dimensionality reduction technique).
Identification of marker genes: The FindAllMarkers function in Seurat (test.use = presto) was used for marker gene identification. This process allowed the identification of genes that were upregulated in each cell type compared to the other cell types, thus serving as potential marker genes. Visualization of the identified marker genes was performed with the VlnPlot and FeaturePlot functions.
Cell type identification: Via the SingleR package (version 1.4.1), the expression profiles of the cells to be identified were correlated with a common reference dataset. The cell type with the highest correlation in the reference dataset was assigned to the cells being identified, reducing subjective interference. The identification principle involved calculating the Spearman correlation coefficient between the expression profile of each cell in the sample and each annotated cell expression profile in the reference dataset, with the cell type with the highest correlation selected as the final identified type.
Differential gene expression and enrichment analyses: The FindMarkers function in Seurat (test.use = presto) was used to select differentially expressed genes. Genes with a P value less than 0.05 and a fold change greater than 1.5 were considered significantly differentially expressed. GO term and KEGG pathway enrichment analyses of the significantly differentially expressed genes were conducted using the hypergeometric distribution test.
Multiplex immunofluorescence staining
We conducted multiplex immunofluorescence (mIF) staining using antibodies specific for CD4 (rabbit monoclonal, clone EPR19514, Abcam, Cat# ab183685), CD8 (rabbit monoclonal, clone EPR21769, Abcam, Cat# ab217344), CCL5 (RANTES) (rabbit polyclonal, clone 25HCLC, Thermo Fisher, Cat# 710,001), CD103 (integrin alpha E)) (mouse monoclonal, clone 2E7, Thermo Fisher, Cat# 14-1031-82), and FoxP3 (rabbit monoclonal, clone D6O8R, Cell Signaling Technology, Cat# 12,653). Tissue sections were deparaffinized with xylene, rehydrated with ethanol, and subjected to antigen retrieval by boiling in Tris-EDTA buffer (pH 9.0) for 15 min. Endogenous peroxidase activity was blocked by incubation with 3% hydrogen peroxide at room temperature for 15 min. Nonspecific antigens were blocked by incubation with a goat serum solution for 30 min. The sections were then incubated with primary antibodies overnight at 4 °C and with horseradish peroxidase (HRP)-conjugated secondary antibodies at room temperature for 30 min. Subsequently, the sections were incubated with Opal tyramide signal amplification (TSA) fluorochromes (Opal Colour Manual IHC Kit, Perkin Elmer, NEL811001KT) at 37 °C for 20 min. Between each run, the antibody (Ab)-TSA complexes in the sections were removed by microwave heating, and the sections were blocked with a goat serum solution. In the final run, 4’,6-diamidino-2-phenylindole, dihydrochloride (DAPI) was added for visualization of nuclei, and the sections were mounted with glycerin.
Single-cell spatial transcriptome analysis
Sequencing data quality control and gene quantification: Raw data generated via high-throughput sequencing, in fastq format, were processed using the official 10x Genomics software Space Ranger (version 2.0.1) for the Visium spatial transcriptome sequencing data and bright-field microscopy slice images. The software detected the capture regions of tissues on the chip, aligned them to the reference genome (human: GRCh38, mouse: mm10), and, based on spatial barcode information, differentiated the reads for each spot. Statistical evaluations included the total spot count, reads per spot, detected gene count, and UMI count, providing an assessment of sample quality.
Gene quantification quality control and data preprocessing: After preliminary quality control processing with Space Ranger, further quality control and processing were performed using Seurat (version 4.3.0) [
66]. The sctransform function was used to normalize the data, detect high-variance features, and store the data in the SCT matrix.
Dimensionality reduction and clustering analysis: The FindVariableGenes function n Seurat was used to select the top 3000 highly variable genes. PCA was conducted using the expression profiles of the highly variable genes, and the results were visualized in two-dimensional space using UMAP (nonlinear dimensionality reduction).
Identification of spatial feature genes: The FindAllMarkers function in Seurat (test.use = bimod) was employed for the identification of marker genes, revealing genes upregulated in each spot group compared to the other spot groups. These genes represented potential marker genes for each spot group, and visualization of the identified marker genes was performed with the VlnPlot and FeaturePlot functions.
Spatial cell type annotation: Robust cell type decomposition (RCTD) [
67] (version 1.1.0) is a robust cell type deconvolution method that leverages cell type profiles obtained via single-cell RNA-seq to decompose mixtures of cell types while correcting for differences across sequencing techniques. For RCTD, the creat.RCTD function was used with default parameters, ensuring at least 1 cell per cell type and at least 1 UMI per spot. The run.RCTD function was used with doublet_mode set to FALSE, allowing the cell type composition of each spot to be inferred.
Differential gene expression and enrichment analyses: The FindMarkers function of Seurat was used for selection of differentially expressed genes, and genes with a P value less than 0.05 and a fold change greater than 1.5 were identified by filtering. GO term and KEGG pathway enrichment analyses of the significantly differentially expressed genes were conducted using the hypergeometric distribution test.
PDX mouse models and drug treatment
Tumour tissues isolated from patient YZL_1 were dissected into 1-mm3 pieces. After NOG mice were anaesthetized, the BC tissues were subcutaneously implanted into the right superior flank. When the tumour diameter reached 1 cm (approximately 60 days after transplantation), we removed the subcutaneous PDX tumours, dissected them into 3 pieces of approximately 2 × 2 × 2 mm each, and then retransplanted the pieces into the flanks of the nude mice to allow tumour growth for approximately one month. The mice were euthanized after no more than 5 weeks or when the tumour diameter reached 10 mm. Beginning on the seventh day after transplantation, each mouse in the drug treatment groups received 20 mg/kg sonidegib or 10 mg/kg rapamycin every two days via tail vein injection. Beginning on the seventh day after transplantation, each mouse in the control group received placebo every two days via tail vein injection.
Subtype identification based on TCGA classification
Following the classification process, we employed the random forest algorithm using the R software package library(randomForest) to develop a predictive model based on the gene expression profiles and classification data. This model had predictive capability, allowing the input of expression profile data from new datasets to determine the corresponding classification outcomes.
For the single-cell sequencing data, we first screened the expression of the 700 genes associated with gut microbiota-related metabolic pathways and patient survival within each cell and determined their average expression levels.
Statistical analysis
Student’s t test and the Mann‒Whitney test were applied to compare continuous variables and categorical variables, respectively, where appropriate. The associations between clinical information and metabolic pathway-based subtypes were examined using the chi-square test and Fisher’s exact test. Survival curves were constructed using the Kaplan‒Meier method and compared with the log-rank test. Univariate and multivariate Cox proportional hazards regression models with or without adjustment for available prognostic clinical covariates were used to calculate hazard ratios (HRs) and 95% confidence intervals. Correlations were analysed with Spearman correlation analysis. All the statistical analyses were performed with R software or GraphPad Prism software.