Tissue sampling and processing
Following the prostatectomy of 13 patients, a four-millimetre tissue core was collected from the prostate tumour site, conditional to histopathological verification [
17,
18]. The patient cohort ranged from 52 to 78 years of age and from CAPRA-S risk score of 0 (attributed to benign tissue samples, harvested from a site far from a low grade, low volume cancer) to 7 (Supplementary file 4), If not otherwise specified, all procedures were carried out at 4 °C. Tissue blocks were washed in phosphate-buffered saline (PBS) solution for 2 min and minced for 2 min with a scalpel. Homogenised tissue was added to a solution (total volume of 7 ml) composed by of 1 mg/ml collagenase IV (Worthington Biochemical Corp, USA), 0.02 mg/ml DNase 1 (New England Biolabs, USA), 0.2 mg/ml dispase (Merck, USA). The homogenised tissue was serially digested in the shaker incubator at 37 °C at 180 rpm (4 g), through three steps of 5, 10 and 10 min of duration. The final 3 min were dedicated to sedimentation at 0 rpm. After each digestion step, the supernatant was aspirated and filtered through a 70 μm strainer into a pre-chilled tube, diluting the solution with 15 ml of Dulbecco’s PBS containing 2% Bovine serum (dPBS-serum) to quench the enzymatic reaction. The resulting cumulative solution was then centrifuged at 300gfor five minutes, with the supernatant collected and the cell pellet resuspended into 1 ml 2% PBS-serum before labelling (Fig.
S1).
Antibody labelling, flow cytometry and cell storage
The cell preparation was labelled with the following antibodies: CD3-BV711 (Becton Dickinson San Jose Ca), EpCAM-PE (BD Biosciences, USA), CD31-APC (BD Biosciences, USA), CD90-PerCP-Cy5.5 (Becton Dickinson San Jose Ca), CD45 APC-CY7, and CD16 Pacific Blue (BD Biosciences, USA). All antibodies were used at concentrations according to manufacturers recommendations and incubated for 30 mins at 4 °C. Following labelling, the cells were diluted to 5 ml and centrifuged at 300 g for 5 min. The supernatant was removed, and the cell pellet was resuspended in dPBS-serum. The viability die (7AAD) was added to the suspension to a final concentration of 5 μg/ml. Epithelial, fibroblasts, myeloid and T cells were sorted using a fluorescence-activated cell sorting Aria III cell sorter (Becton Dickinson San Jose, Ca). The cell sorting strategy utilised a robust 3 stage design: (i) a series of gates based on forward and side scatter to exclude debris, cell clumps and doublets. (ii) a gate to exclude all dead cells and (iii) a combination of the fluorescent antibodies to allow purification of the above cell types. The four cell types were identified as follows: T Cells: FSC and SSC lo, PI negative, EpCAM and CD31 negative, CD3 and CD45 positive. Epithelial cells: FSC and SSC high, PI negative, CD31 and CD90 negative and EpCAM positive. Myeloid cells: FSC and SSC hi and medium, PI negative, CD31 and EpCAM negative and CD16 positive. Fibroblasts: FSC and SSC hi, PI negative, EpCAM and CD31 negative, CD90 positive. The four purified populations were sorted directly into 1.5 ml conical tubes and stored on dry ice immediately after collection before permanent storage at − 80 °C.
RNA extraction was performed in two batches (comprising 6 and 7 patients, for a total of 24 and 28 samples, respectively) on consecutive days. In order to eliminate time-dependent methodological biases, the two patient batches included a balanced distribution of Gleason score (means 2.00 and 2.71, standard deviations 2.50, 1.86; Supplementary file 4) and days elapsed from tissue processing (means 197 and 222, standard deviations 46.3 and 71.9; Supplementary file 4). The RNA extraction was performed using the miRNeasy Micro Kit (Qiagen; Cat #217084), according to the manufacturer’s protocol. Briefly, cell pellets were lysed with QIAzol lysis reagent, treated with chloroform, and centrifugation carried out to separate the aqueous phase. Total RNA was precipitated from the aqueous phase using absolute ethanol, filtered through the MinElute spin column and treated with DNase I to remove genomic DNA. The RNA bound columns were washed with the buffers RWT and RPE before eluting the total RNA with 14 μl of RNase-free water. RNA estimation was carried out using Tapestation (Agilent).
According to the manufacturer’s protocol, transcriptome sequencing on low input total RNA samples (up to 10 ng) was carried out using SMART-Seq v4 Ultra Low Input RNA Kit (Clontech). The first-strand cDNA synthesis utilised 3′ SMART-Seq CDS Primer II-A. The SMART-Seq v4 Oligonucleotide together with the cDNA amplification was carried out on Thermocycler using PCR Primer II-A and PCR conditions: 95 °C for 1 min, 12 cycles of 98 °C 10 s, 65 °C 30 s and 68 °C 3 min; 72 °C for 10 min and 4 °C until completion. The PCR-amplified cDNA was purified using AMPure XP beads and processed with the Nextera XT DNA Library Preparation Kits (Illumina, Cat. # FC-131-1024 and FC- 131-1096) as per the protocol provided by the manufacturer.
Sequencing library preparation (10–100 ng) was carried out using Truseq RNA Sample Preparation Kit v2. The poly-A containing mRNA was purified using oligo-dT bound magnetic beads followed by fragmentation. The first-strand cDNA synthesis utilised random primers, and second-strand cDNA synthesis was carried out using DNA Polymerase I. The cDNA fragments then underwent an end-repair process, adding a single ‘A’ base and ligation of the RNA adapters. The adaptor-ligated cDNA samples were bead-purified and enriched with PCR (15 cycles) to generate the final RNAseq library.
The SMART-Seq v4 RNA and Truseq RNA libraries were sequenced on an Illumina Nextseq 500 to generate 15–20 million 75 base pairs paired-end reads for each sample. The batch effect due to sequencing runs was minimised by pooling all 52 libraries and carrying out three sequential runs on a Nextseq500 sequencer.
Sequencing data quality control, mapping and read counting
The quality of the sequenced reads for each sample was checked using the Fastqc [
19]. Reads were trimmed for custom Nextera Illumina adapters; low-quality fragments and short reads were filtered out from the pools using BBDuk (
jgi.doe.gov) according to default settings. All remaining reads were aligned to the reference genome hg38 using the STAR aligner [
20] with default settings. The quality control on the alignment was performed with RNA-SeQC [
21]. For each sample, the gene transcription abundance was quantified in terms of nucleotide reads per gene (read-count) using FeatureCounts [
22] with the following settings: isPairedEnd = T, requireBothEndsMapped = T, checkFragLength = F, useMetaFeatures = T. All sequenced reads that did not align to the reference human genome were assigned to bacterial and viral reference genomes using Kraken [
23] with default settings.
Statistical inference of differential gene transcript-abundance
Changes of transcriptional levels along CAPRA-S risk score [
24] were estimated independently for each cell type (epithelial, fibroblast, myeloid and T cell). The CAPRA-S risk score is a combination of (i) concentration of blood prostate serum antigen (PSA); (ii) presence of surgical margin (SM); (iii) Gleason score; (iv) presence of seminal vesicle invasion (SVI); (v) the extent of extracapsular extension (ECE); and (vi) lymph node involvement. The RNA extraction batch was used as a further covariate. Due to the absence of publicly available models for non-linear monotonic regression along a continuous covariate, a new Bayesian inference model was implemented. This model is based on the simplified Richard’s curve [
25] (Eq.
1) but re-parameterised to improve numerical stability (Eq.
2). In particular, the standard parametrisation suffers from non-determinability issues if the slope is close to zero; furthermore, in the case of an exponential-like trend, the upper plateau is not supported by data and tends to infinity.
$$ GL\;\left(X,\alpha, \beta, \kappa \right)=\frac{k}{1+{e}^{-\left(\alpha + X\beta \right)}} $$
(1)
$$ GLA\;\left(X,{y}_0,\beta, \eta \right)=\frac{y_0\left(1+{e}^{{\eta \beta}_1}\right)}{1+{e}^{{\eta \beta}_1- X\beta}} $$
(2)
The new parameter y0 represents the intercept on the y axis, η represents the point of inflection on the x-axis, β represents the matrix of coefficients (i.e. slope coefficients, without the intercept term), β1 represents the coefficient of interest (i.e. main slope), and k the upper plateau of the generalised sigmoid function.
Bayesian inference was used to infer the values of all parameters of the model, with TABI (GitHub: stemangiola/TABI@v0.1.3). The probabilistic framework Stan [
26] was used to encode the joint probability function of the model (Eq.
3). We partitioned the transcriptomic dataset into blocks of 5000 genes to decrease the analysis run-time. This Bayes model is based on a negative binomial distribution (parameterised as mean and overdispersion). In order to account for various sequencing depths across samples, a sample-wise normalisation parameter was added to the negative binomial expected value. The slope parameter for the main covariate (β
1) was subject to a regularised horseshoe prior [
27] to increase the robustness of the inference of transcription changes and help anchor data from different samples for normalisation. The role of this prior is to impose a sparsity assumption on the gene-wise transcriptional changes; that is, most genes are not Tdifferentially transcribed. The overall distribution of the gene intercepts follows a gamma probability function. The following joint probability density defines the statistical model.
$$ {\displaystyle \begin{array}{l}P\left(\gamma \right)P\left(\delta \right)P\left(\sigma \right)P\left(\eta \right)P\left(\xi \right)P\left(\dot{\beta}\left|\xi \right.\right)\\ {}\left(\prod \limits_{r=2}^RP\left({\beta}_r\left|\sigma \right.\right)\right)\left(\prod \limits_{g=1}^GP\left({y}_{\mathrm{o}g}\left|{\gamma}^{\prime },{\gamma}^{{\prime\prime}}\right.\right)\right)\\ {}\left(\prod \limits_{g=1}^G\prod \limits_{s=1}^SP\left({Y}_{g,s}\left|\hat{Y},\delta, \omega \right.\right)\right)\end{array}} $$
(3)
$$ {Y}_{t,g}\sim NB\;\left(\mathit{\exp}\left({\delta}_t\right){\hat{Y}}_{t,g},\omega \right) $$
(4)
$$ {\hat{Y}}_{t,g}= GLA\left({X}_{t,}{y}_{0_g},{\beta}_g,{\eta}_g\right) $$
(5)
$$ {\beta}_{g,1}\sim \mathit{\operatorname{Re}} gHorseshoe\left(\dots \right) $$
(6)
$$ {\displaystyle \begin{array}{l}{\beta}_{g,k}\sim N\left(0,{\sigma}_k\right);k>1\\ {}{\sigma}_k\sim HalfN\left(0,1\right)\end{array}} $$
(7)
$$ {\displaystyle \begin{array}{l}{y}_{0_g}\sim Gamma\left({\gamma}_1+1,{\gamma}_2\right)\\ {}{\gamma}_i\sim Exponential(1)\\ {}\omega \sim Gamma\left(1.02,2\right)\end{array}} $$
(8)
$$ {\displaystyle \begin{array}{l}{\eta}_g\sim N\left(0,1\right)\\ {}{\delta}_t\sim N\left(0,1\right);\sum {\delta}_t\sim N\left(0,0.001\ast T\right)\end{array}} $$
(9)
Y represents raw transcript abundance,
\( \underset{\_}{\hat{Y}} \) represents the expected values of transcript abundance, and X represents the design matrix (with no intercept term and scaled covariates). The regression function also includes β, which represents the gene-wise matrix of factors (i.e. slopes excluding the intercept term),
\( \underset{\_}{y} \) and η, which represent the gene-wise y-intercept and the inflection point of the generalised reparameterised sigmoid function (Eq.
2). γ represents the hyperparameters of
\( \underset{\_}{y} \). Other parameters of the negative binomial function are δ, which represents the normalisation factors, and ω, which represents overdispersion. The regularising prior (for imposing the sparsity assumption) over the covariate of interest β
1 (first column of β) is defined by the hyperparameter list ξ [
27] (i.e. nu_local = 1; nu_global = 1; par_ratio = 0.8; slab_df = 4; slab_scale = 0.5), while σ represents the standard deviations of the other factors (in our case only the batch). The algorithm multidimensional scaling [
28] was used to map the data in two-dimensional space.
Gene annotation
Each gene (g) was considered well fitted by the model if it had read counts outside the 95th percentile of the generated quantities for three or fewer samples (according to posterior predictive checks standards [
29]). Among the well-fitted genes, those for which the 0.95 credible intervals of the posterior distribution of the factor of interest β
1g did not include the value 0 were labelled as differentially transcribed. The credible interval is a numerical range within which an unobserved parameter value falls within a certain probability. As distinct from standard practices for frequentist models operating on confidence intervals and
p-values, for this study, the credible interval probability threshold was not altered for multiple hypothesis testing, consistently with standard practices in Bayesian statistics [
30].
In order to interpret the inflection points over the CAPRA-S risk score (i.e. the point of the maximum slope; at what stage of the disease a transcriptional change happens) covariate in a biologically meaningful way, the inflection point was adjusted to the log-scale. Considering that the lower plateau of our generalised sigmoid function was set to 0 (to limit the number of parameters needed to model it), the inflection point of the logarithm-transformed function is not defined. Therefore, we calculated the inflection point (
X) of the log sigmoid forcing a plateau at 1 (i.e. log (0) = 1; Eq.
10; Fig.
S7). This new inflection point can now be calculated as the value of the x-axis at half distance between zero and the upper plateau of the generalised reparameterised sigmoid function (Eq.
10).
$$ \dot{X}=\frac{\beta_1\eta -\mathit{\log}\left({e}^{\frac{y0}{2}}\sqrt{e^{y0\eta }+1-1}\right)}{y0} $$
(10)
Genes were functionally annotated with gene ontology categories [
31] using BiomaRt [
32]. Furthermore, genes were functionally annotated with the protein atlas database [
33] for identifying those that interface with the extracellular environment, encoding for cell-surface and secreted proteins. For a more in-depth analysis of possible interactions between cell types, we compiled a cell-type-specific annotation database for cell-surface and secreted protein-coding genes (Supplementary file 3).
Differential tissue composition analyses
The differential tissue composition analysis is composed of two integrated modules. First, a module infers tissue composition from whole-tissue gene transcript abundances based on reference transcriptional profiles of pure cell types (deconvolution). Second, a module for beta regression on the inferred proportions along the factor of interest (and additional covariates). Bayesian inference allows the transfer of the uncertainty between the two modules (GitHub: stemangiola/ARMET@v0.7.1). The probabilistic framework Stan [
26] was used to encode the joint probability function of the model [
34]. The 0.95 credible interval of the posterior distributions was used as a significance threshold.
The supervised deconvolution was based on deconvolution signatures created using a curated collection of 250 publicly available transcriptional profiles (included in BLUEPRINT [
35], ENCODE [
36], GSE89442 [
37] and GSE107011 [
38]) encompassing of 8 broad categories of cell types and 18 cell phenotypes. Genes whose transcription varied across datasets (detected using Limma [
28]) were used to identify highly correlated datasets. The Pearson correlation was calculated for all-versus-all samples. The samples with a Pearson correlation greater than 0.99 were discarded as redundant. Each cell-type category was classified as belonging to a node of the cell-differentiation tree, which includes epithelial, fibroblasts, endothelial and immune cells in the first level, and B-, T-, natural killer, monocyte-derived, and granulocyte cells. For each cell type in the differentiation tree, the gene-transcript abundance was modelled using a negative binomial distribution (parameterised by mean and overdispersion). Differences in sequencing depth across biological replicates were modelled with a biological replicate-wise exposure rate term ϵ that multiplies the transcripts expected abundance (mean). For each cell-type pair of the same level, 40 genes (20 for each direction) were selected that (i) were abundant (had a mean value higher than the median of all genes), and (ii) segregated the two cell types (having the largest gap between the upper quantile of one cell-type and the lower quantile of the other; 95% credible interval). The gene selection for each level was represented by the union of marker genes for all cell-type pairs. The inference was carried out along the two levels of the hierarchy structure, and the inference for each node (e.g. T-cells) was relative to its parent (e.g. immune cells).
Analysis of tumour microenvironment using multiplex immunohistochemistry
Slides (3 μm) from formalin-fixed and paraffin-embedded (FFPE) tissue were taken from a total of 63 core biopsies of localised prostate cancer across 17 patients. A pathological evaluation was done to define the tumour and surrounding benign tissue areas for each biopsy. The methodology for performing multiplex immunohistochemistry, cell type classification and localisation has been detailed by Keam et al. [
39]. Briefly, slides were deparaffinised and rehydrated with xylene and ethanol. The fluorochrome-coupled antibodies against human CD68 (macrophages and dendritic cells), high molecular weight cytokeratin (HMWCK; epithelial basal cells), CD3 (T cells), CD20 (B cells), CD11c (dendritic cells), and PDL1 were used. The dye DAPI was used for nuclei staining. Vectra 3.0 Automated Quantitative Pathology Imaging System (Perkin Elmer, MA) was used for imaging, as Keam et al. [
39] detailed. The software HALO was used for cell segmentation and phenotyping. Stromal cells were defined with the negative selection of all antibodies (DAPI positive) and with filtering by large size (cell area > 70) and highly elongated shape (ratio of largest dimension and smallest dimension > 2; 0.9 percentile; 0.9 percentile; Fig.
S6).
Cell type proximity was quantified as the number of cells within a radius of 20 cells sizes from a selected cell, averaged per tissue area (5 cell size units) for smoothing and avoiding information duplication due to tight cell clusters. Cell relative size was calculated at 15 units as the observed median length units in the coordinate system. The statistics were summarised at the biopsy level. When the distance between two cell types was measured, only the biopsies including both cell types were selected. The robust regression analyses were performed using the R heavy package [
40] on log-transformed proximity measure. The co-proximity analysis between epithelial basal cells and PDL1+ macrophages and T cells was performed at the single cell level (averaged by tissue area of 5 cell size units). We calculated the proximity on a radius of 50 relative cell sizes for ensuring good coverage of both T cells and PDL1+ macrophages and decrease sparsity. Only the epithelial basal cells in immune rich areas (with > 5 neighbour T cells) were considered.