Mass spectrometry-derived processed spectra (mzML) files for all independent cancer types were obtained from the CPTAC database [
8]. The database of protein sequences was prepared in one of two ways. First, the human proteome with all instances of tryptophan amino acids in the proteome changed to all other amino acids except Arginine and Lysine was used as a database in the scan – referred to as database 1(fully substitutant). Second, to optimize true positives, we generated a second database (optimized database) which includes the canonical human proteome (UniPROT) with the substitutant tryptic peptides (length > 5 & < 50 amino acids) spanning tryptophan residue and tryptophan substituted to all other amino acids. The analysis of both these databases is presented separately on the server. Additional details on the FASTA file are available on the GITHUB page and the description section on the server. Briefly, MSFragger searches the mzML spectral files against the custom database for peptide detection with the following parameters; Precursor mass lower: − 20 ppm, Precursor mass upper: 20 ppm, precursor mass tolerance: 20 ppm, calibrate mass: True, Deisotoping: True, mass offset: False, isotope error: Standard, digestion: Strictly tryptic (Max. missed cleavage: 2), Variable modifications: 15.99490 M 3, 42.01060 [^ 1, 144.1021 n^ 1, 144.1021 S 1, Min Length: 7, Max Length: 50, digest mass range: 500:5000 Daltons, Max Charge: 2, remove precursor range: − 1.5, 1.5, topN peaks: 300, minimum peaks: 15, precursor range: 1:6, add Cysteine: 57.021464, add Lysine: 229.162932, among other basic parameters (Supplementary Table
1). Next, PeptideProphet validates detected peptides with the following parameters; accmass: TRUE, decoyprobs: TRUE, expectScore: TRUE, Glycosylation: FALSE, ICAT: FALSE, masswidth: 5, minimum probability after first pass of a peptide: 0.9, minimum number of NTT in a peptide: 2, among other parameters (Supplementary Table
1). Isobaric quantification was then undertaken the following parameters (bestPSM: TRUE, level: 2, minProb 0.7, ion purity cut-off: 0.5, tolerance: 20 ppm, among other parameters (Supplementary Table
1). Next, to only retain confident peptides, peptides were filtered using stringent False Discovery Rate (FDR) filtering. The following parameters were used for FDR filtering; FDR < 0.01, peptideProbability: 0.7, among other parameters (Supplementary Table
1). Next, TMT-integrator was used to create integrated reports with isobaric quantification across all samples with the following parameters (retention time normalization: False, minimum peptide probability on top of FDR filtering: 0.9, among other parameters (Supplementary Table
1).
Substitutant peptides were fetched from the reports of TMT Integrator (version 3.1.0). Using a R-script, peptides with a log2-transformed intensity score above 0 in a sample were observed as positively detected peptides in that sample. As described before [
13], for intra-tumour type analysis a filter for the maximum number of samples was applied to retain peptides with higher specificity in expression, except for W > F substitutants due to their exclusive significant and specific distribution wherever significant. All tumour types have been demonstrated to be exclusive with the analysis of database 1 [
13], while GBM, UCEC, and PDA did not show this exclusivity in the analysis of database 2. This optimizes the signal for gene expression correlation analysis. Furthermore, this script was used to plot bar plots depicting the cumulative number of tryptophan substitutants detected in the scans.
Gene expression data was downloaded in GCT format from PDC database. The counts of W-substitutants were combined for each sample with gene expression profiles. PERL scripts were designed to count the number of substitutants when a gene is lowly expressed (intensity < 0) or highly expressed (intensity > 0). P-values for comparison are calculated using Wilcoxen t-test.