pVAC-Seq analysis of the cancer genome atlas patients
For our analyses, we used somatic mutations identified with MuTect [
36] from Mutation Annotation Format (MAF) files for 18 cancer types (see Additional file
1: Table S1) in TCGA, retrieved using gdc-scan (v1.0.0) [
37]. The MAF files were then converted to tumor-normal pair variant call format (VCF) files using the maf2vcf tool in the vcf2maf software package [
38], with the GRCh38/hg38 genome build available from the Broad Institute resource bundle [
39] used as the reference genome. Because these VCFs still contained data for both tumor and normal samples, they were then manipulated to remove data from the paired normal sample, leaving final, tumor-only VCF files for compatibility with pVAC-Seq, which accepts only single-sample VCFs. Each disease type consisted of a variable number of patients (see Additional file
1: Figure S1).
We then annotated these VCF files using Variant Effect Predictor (VEP, v88) [
40]. VEP was run according to pVAC-Seq’s recommendations [
41] with the Downstream and Wildtype VEP plugins [
42] used, gene symbols added to output where available, and mutation consequence terms based on Sequence Ontology annotation guidelines [
43]. We also used VEP’s GRCh38 annotation cache (rather than querying remotely) for efficiency.
As the TCGA data we obtained did not allow us to calculate patient-specific HLA types, we assumed each tumor could occur in the setting of any HLA allele type, allowing us to explore neoepitope distributions among a broader theoretical population. To do this, we generated a list of HLA alleles to use for subsequent analysis based on allele frequencies originating from the Allele Frequency Net Database [
44] and summarized for use in the software POLYSOLVER (v1.0) [
45]. The average frequencies across races (Asian, Black, and Caucasian) of alleles for each HLA gene (HLA-A, HLA-B, and HLA-C) were calculated and normalized to sum to 100%. We then selected the top 145 average-frequency HLA alleles for subsequent analysis (see Additional file
1: Tables S2 – S4), encompassing all HLA alleles among 99% of individuals in the general population.
pVAC-Seq (v4.0.8) was run for each patient and allele combination using 9mer epitopes generated from 17-mer peptides surrounding each missense mutation, and using MHC binding predictions generated by NetMHCpan (v2.8) [
46]. For each resulting neoepitope from pVAC-Seq, additional metrics were applied as described below. Note that for the purposes of this study, only epitopes resulting from missense mutations were considered for further analysis, and all peptides were considered to be expressed at equal levels. Note also that neoepitopes from breast cancer (BRCA), cervical cancer (CESC) and melanoma (SKCM) were not assessed for protein sequence similarity against peptides other than their paired normal epitopes.
To assess the degree to which HLA alleles might have overlapping preference for putatively novel binding neoepitopes predicted for mutations across TCGA (see Neoepitope Prioritization Metrics), 1000 random sets of six of the previously described set of HLA alleles (two HLA-A, two HLA-B, and two HLA-C alleles) were chosen using the random.sample function (without replacement) from the random module in Python 2.7.13 [
47] (the combinations tested are available in Additional file
2: Table S5). All unique amino acid sequences of neoepitopes that bound to one or more alleles within each random allele set were counted; separate counts were kept for neoepitopes that bound to one, two, three, four, five, or six of the six alleles (i.e. increasing levels of overlap). The script for randomly sampling allele sets and determining overlap is available in our GitHub repository [
48].
For comparison, we assessed recurrence rates among 2,813,809 simulated neoepitopes (9mers) mirroring the size of the TCGA data set. These neoepitopes were drawn randomly from the GRCh38 peptidome, with subsequent introduction of a random single amino acid substitution at a random position along each 9mer. These simulated peptides were labeled by patient and disease site to produce a random set of peptides for each patient equivalent in size to that patient’s predicted neoepitope repertoire. We repeated this process again for a smaller set of 1000 simulated neoepitopes to assess trends in peptide similarity scores. The gene of origin of the random peptide and the gene corresponding to its closest peptide match in the human proteome were retained for protein sequence similarity analysis (see “Tumor vs. closest human peptide sequence similarity” below).
Analysis of neoepitopes in melanoma patients
We identified patient-specific neoepitopes in whole exome sequencing data from 12 patients selected from a study exploring genomic features of response to immunotherapy in melanoma patients [
49]. Reads were aligned against the GRCh37d5 reference genome using the Sanger cgpmap workflow [
50]. This workflow uses bwa-mem (v0.7.15–1140) [
51] and biobambam2 (v2.0.69) [
52] to generate genome coordinate-sorted alignments with duplicates marked. Realignment around indels and base recalibration were performed using Genome Analysis Toolkit (v3.6) [
53]. Variants were called using VarScan (v2.3.9) [
54] in accordance with the methods outlined in the workflow [
50]. VCF files were annotated using VEP (v88) [
40] as described above. For all missense single nucleotide variants identified, the tumor and normal protein epitopes of 8, 9, 10, and 11 amino acids in length were produced by reconstructing the nucleotide sequence surrounding the mutation using its coordinates from the VCF file and the CDS in the hg19 gene transfer format file [
55], and translating this sequence into amino acids. Each patient’s HLA type was determined from FASTQ files using Optitype (v1.3.1) [
56], and the binding affinity of all predicted tumor and normal epitopes was predicted with NetMHCpan (v2.8) [
46] for each epitope and patient-specific HLA allele combination. Additional prioritization metrics were applied as described below.
Features associated with immunogenicity
To assess how well our prioritization metrics reflect a neoepitope’s ability to elicit an immune response, we applied our criteria to predicted neoepitopes from six studies in which peptide-specific immune responses were measured [
3,
11,
20,
62‐
64]. For data from all studies, we used only peptides which had complete information regarding the neoepitope and its paired normal peptide, as well as complete data regarding epitope-level immune response, providing a total cohort of 419 peptides. Because only binary immune response data was available from Ott et al. [
11] and Bjerregaard et al. [
20], we generated binary response data from the Carreno et al. [
62] dataset for compatibility: a neoepitope was considered to have elicited an immune response if it had a percent neoantigen-specific T-cell in lymph+/CD8+ gated cells of greater than 10%. Of the seven peptides from the Le et al. [
64] dataset that were tested for clonal T cell expansion, the three peptides that demonstrated clonal T-cell expansion were considered to have elicited an immune response, while those that only demonstrated immune reactivity in an ELISpot assay were considered not to have elicited an immune response. Among peptides evaluated in co-culture experiments from Tran et al. [
3] and Gros et al. [
63], those that were T-cell reactive were considered to have elicited an immune response. We produced a linear model to determine the relationship between our neoepitope novelty criteria and peptide-specific immune response (see “Statistical analysis” below). Using Scikit-learn for Python [
65], SVM and Random Forest models were also trained with 10 fold cross validation for comparison.
Statistical analysis
Statistical analysis was performed using R (v3.3.2) in RStudio. To test the relationship between per-allele neoepitope burden and neoepitope frequency across the TCGA dataset, we obtained the Pearson’s product-moment correlation and associated
p-value using a two-sided test. To determine whether a difference in tumor vs. paired normal peptide binding affinity difference exists between epitopes with and without an amino acid change at an anchor position, we applied a Wilcoxon rank sum test. We also used the Wilcox test to compare tumor vs. paired normal peptide sequence similarity scores between epitopes with novel vs. non-novel binding changes, and to compare the difference in tumor vs. paired normal peptide sequence similarity scores for the neoepitope with its paired normal epitope and its closest matching human peptide from BLAST for matching versus non-matching genes. A Welch’s two sample t-test was used to compare the similarity of neoepitopes to bacterial vs. viral peptides. We used the package pROC [
66] to obtain AUROC scores and the lm function to determine the relationship between our continuous predictors and observed peptide-specific immune response data. Our analyses are available as an R script on our GitHub repository [
48].