A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal

James X. Sun; Yuting He; Eric Sanford; Meagan Montesion; Garrett M. Frampton; Stéphane Vignot; Jean-Charles Soria; Jeffrey S. Ross; Vincent A. Miller; Phil J. Stephens; Doron Lipson; Roman Yelensky

doi:10.1371/journal.pcbi.1005965

Abstract

A key constraint in genomic testing in oncology is that matched normal specimens are not commonly obtained in clinical practice. Thus, while well-characterized genomic alterations do not require normal tissue for interpretation, a significant number of alterations will be unknown in whether they are germline or somatic, in the absence of a matched normal control. We introduce SGZ (somatic-germline-zygosity), a computational method for predicting somatic vs. germline origin and homozygous vs. heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing (MPS) of cancer specimens. The method does not require a patient matched normal control, enabling broad application in clinical research. SGZ predicts the somatic vs. germline status of each alteration identified by modeling the alteration’s allele frequency (AF), taking into account the tumor content, tumor ploidy, and the local copy number. Accuracy of the prediction depends on the depth of sequencing and copy number model fit, which are achieved in our clinical assay by sequencing to high depth (>500x) using MPS, covering 394 cancer-related genes and over 3,500 genome-wide single nucleotide polymorphisms (SNPs). Calls are made using a statistic based on read depth and local variability of SNP AF. To validate the method, we first evaluated performance on samples from 30 lung and colon cancer patients, where we sequenced tumors and matched normal tissue. We examined predictions for 17 somatic hotspot mutations and 20 common germline SNPs in 20,182 clinical cancer specimens. To assess the impact of stromal admixture, we examined three cell lines, which were titrated with their matched normal to six levels (10–75%). Overall, predictions were made in 85% of cases, with 95–99% of variants predicted correctly, a significantly superior performance compared to a basic approach based on AF alone. We then applied the SGZ method to the COSMIC database of known somatic variants in cancer and found >50 that are in fact more likely to be germline.

Author summary

We introduce SGZ, a computational method for predicting somatic vs. germline origin and homozygous vs. heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing of clinical formalin-fixed, paraffin embedded (FFPE) cancer specimens. The method does not require fresh tissue or a patient matched normal control, enabling broad application in clinical research. It supports functional prioritization and interpretation of alterations discovered on routine testing and may inform clinical decision making and ultimately expand treatment choices for cancer patients.

Citation: Sun JX, He Y, Sanford E, Montesion M, Frampton GM, Vignot S, et al. (2018) A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal. PLoS Comput Biol 14(2): e1005965. https://doi.org/10.1371/journal.pcbi.1005965

Editor: Roland L. Dunbrack Jr., Fox Chase Cancer Center, UNITED STATES

Received: September 25, 2015; Accepted: January 5, 2018; Published: February 7, 2018

Copyright: © 2018 Sun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files. Sample variant data has been deposited in the NCI's Genomic Data Commons Data Portal under accession number phs001179 and can be accessed at https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine/foundation-medicine. The SGZ software is ready and available on GitHub at https://github.com/jsunfmi/SGZ.

Funding: All authors in this study were funded by Foundation Medicine, Inc. (www.foundationmedicine.com). The funder had a role in study design, data collection and analysis, decision to publish, and preparation of the manuscript.

Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: JXS, YH, MM, GMF, JSR, VAM, PJS, and DL are paid employees of Foundation Medicine. JXS, YH, ES, GMF, JSR, VAM, PJS, DL, and RY are shareholders of Foundation Medicine. ES and RY are former employees of Foundation Medicine, but were employees when the studies in this paper were conducted.

This is a PLOS Computational Biology Methods paper.

Introduction

Characterization of clinical cancer specimens using MPS for targeted treatment selection is becoming increasingly common [1–5]. These procedures generate large numbers of alterations per patient, only a minority of which are potential oncogenic drivers or therapeutically relevant, while the rest are either passenger mutations or germline polymorphisms that are typically functionally benign [6]. Although most therapeutic strategies will focus on variants that have already been well-characterized in the literature, an important opportunity to discover novel oncogenic targets will arise as hundreds of thousands of clinical cancer cases are sequenced. An essential component in this on-going analysis will be prioritizing uncharacterized variants for further follow-up, with somatic versus germline origin determination being a critical step.

The definitive approach to distinguishing somatic mutations from germline variants requires sequencing the tumor alongside a patient matched normal, and subsequently performing a comparison: variants detected in tumor tissue but not present in the normal control are advanced as mutation candidates [7–10]. However, while it is possible to establish protocols for paired collection in the academic cancer center setting, sequencing a patient matched normal specimen is not part of broad oncology practice, and known cancer drivers targetable by approved or investigational therapies can usually be discerned from tumor sequencing alone from well-established databases such as COSMIC [11]. It is therefore likely that as clinical cancer sequencing becomes routine and wide-spread, matched normal data will not be available for the majority of cases, foreclosing a significant opportunity for novel discovery and potential future therapeutic benefit unless this limitation is overcome. Although methods have been developed to determine germline status by matching to public germline databases like dbSNP or sequence a large number of normal individuals to be surrogates for matched normal [12], such methods cannot adequately account for rare germline variants that are private to a family or small population.

We present SGZ, a novel computational method for predicting the somatic vs. germline origin of variants discovered in cancer specimens (Fig 1) without the need for a matched normal sample. In this method, the cancer specimen is sequenced to high depth (>500x) using MPS, in our implementation by a targeted clinical assay of 394 cancer-related genes and over 3,500 genome-wide SNPs [1]. SGZ leverages the precise measurement of the allele frequencies of variants of interest offered by deep sequencing and a statistical model of genome-wide copy number and tumor/normal admixture to characterize the mutational state of the variants. The method is generally applicable to any MPS sequencing platform where the sequencing depth is sufficient, an accurate model of copy number can be created, and the tumor specimen is sufficiently admixed with the surrounding normal tissue.

Download:

Fig 1. SGZ method overview.

The SGZ pipeline is overviewed in panel A. Key components include fitting an optimal copy number model to the genome‐wide log‐ratio and minor allele frequency profiles (B), and modeling the expected allele frequencies of germline, somatic, and subclonal somatic mutations (C). In panel B, the dots in the top panel correspond to log ratios at each exon sequenced, segmented and fitted to discrete copy number levels, while the dots on the bottom panel are germline SNP minor allele frequencies. In panel C, examples of expected variant allele frequencies are shown for various scenarios of copy number and tumor purity. The expected allele frequencies are shown for germline (blue), somatic (red), and subclonal somatic (yellow).

https://doi.org/10.1371/journal.pcbi.1005965.g001

Methods

The SGZ method works as follows (Fig 1, S1 Fig): For each sample, we first execute a standard MPS variant analysis pipeline, which aligns unique sequence reads and obtains candidate mutations with associated mutant allele frequencies [1]. The pipeline also creates a genome-wide copy number profile based on coverage and allele frequencies at over 3,500 SNPs, which is segmented and modeled to estimate the overall tumor purity (p) and ploidy (Ψ), as well as the per segment copy number (C) and minor allele count (M). An overview of our copy number detection approach is shown in Fig 2. To obtain a log-ratio profile of signal intensity, aligned tumor sequence reads are normalized by dividing read depth by that of a process-matched normal control, followed by a GC-content bias correction using Lowess regression. The minor allele frequency (MAF) profile is obtained from the heterozygous genome-wide SNPs. These constitute the observed data for the statistical model.

Download:

Fig 2. Copy number detection overview.

Aligned DNA sequences of the tumor specimen are normalized against a process‐matched normal, producing log‐ratio and minor allele frequency (MAF) data. Next, whole‐genome segmentation is performed using a circular binary segmentation (CBS) algorithm on the log‐ratio data. Then, a Gibbs sampler fitted copy number model and a grid‐based model are fit to the segmented log‐ratio and MAF data, producing genome‐wide copy number estimates. Finally, the degree of fit of candidate models returned by Gibbs sampling and grid sampling are compared and the optimal model is selected by an automated heuristic.

https://doi.org/10.1371/journal.pcbi.1005965.g002

We fit the log-ratio and MAF data by a statistical model which predicts genome-wide copy number profile. This is done in two steps: First, we use the circular binary segmentation (CBS) algorithm to divide the genome into segments of equal copy number [13]. CBS recursively divides the log-ratio data into individual segments until each segment is homogenous such that no further divisions lead to statistically significant differences in signal level. Depending on the aneuploidy and data quality of one sample, the number of segments can range from 22 to a few hundred. Second, we use the segment-based log-ratio and MAFs to fit the statistical copy number model. Briefly, if S_i is a genomic segment, let l_i be its length and C_i be its copy number. The tumor ploidy Ψ of the sample is . If r_i is the random variable representing the median-normalized log-ratio coverage of all exons within S_i, and p is the tumor purity, we model r_i as a Normal distribution as: (1) where σ_ri is the SD of the log-ratio data in segment S_i, reflecting the noise observed. Similarly, if f_i random variable represents the MAF of SNPs within segment S_i, M_i the copy number of minor alleles in S_i, distributed as integer 0 ≤ M_i ≤ C_i/2, and σ_fi the SD of the SNP data at segment S_i, we model f_i as: (2)

Given this model of the log-ratio and MAF, a two-step approach is used to find the optimal fit of model parameters C_i and M_i at each segment, as well as the genome-wide model parameters tumor purity (p) and ploidy (Ψ). First, an initial fit is assessed using the JAGS software package [14], a Gibbs sampling based Markov Chain Monte Carlo algorithm. Assuming a sample has 200 segments after segmentation, the total number of parameters is more than 400. Based on our pipeline design, there are around 10,000 observed SNPs and 50,000 observed median-normalized log-ratios. After checking the convergence of all parameters, the following key MCMC parameters are employed: sampling size at 500, burn-in size at 500, thinning interval at 1 and 9 chains. Second, a grid-based method is used to find alternative solutions that can also fit the model [15]. The grid-based method evaluates the mean-squared-error between the measured and the expected copy numbers, over a grid of different tumor purity and ploidy. All local minima in the grid are considered as model candidates.

The goodness of fit of all copy number models returned by Gibbs sampling and grid sampling are assessed by the mean squared error (MSE) of log-ratios of all segments and the MSE of MAF of all segments. Gibbs model is the default optimal model and is compared to grid-based copy number models at the first three local minima. A grid-based copy number model is selected as the final optimal model if it is proven to meet all of the following five requirements: 1) the MSE of log-ratios and the MSE of MAFs are reduced; 2) the ploidy is higher than 1.2; 3) the model does not have excessive copy number loss events (CN = 0); 4) it is not a more complex model, which is defined by a higher ploidy delta of at least 1.1 and a lower purity delta of at least 0.1; 5) it is not a high purity sample (predicted purity > 0.99) unless an independent high purity estimation prediction algorithm agrees.

Given the output of the copy number model, each variant’s measured AF is compared to expectation at its local segment i: vs. , where V_i is the variant allele count in the tumor, which can be either M_i or C_i-M_i. To determine whether a variant is predicted somatic, germline, or ambiguous, we used the following statistical model: Define y: = (n,f), where y is the variant data comprising read depth n and allele frequency f; G: = germline hypothesis; and S: = somatic hypothesis. Given the germline hypothesis G, the probability of y is obtained using the 2-tailed binomial test P(y|G; AF_germline) = Bin (nf, n, AF_germline). Given the somatic hypothesis S, the probability of y is obtained using the 2-tailed binomial test P(y|S; AF_somatic) = Bin (nf, n, AF_somatic). A variant is predicted somatic if P(y|S; AF_somatic) > α and P(y|G; AF_germline) ≤ α. A variant is predicted germline if P(y|S; AF_somatic) ≤ α and P(y|G; AF_germline) > α. A variant is predicted subclonal somatic if P(y|S; AF_somatic) ≤ α, P(y|G; AF_germline) ≤ α, and f < AF_somatic / 1.5. Subclonal somatic predictions are made only in samples with a tumor purity of greater than 20%. A variant is declared ambiguous and not called if none of the conditions above holds. The variable α is set to be 0.01. All possible prediction outcomes are enumerated in S2A Fig, with an example sample shown in S2B Fig.

Similar to prior studies [15–18], the SGZ method classifies the tumor zygosity of the mutation (homozygous vs. heterozygous) or predicts that the mutation resides in a minor subclone. A variant in the tumor is classified as homozygous if all copies in the tumor carries the mutant allele (V = C and V≠0), heterozygous if both the reference and the mutant are present (V≠C and V≠0), and not in tumor if the tumor only carries the reference (V = 0, applicable only to germline variants). A somatic mutation is further classified as subclonal if the allele frequency is significantly less than the lowest expected allele frequency.

Results

Method validation datasets

We validated SGZ in three different ways, including (1) specimens with matched normal where the true origin of all alterations was known, (2) cell-line admixtures that modeled the impact of varying tumor purity on the inference, and (3) a large set of clinical FFPE specimens with known somatic drivers where real-world somatic variant recovery was assessed.

The first dataset consisted of 87 specimens from 30 non-small cell lung and colon cancer patients, wherein each patient we studied three samples: the primary tumor, a metastatic site, and adjacent tissue matched normal (S3 Table). All DNA were extracted from fresh-frozen clinical specimen. The primary and metastatic tumors uniformly contained a mixture of malignant and benign epithelial, stromal and inflammatory cells. The gold standard origin of a mutation is established by following the rules: whenever a variant appeared in the matched normal with significant allele frequency, it was considered germline, and tumor-only variants were called somatic. In several samples, low levels of tumor infiltrated into the matched normal sample, hence the sample was found to carry low levels of mutation with allele frequency <10%. These were regarded as somatic mutations. A total of 330 unique variants were detected and evaluated, including 70% (N = 231) germline and 30% (N = 99) somatic according to gold standard. SGZ was applied to the primary and metastatic tumor samples to make somatic/germline predictions. DNA from the 30 non-small cell lung and colon cancer patients was obtained from Institute Gustave Roussy [19, 20].

To assess the robustness of the method to different levels of tumor purity, we examined three cancer cell lines (HCC-1937, HCC-1954, & NCI-H1395), which were titrated with their matched lymphoblastoid normal to six levels of tumor purity (10%, 20%, 30%, 40%, 50%, 75%). A total of 42 unique variants were detected by our pipeline and used for validation (S4 Table).

The third dataset is data from 20,182 clinical FFPE tissue samples sent to Foundation Medicine for FoundationOne testing. The samples were of a variety of tumor types, originating from a wide diversity of cancer centers and community oncology practices. To evaluate SGZ predictions of germline/somatic origin, we examined predictions at 17 known somatic hotspot mutations (e.g. BRAF V600, KRAS G12) and 20 common germline SNPs. To assess SGZ predictions of tumor zygosity, we selected the most frequently mutated somatic variants at oncogenes (BRAF, EGFR, IDH1, KRAS, NRAS, PIK3CA) and tumor suppressor genes (TP53, RB1, PTEN) for analysis. To assess the ability of SGZ to detect subclonal mutations, we examined EGFR T790M, a common subclonal tyrosine kinase inhibitor resistance mutation, in all the non-small cell lung samples in this dataset (N = 69). The FoundationOne assay platform, its clinical application, and an early description of the cohort genomics is described in Frampton et al. 2013.

Method validation results

To demonstrate the importance of taking into account the genome-wide copy number profile for somatic/germline prediction, we applied SGZ to the three validation datasets and compared SGZ to a method that does not take tumor aneuploidy into account (referred to as “basic method”), in which a variant is classified as germline if its mutation frequency is near 50% or 100%, or otherwise is classified as somatic [21] (S1 Method).

SGZ yielded somatic vs. germline calls for 85% of variants in the lung and colon samples, 83% of variants in the three cell lines admixtures, and 84% in the 17 somatic hotspot mutations and 20 common germline variants in the 20,182 Foundation Medicine clinical samples. Among these calls, 95%, 97% and 96% of the somatic mutations were predicted correctly, respectively; 99%, 97%, and 97% of the germline mutations were predicted correctly, respectively. On the contrary, the basic method was able to make predictions for 100% of the variants in the three datasets, but only predicted somatic variants correctly 67%, 92% and 95% of time, and germline variants correctly 87%, 41% and 51% of the time, which are significantly lower than the accuracy of SGZ. Importantly, in none of the three datasets did the basic method achieve satisfactory performance in both germline mutations and somatic mutations simultaneously (Table 1, S1 Table). In the cell line dataset, out of a total number of 184 short variants that are correctly classified by SGZ, 63 short variants are incorrectly classified by the basic method due to local copy number deviation from 2 and/or zygosity deviation from the heterozygous state, strongly suggesting the necessity to take copy number variation into account in order to make accurate predictions (S5 Table, S3 Fig).

Download:

Table 1. Validation of somatic and germline predictions.

https://doi.org/10.1371/journal.pcbi.1005965.t001

SGZ had a no-call rate in around 15% of mutations in the lung and colon samples and the Foundation Medicine clinical dataset due to multiple factors (Fig 3), including excessively high tumor purity (>95%), gross deviations of the copy number model at the variant site, observed mutation AF compatible with both somatic and germline AF expectations, and observed AF outside of both somatic and germline expectations.

Download:

Fig 3. Breakdown of no-calls made by SGZ.

Reasons behind no-calls made by SGZ are shown for (left) all variants in 30 lung and colon samples and (right) 17 somatic hotspot mutations and 20 common germline variants within 20,182 clinical samples.

https://doi.org/10.1371/journal.pcbi.1005965.g003

To characterize the performance of SGZ as a function of tumor purity, we captured the call rate and prediction accuracy of SGZ in each tumor purity level in the cell-line dataset (Table 2). Overall, the call rate is between 75% to 94%, and the prediction accuracy ranges from 88% to 100%. As expected, the call rate at 10% tumor purity is the highest among all dilution levels, due to the large difference between expected germline and somatic AF (S1 Fig). Though not available in this dataset, it is expected that call rate would rapidly drop to 0% as the tumor purity exceeds 90% due to much smaller differences between somatic and germline variant AF expectations. For germline and somatic prediction accuracy, a high level of accuracy is maintained from 10% through 75% tumor purity. It is also expected that the prediction accuracy would drop as tumor content exceeds 90%.

Download:

Table 2. SGZ performance as a function of tumor purity in the cell line dataset.

https://doi.org/10.1371/journal.pcbi.1005965.t002

To assess SGZ predictions of tumor zygosity, we examined data from the most frequently mutated somatic variants at oncogenes (BRAF, EGFR, IDH1, KRAS, NRAS, PIK3CA) and tumor suppressor genes (TP53, RB1, PTEN) in the Foundation Medicine clinical sample set. Alterations in oncogenes are expected to be mostly heterozygous, as a single mutation is required for activation, whereas the tumor suppressor genes are expected to have the first functional copy inactivated via mutation, and the second inactivated through loss-of-heterozygosity (LOH) [22]. Our predictions of tumor zygosity are concordant with the roles these genes play: TP53 and RB1 were to determined have >90% of mutations under LOH, while BRAF V600 and KRAS G12 mutations showed no significant enrichment for LOH (Table 3).

Download:

Table 3. Tumor zygosity predictions of somatic mutations in 20,182 clinical samples.

https://doi.org/10.1371/journal.pcbi.1005965.t003

To assess our ability to detect subclonal mutations, we examined EGFR T790M in the non-small cell lung carcinoma subset of our dataset, where the mutation would be expected to occur in tyrosine kinase inhibitor resistant subclones. Indeed, we discovered a significant enrichment of subclonal somatic vs. somatic heterozygous/homozygous calls for T790M –a ratio of 1.5 (41/28)–compared to a ratio of only 0.24 (1043/4282) for the 17 somatic hotspot mutation sites. The SGZ method was also applied to predict the clonality and zygosity of 12 ESR1 mutations in estrogen receptor positive breast cancer biopsies and determined these ESR1 mutations to be somatically acquired, clonal biomarkers of endocrine resistance [23].

Cancer database application

Despite best efforts of cancer investigators leveraging matched normal controls, germline variants may erroneously get nominated and recorded as somatic mutations in the literature and public catalogues of somatic variation, due to the challenges inherent in large scale sequencing studies and MPS data analysis. These variants may divert scarce resources needed for functional follow-up or potentially mislead therapeutic choice if pursued clinically. It would thus be beneficial if these false somatic variants could be collectively flagged and potential interpretation and application corrected.

To discover mutations that may be misclassified as somatic in a public database, we applied the SGZ method to the 20,182 clinical specimens to identify variants predicted to be germline but annotated in COSMIC (v62) as somatic. To confidently call a variant as germline, we required germline predictions in multiple specimens and obtained p-values using a binomial model of SGZ error rate by tabulating the number of somatic, germline, and ambiguous predictions for each variant and obtaining P(S|n_G, n_S), the probability of a variant being somatic, given the n_G germline calls and n_S somatic calls: Using Bayes rule and a flat prior, i.e P(G) = P(S) = 0.5, . Multiple observations were modeled as binomial distributions: and with e_G as the single sample germline error rate, i.e. the probability of SGZ making an error given a germline prediction is made and e_S as the single sample somatic error rate. We used conservative parameters e_G = 0.05 and e_S = 0.10, which are higher than the error rates from Fig 2A. P(G|n_G, n_S) can be readily obtained as 1 – P(S|n_G, n_S).

Table 4 shows the top 10 variants present in COSMIC, but strongly predicted by our method as germline. Each variant was predicted germline in at least 45 samples. Although 9 of 10 variants were annotated as confirmed somatic, the number of entries in the database were all low (≤4), reinforcing that the somatic annotation is likely inaccurate. Further evidence of germline origin was that most variants had an entry in dbSNP, though few were classified as common SNPs. The full list of seventy COSMIC variants predicted to be germline is given in S2 Table.

Download:

Table 4. Likely somatic status mis-annotation in COSMIC, predicted by SGZ to be germline in multiple samples in Foundation Medicine sample set^{^†}.

https://doi.org/10.1371/journal.pcbi.1005965.t004

Discussion

The SGZ method leverages deep MPS to predict variant somatic and germline origin without a matched normal control. While the definitive approach for discovery of novel somatic mutations includes sequencing a patient matched normal, SGZ supports functional prioritization and interpretation of alterations discovered on routine testing performed with tumor alone and can enable assay development and clinical research.

There are several limitations of the SGZ method. Samples must have adequate admixture of the surrounding normal tissue. The exact mixture requirement depends on sequencing depth, which is considered in our statistical model on a per mutation basis, but given our coverage depth of >500X, somatic versus germline calling generally requires at least 10% normal tissue, i.e. tumor content under 90%. This held for 97% of solid tumor clinical cancer specimens that we sequenced. Zygosity calling required estimated tumor purity to be at least 20%, which held for 76% of our sample set.

Accuracy of the copy number model is likewise important. Minor misfit of the model can lead to an elevated rate of no calls, and major misfit of the model can lead to misclassification of somatic versus germline status, especially when tumor content is high, where the expected difference between germline and somatic allele frequency is reduced (S1 Fig). However, in copy number modeling, a key subset of copy number models is mathematically equivalent in terms of SGZ predictions, which improves robustness (S1 Note). Additionally, in low tumor content samples, the differences in expected allele frequencies between germline and somatic mutations are large, hence more robust to deviations in copy number model.

As shown in S1 Fig, there are also limited scenarios where the differences in expected allele frequencies between germline and somatic mutations are small, hence a prediction cannot be made. For example, a mutation with measured allele frequency of 33% in a genomic region with copy number 3 and LOH is equally likely to be either “germline and not in tumor” or “somatic and homozygous”. Finally, there is a scenario in which a subclonal somatic mutation produces an allele frequency equivalent to the expected germline frequency, misclassifying the mutation as germline. In practice, this is rare.

Despite these limitations, SGZ achieved impressive accuracy in validation studies, reaching call rates of 85% and accuracy of 95–99% when applied to individual samples. Importantly, for recurrent mutations (typical focus of cancer studies and clinical research), a key way to improve performance is to apply SGZ to a large cohort of samples, where recurrent mutations can be tabulated in the number of times a germline or somatic prediction is made. This information can be used to annotate variants for which somatic/germline status is unknown or in doubt. When applied over a large cohort of samples, SGZ can aid in the discovery of novel recurrent somatic mutations, along with their clonality and zygosity status [23, 24]. Conversely, SGZ can also identify germline variants not yet catalogued in public databases such as dbSNP and flag them from further consideration as cancer drivers. In this report, we describe the computational approach, which may be implemented on any cancer deep sequencing platform with copy number modeling support and provide both the methodology and a detailed worksheet to ease implementation (S1 Fig). We also apply the method to generate a proposed re-annotation of a large number of variants currently believed to be somatic, in the hope of improving the reliability of publicly availably cancer information. Ultimately, the application of SGZ may inform clinical decision making and expand treatment choices for cancer patients.

Supporting information

S1 Method. Basic somatic/germline prediction comparator method.

https://doi.org/10.1371/journal.pcbi.1005965.s001

(PDF)

S2 Method. Finding COSMIC variants in dbSNP.

https://doi.org/10.1371/journal.pcbi.1005965.s002

(PDF)

S1 Fig. Table of expected mutational allele frequencies.

https://doi.org/10.1371/journal.pcbi.1005965.s003

(PDF)

S2 Fig. All possible SGZ prediction outcomes and an example of a cancer specimen across the genome.

https://doi.org/10.1371/journal.pcbi.1005965.s004

(PDF)

S3 Fig. Exemplar high aneuploidy NCI-H1395 cell line with dilution with matched normal to 50% tumor purity to the advantage of SGZ using copy number model to make correct germline/somatic predictions, as compared to the basic method.

https://doi.org/10.1371/journal.pcbi.1005965.s005

(PDF)

S1 Table. Somatic hotspot mutations and germline polymorphisms used for SGZ validation.

https://doi.org/10.1371/journal.pcbi.1005965.s006

(PDF)

S2 Table. List of variants in COSMIC predicted to be germline.

https://doi.org/10.1371/journal.pcbi.1005965.s007

(PDF)

S3 Table. Summary of 84 samples from 30 non-small cell lung and colon cancer patients.

https://doi.org/10.1371/journal.pcbi.1005965.s008

(PDF)

S4 Table. Mutations from cell line dataset that were detected by our pipeline and used for SGZ validation.

https://doi.org/10.1371/journal.pcbi.1005965.s009

(PDF)

S5 Table. Mutations incorrectly classified by the basic method and correctly classified by SGZ in regions of copy number change in the cell line dataset.

https://doi.org/10.1371/journal.pcbi.1005965.s010

(PDF)

S1 Note. Equivalence of a subset of SGZ solutions to copy number model fitting.

https://doi.org/10.1371/journal.pcbi.1005965.s011

(PDF)

References

1. Frampton GM, Fichtenholtz A, Otto GA, Wang K, Downing SR, He J, et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol. 2013;31(11):1023–31. pmid:24142049
- View Article
- PubMed/NCBI
- Google Scholar
2. Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, Wedge DC, et al. The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012;486(7403):400–4. pmid:22722201
- View Article
- PubMed/NCBI
- Google Scholar
3. Cancer Genome Atlas Research N. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. pmid:25079317
- View Article
- PubMed/NCBI
- Google Scholar
4. Kanchi KL, Johnson KJ, Lu C, McLellan MD, Leiserson MD, Wendl MC, et al. Integrated analysis of germline and somatic variants in ovarian cancer. Nat Commun. 2014;5:3156. pmid:24448499
- View Article
- PubMed/NCBI
- Google Scholar
5. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. pmid:24074859
- View Article
- PubMed/NCBI
- Google Scholar
6. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. pmid:23770567
- View Article
- PubMed/NCBI
- Google Scholar
7. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76. pmid:22300766
- View Article
- PubMed/NCBI
- Google Scholar
8. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28(3):311–7. pmid:22155872
- View Article
- PubMed/NCBI
- Google Scholar
9. Li A, Liu Y, Zhao Q, Feng H, Harris L, Wang M. Genome-wide identification of somatic aberrations from paired normal-tumor samples. PLoS One. 2014;9(1):e87212. pmid:24498045
- View Article
- PubMed/NCBI
- Google Scholar
10. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics. 2012;28(7):907–13. pmid:22285562
- View Article
- PubMed/NCBI
- Google Scholar
11. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011;39(Database issue):D945–50. pmid:20952405
- View Article
- PubMed/NCBI
- Google Scholar
12. Hiltemann S, Jenster G, Trapman J, van der Spek P, Stubbs A. Discriminating somatic and germline mutations in tumor DNA samples without matching normals. Genome Res. 2015;25(9):1382–90. pmid:26209359
- View Article
- PubMed/NCBI
- Google Scholar
13. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5(4):557–72. pmid:15475419
- View Article
- PubMed/NCBI
- Google Scholar
14. Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). 2003.
15. Van Loo P, Nordgard SH, Lingjaerde OC, Russnes HG, Rye IH, Sun W, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A. 2010;107(39):16910–5. pmid:20837533
- View Article
- PubMed/NCBI
- Google Scholar
16. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21. pmid:22544022
- View Article
- PubMed/NCBI
- Google Scholar
17. Li Y, Xie X. Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics. 2014;30(15):2121–9. pmid:24695406
- View Article
- PubMed/NCBI
- Google Scholar
18. Rasmussen M, Sundstrom M, Goransson Kultima H, Botling J, Micke P, Birgisson H, et al. Allele-specific copy number analysis of tumor samples with aneuploidy and tumor heterogeneity. Genome Biol. 2011;12(10):R108. pmid:22023820
- View Article
- PubMed/NCBI
- Google Scholar
19. Vignot S, Frampton GM, Soria JC, Yelensky R, Commo F, Brambilla C, et al. Next-generation sequencing reveals high concordance of recurrent somatic alterations between primary tumor and metastases from patients with non-small-cell lung cancer. J Clin Oncol. 2013;31(17):2167–72. pmid:23630207
- View Article
- PubMed/NCBI
- Google Scholar
20. Vignot S, Lefebvre C, Frampton GM, Meurice G, Yelensky R, Palmer G, et al. Comparative analysis of primary tumour and matched metastases in colorectal cancer patients: evaluation of concordance between genomic and transcriptional profiles. Eur J Cancer. 2015;51(7):791–9. pmid:25797355
- View Article
- PubMed/NCBI
- Google Scholar
21. Jones S, Anagnostou V, Lytle K, Parpart-Li S, Nesselbush M, Riley DR, et al. Personalized genomic analyses for cancer mutation discovery and interpretation. Sci Transl Med. 2015;7(283):283ra53. pmid:25877891
- View Article
- PubMed/NCBI
- Google Scholar
22. Bieging KT, Mello SS, Attardi LD. Unravelling mechanisms of p53-mediated tumour suppression. Nat Rev Cancer. 2014;14(5):359–70. pmid:24739573
- View Article
- PubMed/NCBI
- Google Scholar
23. Jeselsohn R, Yelensky R, Buchwalter G, Frampton G, Meric-Bernstam F, Gonzalez-Angulo AM, et al. Emergence of constitutively active estrogen receptor-alpha mutations in pretreated advanced estrogen receptor-positive breast cancer. Clin Cancer Res. 2014;20(7):1757–67. pmid:24398047
- View Article
- PubMed/NCBI
- Google Scholar
24. Ross JS, Wang K, Gay LM, Al-Rohil RN, Nazeer T, Sheehan CE, et al. A high frequency of activating extracellular domain ERBB2 (HER2) mutation in micropapillary urothelial carcinoma. Clin Cancer Res. 2014;20(1):68–75. pmid:24192927
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Frampton GM, Fichtenholtz A, Otto GA, Wang K, Downing SR, He J, et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol. 2013;31(11):1023–31. pmid:24142049
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, Wedge DC, et al. The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012;486(7403):400–4. pmid:22722201
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Cancer Genome Atlas Research N. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. pmid:25079317
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Kanchi KL, Johnson KJ, Lu C, McLellan MD, Leiserson MD, Wendl MC, et al. Integrated analysis of germline and somatic variants in ovarian cancer. Nat Commun. 2014;5:3156. pmid:24448499
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. pmid:24074859
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. pmid:23770567
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76. pmid:22300766
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28(3):311–7. pmid:22155872
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Li A, Liu Y, Zhao Q, Feng H, Harris L, Wang M. Genome-wide identification of somatic aberrations from paired normal-tumor samples. PLoS One. 2014;9(1):e87212. pmid:24498045
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics. 2012;28(7):907–13. pmid:22285562
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011;39(Database issue):D945–50. pmid:20952405
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Hiltemann S, Jenster G, Trapman J, van der Spek P, Stubbs A. Discriminating somatic and germline mutations in tumor DNA samples without matching normals. Genome Res. 2015;25(9):1382–90. pmid:26209359
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5(4):557–72. pmid:15475419
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). 2003.

[ref15] 15. Van Loo P, Nordgard SH, Lingjaerde OC, Russnes HG, Rye IH, Sun W, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A. 2010;107(39):16910–5. pmid:20837533
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref16] 16. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21. pmid:22544022
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref17] 17. Li Y, Xie X. Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics. 2014;30(15):2121–9. pmid:24695406
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref18] 18. Rasmussen M, Sundstrom M, Goransson Kultima H, Botling J, Micke P, Birgisson H, et al. Allele-specific copy number analysis of tumor samples with aneuploidy and tumor heterogeneity. Genome Biol. 2011;12(10):R108. pmid:22023820
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Vignot S, Frampton GM, Soria JC, Yelensky R, Commo F, Brambilla C, et al. Next-generation sequencing reveals high concordance of recurrent somatic alterations between primary tumor and metastases from patients with non-small-cell lung cancer. J Clin Oncol. 2013;31(17):2167–72. pmid:23630207
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref20] 20. Vignot S, Lefebvre C, Frampton GM, Meurice G, Yelensky R, Palmer G, et al. Comparative analysis of primary tumour and matched metastases in colorectal cancer patients: evaluation of concordance between genomic and transcriptional profiles. Eur J Cancer. 2015;51(7):791–9. pmid:25797355
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref21] 21. Jones S, Anagnostou V, Lytle K, Parpart-Li S, Nesselbush M, Riley DR, et al. Personalized genomic analyses for cancer mutation discovery and interpretation. Sci Transl Med. 2015;7(283):283ra53. pmid:25877891
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref22] 22. Bieging KT, Mello SS, Attardi LD. Unravelling mechanisms of p53-mediated tumour suppression. Nat Rev Cancer. 2014;14(5):359–70. pmid:24739573
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref23] 23. Jeselsohn R, Yelensky R, Buchwalter G, Frampton G, Meric-Bernstam F, Gonzalez-Angulo AM, et al. Emergence of constitutively active estrogen receptor-alpha mutations in pretreated advanced estrogen receptor-positive breast cancer. Clin Cancer Res. 2014;20(7):1757–67. pmid:24398047
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref24] 24. Ross JS, Wang K, Gay LM, Al-Rohil RN, Nazeer T, Sheehan CE, et al. A high frequency of activating extracellular domain ERBB2 (HER2) mutation in micropapillary urothelial carcinoma. Clin Cancer Res. 2014;20(1):68–75. pmid:24192927
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

Figures

Abstract

Author summary

Introduction

Methods

Results

Method validation datasets

Method validation results

Cancer database application

Discussion

Supporting information

S1 Method. Basic somatic/germline prediction comparator method.

S2 Method. Finding COSMIC variants in dbSNP.

S1 Fig. Table of expected mutational allele frequencies.

S2 Fig. All possible SGZ prediction outcomes and an example of a cancer specimen across the genome.

S3 Fig. Exemplar high aneuploidy NCI-H1395 cell line with dilution with matched normal to 50% tumor purity to the advantage of SGZ using copy number model to make correct germline/somatic predictions, as compared to the basic method.

S1 Table. Somatic hotspot mutations and germline polymorphisms used for SGZ validation.

S2 Table. List of variants in COSMIC predicted to be germline.

S3 Table. Summary of 84 samples from 30 non-small cell lung and colon cancer patients.

S4 Table. Mutations from cell line dataset that were detected by our pipeline and used for SGZ validation.

S5 Table. Mutations incorrectly classified by the basic method and correctly classified by SGZ in regions of copy number change in the cell line dataset.

S1 Note. Equivalence of a subset of SGZ solutions to copy number model fitting.

References