Introduction
The study of the contribution of germline genetic variation to disease risk or treatment outcome typically requires blood or saliva samples as a source of constitutive DNA. Depending on the phenotype studied, such samples may not be banked and readily available. Samples may have to be prospectively collected, which hinders studies requiring long-term follow-up or obtained after contacting potential subjects of interest, which can be logistically and ethically challenging or impossible if a patient has relocated or died. In cancer research, there is a growing interest in directly profiling tumor tissue to obtain germline measures such as ancestry, polygenic risk, and HLA-typing [
1]. Array-based genotyping followed by imputation from a reference population has been a standard method to genotype genome-wide SNPs in the human genome, but its compatibility with DNA obtained from archival tissue specimens remains to be established [
2,
3]. The approach can be challenging when the amount of tissue available for research is limited, which is often the case with surgical excisions of premalignant lesions or with most needle biopsies.
Recently low-coverage whole-genome sequencing (lc-WGS) has emerged as an attractive alternative to single nucleotide polymorphism (SNP) array by offering higher throughput at a reduced cost, reduced DNA input, and improved genotyping accuracy [
4‐
6]. In fact, recent studies have shown the feasibility of using frozen tissue for germline profiling by imputing genotypes from off-target reads repurposed from tumor-targeted panel sequencing data, effectively equivalent to ultra-low coverage (less than 0.1×) whole-genome sequencing [
1]. It is therefore likely that, in the absence of available targeted sequencing data, lc-WGS can be performed with DNA of lower quality and quantity to enable the imputation of germline variants from archival tissue specimens.
If accurate, such an approach could have important implications for the study of the contribution of inherited risk factors to the progression of premalignant disease. For many cancer types, the widespread adoption of cancer screening has led to an increase in the detection of premalignant lesions. Despite such efforts, screening has had limited impact on overall survival [
7]. Clinical guidelines vary widely from watchful waiting or biopsy as for prostatic intraepithelial neoplasia to surgery and adjuvant treatment as for ductal carcinoma in situ (DCIS) of the breast [
8,
9]. In absence of reliable progression risk biomarkers and models, these interventions may have deleterious consequences at the two clinical extremes: delay in life-saving treatment or complications from overtreatment. DCIS is the most common breast cancer-related diagnosis, comprising ~ 20% of annual cases in the U.S. [
10]. In breast disease, factors that impact the risk of breast cancer subsequent event (BCSE), defined as an in situ or invasive breast cancer neoplasm developed at least 6 months after treatment of a DCIS diagnosis, include age, size, grade of the lesion, hormone receptor status, and molecular profile. Their combined effect in risk models such as the University of Southern California/Van Nuys Prognostic Index has not resulted in any reliable BCSE risk prediction model and additional, more in-depth molecular and histological characterization is needed [
11‐
16].
Given the independence between DCIS and associated BCSE in upwards of 20% of cases as evidenced by molecular studies comparing genomic profiles of initial DCIS and subsequent ipsilateral BCSE, systemic risk factors need to be considered in addition to those related to the index lesion [
17]. While penetrant germline pathogenic variants exist and represent strong risk factors in breast cancer susceptibility genes such as
BRCA1,
BRCA2,
CHEK2,
PALB2, and
PMS2, they are only present in 1.5% of all women [
18]. Meanwhile, population-based genome-wide association studies (GWAS), have identified multiple common variants associated with lifetime risk of invasive breast cancer (IBC) [
2,
19]. The same SNPs have also been associated with risk of DCIS demonstrating the shared genetic susceptibility for IBC and DCIS [
20]. It is however unclear if these SNPs are also associated with DCIS progression. Polygenic risk scores (PRS) derived from the allelic burden of risk-associated SNPs are now being added to common breast cancer risk models, significantly improving their performance, with individuals in the top percentile having a three- to fivefold increase in lifetime risk relative to women with risk scores in the middle quintile of those studied [
21,
22]. It is thus possible that DCIS patients with elevated breast cancer PRS are also at higher risk of BCSE and the addition of PRS could improve DCIS prognostic models akin to lifetime breast cancer risk models. Since BCSE can occur years after the initial DCIS diagnosis and is uncommon—observed in 10 to 25% of patients after 10 years, depending on treatment and known risk factors—a retrospective study is much more feasible for the purposes of validation [
23,
24]. Formalin-fixed paraffin-embedded (FFPE) tissue (referred to as archival tissue) from the DCIS biopsy or resection are therefore the only source of genetic material available and their validity for genome-wide genotyping of germline variants would be critical to the feasibility of such study.
Here we evaluate the validity of repurposing archival tissue specimens for germline genetic studies. We performed lc-WGS and imputed genotypes for 10 pairs of matching blood and tumor tissue samples to benchmark the accuracy for calling genome-wide genotypes, HLA haplotypes, and for implementing PRS. The reported results indicate the high accuracy of germline genotypes and haplotypes obtained from archival tissue DNA. Using this methodology we present a use-case measuring breast cancer PRS in 36 DCIS patients and explore its association with BCSE.
Discussion
Here we rethink the traditional design of germline genetic studies by answering the question, when typical DNA sources such as blood, saliva or urine are unavailable, can we extract the same information from archival tissue specimens? Often collected for histological examination and diagnosis and then stored indefinitely, these samples offer an abundant source of genetic material from patients with potentially long clinical follow-up. By using lc-WGS and recent advances in genotype imputation, we compared the concordance of germline genotypes obtained from blood DNA and archival tissue DNA in 10 different individuals. Archival tissue faithfully represented the germline profile of common SNPs obtained from blood both at the genome-wide level and across well-established PRS. Beyond concordance at the SNP-level, we also demonstrated accurate genotyping at highly polymorphic HLA alleles. To our knowledge, we present the first evidence that HLA-typing using lc-WGS from archival tissue is as accurate as true clinical-grade HLA-typing. The restriction to lung tissue in the comparison may have some limitations. The tissue was on average more recent than the one used for the breast cancer study and therefore results in higher quality DNA and overestimation of the accuracy of the approach. This would however not affect the genotypes of a particular patient, but rather result in a larger number of non-evaluable samples due to overall assay failure. Lung cancer may also have increased mutational burden or genomic instability resulting in a possible underestimation of the accuracy compared to breast DCIS. A more comprehensive collection of various tissue types, grades, age or fixation conditions would be needed to fully understand the impact of histological variables on archival tissue-based genotyping. Nevertheless, our results support the future utilization of archival tissue to construct large retrospective studies to characterize the role of germline variants in disease etiology, progression, and treatment.
The use of archival tissue as a source of constitutive DNA will enable a wealth of retrospective studies by repurposing specimens archived by most clinical sites to help address the genetic underpinnings of disease with long-outcome, such as the progression of pre-malignant lesions as presented here. Such studies would either require long follow up after the initial sample collection, or a massive and costly effort to retrospectively collect blood or saliva samples. In contrast, provided the subjects have been offered diagnostic biopsies, or surgical treatment, the course of their clinical care or study participation, their left-over specimen can be used to enable post-hoc genetic analysis. Such studies would require approval of the Institutional Review Boards (IRB) and, since 2015, informed consent needs to be explicit about the use of specimens and data for genetic research and the risk for privacy it entails [
37]. Commonly, IRBs waive the requirement for consent from patients deceased or lost to follow-up, however, such data needs to be distributed with caution and typically protected by a Data Access Policies the researcher has to comply with. As such, while our approach can enable large retrospective genetic studies where informed consent may be waived, the eligibility of each patient, and the overall data sharing policy need to be carefully considered.
Our report includes the application of the approach to interrogate the contribution of genetic factors to breast DCIS progression. The relatively good outcome of the disease poorly justified a thorough collection of risk variables, especially those related to inherited risk. However, overtreatment of DCIS, and its harms, is being increasingly acknowledged and systematic reviews of clinicopathological factors have not resulted in reliable models of progression [
11,
12,
38]. Most epidemiological studies need to be large due to the slow progression and rarity of poor outcomes and rely exclusively on medical chart review [
24,
39,
40]. As such, additional factors that are hard or impossible to collect from the charts such as mammography or magnetic resonance imaging, digital pathology, or germline inherited factors have not been as thoroughly and systematically investigated. We made the narrow hypothesis that lifetime breast cancer susceptibility—which can be seen as progression from normal to malignant epithelium—and progression of DCIS share the same genetic risk factors. We tested this hypothesis by measuring breast PRS in a small cohort of carefully selected DCIS subjects using our approach. Interestingly, the effect size observed (HR = 2) is higher than the one observed for other risk factors to DCIS progression, suggesting PRS could significantly improve previous risk models [
22,
41,
42]. Thanks to the accurate PRS estimate obtained from left-over surgical specimens, we were able to see that germline variation likely contributed significantly to the DCIS progression to an extent similar or greater to previously investigated risk factors such as grade, age, and Her2 overexpression [
38]. The case–control design of the study is suboptimal and the findings would need to be validated in a larger cohort of unselected patients, where a more comprehensive set of covariates would be accounted for, including treatment. Subsequent larger studies would also be important to evaluate competing risk models for subsequent in situ versus invasive disease, or laterality of the event, where PRS may contribute more in particular contexts. The modest cost and relative experimental simplicity of our approach, accompanied by a state-of-the-art imputation strategy can likely be scaled up provided diagnostic sections or left-over specimens can be found. Several large DCIS cohorts are generating mutational profiles, including some with lc-WGS and associated with clinical outcomes, which would be particularly suitable for validation in the future [
17,
43].
In the study of malignant progression as well as the onset and progression of multiple other diseases, the overactivity or inactivity of the immune system represents a key factor. A large contribution of variation in immune traits is inherited and yet the role of this contribution in disease progression is poorly understood [
44,
45]. In particular, the genetic diversity of the MHC, one of the most polymorphic regions of the genome, is a real challenge to study the role of the adaptive immune system. In the context of tumorigenesis, the failure of the major histocompatibility complex (MHC) to present antigens to the immune system is being increasingly recognized as contributing to cancer immune evasion and failure to respond to immune checkpoint inhibitors [
46‐
48]. The determination of the HLA haplotypes, encoding the MHC is typically limited to the setting of organ or bone marrow transplants and not typically performed in other epidemiological studies. Recent reports however show the importance of the HLA-type in understanding the exposed mutanome, and its consideration can have important predictive value in the context of immunotherapies [
33,
34,
49]. But with a lack of systemic HLA-typing or absence of genetic material to do so, such studies are hard to replicate or scale-up. To address this, we demonstrated that we can assign 4 field alleles to HLA-A, B, C, and DRB1, DQB1 genes by reference informed imputation of lc-WGS data [
6]. These imputed HLA-types had comparable accuracy to deep targeted sequencing of the HLA locus with a fraction of the required DNA input (5 vs 40,000 ng) and with a simplified protocol (no need for targeted capture). The improvement in both sample requirement and throughput to HLA-typing supports the evaluation in lc-WGS with imputation in replacing current clinical standard tests.
While offering many benefits, there are still some limitations to lc-WGS paired with imputation for germline profiling of archival tissue. Similar to previous reports benchmarking lc-WGS imputation, error increases with decreasing minor allele frequency [
5,
6]. This would preclude the use of this strategy for the identification of rare variants of high penetrance associated with familiar risk (
BRCA, Lynch, or Li-Fraumeni syndromes). Similarly, genotypes from rare risk-associated SNPs or HLA-types only found in small populations would be more likely missed by this approach. In the future, the availability of even larger and more diverse reference populations may help mitigate this effect. For the purposes of this study we utilized the unrestricted 1000G reference panel (N = 3202 haplotypes), however larger extensive, though restricted, panels such as Haplotype Reference Consortium (HRC) (N = 64,976) or TopMed (N = 53,831) exist [
25,
50,
51]. Low coverage depth represents an additional limitation of our approach. While a restricted number of reads sequenced from a WGS library can result in decreased imputation accuracy, another source of tumor-specific decreased coverage is somatic copy number alterations (CNA). We observed that regions in a copy number loss resulted in decreased imputation accuracy. Similar observations were recently reported in a study performing germline imputation from discarded reads from targeted-sequencing tumor-derived tissue [
1]. Here the choice of the tissue source, or the possibility to dissect normal histological regions, can help mitigate these effects. Indeed the use of adjacent normal tissue, pre-malignant or low-grade lesions or even lymphocytic aggregates, or lymph node specimens would enrich for diploid cells resulting in fewer inaccurate genomic regions. In contrast, imputation in high-grade lesions or invasive tumors with prominent aneuploidy needs to be carefully considered and may be mitigated in the largest dataset where available CNA profiles could be used as prior information in the imputation strategy.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.