Background
Previous studies have identified very few germline variants associated with familial lung cancer. It’s half a century since familial lung cancer aggregation was observed. An increased familial risk of LC observed in our previous study provided indirect evidence that genetic factors contribute to susceptibility to LC [
1]. This echoed an early observation that the LC exhibited familial aggregation.
In LC, no somatic driver mutations have been found for 20% of cases of adenocarcinoma and 60% of cases of squamous carcinoma [
2]. One explanation for the mutational heterogeneity observed in cancer is the fact that genes act together in various signaling and regulatory pathways and protein complexes [
3]. Accordingly, a pan-cancer network approach that examines combinations of genes may be necessary [
4]. Genetic susceptibility to LC may be polygenic and heterogeneous, conferred by relatively common polymorphisms with low penetrance and modest effect sizes [
5]. Germline variations may have an important impact on the etiology of complex trait-related pathways, which cannot be explained by common variants. To date, more than 10 genome-wide association studies have examined inherited susceptibility in LC, and relatively few loci have been confirmed [
6‐
11]. Moreover, the results have shown that there are certain differences in inherited susceptibility in LC between the East and West. However, despite these studies, most of the heritability of LC remains unexplained. In one study using next-generation sequencing, disruptive germline mutation genes were identified between familial and sporadic LC [
12]. However, the independent statistical analysis of each genomic nucleotide position in GWAS (Genome-wide association studies) makes it difficult to assess the complex interactions among many genes containing these SNPs.
Emerging studies have shown that many inheritable traits and susceptibilities are not caused by single gene mutations, but by accumulation of SNPs of many functional-related genes. In a recent GWAS study of same-sex sexual behavior, the 5 SNPs identified by traditional single-locus statistical criteria explained less than 1% of the heritability, far less than the actual heritability (25% ~ 32%). This demonstrated that the SNPs of many other genes also contribute to the trait, although the contribution of each SNP is minor [
13]. Studies on education attainment-associated genes also revealed numerous SNPs in nearly 100 functional-related genes collectively predict the traits [
14,
15]. Hence, in this study, we conducted a WES-based epidemiological analysis of pedigrees with the highest genetic susceptibility in lung cancer to analyse the potential genetic background, especially under the hypothesis that multiple SNPs of a group of functional-related genes provide the familial LC susceptibility.
Methods
Study design and participants
More than 1300 patients were screened from 2009 to 2010, and 633 pathologically diagnosed LCs were enrolled as probands. The first-degree relatives of both the patients and their spouses were study participants, yielding 565 spouse pedigrees. We collected information on sex, age, lung disease history, race, occupational exposure, living environment, and smoking history for probands and controls (Supplementary Table S
1). A detailed description of this study is given in our previous articles [
16]. The goal of this study was to characterize the familial genetic susceptibility of LC risk.
Statistical analyses
We evaluated the risk factors using step-wise logistic regression with the diagnosis of LC as the dependent variable and the following independent variables: age cohort, sex, lung disease history, living environment, occupational exposure, smoking history, and number of affected individuals as first-degree relatives. Univariate and multivariate-adjusted ORs with 95% CIs of LC were calculated using the binary logistic regression model. The estimates were adjusted for sex, age cohort, lung disease history, living environment, and occupational exposure. All of the statistical tests were performed using the SPSS 17.0. Two-sided P values of less than 0.05 were considered statistically significant.
Exome sequencing
Probands having adenocarcinoma and no less than two first-degree relatives with LC were chosen for exome sequencing because of a highest genetic risk in these patients. Healthy controls were selected by matching demographic factors and levels of exposure to kitchen oil, tobacco and living environment variables. Genomic DNA from the blood and from cancer or para-cancer (normal tissues adjacent to cancer) tissues was extracted with a Tiangen Blood/Cell/Tissue genomic DNA extraction kit (Tiangen). A genome sequencing library was constructed using a NEBNext DNA Library Prep Kit for Illumina (New England Biolabs). Exome capture was performed using a SeqCap EZ ExomeV3-Plus kit (Nimblegen). The libraries were sequenced on Illumina HiSeq-2000/2500 sequencers. High-quality reads passing Illumina filter were kept for subsequent bioinformatics analysis.
The clean reads (adapter trimmed) were mapped against the human reference genome GRCh37/hg19 (downloaded from UCSC Genome Browser) using FANSe2 algorithm [
17] with the parameters -E4 -I0 -S14 -M1. By piling up the mapped reads, genomic positions with a sequencing depth of greater than or equal to 10× were kept for SNV (single nucleotide variation) detection. SNVs were detected using Fisher exact test against the null hypothesis that the nucleotides at this position were all identical to the reference genome, with a significance threshold of 0.01. This variant calling procedure was experimentally validated for its almost-perfect accuracy and sensitivity [
18]. Germline SNPs were defined as nucleotides in para-cancer/blood samples that were different from those in the reference genome. Somatic mutations were defined as SNVs detected in cancer samples but absent in the corresponding para-cancer sample. The workflow is illustrated in the Fig. S
2.
Gene annotations were taken from the refflat file downloaded from the UCSC Genome Browser. Nonsynonymous germline SNPs and somatic SNVs were used for network analyses.
Network analysis
The network of mutated genes was generated using STRING-DB 9.1 [
19] and visualized using Cytoscape software v3.0.2 [
20]. To ensure high confidence in the analysis, the minimum required interaction scores were set to “high confidence (0.700)”, and only “experiments, databases and gene fusion” were considered as effective evidence for the PPI (protein-protein interaction) sources. The graph properties of the networks were calculated also using Cytoscape software. KEGG pathway enrichment analysis was performed using KEGG online tools (
http://www.kegg.jp/). SVM classification details were described in Supplementary Methods.
Discussion
Our previous study showed that an increased risk of LC was associated with the number of affected relatives [
21]. The risk of LC development is significantly higher in patients with adenocarcinoma with familial aggregation. Further analysis of these results indicated that familial risks are compatible with genetic predisposition but can also reflect shared exposures and genetic factors.
As a highly complex disease, LC cannot be explained by single specific mutations. Highly variable somatic mutations may provide a temporal and limited explanation for the progression, but not the incidence of LC. Our results also showed that no shared somatic mutations were found in the five probands. In contrast, germline SNPs can be indicators of the susceptibility to the disease. The GWAS-identified susceptibility loci of LC only showed their marginal statistical significance to incidence, suggesting that a rational combination of many genetic loci may be suitable for predicting LC incidence, particularly in the context of familial LC.
Based on the concepts of systems biology, we aimed to screen germline SNP networks that may contribute to familial LC. We confirmed that, despite differences among SNPs in the familial LC probands, these patients shared the same enrichment in the PI3K/AKT pathway, highlighting this pathway as a major predictor of familial susceptibility to LC.
The PI3K activates multiple downstream pathways such as RAS, ERK and mTOR pathways, which are crucial for protein synthesis, cell survival, cell growth and proliferation [
22‐
24]. Somatic alterations including mutations and amplification in genes in the PI3K pathway, such as PTEN, PIK3CA, PIK3R1, and AKT, are often found in various kinds of human cancers including lung and activate the PI3K/AKT pathway, driving carcinogenesis [
24,
25]. Genetic alterations of PI3K pathway were rarely reported in familial lung cancer [
26]. Other some specific loci or genes in the genome, like 6q23–25, ARHGEF5 were also reported in familial lung cancers [
27,
28]. The actual function of these genetic or genomic alteration needs further investigation. Many PI3K/AKT pathway inhibitors were designed as therapeutic treatment for multiple cancer categories [
23,
29]. In contrast, germline variations in the PI3K pathway, particularly the combinatory effects of multiple SNPs in this pathway, are often overlooked. Notably, individual germline SNPs in the PI3K pathway rather than shared SNPs or somatic mutations were found to be related to familial LC in this study.
Germline SNPs are not the direct cause of LC, and most patients with familial LC did not harbor known driver somatic mutations. Therefore, one possible explanation for the function of these disperse germline SNPs is as follows: germline SNPs in these patients may provide a fragile “network basis” of nonsynonymous SNP-containing genes enriched in the PI3K/AKT pathway. Although single SNP possess minor malignant potency, accumulation of many such SNPs collectively contribute to the susceptibility in a perceptible significance, which has been evidenced in the studies on same-sex sexual behavior and cognition capabilities [
13‐
15]. Networking of such SNP-containing genes may promote the PI3K/AKT pathway to an unstable or precancerous status, resulting in susceptibility to cancer initiation. The fragile network will collapse into imbalance and increase the risk of cancer development if further somatic mutations occur. These somatic mutations are not necessarily the driver mutations, but together with the fragile germline-determined PI3K/AKT pathway, this nonrobust system will easily become unbalanced with random environmental fluctuations and may develop in an emergent and/or chaotic manner, resulting in cancer. This hypothetical explanation of the basic role of PI3K/AKT SNPs in cancer is echoed by a series of system biology approaches. For example, alterations in CDK1 and CDK2 enzyme kinetics parameters will disrupt the regular cell cycle [
29]; the in vitro tumor cell proliferation dynamics follows a fractal structure different from normal oscillatory dynamics [
30]. Although mathematical nonlinear theories are thought to model carcinogenesis in pure theoretical approaches [
31], our results may provide explicit and experimental support of this philosophy. This theory may also apply to other types of cancer.
Compared with the well-known Knudson “two-hit hypothesis”, which emphasizes the subsequent deactivation of the two alleles of tumor-suppressor genes, our “mutation network basis hypothesis” emphasized the interactions of the SNP-involved gene sets, not a single tumor-suppressor gene. Compared with Nordling’s “multimutation theory”, which assumed that the genesis of cancer requires the accumulation of multiple consecutive mutations, our “mutation network basis hypothesis” emphasized the importance of the inherited fragile network due to germline SNPs. Therefore, the “two-hit hypothesis” and “multimutation theory” may be more suitable to explain the incidence of sporadic cancer.
Although our study was limited by the small number of familial LC pedigrees due to the rare occurrence of pedigrees with such strict criteria, our results provided insights into the management of familial susceptibility to LC based on several concepts. First, accurate whole-exome or whole-genome sequencing, not just genetic testing of a small gene panel or several specific SNPs, should be applied to everyone during early life to evaluate risk at a systems level. The decreasing price of sequencing makes this approach affordable. Second, in cases of a high risk of familial incidence, the individual should adjust his/her lifestyle to avoid inducing factors, such as smoking, air pollution and mutagens. Third, high-risk populations should undergo more frequent screening to detect early-stage tumors. Finally, healthy individuals in families with familial LC should undergo such WES tests because they may share the same fragile germline basis as the probands. This echoes a recent study that the population genomic screening of all young adults is extremely cost-effective in disease prevention and enhancing life quality [
32]. Our results suggested that such WES-level genomic screening might be more useful in the familial LC families.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.