Background
Lung cancer is one of the leading causes of cancer death worldwide [
1,
2]. Most patients are diagnosed at an advanced stage, so are not able to undergo surgical removal of tumors [
1]. As a result, the overall 5-year survival rate is low. Early stage detection when treatment might be more effective, would therefore help reduce lung cancer mortality. For this reason, a well-established assessment model that could identify individuals at high risk would greatly benefit patients, clinicians and researchers.
Lung cancer is a polygenic disease, for which many genetic factors appear to play an important role in disease development [
2,
3]. During the past three years, several genome-wide association (GWA) studies have identified a number of genetic susceptibility loci associated with lung cancer risk [
4‐
9], but most of these studies were conducted in populations of European descent, and many identified risk alleles have not been adequately evaluated in Asian populations.
In addition, when examined individually, each of the genetic susceptibility loci only confers a small to moderate disease risk, and is of limited utility in risk prediction. It is possible that combining multiple disease-related loci with modest effects into a genetic risk score (GRS) may be useful to identify subgroups that are at high risk of lung cancer [
10,
11]. Several lung cancer risk assessment models have been proposed, including the Bach model, Spize model, and Liverpool Lung Project (LLP) model [
12‐
15]. However, most predictors from these models focus on demographic and clinical factors, and, to our knowledge, no report has quantified the risk of lung cancer using a combination of newly identified risk loci in a Chinese population.
In this case–control study, we evaluate the discriminatory and predictive ability of the cumulative effect of several SNPs associated with lung cancer risk in populations of European descent, and estimate the proportion of genetic variants explained by the selected risk loci in a Chinese population.
Methods
Subjects
A total of 2,283 lung cancer cases and 2,785 cancer-free controls (from Shanghai Zhongshan Hospital, Shanghai Chest Hospital, First Affiliated Hospital of Nanjing Medical University, Beijing Union Medical College Hospital, and Wuhan Union Hospital, China) who were genetically unrelated Han Chinese were enrolled in this study. Eligible patients had histopathologically confirmed lung cancer, and with no previous cancer history and were no receiving radiotherapy or chemotherapy for other condition. Control participants were randomly selected from individuals receiving routine physical examinations in local hospitals or those who participated in a community-based screening program of non-communicable diseases. They were frequency-matched to the cases according to age, gender and residential area.
Information on smoking was collected by means of interviews. Individuals who had smoked less than one cigarette per day for less than one year of their lifetime, or less, were defined as nonsmokers. The remaining individuals were divided into light and heavy smokers according to the threshold of 25 pack years (median pack years in the controls). All participants provided written informed consent for study participation with approval from institutional review boards of each participating institution.
Selection of genetic risk factors and genotyping
We reviewed the literature on GWAS and large cohort studies published up until June, 2011, and selected those lung cancer risk SNPs from GWAS demonstrating
p < 5E-6 or from large cohort studies with evidence of replication at
p < 0.05. In total, Five SNPs were selected for analysis (Table
1).
Table 1
Selected SNPs associated with lung cancer
*
| rs2736100 | TERT | 5p15 | 99.5% | C | 1.18 (1.10-1.26) |
| rs402710† | CLPTM1L | 5p15 | 99.6% | C | 1.22 (1.13-1.32) |
| rs1051730 | CHRNA 5, CHRNA 3 | 15q25 | 98.8% | T | 1.35 (1.5-1.45) |
| rs4083914 | RGS17 | 6q23-25 | 99.7% | G | 1.80 (1.36-2.39) |
| rs4488809 | TP63 | 3q28 | 100% | T | 1.27 (1.14-1.41) |
Blood samples were collected from each subject at the time of recruitment, and genomic DNA was extracted using QIAamp DNA Maxi kit (Qiagen GmbH). All SNPs were determined using the Sequenom MassARRAY iPLEX platform using the matrix-assisted laser desorption/ionization time-of-flight mass spectrometer (MALDI-TOF). Primer sequences are available on request. Overall, more than 98% of genotypes were successfully determined for all the SNPs; 5% of samples were randomly selected to re-genotype for quality control, and showed a reproducibility of 100%.
Genetic risk score computation
Two approaches were used to calculate the genetic risk score (GRS): a simple risk alleles count method (count GRS, cGRS) and a weighted method based on the genotype frequencies for each SNP and effect sizes (allelic odds ratio) from our study (weighted GRS, wGRS). Based on the log-additive model, the three genotypes AA, AB, and BB (A, low-risk allele; B, high-risk allele) for an SNP had a relative risk of 1, OR and OR2, respectively. If the B allele had frequency p, then the average relative risk in the population is calculated as: u = (1-p)2 + 2p (1-p) OR + p2OR2. The adjusted risk values for AA, AB, and BB genotype were 1/u, OR/u, and OR2/u2, respectively. Missing genotypes were assigned a value of 1. The formula for our combined SNP weighted risk score was: wGRS = SNP1 × SNP2 × SNP3 × SNP4, where SNP1-4 were weighted risk score for individual SNPs.
Percentage of genetic variance explained
The percentage of genetic variance was estimated under a liability threshold model [
16]. Allele frequencies and effect sizes corresponding to ORs were used to calculate the threshold: [2p (1-p)] β
2 (p, risk allele frequency; β, additive allelic effect).
Statistical analysis
Logistic regression was employed to test the association between genetic variants and lung cancer risk. The classification ability of the model was assessed using the area under the receiver operating characteristic (ROC) curve (AUC), known as a concordance (c) statistic. The Hosmer-Lemeshow test was used to evaluate the calibration of risk estimated in our cohort data. Internal validation of models was carried out using a bootstrap method involving 1000 replications to adjust model parameters for potential over-fitting. A second validation was performed by randomly dividing the cohort population into two unequal groups (one with 75% of the population, and the second with the remaining 25%). The larger group (training set) was used to rebuild the same model, which was then tested on the remaining 25% of the population (test set). All analyses were conducted by Statistical Analysis System (SAS) software (version 8.2; SAS Institute, Cary, NC). All p values were two-sided, and p values < 0.05 were considered statistically significant.
Discussion
In this study, we systematically evaluated the clinical utility of five SNPs identified in recent GWAs and large cohort studies of lung cancer. Using data from a large case–control study that enrolled 5,068 participants, we found that most of the genetic variants (rs2736100, rs402710, rs4488809, and rs4083914) identified previously in other populations were also associated with risk of lung cancer in a Chinese population. In addition, we showed that a wGRS accounting for the adjusted effect size of each SNP was a better predictor than a cGRS, and had a stronger association with lung cancer risk than any single SNP alone. Although the weighted genetic risk score had a moderate predictive ability, it gave a better discrimination between lung cancer cases and cancer-free controls (AUC of ROC curve, 0.639) when used in combination with smoking status using the logistic regression model.
Several lung cancer risk assessment models have previously been proposed [
12‐
15], but most predictors focused on traditional risk factors such as family history of lung cancer, smoking status, environmental exposure, age and gender. In contrast to these, genetic scores derived from inherited genetic variations offer the advantage of stability during the lifetime of the individual.
Previous studies have indicated that inherited genetic variants might account for an important fraction of lung cancer developmental risk [
18,
19]. Recent GWA studies of lung cancer in population of European ancestry identified three lung cancer susceptibility loci: 5p15 (TERT-CLPM1L), 15q25 (CHRNA 3–5) and 6p21 (BAT3-MSH5) [
4‐
9]. McKay et al. [
4] reported two independent markers of lung cancer at the 5p15 region, rs2736100 (TERT) and rs402710 (CLPM1L). Furthermore, an association between rs2736100 and lung cancer were also replicated in Asian populations [
20,
21]. Of the five SNPs evaluated in this study, we observed a strong signal at rs2736100 in accordance with previous reports.
15q25 region encoding nicotinic acetylcholine receptor subunits was thought to be related with lung cancer risk [
6‐
8]. We evaluated the rs1051730 SNP from this region in the present study, but it showed no association with disease risk. It is conceivable that the rs1051730 allele frequency in the Chinese Han population (MAF, 0.02) is too low to confirm the effects seen in European populations [
22]. Reported risk SNPs at 6p21 (rs3117582 and rs3131379) are not polymorphic in the Chinese Han population, so were excluded from this study. Rs4488809 and rs4083914, previously identified by GWA and large cohort investigations, were also shown to be significantly associated with lung cancer risk in this study [
23,
24].
Of the five SNPs evaluated in this study, the strongest signal was found for rs4488809, for which there was 21% elevated risk of lung cancer with each risk allele. The three other SNPs (rs2736100, rs402710, and rs4083914) were also associated with a risk of lung cancer, albeit at lower levels (<18%) for each risk allele. The estimated proportion of genetic variation explained by these four SNPs was therefore 4.02%, which includes 1.82% due to rs4488809 and 1.33% due to rs2736100. This suggests that the genetic susceptibility loci identified by GWA and large cohort studies in other populations only confer a small to moderate risk in a Chinese population when considered alone, and are of little use in lung cancer risk assessment.
To overcome this, a genetic risk score combining multiple loci might improve the identification of persons at high risk for developing lung cancer. Our results showed that although wGRS was highly associated with lung cancer susceptibility, a model including wGRS alone did not provide a better predictive capacity than a model including traditional factors (c statistic for wGRS alone, 0.551). Smoking history was also associated with lung cancer risk in this study, in agreement with previous reports [
12,
25]. Moreover, wGRS, in combination with smoking status showed a better predictive ability (c statistic, 0.639). Indeed, the c statistic decreased by 0.020 when wGRS was removed from the full model, indicating that genetic risk factors could improve the discriminatory ability of the traditional assessment model, although this effect was moderate.
This study has a number of limitations. First, the susceptibility loci identified by GWA and large cohort studies with evidence of replication were associated with a lung cancer risk through strong linkage disequilibrium, and always conferred moderate effects. Many additional susceptibility loci for lung cancer remain to be discovered, and it is possible that rare variants with high penetrance would explain the remaining hereditary [
26]. Next generation sequencing technologies offer hope in the future research of such variants [
27]. Recently, several identified SNPs were reported [
28‐
30]. Combining these new SNPs might result in improvement in classification of lung cancer risk. Second, because of limited traditional factors, the full predictive model established in this study only provided a moderate level of classification accuracy, with a c statistic of 0.639, which is inadequate for risk prediction. The discriminatory capability of our model might be improved by including additional factors such as history of bronchitis, emphysema or pneumonia, asbestos exposure, and family history of lung cancer. Third, our assessment model lacked external validation even though our estimates of ROC AUC were corrected for over-fitting by bootstrap and internal validation was conducted. Finally, as this was a retrospectively designed study, the results need to be validated by a large-scale, prospective study.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HL, LXY, DRL, WMW and LJ conceived and designed the experiments. HL, LXY and XYZ performed the experiments. HL and XYZ analyzed the data. JQ, JCW, HYC, WWF, HCL and LJ contributed reagents, materials or analysis tools. DRL, WMW and HL wrote the manuscript. All authors read and approved the final manuscript.