Background
The prevention and early diagnosis of breast cancer is one of the main objectives of cancer research. There are different models to estimate cancer risk based on genetic or non-genetic factors; that is, a high or moderate predisposition [
1,
2]. In recent years, the extensive use of genome-wide association studies (GWAS) has led to the identification of low-susceptibility alleles (SNPs). These SNPs are usually combined in a polygenic risk score (PRS), which, in combination with non-genetic factors, reflects the risk of developing breast cancer [
3]. We recently described a low-susceptibility SNP polygenic risk score of 76 for breast cancer that allows the general population to be stratified. According to this score, women at a low and high risk of developing breast cancer presented 0.5 and 2.5-fold increased risks, respectively, relative to women in the middle quintile [
4]. Previous studies have shown that breast density, familial antecedents and PRS models composed of 77 [
5], 83 [
6] or, more recently, 313 SNPs [
7] determine women at risk. The combination of phenotype and PRS increases the likelihood of identifying women at risk who require personalized follow-up, particularly when an individual exceeds the risk threshold.
Although there are previous studies in Caucasian populations, this is the first to combine a PRS of 92 SNPs with other risk factors, such as mammographic density (MD), reproductive factors, and family history, in a Spanish population of 1097 women. The main objective was to analyze the usefulness of this approach in our population using a multivariable logistic method based on the combination of these variables.
Methods
Study design: description of cohorts
The present study was submitted to and approved by the Clinical Research Ethics Committee (CEIC) of the Hospital Clínico Universitario de Valencia (Spain) - September 29th, 2016 (2016/169) and July 13th, 2018 (2018/139) - and was conducted in compliance with the Helsinki Declaration.
This is case control study compiling full genotyping and phenotypic data for a cohort recruited between January 2017 and December 2018 from two sources: Hospital Clínico Universitario de Valencia and Valencian Community Screening Programme (General Directorate Public Health), both in the Autonomous Community of Valencia (on the Mediterranean Coast). A total of 867 healthy women and 640 breast cancer patients were recruited, with ages in the range of 30–70. Patients had developed breast cancer in a maximum period of 5 years prior to data collection, while controls were women who had not developed breast cancer during the same period. Those that presented incomplete phenotypic data or genotyping failure were excluded from the cohort, which left 1097 participants consisting of 642 healthy women and 455 breast cancer cases.
The patient cohort was composed of 45% Luminal A, 20% Luminal B, 20% Her-2 positive and 15% Triple Negative tumors (approximate percentages).
Data collection
Clinical information was collected for all subjects at recruitment: family history of breast cancer, date of birth, age, age at menarche, age at menopause, age at first pregnancy, and mammographic density (MD). Breast density was assessed from craniocaudal and mediolateral oblique mammographic projections by an experienced radiologist with more than 10 years of experience. The radiologist used the image viewer system (DICOM, from General Electric GIMD company), classifying MD according to Boyd’s semiquantitative scale [
8].
SNP selection and genotyping
As in our previous PRS risk analysis [
4], we initially selected 76 SNPs from the European Collaborative Oncological Gene Environment Study (COGS) [
9]. These SNPs were significant or showed a trend towards significance in our previous validation with Spanish samples. The correlation of the genetic variants analyzed with prediction of breast cancer risk in women of the Spanish population has already been described [
4]. In brief, we analyzed the performance of our PRS using the 76 selected SNPs for breast cancer risk prediction in a Spanish case and control cohort. The initial selection was extended to 123 SNPs by including additional SNPs obtained from the OncoArray Project [
10]. Of these, 28 SNPs with an OR close to 1 (0.95
< OR
< 1.05) and another 3 SNPs with platform genotyping failure were removed. In this way, a total of 92 SNPs [
11‐
16] were eventually employed for the current analysis (Online Resource
1).
The genotyping method has been described previously [
4]. In short, 10 ml of peripheral blood was collected in an EDTA tube. One μg of Deoxyribonucleic acid (DNA) was used for the genotypic analysis (minimum concentration of 25 ng/μL). Genotyping was performed with the Open Array® Real-Time PCR platform (Life Technologies) using the Acufill® system and Taqman® probes. The data obtained were analyzed using Genotyper software. Samples with a call rate < 0.95 were discarded. SNPs with a genotyping rate < 0.95 and SNPs generating errors in control duplicates were also ruled out.
Statistical analysis
Sample size was calculated with a 95% confidence level (two-tailed test), 80% statistical power, control-case ratio of 1.3 and initial prevalence of breast cancer of 12%; the total number of women necessary for results to be statistically significant was 1138, similar to our case control cohort (1097). In an initial exploratory univariable process, the case/control ratio of each risk factor was compared. During this step, the Wilcoxon-test was used with a two-sided p-value threshold of 0.05.
The PRS was based on a combined effect of 92 SNPs statistically associated with breast cancer. This strategy considers an independent effect of each SNP, ignoring departures from a multiplicative model [
17]. The PRS was derived for each study subject using the formula:
$$ PRS=\beta 1x1+\beta 2x2+\dots +\beta kxk+\dots +\beta 92x92 $$
where
xk is the number of risk alleles (0, 1 or2) based on the ploidy of each SNP. The
βk weights are the ORs of the risk alleles associated with breast cancer described in Online Resource
1. This strategy has been used in other studies [
5,
6]. The resulting values are normalized using the median PRS value of the control samples of the cohort.
In the phenotypic analysis, the phenotypic categories were transformed into quantitative variables using the ORs described in the
Pollan et al. study [
8], except for family history, the ORs of which were based on the
Pharoah et al. study [
18]. In addition, the age of women (age at diagnosis of patients and age at interview of controls) was grouped into five-year periods, similar to in other publications [
19], which allowed the groups to be transformed into quantitative variables. The final number of cases and controls in our cohort was 455 and 642, respectively.
For the univariable analysis, logistic regression was applied to each risk factor, which has been adjusted for age and centre. The coefficients of the model were standardized using the reghelper library of R [
20]. Additionally, the PRS was adjusted for the first five principal components. The interaction effect between variables was also evaluated using the likelihood ratio test (LRT). All analyses were two-sided and employed a
p-value threshold of 0.05.
To confirm the independence of the PRS and other phenotypic risk factors, pairwise Spearman correlations of unaffected controls were evaluated.
For the multivariable study, we performed a logistic regression analysis that incorporated the statistically significant variables obtained in the previous steps, including the interaction terms. Family history and age at menarche were also included in the analyses, even though they were not significant, since they are well-known risk factors. The significance of the final model was evaluated using the Wald Test [
21]. To assess the accuracy of the final multivariable model, a global Hosmer-Lemeshow goodness-of-fit test was performed using deciles [
22].
To evaluate improvement in risk prediction for the different models and risk factors, the area under the curve (AUC) was evaluated [
23] as a measure of discrimination between cases and control women. This calculation was performed using the pROC [
24] library of R. To avoid a possible overfitting of the model, the 95% Confidence Interval (CI) of the AUC was assessed using a cross validation strategy [
25]. This step was based on the calculation of AUC in 1000 permutations using a random selection of 90% of women as a training set and the remaining 10% as a test set.
Finally, women were stratified into deciles based on their final individual risk factor, obtained from the multivariable model. The ORs of extreme deciles were evaluated using logistic regression with a reference range of 40–60%.
Based on the characteristics of our cohort, the final individual risk factor proposed in this study describes the relative risk of women in the Spanish population of suffering breast cancer in a maximum period of 5 years.
Discussion
In recent years, there have been various proposals for multivariable models that stratify women who might suffer breast cancer according to their individual risk. Different biomarkers have been analyzed as possible predictors, including phenotypic and non-phenotypic markers, and environmental and genetic factors.
One approach to measuring genetic variables is the polygenic risk score (PRS). This strategy is based on variable numbers of statistically significant low penetrance variants obtained from large GWAS analyses [
5,
32].
Our study was based on a relatively small cohort of women adjusted for center of origin in our univariable and multi-variable models.
Employing a specific PRS based on 92 SNPs we obtained an OR of 1.41 (1.24–1.61) that was consistent with the results of other published studies of Caucasian populations using different numbers of SNPs (from 18 to 313) [
5,
32‐
34].
The AUC-ROC was 0.62, with a 95% CI of 0.56–0.66, which is also in line with the literature and assigns a range of 0.58 to 0.65 to European populations and one of 0.53 to 0.64 to non-European populations [
35].
Regarding univariable phenotypic risk factor analysis, the most statistically significant results in terms of discriminant variables were obtained for menopause status and mammographic density, which once again is consistent with previous studies [
28,
29,
36,
37]. Other reproductive factors, such as later age when giving birth for the first time and later age at menarche, have been identified as risk factors for breast cancer [
38]. In our study, a significant
p-value of 0.03 and an OR of 1.15 were identified for the former risk factor, while the latter was not found to be statistically significant (
p-value = 0.061).
The ORs of the risk factors obtained in our cohort present differences with respect to those previously reported. The most evident concern the lack of a statistical significance of family history and age of menarche. However, the direction (positive or negative) of these well-established effects and our results are concordant. On the other hand, the magnitude of OR of mammographic density was lower than that reported in the literature. These differences may be due to the low number of women in our cohort; however, the concordance of the effect, direction and magnitude of the different ORs of our population corroborates the validity of our study as a first proof of concept in a Spanish population.
Additionally, the joint association of our PRS92 with transformed continuous phenotypic variables, such as MD, reproductive factors and family history, was examined in our Spanish population. We did not find any significant correlation between genotypic and phenotypic variables; a multiplicative model would possibly describe this in greater depth and help to improve breast cancer risk estimation.
The precision of the multivariable model increased when we added two statistically significant interaction terms associated with women’s age: menopause and mammographic density. Such interactions have previously been observed, and we detected an increase of AUC-ROC from 0.74 (95% CI: 0.71–0.77) to 0.80 (95% CI: 0.77–0.83) (Table
2), a rise that was statistically significant and offered a final value slightly higher than those of other similar multivariable studies [
39].
We were able to stratify the control group within our model (Fig.
3), in which both extremes showed important differences. The last decile included 1% of controls vs 9% of cases, with an OR of 12.9 and a
p-value 3.43E-07. In contrast, the first decile presented an inverse proportion (1% of cases and 9% of controls); in this case, the OR was 0.097, with a p-value of 1.86E-08. These results indicate the capacity of the multivariable model to stratify women according to risk of developing breast cancer.
In summary, our results indicate that using the multivariable logistic model and a combination of genetic, phenotypic and interaction variables is an effective approach for stratifying women in the Spanish population according to individual risk of suffering breast cancer within a 5-year period, with a capacity similar to that observed in other studies in European and non-European populations. Due to the nature of our study, different biases could have affected the precision of the results; for example, there may have been selection and length biases. Additionally, the small size of our cohort could have led to overfitting of the model in terms of risk estimation or the over/under representation of a specific tumor type. However, in spite of these limitations, our analysis provides proof of concept in a population that has not been studied until now. Larger series are necessary in order to confirm our data and initiate the use of this type of screening method in the Spanish population.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.