Study population and sample collection
This study population consisted of 3619 participants from 2 observational studies for whom genetic data were available: Colorectal Cancer Genetics & Genomics (CRCGEN;
n = 1801) and Colorectal Cancer Screening (COLSCREEN;
n = 1818). CRCGEN combines data of two Spanish case-control studies. The first one, conducted in the University Hospital of Bellvitge, L’Hospitalet del Llobregat, (Barcelona), recruited incident pathology-confirmed CRC cases (
n = 304) and age and sex frequency-matched hospital controls (
n = 293) during the period 1996–1998. The control group was randomly selected among patients without previous CRC admitted to the same hospital during the same period. To avoid selection bias, the criterion of inclusion in the control group was a new diagnosis. The second study was conducted in parallel in the University Hospital of Bellvitge and in the Hospital of León, León, during 2007–2015 and recruited a total of 633 incident CRC cases (313 Bellvitge and 320 León) and 571 population controls (free of CRC, 164 Bellvitge and 407 León). The control group was recruited by inviting to participate subjects selected from the primary health care lists of the hospitals’ referral areas, frequency matched by age and sex. COLSCREEN is a cross-sectional screening cohort study designed to recruit participants from the ongoing population-based CRC screening program conducted by the Catalan Institute of Oncology, L’Hospitalet del Llobregat (Barcelona), from 2011 to 2020. The design of the CRC screening program, based on FOBT, has been published elsewhere [
12,
13]. Exclusion criteria to participate at the biennial screening program were as follows: gastrointestinal symptoms; family history of hereditary or familial colorectal cancer, personal history of CRC, adenomas or inflammatory bowel disease; and colonoscopy in the previous 5 years or a FIT within the last 2 years, terminal disease, and severe disabling conditions. Most of the participants of the COLSCREEN study were invited to participate after a positive FIT result (
n = 1242, 77%; ≥ 20 μg Hb/g feces), but we also invited to undergo a colonoscopy to a sample of 362 subjects with a negative FIT result (< 20 μg Hb/g feces). To increase CRC sample size, we further included in the COLSCREEN study 70 newly diagnosed CRC identified by the hospital CRC Functional Unit. These clinically diagnosed CRC cases, though labeled under COLSCREEN because they were recruited simultaneously to that study, for analyses were combined with others of CRCGEN and were excluded when the analyses were restricted to participants in screening.
Colonoscopy and/or CRC histological reports were examined and used to classify COLSCREEN participants into different categories following the proposal by Castells et al. for risk stratification of patients with CRC and/or serrated polyps [
14]: low-risk lesions (LRL;
n = 286), intermediate-risk lesions (IRL;
n = 352), high-risk lesions (HRL;
n = 226), and CRC screening program cases (
n = 70). Controls (free of CRC) were classified into population-based controls and screening controls (normal colonoscopy or no-risk lesions).
All participants who agreed to take part of the CRCGEN and COLSCREEN studies provided written informed consent and donated a blood sample at recruitment. Each hospital’s ethics committees (University Hospital of Bellvitge and Hospital of León) approved the protocols of the studies (PR148/08, PR073/11, PR084/16).
Genotyping and single nucleotide polymorphism data
Genotyping within the CRCGEN study was conducted in 2016 from blood DNA using the Infinium OncoArray-500 K BeadChip (Illumina, San Diego, CA) which contains 500,000 SNPs, whereas the genotyping within the COLSCREEN study was performed in 2019 using the Infinium Global Screening Array v2.0 (Illumina, San Diego, CA) which includes near 800,000 SNPs markers. The correlation of the minor allele frequency (MAF) between arrays was excellent (Pearson r > 0.99).
SNPs were filtered out if Hardy Weinberg equilibrium
p value was < 1e−04 or if the MAF was < 0.001. Multiallele SNPs and SNPs outside autosomes or chromosome X or with > 5% of missing values were also excluded. Likewise, duplicated, or related samples and samples with > 1% of missing values or with no sex concordance were also excluded. Whole genome imputation was performed using Michigan Imputation Server [
15], for each dataset separately, using the Haplotype Reference Consortium panel (HRC.r1.1.2016) for CEU population as reference.
In 2020, Thomas et al. reported 140 SNPs associated to CRC risk in a large-scale GWAS study [
6]. For the present study, a total of 133 SNPs were used to calculate the PRS. One SNP was not found in our data after imputation (rs6928864) and six SNPs (rs35470271, rs145364999, rs755229494, rs77969132, rs373585858, and rs556532366) were excluded due to low imputation quality information index (
R2 < 0.3). None of the 133 SNPs were in linkage disequilibrium (Additional file
1: Tab. S1).
Statistical analysis
Odds ratios (OR) and 95% confidence intervals (95% CI) were estimated using unconditional logistic regression models to evaluate associations between each analyzed variant and the outcome, defined as cases (IRL, HRL or CRC, either screening or clinical) versus controls (normal colonoscopy, LRL or population control). All samples were combined for this analysis and, previously, potential confounders were explored (age, sex, genetic ancestry, family history and array). Though the crude and fully adjusted models provided very similar OR estimates, we report the adjusted ORs.
A principal component analysis (PCA) with the ancestry-informative marker SNPs (AIMS) including 1397 HapMap samples was performed [
16]. Based on ethnicity of HapMap samples, we could classify our samples by ancestry: European (
n = 3509), Latino (
n = 90), and African (
n = 20) (Additional file
2: Fig. S1). We decided not restrict the sample to European ancestry, since the 110 non-European samples had a minimal impact in the estimates and this population also participates in CRC screening. We adjusted the analyses by the first 5 PCs, though only the first two were associated with the outcome. We also performed a sensitivity analysis excluding the subjects with no European ancestry.
To assess genetic susceptibility, two approaches were used: an unweighted PRS and a weighted PRS (w-PRS). Each SNP was coded as 0, 1, or 2 copies of the risk allele. The PRS was defined as the sum of risk alleles across all 133 SNPs. The w-PRS was assessed using the published
β values reported by Thomas et al. as weights. However, because part of our data (CRCGEN) was used by Huyghe et al. as a discovery data [
7], the unweighted PRS was preferred, though the weights used had been corrected for the winner’s course bias.
The PRS was analyzed initially as response in a multivariate linear model to assess potential confounding. Though only sex and two ancestry components were significant, we calculated and adjusted PRS as the residuals of the linear model that included sex, age, five ancestry components, array, and family history of CRC. Then, we used that adjusted PRS to estimate averages for the different risk groups (population control, screening control, LRL, IRL, HRL, screening CRC, and clinically diagnosed CRC). Differences between PRS mean values within variables (ethnicity, sex, age, and FIT) were assessed using Student’s
t test. To identify distributional changes across groups, the non-parametric Mann-Kendall Trend test was used [
17,
18]. Stratified analysis by sex and age (< 60, ≥ 60) were also carried out.
Further, we performed additional analyses focusing only on screening samples, which were dichotomized into cases with pathogenic lesions (n = 648; IRL, HRL, and screening CRC) and controls (n = 956; LRL and screening controls). Additional stratified analyses by sex, age (< 60, ≥ 60), and FIT (positive FIT, negative FIT) were conducted within this subset. We carried out two exploratory analyses in order to verify the consistency of the results. First, we used the w-PRS instead of the PRS, and second, cases were defined as HRL and screening CRC (n = 296) and controls included the IRL group, with LRA and screening controls (n = 1308).
The predictive accuracy of the models to discriminate cases and controls was assessed with sensitivity, specificity and, aROC as implemented in the pROC R package [
19]. To reduce the impact of estimating the model coefficient in our data, we used fivefold cross-validation to calculate the aROC. The
roc.test function with DeLong test from the same package was used to compare two aROC curves. Utility of the PRS was assessed calculating the positive predictive values (PPV) and negative predictive values (NPV). Since our sample was not representative of the prevalence of lesions in average risk population, because most of the subjects had been selected by a previous FIT test, an estimate of the population PPV and NPV were calculated using a weighted average of the values calculated by strata according to FIT result. Based on the number of participants with a positive FIT result in the population screening program, sampling weights of 0.06 and 0.94 were applied for participants with a positive and a negative FIT result, respectively [
13].
Lastly, a quantile plot was performed stratifying the screening population according to the adjusted PRS. Participants were categorized in deciles based on the distribution in the control group. The first PRS decile was treated as reference category. The OR and the corresponding 95% CI for the association between PRS and CRC (including IRL, HRL, and CRC) risk were estimated using unconditional logistic regression models.
All statistical analyses and graphical representations were carried out using R statistical software (R Foundation for Statistical Computing, Vienna, Austria) and SNP selection and PRS calculations were performed using PLINK version 1.9 [
20]. All statistical tests were two-sided and statistical significance was set at
α = 5%.