Background
Highly heritable triglycerides (TG) [1] are an important measure of heart health. Having excess levels of TG can increase the risk of heart disease. Identified common variants only explain 12% to approximately 15% of the variance in lipid levels [2]. A substantial proportion of lipid heritability is unexplained [3]. This suggests that rare (minor allele frequency [MAF] < 1%) or intermediate variants (0.01 < MAF < 0.05) with potentially larger effect sizes or other mechanisms, such as gene–environment interactions, may play a role in explaining the substantially missing heritability.
Clear evidence shows that lipids vary by age. A handful of lipid loci with age-dependent effects were identified from candidate gene studies and genome-wide association study (GWAS) [4, 5]. However, few of these explored the role of gene–age interaction for rare and intermediate variants in lipid levels. More than 74.6% of variants are rare and intermediate variants [6], which may have a larger effect size than common variants and explain substantial proportions of lipid variance. In this GAW20 study, we attempted to detect the effect of gene–age interactions on TG for rare and intermediate variants with novel statistical methods.
Anzeige
Due to the allelic heterogeneity and the extreme rarity of individual variants [7], most existing methods focus on improving the power of detecting gene–environment (G × E) interactions only for individual markers, especially for common variants, and are not optimal for detecting rare variants. Although there has been interest in multiple-marker analysis by grouping single-nucleotide polymorphisms (SNPs) into biologically meaningful SNP-sets (eg, SNPs in a gene or pathway) to improve power and interpretability, the existing SNP set analysis has focused on testing for the marginal effect of a SNP set [8, 9]. Limited work has been done on testing the interactions between a SNP set and an environmental variable, especially as it pertains to rare variants. Although the SNP-set–based interaction sequence kernel association test (iSKAT) [10] can be applied to detect G × E interactions in rare variants, its power is very restricted and lacks robustness to the shape of the data in many circumstances. Motivated by the need for powerful methods to test G × E interactions for rare variants, we developed two novel methods: testing of optimally weighted combination of SNP environment interaction (TOW-SE) and a variable weight TOW-SE (VW-TOW-SE) to identify G × E interactions for SNP sets of common and/or rare variants in GWAS, exome, or next-generation sequencing data. Our simulation studies show that the Type I error rates of the proposed methods are under control. Comparing the two methods with the iSKAT, and the VW-TOW-SE is the most powerful test, TOW-SE is the second most powerful test when G × E interaction effect exists for both rare and common variants.
The Genetic Analysis Workshop (GAW) 20 “SimulationBestOneRepresentative” data includes TG levels before and after treatment with fenofibrate and genotyped genome-wide SNPs from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study [11]. We imputed chromosomes 1, 6, 8, 9, 10, and 17, where five major main-effect causal SNPs of TG reside. We applied the proposed methods (TOW-SE and VW-TOW-SE) and iSKAT to the unrelated individuals (sample size n = 246) to test TG susceptible to gene–age interactions on the five imputed genes. The main effect of common SNP was simulated in the five regions. However, the gene–age interaction effect was not included. As expected, using the proposed methods, none of the regions indicated significantly gene–age interaction effects on TG.
Methods
Consider a sample of n individuals. Each individual has been genotyped at M variants in a genomic region (a gene or a pathway). For the ith individual, denote yi as the trait value (continuous or binary); Ei as the environmental variable (continuous or binary); Gi = (gi1, ⋯, giM) as the genotypic scores at M variants, where gim ∈ {0, 1, 2} is the number of minor alleles the ith individual has at the mth variant. Zi denotes the potential confounder covariates.
We use the generalized linear model (GLM):
Anzeige
(1)
to model the relationship between trait values and G × E interactions EiGi, where f (∙) is a monotone “link” function. For a quantitative trait, f (∙) will be an identity link function. For a binary trait, a logit link function will be used. Coefficients of each term in Eq. (1) are denoted by α0, a, β, Ϛ, and η, respectively. To test for the G × E interaction for a SNP set of M SNPs is equivalent to testing the null hypothesis H0 : β = 0 in Eq. (1).
To test H0 : β = 0 in Eq. (1), we developed a score test by treating α0, a, Ϛ, and η as nuisance parameters. First, we adjusted both trait value yi and G × E interaction EiGi for the covariates Zi, the genotypic score Gi and the environmental variable Ei by applying linear regression and obtaining residuals. Denote\( {\overset{\sim }{y}}_i \) as the residual of yi and \( {\overset{\sim }{X}}_i=\left({\overset{\sim }{x}}_{i1},\cdots, {\overset{\sim }{x}}_{iM}\right) \)as the residual of EiGi. Then, the relationship between\( {\overset{\sim }{y}}_i \) and \( {\overset{\sim }{X}}_i=\left({\overset{\sim }{x}}_{i1},\cdots, {\overset{\sim }{x}}_{iM}\right) \) can be modeled by the GLM:
$$ f\left(E\Big({\overset{\sim }{y}}_i|{\overset{\sim }{X}}_i\Big)\right)={\beta_0}^{\ast }+{\overset{\sim }{X}}_i{\beta}^{\ast } $$
(2)
To test Ho : β = 0 in equation (1) is equivalent to test Ho : β∗ = 0 in equation (2). Sha et al. [12] proposed a score test to test Ho : β∗ = 0 in GLM. However, for rare variants’ SNP–environment interactions, the score test may lose power as a consequence of the sparse data and a large degree of freedom. To increase power by effectively using information from data, we proposed to test the G × E interactions by testing the effect of a weighted combination of SNP–environment interactions, \( {\overset{\sim }{x}}_i={\sum}_{m=1}^M{w}_m{\overset{\sim }{x}}_{im}. \)
To test\( {\overset{\sim }{x}}_i={\sum}_{m=1}^M{w}_m{\overset{\sim }{x}}_{im} \), the score test is:
$$ {\displaystyle \begin{array}{c}S\left({w}_1,\cdots, {w}_M\right)=n\frac{{\left({\sum}_{i=1}^n\left({\overset{\sim }{y}}_i-\overset{\sim }{y}\right)\left({\overset{\sim }{x}}_i-\overset{\sim }{x}\right)\right)}^2}{\sum \limits_{i=1}^n{\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)}^2{\sum}_{i=1}^n{\left({\overset{\sim }{x}}_i-\overline{\overset{\sim }{x}}\right)}^2}\\ {}=n\frac{{\left({\sum}_{m=1}^M{w}_m{\sum}_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_{im}-{\overline{\overset{\sim }{x}}}_m\right)\right)}^2}{\sum \limits_{i=1}^n{\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)}^2{\sum}_{i=1}^n{\left({\overset{\sim }{x}}_i-\overline{\overset{\sim }{x}}\right)}^2}\end{array}} $$
(3)
It reaches its maximum \( {S}_o\left({w}_1^0,\cdots, {w}_M^0\right)=n\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_i^0-{\overline{\overset{\sim }{x}}}^0\right)/{\sum}_{i=1}^n{\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)}^2 \) when \( {w}_m^0=\frac{\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_{im}-{\overline{\overset{\sim }{x}}}_m\right)}{\sum \limits_{i=1}^n{\left({\overset{\sim }{x}}_{im}-{\overline{\overset{\sim }{x}}}_m\right)}^2} \); \( {\overset{\sim }{x}}_i^0={\sum}_{m=1}^M{w}_m^0{\overset{\sim }{x}}_{im}, \) as rare variants are essentially independent. Thus, \( {w}_m^0 \) is the optimal weight. We define \( {T}_{T- SE}=\sum \limits_{i=1}^n\left({\overset{\sim }{y}}_i-\overline{\overset{\sim }{y}}\right)\left({\overset{\sim }{x}}_i^0-{\overline{\overset{\sim }{x}}}^0\right) \) as the statistic to Test the effect of the Optimally Weighted combination of SNP-Environment interactions (TOW-SE), which is equivalent to \( {S}_o\left({w}_1^0,\cdots, {w}_M^0\right) \) when we use a permutation test to evaluate p-values.
We analytically derive optimal weights for TOW-SE. The optimal weight \( {w}_m^0 \) will put a big weight to SNP–environment interactions that have strong association with the trait of interest and also adjust the direction of the association. Moreover, it will put big weights to SNP–environment interactions with small variations that are often rare variants. iSKAT assigns weights to variants based on the MAFs via a beta function. It put decent nonzero weights for variants with MAF in (0.01, 0.05). If MAFs of causal variants are not in the range of (0.01, 0.05), iSKAT will be less powerful than TOW-SE. TOW-SE targets rare variants and may lose power when testing G × E effects of both rare and common variants.
To test for the G × E interaction of both rare and common variants, we propose variable weight TOW-SE (VW-TOW-SE). We divide variants into rare and common. Let Tr and Tc denote the test statistic of TOW-SE for rare and common variants, respectively. Let \( {T}_{\lambda }=\lambda \frac{T_r}{\sqrt{\mathit{\operatorname{var}}\left({T}_r\right)}}+\left(1-\lambda \right)\frac{T_c}{\sqrt{\mathit{\operatorname{var}}\left({T}_c\right)}}. \) Denote pλ as the p-value of Tλ. The test statistic of VW-TOW-SE is defined as TVW − T − SE = min0 ≤ λ ≤ 1pλ. We will use permutations to evaluate p-values of both TT − SE and TVW − T − SE.
Anzeige
Simulations
Following the simulation setting in Lin et al. [10], we conducted simulation studies using the GAW17 empirical mini-exome sequenced data. The data set contains genotypes of 697 unrelated individuals on 3205 genes. Research shows that SNP rs11583200 on gene ELAVL4 is associated with body mass index [13] and rare variants on gene ELAVL4 are associated with the quantitative trait Q1 in the GAW17 data [14]. Therefore, we chose gene ELAVL4 in our simulation study. There are 10 variants on gene ELAVL4 of which 8 are rare variants and 2 are common variants. The rare variants threshold was chosen as 0.01. We use the program fastPHASE [15] to infer haplotypic phase for the 697 individuals and calculate haplotype frequencies. To generate the genotype of an individual, we generate 2 haplotypes according to the haplotype frequencies. The quantitative trait was generated using the following model:where Z1~N(0, 1); Z2~Binomial(1, 0.5) and ϵ~N(0, 1).The environmental variable E is assumed to be continuous following standard normal distribution and we set α1 = 0.015; EG is the rare variants G × E interaction and EGc is one common variant G × E interaction.
$$ Y=0.5{Z}_1+0.5{Z}_2+E{\alpha}_1+{\boldsymbol{G}}^{\boldsymbol{T}}{\alpha}_2+\boldsymbol{E}{\boldsymbol{G}}^{\boldsymbol{T}}\beta +\boldsymbol{E}{\boldsymbol{G}}_{\boldsymbol{c}}{B}^c+\epsilon $$
(4)
We consider two scenarios: (a) with main effect and (b) without main effect in the model (4). When there are no main effects, we set the magnitudes of vector α2 = 0.3 for each element and their signs are randomly sampling from (−1, 1). When there are no main effects, we set α2 = 0. To evaluate the Type I error, we set β and βc all to 0. To evaluate power, we vary the number of non-zero elements βj in β. We set the magnitude of the nonzero βj as |βj| = c, and increase c from 0.02 to 0.1. Of the βj, 50% are positive. βc is positive and twice the magnitude of βj. The sample size is 2000 for each scenario. P values are estimated by 10,000 permutations. The Type I error rates and power are evaluated using 1000 replicated samples.
GAW20 data analysis
We applied TOW-SE, VW-TOW-SE, and iSKAT to the GAW20 “SimulationBestOneRepresentative” data, which includes TG levels before and after treatment with fenofibrate and genotyped genome-wide SNPs from the GOLDN project [11]. We imputed chromosomes 1, 6, 8, 9, 10, and 17 with minimac2 software [16]. Five major main effect causal SNPs (rs9661059, rs736004 [LYRM4], rs1012116, rs10828412, and rs4399565 [HS3ST3A1]) of TG reside on chromosomes 1, 6, 8, 9, 10, and 17, respectively. The 1000 Genomes project haplotypes integrated phase I served as the reference panel. It includes 1092 individuals. Our analysis was based on 246 unrelated individuals. Both pre-treatment and post-treatment TG values were provided for two visits. We used the log ratio of the pre-treatment mean and the post-treatment mean as the phenotype trait in our G × E interaction analysis. For individuals who did not have two visits, we just used the existing value. The median age is 64 years (range: 28–83 years). We used the centered age in our analysis. The median of the TG ratio is 1.54 (range: 0.72–4.24). We excluded 8 individuals from our analysis because of completely missing post-treatment TG values.
We evaluated the performance of TOW-SE, VW-TOW-SE, and iSKAT by testing G × E interaction effect on TG for the aforementioned 5 regions. Each region consists of 10 SNPs, and the fifth SNP in each region is the major main effect causal SNP of TG. Additionally, we assessed the effect of region size on the power of the tests using the region of rs4399565.
Anzeige
Results
Table 1 shows that the Type I error rates of all the three methods are under control. Power comparisons of the three tests (VW-TOW-SE, TOW-SE, and iSKAT) for different values of G × E effect for rare and common variants are given in Fig. 1 (with main effect) and Fig. 2 (without main effect). The power of the three tests increases as the effect size increases. When there is a G × E interaction effect for both rare and common variants, VW-TOW-SE is the most powerful test and TOW-SE is the second most powerful test. Table 2 shows that the median of the MAF ranges from 0.003 to 0.013 in the 5 regions. When we apply the three tests to test G × E interaction effect, under Bonferroni correction, none of the regions are significantly associated with TG. Table 3 suggests that all of the three methods perform better when the region size is larger.
Table 1
Type I error rates for both rare and common variants in the presence of main effects (top panel) and in the absence of main effects (bottom panel) for n = 2000
α-level | TOW-SE | iSKAT | VW-TOW-SE | |
---|---|---|---|---|
With main effect | ||||
n = 2000 | 0.050 | 0.050 | 0.055 | 0.059 |
0.010 | 0.011 | 0.012 | 0.015 | |
0.001 | 0.000 | 0.001 | 0.000 | |
Without main effect | ||||
n = 2000 | 0.05 | 0.051 | 0.061 | 0.056 |
0.01 | 0.011 | 0.013 | 0.009 | |
0.001 | 0.000 | 0.003 | 0.001 |
Table 2
Results of testing G × E interaction in the five causal regions using the three methods
Region name (SNP set) | Median of MAF (range) |
P
TOW-SE
|
P
iSKAT
|
P
VW-TOW-SE
|
---|---|---|---|---|
rs736004 | 0.013 (0.001–0.387) | 0.020 | 0.022 | 0.019 |
rs1012116 | 0.015 (0.002–0.167) | 0.024 | 0.030 | 0.021 |
rs4399565 | 0.006 (0.001–0.449) | 0.022 | 0.023 | 0.019 |
rs9551059 | 0.007 (0.001–0.275) | 0.031 | 0.030 | 0.037 |
rs10828412 | 0.003 (0.001–0.394) | 0.100 | 0.099 | 0.095 |
Table 3
Region size effect analysis for the three methods based on the region of rs4399565
Region size | Median of MAF (range) |
P
TOW-SE
|
P
iSKAT
|
P
VW-TOW-SE
|
---|---|---|---|---|
10 SNPs | 0.006 (0.001–0.449) | 0.022 | 0.023 | 0.019 |
20 SNPs | 0.008 (0.001–0.449) | 0.014 | 0.018 | 0.015 |
30 SNPs | 0.008 (0.001–0.449) | 0.013 | 0.014 | 0.012 |
×
×
Discussion
The computation time for TOW-SE and VW-TOW-SE using 10,000 permutations for analyzing 1000 individuals in a region that includes 50 SNPs is 9 s and 20 s, respectively. Suppose a whole genome sequencing data with 13,498,188 SNPs, TOW-SE will take 674 h (28 days) to conduct a whole genome analysis. The effect analysis of the region size suggests that the three methods will perform better when the region size is larger. However, the larger the region size, the higher chance for collinearity to appear in the region, which makes the computation more complex. To minimize the problem of collinearity, we recommend a region size between 10 SNPs and 30 SNPs when we apply the proposed methods to a genome-wide scan.
Conclusions
In summary, we developed two novel statistical methods: TOW-SE and VW-TOW-SE by grouping SNPs into biologically meaningful SNP-sets, which improved power and interpretability. Simulation studies show that the proposed methods yielded well-controlled Type I error rates under all study conditions. When gene environment interaction effect exists for both rare and common variants, VW-TOW-SE is the most powerful test, TOW-SE is the second most powerful test, and iSKAT is the least powerful test.
Funding
Publication of this article was supported by NIH R01 GM031575.
Anzeige
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.