Discussion
Genome-wide association studies have arguably become the mainstay of identifying genetic risk factors for complex disease. However, these studies cannot identify which gene(s) in the region is responsible for the association, and testing all variants individually and independently is likely suboptimal. Here, we used an integrative method that combines the genetic component of gene expression with genetic association analysis in severe COPD and quantitative emphysema to predict differentially expressed genes. Importantly, this method focuses on the association of genetic component of gene expression, not gene expression as a whole, as is typical in most gene expression studies. We also provided additional support of our results by examining results in a second gene expression dataset, and performing colocalization analysis that attempts to identify whether association signals for gene expression and a phenotype of interest appear to be driven by the same causal variant(s). We implicated genes that are genetically regulated in known COPD-susceptibility loci, such as FAM13A, and also found genes in regions that were not previously reported: WNT3 for severe COPD, and DCBLD1 and LILRA3 for quantitative emphysema.
We found a novel association of
WNT3 in lung tissue with severe COPD in two gene expression datasets. Although variants surrounding this gene in the 17q21 locus were not genome-wide significant in our COPD analysis GWAS (Fig.
3), the top signal (rs9912530) is in strong LD with variants previously reported in GWAS of FEV
1 [
18,
19], interstitial lung disease [
20], and idiopathic pulmonary fibrosis [
21] (r
2 with these previously described variants, 0.55–0.72).
WNT3 (Wnt family member 3) encodes Wnt3, a critical component of the Wnt-beta-catenin-TCF signaling pathway [
22] and a required signal for the apical ectodermal ridge in limb patterning [
23]. Deficient
WNT3 is associated with tetra-amelia syndrome, a Mendelian disease characterized by an absence of all limbs. The top signal is also in strong LD with variants associated with various complex diseases such as Parkinson’s disease and celiac disease (r
2 0.72–0.79) [
24,
25]. Previous expression studies of small airway epithelium found that this gene, along with its Wnt signaling companions, was down-regulated in smokers compared with nonsmokers [
26]. Of interest,
FAM13A, a well-supported COPD susceptibility gene, has been involved in the Beta-catenin/Wnt signaling pathway by protein degradation [
27]. While there is substantial interest in Wnt signaling in lung disease [
28], the contribution of
WNT3 to the pathogenesis of COPD requires further investigation. To address whether these findings were specific for severe COPD, we repeated the analysis including moderate disease (GOLD 2). All of our genes were at least nominally significant, though overall the significance of our findings was attenuated (Additional file
1: Table S4).
For emphysema, we identified novel associations of
LILRA3 and
DCBLD1 using whole blood and lung tissue, respectively, and validated these findings in additional gene expression datasets.
LILRA3 (leukocyte immunoglobulin like receptor A3) is a gene encoding a soluble receptor for class I major histocompatibility complex (MHC) antigens expressed in monocytes and B cells, which is located in the 19q13 locus. Our top hit from GWAS in this locus, was not genome-wide significant (rs384116 with
P = 1.88 × 10
− 5; Additional file
1: Figure S1), and 13-Mb away from the previously reported locus [
16] that contains
EGLN2 and
RAB4B (rs7937; r
2 0.002). It is in modest LD with variants suggestively associated with FEV
1/FVC [
18] (r
2 0.44), in strong LD with variants genome-wide significantly associated with HDL-C level [
29] and prostate cancer [
30] (r
2 0.92–0.99). Blood may be the most relevant tissue for this gene, as it is preferentially expressed [
31] with a high estimate of heritability of gene expression in whole blood [
32]. However, it may also have an effect in other tissues, given its broad eQTL effects identified by multi-tissue eQTL analysis [
33]. This was supported by the suggestive signals of this gene using lung tissue in S-PrediXcan analysis (
P = 7.71 × 10
− 5 in GTEx-Lung and 1.38 × 10
− 4 in the Lung-eQTL Consortium with the same direction of effect). Nonetheless, its functional role in COPD has not been described previously. Our other novel association identified in lung tissue,
DCBLD1 (discoidin, CUB and LCCL domain containing 1), located in the 6q22 locus, is an integral component of cell membranes and binds to oligosaccharides [
34]. GWAS signals in this locus are also sub-genome wide significant (Additional file
1: Figure S2). Our top GWAS variant at this locus was in LD with variants associated with lung cancer [
35] (r
2 0.54).
In addition to novel associations, our study also provides insight into disease-associated genes in known COPD susceptibility loci. We identified six genes (
FAM13A,
GPRIN3,
HYKK,
PSMA4,
EGLN2, and
RAB4B) in three known COPD-susceptibility loci for which their genetic component of gene expression in blood or in lung tissue is associated with severe COPD. Five of these six genes are not the most proximal to the top associated SNP, a phenomenon previously observed in other genetic association studies [
36,
37]. These findings underscore the complexity of genetic regulation in tissues and also identify multiple potential effector genes in the same locus. For example, in 15q25,
PSMA4, and not
CHRNA3 (the nearest gene to the top GWAS hit) was highlighted in S-PrediXcan and colocalization analysis. Although a role for
IREB2 has been clearly demonstrated [
38], our study suggested that other genes in the locus, particularly
PSMA4 – a gene encoded for subunit of proteasome complex that acts in the proteolytic pathway [
39], may also be of biologic importance.
At the 4q22 locus, an association for
FAM13A identified using DGN-Blood was not validated in the GTEx-blood dataset. However, a significant but directionally opposite association was identified in the Lung-eQTL consortium dataset. To further explore this phenomenon, we examined individual SNP eQTL data from the Framingham Heart Study (FHS) blood, and the lung tissue from the Lung eQTL consortium (Additional file
1: Supplementary Methods). We confirmed that SNPs have opposite directions of effect in lung and blood (Additional file
1: Figure S3 and S4). This finding is consistent with prior reports describing significant and opposite tissue specific effects of eQTLs [
33,
40,
41]. The interpretation of this phenomenon is not clear, but may be a result of pleiotropic effects of
FAM13A [
42,
43]. Of note, a recent analysis of emphysema-related gene expression in blood and lung tissue [
44] found that the expression of genes in two tissues are often opposite; together, our findings highlight the tissue-specific genetic regulation of genes in COPD susceptibility loci. At the 19q13 locus, while both
EGLN2 and
RAB4B were successfully validated, only GWAS and eQTL signals for
EGLN2 colocalized. This genetic locus was associated with COPD [
16] and smoking behavior [
45]. Although the causal gene(s) in this region is unclear, methylation and expression studies support the role of
EGLN2 in this region [
46].
EGLN2 (egl-9 family hypoxia inducible factor 2) encodes an enzyme that regulate the degradation of alpha subunit of hypoxia inducible factor (HIF) [
47]. Gene and protein expression of HIF-1α is reduced in lung tissue samples from COPD patients [
48].
Although
ATF6B (activating transcription factor 6 beta) and
ITGA1 (integrin subunit alpha 1) were not successfully validated, we cannot rule out the possibility of false negatives due to differences between the transcriptome datasets used for validation, and they are potentially interesting candidates for COPD.
ATF6B was implicated in the unfolded protein response (UPR) pathway during endoplasmic reticulum (ER) stress following cigarette smoke, and may contribute to lung inflammation in patients with COPD [
49], while integrins were found to be involved in COPD through the mitogen-activated protein kinase (MAPK) pathway [
50,
51]. This region also harbors variants associated with FEV
1/FVC [
52]. Decreased expression of
ITGA1 was observed in the small airways of patients with low FEV
1 [
53].
Our analysis assesses only the genetic component of gene expression. We also investigated whether these genes were differentially expressed in COPD patients, in 464 blood samples from the COPDGene study [
54], and 151 lung tissue samples [
55] (Additional file
1: Supplementary Methods and Table S5-S8). These genes were not differentially expressed, with the exception of
LILRA3, which was nominally significant with %LAA-950 (
P = 0.03). Given that the genetic component of gene expression was replicated, we believe that the genetic findings are robust, and speculate that these null findings could be due to non-genetic (i.e. environmental) perturbations that may occur downstream, or as a result of the genetic effects. In fact, in several cases measurements of mRNA or protein are actually opposite those predicted by genetic risk. For example,
SERPINA1 risk alleles result in decreased levels and increased risk for COPD, yet average, alpha-1 levels in patients with COPD are actually elevated. Similarly, genetic variants in
AGER and
DSP affect transcript or protein levels opposite than what is measured in disease [
4,
56,
57]. The mechanisms underlying our genetic findings, as well as
AGER and
DSP, that result in null or opposite direction effects requires further experimental investigation.
In addition to examination of individual loci, we applied pathway enrichment analysis to nominally significant differentially expressed genes in severe COPD and quantitative emphysema both in whole blood and lung tissue. This analysis identified enrichment of the T cell receptor signaling pathway in emphysema. This finding is consistent with reports that found antigen-specific T cell differentiation in lungs of patients with severe emphysema [
58]. Our analysis using gProfileR does not assess of direction of effect, and the relative up- or down-regulation of specific genes in this pathway makes determination of direction difficult. To attempt to infer direction, we used Gene Set Enrichment Analysis (GSEA; [
59]). In these results, the TCR signaling pathway and downstream TCR response were up-regulated, though these results were not statistically significant (Additional file
1: Table S9). Further study will be needed to determine the combined effects of COPD genetic susceptibility variants on T cell function and whether these explain some immune dysfunction seen in COPD [
60,
61]. The finding of the enrichment of genes in the proteasome core complex further suggested a role of proteasome in COPD as described previously. Somewhat surprisingly, we observed enrichment of the asthma pathway in KEGG using genes identified in quantitative emphysema. This finding complements the description of substantial genetic correlation of COPD and asthma [
4], and the presence of quantitative emphysema (or lung hyperinflation) in asthmatic patients [
62].
Our study did not identify associations of genetically regulated differential expression of genes at some previously reported GWAS loci. Moreover, some of our identified associations in our discovery dataset were not successfully validated in a second transcriptome dataset. These findings indicate some of the limitations of our approach. First, as S-PrediXcan uses
cis genetic variants as predictors for gene expression, variants that have lesser or no effect on transcript abundance or act in
trans would not be detected by this approach [
63]. Second, although most genetic variants implicated by GWAS are likely regulatory, only a minority of genetic loci are explained by existing eQTLs [
64]. This may be due to lack of data in the appropriate tissue, cell type, or biologic conditions; or the heterogeneity of gene expression studies of bulk tissue. We may overcome these issues as more gene expression datasets and newer techniques such as single-cell gene expression profiling [
65] become widely available. Moreover, issues such as cell type composition, sample collection methods, disease status, and differences in analytic methods also made the overlapping analysis challenging. Third, the number of genes available for an analysis depends on the power and sample size of the expression data used in constructing a gene expression prediction model [
8,
9]. Given the noisy and condition-specific nature of gene expression datasets, variants with small effects on gene expression may be undetectable at the sample sizes available. Additionally, the difference in sample size among transcriptome databases decreases our power to validate or discover more genes.
However, despite technical and population differences, most cis-eQTLs appear to be consistent between studies [
66]. Therefore, despite in some cases a modest value of overall coefficient of correlation between predicted and measured gene expression, associations of the genetic component of gene expression as inferred by imputed gene expression have been successfully in identifying disease-associated genes that complement existing methods.