Materials and methods
Study population
The current meta-analysis incorporated 6,643 individuals derived from two GWAS on BMD at the lumbar spine and femoral neck, the HKSC Study (
n = 778), and dCG Study (dCG,
n = 5,858) [
2]. In the Hong Kong Osteoporosis Study, 800 unrelated women with extreme high or low BMD were selected from a HKSC cohort with extreme BMD. These subjects were selected from a database (>9,000 Southern Han Chinese volunteers) at the Osteoporosis Centre of the University of Hong Kong. Low-BMD subjects are defined as those with a BMD Z-score ≤ −1.28 at either the lumbar spine (LS) or femoral neck (FN) (the lowest 10% of the total cohort). High-BMD subjects comprised individuals with BMD Z-score ≥ +1.0 at either site. Subjects who reported diseases or environmental factors that may affect BMD and bone metabolism were excluded. The recruitment procedure and exclusion criteria have been detailed elsewhere [
3]. The demographic data of studied population are provided in Supplementary Table
1.
BMD and anthropometric measurements
BMD (grams per square centimeter) at the LS and FN was measured by dual-energy X-ray absorptiometry (Hologic QDR 4500 plus, Hologic Waltham, MA, USA) with standard protocol. The in vivo precision of the machine was 1.2% and 1.5% for LS and FN BMD, respectively.
Genotyping
The discovery sample was genotyped via the Infinium assay (Illumina, San Diego, CA) with Human610-quad chip, including 564,214 SNPs. PLINK (version 1.04) was used for data management and quality-control statistics. Seven hundred seventy-eight individuals and 488,853 SNPs were retained for analysis following application of strict quality-control criteria. Subjects were excluded according to the following criteria: (1) genotyping call rate <95% (n = 5), (2) autosomal heterozygosity <27% or >31% (the same five subjects with low genotyping call rate), (3) being related or identical to other individuals in the sample (n = 7), and (4) discordance of observed gender and estimated sex (n = 3). SNPs were excluded if (1) genotyping call rate was ≤95% (1,158 SNPs), (2) Hardy–Weinberg equilibrium p value <1.0 × 10 −4 (904 SNPs), (3) minor allele frequency <0.01 (73,589 SNPs). The average genotyping call rate of retained SNPs was 99.91%.
In silico gene-based GWAS of European subjects
All GWA data for spine and hip BMD of dCG were retrieved from the publication [
2].
Phenotype modeling
To be comparable and compatible in meta-analysis, both studies used standardized residuals of raw BMDs following adjustment for age, weight, and sex (dCG) in the linear regression model.
Definition of gene locus
We defined the gene locus by its position based on UCSC and ±50 kb. Fifty kilobytes is the mean distance between the top intergenic SNPs and the nearest genes identified in the latest meta-analysis of GWAS of BMD [
1].
SOX6 and
MEF2C were excluded from the mean distance calculation as they were the outliers.
Statistical analysis
In the genome-wide association study, the association test of SNPs with standardized residuals of BMD was implemented via PLINK (version 1.04). For each SNP, the asymptotic
p value for the relationship between the number of minor alleles and BMD was derived from a two-sided
t statistic assuming the minor allele has an additive effect. We identified disease-associated genes with two stages. The first stage was the gene-based test using a simulation approach that took account of LD structure of different populations based on HapMap phase two information. Gene-based analysis in each population was done using VEGAS [
4]. In brief, each SNP was assigned to each of 17,787 autosomal genes according to positions on the UCSC Genome Browser hg18 assembly. In order to capture regulatory regions and SNPs in LD, the gene boundaries were defined as 51 kb of 5′ and 3′ UTRs. Then, for a given gene with
n SNPs, association
p values were first converted to upper tail chi-squared statistics with 1 degree of freedom (
df). The observed gene-based test statistic was then the sum of all (or a pre-defined subset) of the chi-squared 1 df statistics within that gene. In the current study, we summed the statistics of all SNPs located within a gene. Using the Monte Carlo approach, a large number of multivariate normal vectors were simulated, and the empirical gene-based
p value was calculated as the proportion of simulated test statistics that exceeded the observed gene-based test statistic. In the second stage, we attempted to meta-analyze the findings from both populations, to increase statistical power and to assess the consistency of evidence in two ethnicities using weighted Z-transformed test as implemented in the R. A weighted Z-transformed test was chosen because it has been suggested that when the number of tests are small, the weighted Z-transformed test performs better than other combination probability methods, such as Fisher's test and generalized binomial test [
5,
6].
Gene-based genome-wide significant level and suggestive level
Among 17,640 genes included in the analysis, 14,605 overlapped with either 5′ and/or 3′ genes with the average overlapping size per gene size (overlapping size with other gene/gene size) 0.62. We therefore arbitrarily defined the gene-based genome-wide significant level as 0.05/(3,035 × 1 + 14,605 × 0.38) = 5.8 × 10−6, while the suggestive level was 1/(3,035 × 1 + 14,605 × 0.38) = 1.2 × 10−4.
Identification of enriched physiological role in genes associated with BMD
The top 35 genes were imported into the Ingenuity Pathways Analysis (IPA) Software (Ingenuity Systems, Redwood City, CA, USA) to obtain networks for further analyses and to determine whether their physiological role was enriched. These top 35 genes were chosen because 35 was the limited number of genes/molecules required to form a functional regulatory gene network in the later gene network inference analysis. The enriched physiological roles were ranked by the p values of the Fisher's Exact Test that indicated the probability of the input gene (from the gene-based GWAS) being associated with genes in the physiological roles by chance.
Gene network inference via knowledge-based data mining
We next analyzed biological interactions among top hits using the IPA tool. The gene annotations from the top hits with suggestive p value were entered into the IPA analysis tool to construct the biological networks of the clustered genes. Networks are generated from the gene set by maximizing the specific connectivity of the input genes, which represents their interconnectedness with each other relative to other molecules to which they are connected in Ingenuity's Knowledge Database. Networks were limited to 35 molecules each to keep them to a functional size. The p value of probability for the genes forming a network was calculated using the right-tailed Fisher's Exact Test based on the hypergeometric distribution.
Discussion
GWA is a powerful tool that can identify genes associated with common diseases or traits such as BMD variation. Nonetheless, GWA studies usually focus on the most significant individual variants without considering the global evidence of the gene tested. It should be noted that allelic heterogeneity (i.e., presence of more than one susceptibility allele in a locus or gene) greatly reduces the power for testing of an individual SNP [
7,
8]. Therefore, a gene-based test can ameliorate the situation by simply testing the global null hypothesis about the SNPs located per gene. The gene-based test is a direct and powerful means of protecting the overall false-positive rate when a collection of loci are tested, because the
p value from the gene-based test has already corrected the number of SNPs included via a simulation approach. Using gene-based analysis of GWA data, our study confirmed several well established candidate genes and suggested several novel genes and loci for BMD variation. Importantly, most of these genes did not contain any SNP that reached genome-wide significance, so the potential importance of these genes would not have been recognized in the absence of gene-based association study.
An ethnic-specific BMD gene may underlie BMD variation in southern Chinese and Europeans. In line with the observations of our recent GWAS, there was no overlap of genes in the significant or suggestive gene list from HKSC and dCG populations We recently identified a SNP rs2273601 in
JAG1 that was associated with spine BMD (
p value = 1.06 × 10
−8) in 1,520 HKSC subjects with extreme BMD; nonetheless, only a modest association of this SNP with spine BMD was observed in three Caucasian cohorts (
p range, 0.007–0.045) [
3]. In the current study, we observed that top hip BMD genes were more consistent in HKSC and dCG, as reflected by the inflation factor and the results from independent
t testing (Supplementary methods, Supplementary Figures
3 to
4, and Supplementary Table
2). The discrepancies of gene-based association results for spine BMD in two populations may be due to a number of factors such as lifestyle, diet, and genetic background. Although these factors may also affect hip BMD, the possibility that spine BMD may be more susceptible than hip BMD to gene and environment interaction cannot be excluded. If this hypothesis is true, identification of gene and environmental interaction will benefit genetic research into osteoporosis and clinical practice.
The study design of HKSC and dCG also differed. In our HKSC cohort, we studied subjects at the extremes of BMD distribution. Studying subjects at the extremes of a quantitative phenotype has proven useful in identifying functional rare variants [
9,
10]. The genes identified in our HKSC cohort may therefore harbor more rare variants than the dCG cohort. Future deep sequencing of the top hits will be required to validate the first hypothesis. The dCG cohort also included both men and women, while our HKSC cohort included only women. Since sex-specific genetic architecture has been well demonstrated for BMD variation [
11‐
13], this difference likely accounts for some differences in the findings. Although the number of subjects in the HKSC cohort was fewer, the HKSC cohort captured information from the extreme 25% (cases, lowest 10%; super control, highest 15%) of 3,200 subjects. Other heterogeneity in different ethnicities, such as lifestyle, diet, LD structure, might also contribute to the difference in the strength of findings [
13].
Interpretation of the gene-based results required extra attention. For example, two spine suggestive genes (
CCDC55 and
EFCAB5) identified in HKSC harbored the SNP rs4470197 which showed a strong association signal with spine BMD (
p = 8.1 × 10
−6). This SNP was located between these two genes, and the gene-based
p value was partly contributed by the
p value of rs4470197. Nonetheless, it is unknown whether rs4470197 is associated with
CCDC55 or
EFCAB5 or both.
CCDC55 (coiled-coil domain containing 55) and
EFCAB5 (EF-hand calcium binding domain 5) are newly annotated genes with no known function; both are conserved in a number of animals such as the chimpanzee, cow, mouse, rat, and chicken. A future functional study is required to validate their role in bone metabolism. The most significant hip BMD gene identified in HKSC was
KPNA4 (karyopherin alpha 4 (importin alpha 3)). The primary function of karyopherins is to recognize nuclear localization signals (NLSs) and dock NLS-containing proteins to the nuclear pore complex. A number of bone genes contain NLS, such as
RUNX2 and
PTHrP. A recent study [
14] demonstrated that NLS of
PTHrP regulates skeletal development, including bone mass and osteoblast development. Therefore, defective recognition of NLS may affect bone metabolism.
The findings in the dCG cohort were similar to the findings in meta-analysis, despite the fact that
CKAP became significant and
C6orf97 became insignificant in the meta-analysis for hip BMD. In the meta-analysis, we identified a number of gene loci that have been implicated in bone metabolism in the latest meta-analysis of GWAS in 19,195 subjects [
1], such as 6q25 and 12q13 for spine BMD and 11p11.2 for hip BMD. We also identified a number of novel suggestive loci associated with BMD. 1q21.3 encompasses late cornified envelope protein (LCE) gene cluster and keratinocyte proline-rich protein (
KPRP) and is known as the epidermal differentiation complex [
15]. Both
LCE2A and
LCE4A were induced and responsive to the extracellular calcium level and UV irradiation. Though thought to be mainly involved in skin conditions (such as psoriasis [
16]), deletion of LCEs was also associated with rheumatoid arthritis [
17], thus offering an insight into the role of LCEs in the autoimmune system. The mammalian Forkhead Box (Fox) family of transcription factor is involved in a number of biological processes such as tissue-specific transcription, cell fate determination during embryogenesis, and cell survival.
FOXE1 at 9q22 was identified as a BMD candidate gene in the current study.
FOXE1 is involved in thyroid organogenesis and development of cleft palate [
18,
19]. A recent study has shown that this gene is also associated with skeletogenesis in zebrafish. Knocking down of
FOXE1 in zebrafish using morpholino resulted in severe reduction in the expression of
sox9a,
col1a1, and
runx2. In addition, this gene and another candidate gene in the same gene family identified in the recent meta-analysis [
1],
FOXL1, are downstream targets of Hedgehog-Gli signaling pathway [
20,
21]. The Hedgehog and Gli signaling pathway is important in bone development [
22] and osteoblast differentiation [
23].
CDK5RAP2 (CDK5 regulatory subunit associated protein 2) at 9q33.2 is involved in the regulation of neuronal differentiation and associated with microcephaly [
24]. Microcephaly is a disease in which head size is smaller than average and is often associated with osteoporosis [
25,
26]. Adrenergic, alpha-1D-receptor (
ADRA1D) at 20p13 is a G-protein coupled receptor that mediates actions in the sympathetic nervous system through a number of neurotransmitters, such as catecholamines, epinephrine, and norepinephrine. The sympathetic nervous system is important in bone mass regulation [
27,
28]; male mice without beta1/beta2 adrenergic receptor have increased cortical bone mass [
29]. The role of
ADRA1D in bone metabolism has been demonstrated in MC3T3-E1 osteoblast-like cells, in which
ADRA1D is expressed in MC3T3E-1 cells, and
RANKL expression is regulated via alpha-adrenergic receptor stimulation in osteoblasts [
30]. Eukaryotic translation initiation factor 6 (eIF6) at 20q12 is a gene that controls translation at the rate-limiting step of initiation. Recently, Gandin et al. demonstrated that heterozygous mice of eIF6 had fewer hepatic and adipose cells due to impaired G1/S cell cycle progression [
31]. They found that the reduction of adipose tissue was due to a decreased proliferation of pre-adipocytes derived from mesenchymal stem cells. Although bone phenotype was not investigated in their study, we believe that eIF6 could affect bone metabolism by regulating the cell number of osteoblasts, since both adipocytes and osteoblasts are derived from the same progenitor–mesenchymal stem cell; eIF6 also regulates Wnt/beta-catenin signaling via regulation of beta-catenin synthesis [
32].
Collectively, our data showed that the BMD genes identified in our meta-analysis play an important role in bone metabolism. Although additional studies will be necessary to validate their function, our current findings indicate that these BMD genes are involved in connective tissue development and function and skeletal and muscular system development and function using bio-function analysis implemented in IPA (
p < 0.05) (Tables
6 and
7). Due to the limitation of genechip, gene-based GWAS may not be able to identify BMD genes if the SNP coverage of the BMD genes is low. One method to remedy this is to perform more genotyping with denser SNP; another method is to perform gene network inference to identify genes that are connected with other BMD genes. Using the gene network inference approach, several bone-related hub genes or complexes have been identified, such as ERK1/2 [
33,
34], P38 MAPK [
35,
36] (Fig.
1a), prostaglandin E2 [
37], and TNF [
38] (Fig.
1b). Overlaying the gene network with known canonical signaling pathways revealed that aryl hydrocarbon receptor signaling; role of osteoblasts, osteoclasts, and chondrocytes in rheumatoid arthritis; and role of macrophages, fibroblasts, and endothelial cells in rheumatoid arthritis (7 genes out of 35 genes in each signaling pathway) were the predominant themes of the spine BMD gene network (Supplementary Table
3a), whereas acute phase response signaling (8 genes out of 35 genes) was the predominant theme of the hip BMD gene network (Supplementary Table
3b). Interestingly, acute phase response was one of the underlying mechanisms of action of bisphosphonate in the treatment of osteoporosis [
39]. Our findings suggest that hip BMD genes
F2,
MBL2, and
HMOX1 may be the genes involved in bisphosphonate treatment and may be used to monitor treatment response.
There are a number of limitations in the current gene-based GWAS. First, our definition of gene-based GWAS significance level may not be accurate. The most accurate way would be to use simulation; however, this would require extremely heavy computations, as the number of SNPs included in each study and the number of independent genes will vary from study to study. The LD structure also varies in different ethnicities. Nonetheless, our gene-based GWAS significant level 5.8 × 10
−6 was not much different to the conservative Bonferroni-corrected GWAS significance level of 2.8 × 10
−6 (=0.05/17,640, assuming each gene is independent to each other). Second, our definition of the gene locus (±50 kb 5′ upstream and 3′ downstream of the coding region) might strongly affect the test statistics and hence the gene-based
p value. Noting that large boundaries lead to a longer overlapping region with the neighbor genes, hence some markers are included in multiple genes. Thus, we justified how long the boundaries should be included by averaging the distance between the top intergenic SNPs identified in the recent meta-analysis of GWAS to the nearest coding genes [
1]. Notably, the highly significant SNP may also inflate the test statistics in a number of nearby genes, which poses interpretation difficulty. Thirdly, although a gene-based approach can identify genes with multiple causal SNPs with small effect size, it cannot identify genes with only one very significant SNP, while other SNPs in the gene do not show any significant
p value. Fourthly, we did not account for the heterogeneity in the meta-analysis although there is obvious heterogeneity between HKSC and dCG studies (as reflected by the Q statistics, data not shown). Moreover, since the sample size of the dCG cohort was much larger than the HKSC cohort, many significant
p values of the top findings were driven primarily by the dCG study. Caution should therefore be exercised in interpreting meta-analysis findings, especially when our current data suggested that there was a large genetic heterogeneity for spine BMD present between Chinese and European. Lastly, correction for stratification or any inflation has not been established in gene-based GWAS study; therefore, all QC should be done in the single-locus GWAS before performing the gene-based GWAS.
In conclusion, our results demonstrate the potential applicability of a gene-based approach to the interpretation and further mining of GWAS data. The importance of a gene-based approach is that single-locus GWAS mainly focuses on the association between a single marker and disease trait. It may not be able to identify a disease gene that harbors several causal variants with small effect size (allelic heterogeneity). Testing the overall effect of all SNPs in a gene, thus leveraging this information, may provide significant power to identify disease genes. In this study, we identified and/or confirmed a number of BMD genes. These BMD genes were significantly enriched in connective tissue development and function and skeletal and muscular system development and function. Using a gene network inference approach, we observed that a large number of BMD genes were connected with each other and contributed to a significant physiological function related to bone metabolism. Our approach suggests a concept of how variation in multiple genes linked in a functional gene network contributes to BMD variation and provides a useful tool to reveal the hidden information of GWAS that would be missed in single SNP analysis.