Simulation settings
We here implemented simulation studies to evaluate the performance of CONTO. Because it is conducted with only summary-level data, we thus directly sampled two sets of Z scores from a given multivariate normal (MVN) distribution under various scenarios. Specifically, for a gene in the first population, we generated its Z scores randomly from MVN((0, 0), Λ) under H00 with a probability λ00, or from MVN((τ10, 0), Λ) under H10 with a probability λ10, or from MVN((τ01, 0), Λ) under H01 with a probability λ01, or from MVN((τ11, 0), Λ) under H11 with a probability λ11. For the same gene in the second population, we drew its Z scores at random from MVN((0, 0), Λ) under H00 with a probability π00, or from MVN((0, 0), Λ) under H10 with a probability π10, or from MVN((0, τ01), Λ) under H01 with a probability λ01, or from MVN((0, τ11), Λ) under H11 with a probability λ11. The magnitude of τ10 (or τ01 and τ11) measures the strength of association, with larger value indicating stronger association signal. For simplicity, we set τ10 = τ01 = τ11 = 2, 3 or 4, and the total number of genes to 10000, 15000, or 20000.
After obtaining Z scores, we transformed them into P values based on the normal approximation. In our simulation, we set Λ to be a two-dimensional identify matrix. We considered three various probability settings with λ11 ≠ 0 to evaluate FDR control and power: (i) λ00 = 0.40, λ10 = 0.20, λ01 = 0.20, and λ11 = 0.20, constructing a highly polygenic but less overlapped genetic architecture, in which 40% genes were related to the trait in each population. and approximately 33.3% of associated genes were shared across populations; (ii) λ00 = 0.80, λ10 = 0.05, λ01 = 0.05, and λ11 = 0.10, building a less polygenic and moderately overlapped genetic architecture, in which 15% genes were related to the trait in each population and approximately 50% of associated genes were shared across populations; (iii) λ00 = 0.90, λ10 = 0.01, λ01 = 0.01, and λ11 = 0.08, generating a sparse but highly overlapped genetic architecture, in which 9% genes were related to the trait in each population, but approximately 80% of associated genes were shared by the trait across populations. Note that, to a great extent, these simulation parameters were selected based on our results of real data applications (see below).
Besides CONTO, for comparison we also carried out three other composite null methods (Additional file
1), including JST [
44], joint significance composite-null test (JT-comp) [
64], and divide-aggregate composite-null test (DACT) [
65]. We repeated 10
3 times for each simulation setting and displayed the average across these replicates for these methods.
Summary statistics of 31 complex diseases from the EAS and EUR populations
We applied these methods to 31 complex traits of EAS-only or EUR-only individuals available from distinct GWAS consortia (Table
1 and Additional file
1: Table S1). These traits were analyzed in our previous work and more detailed descriptions regarding them can be found therein and in respective original paper [
28,
62]. We downloaded summary statistics of these traits and performed stringent quality control in both populations for each trait: (i) removed SNPs without rs label; (ii) filtered out non-biallelic SNPs and those with strand-ambiguous alleles; (iii) deleted SNPs whose alleles did not match with those in the 1000 Genomes Project [
66]; (iv) excluded duplicated SNPs and those with inconsistent alleles between EAS and EUR populations; (v) kept only common SNPs (MAF > 1%) which were shared in the two populations; (vi) removed SNPs located within the major histocompatibility complex region because of its complicated LD structure.
Table 1
Number of associated SNPs discovered by JST and CONTO for traits in the EAS and EUR populations
SCZ | 21 | 0 | 186 | 57 | eGFR | 71 | 22 | 205 | 312 |
RA | 27 | 5 | 48 | 87 | ANM | 28 | 7 | 100 | 127 |
T2D | 293 | 115 | 310 | 824 | PLT | 261 | 29 | 429 | 625 |
COA | 19 | 14 | 112 | 111 | RBC | 195 | 8 | 568 | 438 |
AOA | 22 | 19 | 57 | 53 | MCV | 351 | 30 | 388 | 782 |
PCA | 27 | 1 | 44 | 85 | HCT | 40 | 3 | 315 | 281 |
BMI | 95 | 1 | 1027 | 291 | MCH | 251 | 20 | 316 | 726 |
Height | 698 | 228 | 455 | 1544 | MCHC | 136 | 23 | 125 | 224 |
DBP | 33 | 0 | 802 | 130 | HGB | 34 | 7 | 303 | 182 |
SBP | 84 | 4 | 643 | 252 | MONO | 40 | 4 | 250 | 151 |
PP | 57 | 2 | 315 | 122 | NEUT | 44 | 0 | 158 | 135 |
HDL | 113 | 52 | 0 | 305 | EO | 40 | 2 | 259 | 173 |
LDL | 74 | 22 | 18 | 166 | BASO | 32 | 5 | 29 | 101 |
TC | 101 | 64 | 22 | 204 | LYMPH | 29 | 1 | 163 | 78 |
TG | 42 | 41 | 14 | 109 | WBC | 69 | 12 | 149 | 259 |
HbA1c | 56 | 47 | 34 | 96 | | | | | |
After quality control, we implemented MAGMA with genotypes of 504 EAS or 503 EUR individuals from the 1000 Genomes Project as the reference panel. We defined the set of cis-SNPs for a specific gene in terms of the annotation file provided by VIGAS [
67]. The
P value and
Z score for each gene of traits were thus available. To handle possible residual influence of population stratification, family structures and cryptic relatedness [
68‐
71], we further conducted genomic control for the gene-based association results of MAGMA if an inflation in these gene-level test statistics was observed (indicated by the inflation factor > 1.05). We took the resulting
P values or
Z scores as input to implement JST, JT-comp, DACT and CONTO for detecting trait-associated genes shared across the EAS and EUR populations.
Afterwards, for each trait we could classify these genes into three groups: (i) null genes which were not associated with the trait in either population; (ii) population-specific genes that were related to the trait in the EAS or EUR population; (iii) population-common genes that were shared across the two populations. To understand the characteristics of these genes in distinct groups, we used several conservation scores to examine the extent to which a particular gene varied across populations, which included phyloP score [
72], phastCons score [
73], and dN/dS ratio [
74]. Specifically, higher phyloP or phastCons score indicates more conservativeness, while smaller dN/dS ratio represents higher conservativeness. We obtained these scores from [
75], and compared the average scores across all genes of these traits in the three groups described above using the Friedman
F test method.