A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS)

Huaqing Zhao; Nandita Mitra; Peter A. Kanetsky; Katherine L. Nathanson; Timothy R. Rebbeck

doi:10.1515/sagmb-2017-0054

Published by De Gruyter December 4, 2018

A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS)

Huaqing Zhao , Nandita Mitra , Peter A. Kanetsky , Katherine L. Nathanson and Timothy R. Rebbeck

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2017-0054

Showing a limited preview of this publication:

Abstract

Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.

Keywords: bias; principal components analysis; propensity score; testicular germ cell tumors; Tracy-Widom statistic

References

Airy, G. (1838): “On the intensity of light in the neighbourhood of a caustic,” Thans. Cambr. Phil. Soc., 6, 379–402.Search in Google Scholar

Allen, A., M. P. Epstein and G. A. Satten (2010): “Score-based adjustment for confounding by population stratification in genetic association studies,” Genet. Epidemiol., 34(5), 383–385.10.1002/gepi.20487Search in Google Scholar PubMed PubMed Central

Bouaziz, M., C. Ambroise and M. Guedj (2011): “Accounting for population stratification in practice: a comparison of the main strategies dedicated to genome-wide association studies,” PLoS One, 6, e28845.10.1371/journal.pone.0028845Search in Google Scholar PubMed PubMed Central

Cepeda, M. S., R. Boston, J. T. Farrar and B. L. Strom (2003): “Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders,” Am J Epidemiol, 158, 280–287.10.1093/aje/kwg115Search in Google Scholar PubMed

Chen, H., C. Wang, M. P. Conomos, A. M. Stilp, Z. Li, T. Sofer, A. A. Szpiro, W. Chen, J. M. Brehm, J. C. Celedón, S. Redline, G. J. Papanicolaou, T. A. Thornton, C. C. Laurie, K. Rice and X. Lin (2016): “Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models,” Am. J. Hum. Genet., 98, 653–666.10.1016/j.ajhg.2016.02.012Search in Google Scholar PubMed PubMed Central

de Andrade, M., D. Ray, A. C. Pereira and J. P. Soler (2015): “Global individual ancestry using principal components for family data,” Hum. Hered., 80, 1–11.10.1159/000381908Search in Google Scholar PubMed PubMed Central

Devlin, B. and K. Roeder (1999): “Genomic control for association studies,” Biometrics, 55, 997–1004.10.1111/j.0006-341X.1999.00997.xSearch in Google Scholar PubMed

Dominici, D. and R. S. Maier (2008): Special Functions and Orthogonal Polynomials, American Mathematical Society.10.1090/conm/471Search in Google Scholar

Drake, C. (1993): “Effects of misspecification of the propensity score on estimators of treatment effect,” Biometrics, 49, 1231–1236.10.2307/2532266Search in Google Scholar

Epstein, M. P., A. S. Allen and G. A. Satten (2007): “A simple and improved correction for population stratification in case-control studies,” Am. J. Hum. Genet., 80, 921–930.10.1086/516842Search in Google Scholar PubMed PubMed Central

Epstein, M. P., R. Duncan, K. A. Broadaway, M. He, A. S. Allen and G. A. Satten (2012): “Stratification-score matching improves correction for confounding by population stratification in case-control association studies,” Genet. Epidemiol., 36, 195–205.10.1002/gepi.21611Search in Google Scholar PubMed PubMed Central

Feng, Q., J. Abraham, T. Feng, Y. Song, R. C. Elston and X. Zhu (2009): “A method to correct for population structure using a segregation model,” BMC Proc., 3(Suppl 7), S104.10.1186/1753-6561-3-s7-s104Search in Google Scholar PubMed PubMed Central

Hastings, S. P. and J. B. McLeod (1980): “A boundary value problem associated with the second Painleve transcendent and the Korteweg-de Vries equation,” Arch. Ration. Mech. An., 73, 31–51.10.1007/BF00283254Search in Google Scholar

Imbens, G. W. (2004): “Nonparametric estimation of average treatment effects under exogeneity: a review,” Rev. Econ. Stat., 86, 4–29.10.1162/003465304323023651Search in Google Scholar

Johnstone, I. M. (2001): “On the distribution of the largest eigenvalue in principal components analysis,” Ann. Stat., 29, 295–327.10.1214/aos/1009210543Search in Google Scholar

Kanetsky, P. A., N. Mitra, S. Vardhanabhuti, M. Li, D. J. Vaughn, R. Letrero, S. L. Ciosek, D. R. Doody, L. M. Smith, J. Weaver, A. Albano, C. Chen, J. R. Starr, D. J. Rader, A. K. Godein, M. P. Reilly, H. Hakonarson, S. M. Schwartz and K. L. Nathanson (2009): “Common variation in KITLG and at 5q31.3 predisposes to testicular germ cell cancer,” Nat. Genet., 41, 811–815.10.1038/ng.393Search in Google Scholar PubMed PubMed Central

Kang, H. M., J. H. Sul, S. K. Service, N. A. Zaitlen, S.-Y. Kong, N. B. Freimer, C. Sabatti and E. Eskin (2010): “Variance component model to account for sample structure in genome-wide association studies,” Nat. Gene., 42, 348–354.10.1038/ng.548Search in Google Scholar PubMed PubMed Central

Kang, S. J., E. K. Larkin, Y. Song, J. Barnholtz-Sloan, D. Baechle, T. Feng and X. Zhu (2009): “Assessing the impact of global versus local ancestry in association studies,” BMC Proc., 3(Suppl 7), S107.10.1186/1753-6561-3-s7-s107Search in Google Scholar PubMed PubMed Central

Lee, A. B., D. Luca, L. Klei, B. Devlin and K. Roeder (2010): “Discovering genetic ancestry using spectral graph theory,” Genet. Epidemiol., 34, 51–59.10.1002/gepi.20434Search in Google Scholar PubMed PubMed Central

Li, C. and M. Li (2008): “GWAsimulator: a rapid whole-genome simulation program,” Bioinformatics, 24, 140–142.10.1093/bioinformatics/btm549Search in Google Scholar PubMed

Li, Q., S. Wacholder, D. J. Hunter, R. N. Hoover, S. Chanock, G. Thomas and K. Yu (2009): “Genetic background comparison using distance-based regression, with applications in population stratification evaluation and adjustment,” Genet. Epidemiol., 33, 432–441.10.1002/gepi.20396Search in Google Scholar PubMed PubMed Central

Li, Q., and K. Yu (2008): “Improved correction for population stratification in genomewide association studies by identifying hidden population structures,” Genet. Epidemiol., 32, 215–226.10.1002/gepi.20296Search in Google Scholar PubMed

Lin, D. Y. and D. Zeng. (2011): “Correcting for population stratification in genomewide association studies,” J. Am. Stat. Assoc., 106, 997–1008.10.1198/jasa.2011.tm10294Search in Google Scholar PubMed PubMed Central

Liu, L., D. Zhang, H. Liu and C. Arendt (2013): “Robust methods for population stratification in genome wide association studies,” BMC Bioinformatics, 14, 132.10.1186/1471-2105-14-132Search in Google Scholar PubMed PubMed Central

Luca, D., S. Ringquist, L. Klei, A. B. Lee, C. Gieger, H. E. Wichmann, S. Schreiber, M. Krawczak, Y. Lu, A. Styche, B. Devlin, K. Roeder and M. Trucco (2008): “On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants,” Am. J. Hum. Genet., 82, 453–63.10.1016/j.ajhg.2007.11.003Search in Google Scholar PubMed PubMed Central

Lunceford, J. K. and M. Davidian (2004): “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study,” Stat. Med., 23, 2937–2960.10.1002/sim.1903Search in Google Scholar PubMed

McPeek, M. and M. Abney (2008): “Association testing with principal-components-based correction for population stratification,” The American Society of Human Genetics, November 13, 2008, Philadelphia, PA.Search in Google Scholar

Patterson, N., A. L. Price and D. Reich (2006): “Population structure and eigenanalysis,” PLoS Genet., 2, e190.10.1371/journal.pgen.0020190Search in Google Scholar PubMed PubMed Central

Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick and D. Reich (2006): “Principal components analysis corrects for stratification in genome-wide association studies,” Nat. Genet., 38, 904–909.10.1038/ng1847Search in Google Scholar PubMed

Price, A. L., N. A. Zaitlen, D. Reich and N. Patterson (2010): “New approaches to population stratification in genome-wide association studies,” Nat. Rev. Genet., 11, 459–463.10.1038/nrg2813Search in Google Scholar PubMed PubMed Central

Pritchard, J. K. and P. Donnelly (2001): “Case-control studies of association in structured or admixed populations,” Theor. Popul. Biol., 60, 227–237.10.1006/tpbi.2001.1543Search in Google Scholar PubMed

Pritchard, J. K., M. Stephens, N. A. Rosenberg and P. Donnelly (2000): “Association mapping in structured populations,” Am. J. Hum. Genet., 67, 170–181.10.1086/302959Search in Google Scholar PubMed PubMed Central

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. de Bakker, M. J. Daly and P. C. Sham (2007): “PLINK: a tool set for whole-genome association and population-based linkage analyses,” Am. J. Hum. Genet., 81, 559–575.10.1086/519795Search in Google Scholar PubMed

Ray, D. and S. Basu (2017): “A novel association test for multiple secondary phenotypes from a case-control GWAS,” Genet. Epidemiol., 41, 413–426.10.1002/gepi.22045Search in Google Scholar PubMed

Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41–55.10.1093/biomet/70.1.41Search in Google Scholar

Tracy, C. A. and H. Widom (1993): “Level-spacing distributions and the Airy kernel,” Phys. Lett. B., 305, 115–118.10.1016/0370-2693(93)91114-3Search in Google Scholar

Tracy, C. A. and H. Widom (1994): “Level-spacing distributions and the Airy kernel,” Commun. Math. Phys., 159, 151–174.10.1007/BF02100489Search in Google Scholar

Tracy, C. A. and H. Widom (1996): “On orthogonal and symplectic matrix ensembles,” Commun. Math. Phys., 177, 727–754.10.1007/BF02099545Search in Google Scholar

Voight, B. F. and J. K. Pritchard (2005): “Confounding from cryptic relatedness in case-control association studies,” PLoS Genet., 1:e32.10.1371/journal.pgen.0010032Search in Google Scholar PubMed PubMed Central

Wan, F. and N. Mitra (2016): “An evaluation of bias in propensity score adjusted non-linear regression models,” Stat. Methods Med. Res., 27:846–862.10.1177/0962280216643739Search in Google Scholar

Wang, D., Y. Sun, P. Stang, J. A. Berlin, M. A. Wilcox and Q. Li (2009): “Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: Principal-component analysis versus multidimensional scaling,” BMC Proc., 3(Suppl 7), S109.10.1186/1753-6561-3-S7-S109Search in Google Scholar PubMed PubMed Central

Weir, B. S., A. D. Anderson and A. B. Hepler (2006): “Genetic relatedness analysis: modern data and new challenges,” Nat. Rev. Genet., 7, 771–780.10.1038/nrg1960Search in Google Scholar PubMed

Zhang, Y. and W. Pan (2015): “Principal component regression and linear mixed model in associaiton analysis of structured samples: competitors or complements?,” Genet. Epidemiol., 39, 149–155.10.1002/gepi.21879Search in Google Scholar PubMed PubMed Central

Zhang, Z., E. Ersoz, C.-Q. Lai, R. J. Todhunter and H. K. Tiwari (2010): “Mixed linear model approach adapted for genome-wide association studies,” Nat. Genet., 42, 355–360.10.1038/ng.546Search in Google Scholar PubMed PubMed Central

Zhang, Y., W. Guan and W. Pan (2013a): “Adjustment for population stratification via principal components in association analysis of rare variants,” Genet. Epidemiol., 37, 99–109.10.1002/gepi.21691Search in Google Scholar PubMed PubMed Central

Zhang, Y., X. Shen and W. Pan (2013b): “Adjusting for population stratification in a fine scale with principal components and sequencing data,” Genet. Epidemiol., 37, 787–801.10.1002/gepi.21764Search in Google Scholar PubMed PubMed Central

Zhao, H., T. R. Rebbeck and N. Mitra (2009): “A propensity score approach to correction for bias due to population stratification using genetic and non-genetic factors,” Genet. Epidemiol., 33, 679–690.10.1002/gepi.20419Search in Google Scholar PubMed PubMed Central

Zhao, H., T. R. Rebbeck and N. Mitra (2012): “Analyzing genetic association studies with an extended propensity score approach,” Stat. Appl. Genet. Mol. Biol., 11, ISSN (Online) 1544–6115, DOI: https://doi.org/10.1515/1544-6115.1790.10.1515/1544-6115.1790Search in Google Scholar PubMed PubMed Central

Zhu, X., S. Li, R. S. Cooper and R. C. Elston (2008): “A unified association analysis approach for family and unrelated samples correcting for stratificaiton,” Am. J. Hum. Genet., 82, 352–365.10.1016/j.ajhg.2007.10.009Search in Google Scholar PubMed PubMed Central

Zou, F., S. Lee, R. Knowles and F. A. Wright (2010): “Quantification of population structure using correlated SNPs by shrinkage principal components,” Hum. Hered., 70, 9–22.10.1159/000288706Search in Google Scholar PubMed PubMed Central

Supplementary Material

The online version of this article offers supplementary material (DOI: https://doi.org/10.1515/sagmb-2017-0054).

Published Online: 2018-12-04

From the journal

Statistical Applications in Genetics and Molecular Biology

Volume 17 Issue 6

Submit manuscript

Journal and Issue

Articles in the same Issue

Research Articles

A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS)

A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series

False discovery control for penalized variable selections with high-dimensional covariates