Skip to main content
Erschienen in: BMC Proceedings 7/2016

Open Access 01.10.2016 | Proceedings

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data

verfasst von: Emily R. Holzinger, Silke Szymczak, James Malley, Elizabeth W. Pugh, Hua Ling, Sean Griffith, Peng Zhang, Qing Li, Cheryl D. Cropp, Joan E. Bailey-Wilson

Erschienen in: BMC Proceedings | Sonderheft 7/2016

Einloggen, um Zugang zu erhalten

Abstract

Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of “true” functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.
Literatur
1.
Zurück zum Zitat Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2013;42(Database issue):D1001–6.PubMedPubMedCentral Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2013;42(Database issue):D1001–6.PubMedPubMedCentral
2.
Zurück zum Zitat Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.CrossRefPubMedPubMedCentral Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.CrossRefPubMedPubMedCentral
3.
Zurück zum Zitat Gertrudes JC, Maltarollo VG, Silva RA, Oliveira PR, Honório KM, da Silva AB. Machine learning techniques and drug design. Curr Med Chem. 2012;19(25):4289–97.CrossRefPubMed Gertrudes JC, Maltarollo VG, Silva RA, Oliveira PR, Honório KM, da Silva AB. Machine learning techniques and drug design. Curr Med Chem. 2012;19(25):4289–97.CrossRefPubMed
5.
Zurück zum Zitat Strobl C, Malley J, Tutz G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009;14(4):323–48.CrossRefPubMedPubMedCentral Strobl C, Malley J, Tutz G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009;14(4):323–48.CrossRefPubMedPubMedCentral
6.
Zurück zum Zitat Szymczak S, Holzinger E, Dasgupta A, Malley J, Molloy A, Mills J, et al. r2VIM: a new variable selection method for random forests in genome-wide association studies. Pers Commun. 2015;1(9):7. Szymczak S, Holzinger E, Dasgupta A, Malley J, Molloy A, Mills J, et al. r2VIM: a new variable selection method for random forests in genome-wide association studies. Pers Commun. 2015;1(9):7.
7.
Zurück zum Zitat Schwarz DF, Konig IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics. 2010;26(14):1752–8.CrossRefPubMedPubMedCentral Schwarz DF, Konig IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics. 2010;26(14):1752–8.CrossRefPubMedPubMedCentral
8.
Zurück zum Zitat Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, et al. Omics-squared: Human genomic, transcriptomic and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9 Suppl 8:S2. Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, et al. Omics-squared: Human genomic, transcriptomic and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015;9 Suppl 8:S2.
Metadaten
Titel
Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data
verfasst von
Emily R. Holzinger
Silke Szymczak
James Malley
Elizabeth W. Pugh
Hua Ling
Sean Griffith
Peng Zhang
Qing Li
Cheryl D. Cropp
Joan E. Bailey-Wilson
Publikationsdatum
01.10.2016
Verlag
BioMed Central
Erschienen in
BMC Proceedings / Ausgabe Sonderheft 7/2016
Elektronische ISSN: 1753-6561
DOI
https://doi.org/10.1186/s12919-016-0021-1

Weitere Artikel der Sonderheft 7/2016

BMC Proceedings 7/2016 Zur Ausgabe