Skip to main content

Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations

  • Protocol
  • First Online:
Genome-Wide Association Studies and Genomic Prediction

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1019))

Abstract

Estimating genetic variance is traditionally performed using pedigree analysis. Using high-throughput DNA marker data measured across the entire genome it is now possible to estimate and partition genetic variation from population samples. In this chapter, we introduce methods and a software tool called Genome-wide Complex Trait Analysis (GCTA) to estimate genomic relationships between pairs of conventionally unrelated individuals using genome-wide single nucleotide polymorphism (SNP) data, to estimate variance explained by all SNPs simultaneously on genomic or chromosomal segments or over the whole genome, and to perform a joint and conditional multiple SNPs association analysis using summary statistics from a meta-analysis of genome-wide association studies and linkage disequilibrium between SNPs estimated from a reference sample.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hindorff LA, Sethupathy P, Junkins HA et al (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106(23):9362–9367

    Article  PubMed  CAS  Google Scholar 

  2. Maher B (2008) Personal genomes: the case of the missing heritability. Nature 456(7218):18–21

    Article  PubMed  CAS  Google Scholar 

  3. Yang J, Benyamin B, McEvoy BP et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–569

    Article  PubMed  CAS  Google Scholar 

  4. Yang J, Manolio TA, Pasquale LR et al (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet 43(6):519–525

    Article  PubMed  CAS  Google Scholar 

  5. Davies G, Tenesa A, Payton A et al (2011) Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry 16(10):996–1005

    Article  PubMed  CAS  Google Scholar 

  6. Deary IJ, Yang J, Davies G et al (2012) Genetic contributions to stability and change in intelligence from childhood to old age. Nature 482(7384):212–215

    PubMed  CAS  Google Scholar 

  7. Lee SH, Decandia TR, Ripke S et al (2012) Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat Genet 44(3):247–250

    Article  PubMed  CAS  Google Scholar 

  8. Gibson G (2010) Hints of hidden heritability in GWAS. Nat Genet 42(7):558–560

    Article  PubMed  CAS  Google Scholar 

  9. Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J Hum Genet 90(1):7–24

    Article  PubMed  CAS  Google Scholar 

  10. Teslovich TM, Musunuru K, Smith AV et al (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466(7307):707–713

    Article  PubMed  CAS  Google Scholar 

  11. Heid IM, Jackson AU, Randall JC et al (2010) Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat Genet 42(11):949–960

    Article  PubMed  CAS  Google Scholar 

  12. Lango Allen H, Estrada K, Lettre G et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467(7317):832–838

    Article  PubMed  CAS  Google Scholar 

  13. Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948

    Article  PubMed  CAS  Google Scholar 

  14. Ripke S, Sanders AR, Kendler KS et al (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43(10):969–976

    Article  CAS  Google Scholar 

  15. Yang J, Ferreira T, Morris AP et al (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44(4):369–375

    Article  PubMed  CAS  Google Scholar 

  16. Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1):76–82

    Article  PubMed  CAS  Google Scholar 

  17. Hayes BJ, Visscher PM, Goddard ME (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res 91(1):47–60

    Article  CAS  Google Scholar 

  18. Strandén I, Garrick DJ (2009) Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci 92(6):2971–2975

    Article  PubMed  Google Scholar 

  19. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423

    Article  PubMed  CAS  Google Scholar 

  20. Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58(3):545–554

    Article  Google Scholar 

  21. Purcell S, Neale B, Todd-Brown K et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575

    Article  PubMed  CAS  Google Scholar 

  22. Lee SH, van der Werf JH (2006) An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genet Sel Evol 38(1):25–43

    Article  PubMed  CAS  Google Scholar 

  23. Jorjani H, Klei L, Emanuelson U (2003) A simple method for weighted bending of genetic (co)variance matrices. J Dairy Sci 86(2):677–679

    Article  PubMed  CAS  Google Scholar 

  24. Hill WG, Thompson R (1978) Probabilities of non-positive definite between-group or genetic covariance matrices. Biometrics 34:429–439

    Article  Google Scholar 

  25. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:2–19

    Article  Google Scholar 

  26. Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA

    Google Scholar 

  27. Falconer DS (1965) The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet 29:51–71

    Article  Google Scholar 

  28. Dempster ER, Lerner IM (1950) Heritability of threshold characters. Genetics 35(2):212–236

    PubMed  CAS  Google Scholar 

  29. Lee SH, Wray NR, Goddard ME, Visscher PM (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88(3):294–305

    Article  PubMed  Google Scholar 

  30. Price AL, Weale ME, Patterson N et al (2008) Long-range LD can confound genome scans in admixed populations. Am J Hum Genet 83(1):132–135

    Article  PubMed  CAS  Google Scholar 

  31. Gilmour AR, Thompson R, Cullis BR (1995) Average information REML: an efficient algorithm for variance parameters estimation in linear mixed models. Biometrics 51:1440–1450

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendix A

Appendix A

In Eq. 4, we have

$$ \mathbf{y}=\mathbf{Xb}+\sum\limits_{i-1}^r {{{\mathbf{g}}_i}} +\mathbf{e}\;\mathrm{ and} \operatorname {var}(\mathbf{y})=\mathbf{V}=\sum\limits_{i-1}^r {{{\mathbf{A}}_i}\sigma_i^2} +\mathbf{I}\sigma_{\mathrm{ e}}^2, $$

of which Eq. 2 is a special case with r = 1. By default in GCTA, we use the average information (AI) REML algorithm [31] to obtain the estimates the variance components \( \sigma_i^2 \) and \( \sigma_{\mathrm{ e}}^2 \) through iteration. In the tth iteration, \( {{\mathbf{q}}^{(t) }}={{\mathbf{q}}^{(t-1) }}+{{({{\mathbf{H}}^{(t-1) }})}^{-1 }}\frac{{\partial L}}{{\partial \mathbf{q}}}|{{\mathbf{q}}^{(t-1) }} \), where \( \mathbf{q} \) is a vector of the estimates of variance components (\( \hat{\sigma}_1^2 \), …, \( \hat{\sigma}_r^2 \) and \( \hat{\sigma}_{\mathrm{ e}}^2 \)); L is the log likelihood function of the mixed linear model (ignoring the constant), \( L=-1/2(\log |\hat{\mathbf{V}} |+\log |{\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }\bf X}|+\mathbf{y} \mathbf{^{\prime}\bf Py}) \) with \( \hat{\mathbf{V}} =\sum\limits_{i=1}^r {{{\mathbf{A}}_i}\hat{\sigma}_i^{2(t-1) }} +\mathbf{I}\hat{\sigma}_e^{2(t-1) } \) and \( \mathbf{P}={{\hat{\mathbf{V}}}^{-1 }}-{{\hat{\mathbf{V}}}^{-1 }}\mathbf{X}{{({\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }}\mathbf{X})}^{-1 }}{\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }} \) ; H is the average of the observed and expected information matrices [22],

$$ \mathbf{H}=\frac{1}{2}\left\lfloor {\begin{array}{*{20}{c}} {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{P}\mathbf{y}} \\ {\vdots } & \vdots & \vdots & \vdots \\ {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{P}\mathbf{y}} \\ {\mathbf{y} \mathbf{^{\prime}PP}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}PP}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}PPPy}} \\ \end{array}} \right\rfloor; $$

and \( \frac{{\partial L}}{{\partial \mathbf{q}}} \) is a vector of first derivatives of the log likelihood function with respect to each variance component,

$$ \frac{{\partial L}}{{\partial \mathbf{q}}}=-\frac{1}{2}\left[ {\begin{array}{*{20}{c}} {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1})-\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{Py}} \\ \vdots \\ {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r})-\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{Py}} \\ {\mathrm{tr}(\mathbf{P})-\mathbf{y} \mathbf{^{\prime}PPy}} \\ \end{array}} \right] $$

We also provide in GCTA two optional algorithms to estimate the variance components, which we call the direct REML and EM-REML. For the direct REML algorithm, the variance components in the tth iteration are estimated as

$$ {{\mathbf{q}}^{(t) }}={{\left[ {\begin{array}{*{20}{c}} {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P})} \\ \vdots & \vdots & \vdots & \vdots \\ {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P})} \\ {\mathrm{tr}(\mathbf{P}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}\mathbf{P})} \\ \end{array}} \right]}^{-1 }}\left[ {\begin{array}{*{20}{c}} {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} \\ \vdots \\ {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} \\ {\mathbf{y} \mathbf{^{\prime}PPy}} \\ \end{array}} \right] $$

The direct REML algorithm is generally more robust but computationally less efficient than AI-REML. For the EM-REML algorithm, each variance component is estimated as

$$ \sigma_i^{2(t) }=[\sigma_i^{4(t-1)}\mathbf{y} \mathbf{^{\prime}\bf P}{{\mathbf{A}}_i}\mathbf{P}\mathbf{y}+\mathrm{ tr}(\sigma_i^{2(t-1)}\mathbf{I}-\sigma_i^{4(t-1)}\mathbf{P}{{\mathbf{A}}_i})]/n $$

The EM-REML is robust, which guarantees increased likelihood after each iteration, but is extremely slow to converge. We therefore do not recommend choosing EM-REML in GCTA unless we know that the starting values are very close to the estimates. The GCTA option for choosing different REML algorithm is --reml-alg with the input value 0 for AI-REML (default), 1 for the direct REML algorithm and 2 for EM-REML. At the beginning of the iteration process, all the variance components are initialized by an arbitrary value, i.e., \( \sigma_i^{2(0) }=\sigma_{\mathrm{ P}}^2/(r+1) \), which is subsequently updated by the EM-REML algorithm \( \sigma_i^{2(1) }=[\sigma_i^{4(0)}\mathbf{y} \mathbf{^{\prime}\bf P}{{\mathbf{A}}_i}\mathbf{P}\mathbf{y}+\mathrm{ tr}(\sigma_i^{2(0)}\mathbf{I}-\sigma_i^{4(0)}\mathbf{P}{{\mathbf{A}}_i})]/n \). The EM-REML algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values. We also provide options (--reml-priors and --reml-priors-var) in GCTA for users to specify starting values. After one EM-REML iteration, GCTA switches to the AI-REML algorithm (or the other two algorithms) for the remaining iterations until the iteration converges with the criteria of L (t) − L (t−1) < 10−4 where L (t) is the log likelihood of the tth iteration. By default, any variance component that escapes from the parameter space (i.e., its estimate is negative) will be set to \( {10^{-6 }} \times \sigma_{\mathrm{ P}}^2 \). If a component keeps escaping from the parameter space, it will be constrained at \( {10^{-6 }} \times \sigma_{\mathrm{ P}}^2 \). There is an option in GCTA (--reml-no-constrain) that allows the estimates of variance components to be negative. This is justified because if a parameter is zero, an unbiased estimate of this parameter will have half chance being negative. In practice, however, a negative variance component is usually difficult to interpret. We also provide an option (--reml-maxit) for users to specify the maximum number of iterations at which the iteration process will stop without convergence.

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Yang, J., Lee, S.H., Goddard, M.E., Visscher, P.M. (2013). Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations. In: Gondro, C., van der Werf, J., Hayes, B. (eds) Genome-Wide Association Studies and Genomic Prediction. Methods in Molecular Biology, vol 1019. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-447-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-447-0_9

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-446-3

  • Online ISBN: 978-1-62703-447-0

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics