Skip to main content
Log in

Successful strategies for human microbiome data generation, storage and analyses

  • Review
  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

Current interest in the potential for clinical use of new tools for improving human health are now focused on techniques for the study of the human microbiome and its interaction with environmental and clinical covariates. This review outlines the use of statistical strategies that have been developed in past studies and can inform successful design and analyses of controlled perturbation experiments performed in the human microbiome. We carefully outline what the data are, their imperfections and how we need to transform, decontaminate and denoise them. We show how to identify the important unknown parameters and how to can leverage variability we see to produce efficient models for prediction and uncertainty quantification. We encourage a reproducible strategy that builds on best practice principles that can be adapted for effective experimental design and reproducible workflows. Nonparametric, data-driven denoising strategies already provide the best strain identification and decontamination methods. Data driven models can be combined with uncertainty quantification to provide reproducible aids to decision making in the clinical context, as long as careful, separate, registered confirmatory testing are undertaken. Here we provide guidelines for effective longitudinal studies and their analyses. Lessons learned along the way are that visualizations at every step can pinpoint problems and outliers, normalization and filtering improve power in downstream testing. We recommend collecting and binding the metadata and covariates to sample descriptors and recording complete computer scripts into an R markdown supplement that can reduce opportunities for human error and enable collaborators and readers to replicate all the steps of the study. Finally, we note that optimizing the bioinformatic and statistical workflow involves adopting a wait-and-see approach that is particularly effective in cases where the features such as ‘mass spectrometry peaks’ and metagenomic tables can only be partially annotated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2

Similar content being viewed by others

Notes

  1. See Don Knuth’s famous quote that ‘premature optimization is the root of all evil in computer programming’.

References

  • Callahan B, McMurdie P, Rosen M, Han A, Johnson A and Holmes S 2016a Dada2: high resolution sample inference from amplicon data. Nat. Methods 13 581

    Article  CAS  Google Scholar 

  • Callahan B, Proctor D, Relman D, Fukuyama J and Holmes S 2016b Reproducible research workflow in r for the analysis of personalized human microbiome data. In Biocomputing 2016: Proceedings of the Pacific Symposium (World Scientific) pp 183–194

  • Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ and Holmes SP 2016c Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research 5

  • Callahan BJ, McMurdie PJ and Holmes SP 2017 Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 10.1038/ismej.2017.119

    Article  PubMed  PubMed Central  Google Scholar 

  • Davis NM, Proctor D, Holmes SP, Relman DA and Callahan BJ 2018 Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6 226

    Article  Google Scholar 

  • DiGiulio D, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, Sun CL, Goltsman DSA, Wong RJ, Shaw G, Stevenson DK, Holmes S and Relman RDA 2015 Temporal and spatial variation of the human microbiota during pregnancy. PNAS 112 11060–11065

    Article  CAS  Google Scholar 

  • Fukuyama J 2017 Adaptive gpca: a method for structured dimensionality reduction arXiv:170200501

  • Fukuyama J, Rumker L, Sankaran K, Jeganathan P, Dethlefsen L, Relman DA and Holmes SP 2017 Multidomain analyses of a longitudinal human micro- biome intestinal cleanout perturbation experiment. PLOS Comput. Biol. https://doi.org/10.1371/journal.pcbi.1005706

    Book  Google Scholar 

  • Holmes I, Harris K and Quince C 2012 Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7 e30126

    Article  CAS  Google Scholar 

  • Holmes S and Huber W 2019 Modern statistics for modern biology (Cambridge University Press, Cambridge, UK) http://web.stanford.edu/class/bios221/book/

  • Ioannidis JP 2005 Why most published research findings are false. PLoS Med. 2 e124

    Article  Google Scholar 

  • Jeganathan P, Callahan BJ, Proctor DM, Relman DA and Holmes SP 2018 The block bootstrap method for longitudinal microbiome data. arXiv:180901832

  • Karstens L, Asquith M, Caruso V, Rosenbaum JT, Fair DA, Braun J, Gregory WT, Nardos R and McWeeney SK 2018 Community profiling of the urinary microbiota: considerations for low-biomass samples. Nat. Rev. Urol. 12 1

    Google Scholar 

  • Leek JT and Storey JD 2007 Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 e161

    Article  Google Scholar 

  • Love MI, Huber W and Anders S 2014 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 550

    Article  Google Scholar 

  • McMurdie PJ and Holmes S 2012 Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac. Symp. Biocomput. 17 235–246

    Google Scholar 

  • McMurdie PJ and Holmes S 2013 Phyloseq: reproducible research platform for bacterial census data. Plos ONE 8 e61217

    Article  CAS  Google Scholar 

  • McMurdie PJ and Holmes S 2014 Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10 e1003531

    Article  Google Scholar 

  • Proctor DM, Fukuyama JA, Loomer PM, Armitage GC, Lee SA, Davis NM, Ryder MI, Holmes SP and Relman DA 2018 A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow. Nat. Commun. 9 681

    Article  Google Scholar 

  • Purdom E 2010 Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann. Appl. Stat. 5 2326–2358

    Article  Google Scholar 

  • Ren B, Bacallado S, Favaro S, Holmes S and Trippa L 2017 Bayesian nonparametric ordination for the analysis of microbial communities. J. Am. Stat. Assoc. 112 1430–1442

    Article  CAS  Google Scholar 

  • Sankaran K and Holmes S 2018 Latent variable modeling for the microbiome. Biostatistics kxy018 31–47

    Google Scholar 

Download references

Acknowledgements

The work was partly supported by NIH Grant AI112401. The author is thankful to Dr. Yogesh Shouche and the team at ICMR2018 for the opportunity to provide this short personal review of the challenges in designing and analyzing microbiome studies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susan Holmes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Holmes, S. Successful strategies for human microbiome data generation, storage and analyses. J Biosci 44, 111 (2019). https://doi.org/10.1007/s12038-019-9934-y

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s12038-019-9934-y

Keywords

Navigation