Skip to main content
Log in

Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i.e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Albrecht, D., Kniemeyer, O., Brakhage, A. A., & Guthke, R. (2010). Missing values in gell based proteomics. Proteomics, 10(6), 1202–1211.

    Article  PubMed  CAS  Google Scholar 

  • Andersson, C. A., & Bro, R. (1998). Improving the speed of multi-way algorithms. Part I. Tucker 3. Chemometrics and Intelligent Laboratory Systems, 42, 93–103.

    Article  CAS  Google Scholar 

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57, 289–300.

    Google Scholar 

  • Blanchet, L., Smolinska, A., Attali, A., Stoop, M. P., Ampt, K. A. M., van Aken, H., et al. (2011). Fusion of metabolomics and proteomics data for biomarkers discovery: Case study on the experimental autoimmune encephalomyelitis. BMC Bioinformatics, 12, 254.

    Article  PubMed  Google Scholar 

  • Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196.

    Article  CAS  Google Scholar 

  • de Brevern, A. G., Hazout, S., & Malpertuy, A. (2004). Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics, 5, 114.

    Article  PubMed  Google Scholar 

  • Defamie, V. (2008). Gene expression profiling of human liver transplants identifies an early transcriptional signature associated with initial poor graft function. American Journal of Transplantation, 8, 1221–1236.

    Article  PubMed  CAS  Google Scholar 

  • Dieterle, F., Ross, A., Scholtterbeck, G., & Senn, H. (2006). Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabolomics. Analytical Chemistry, 78, 4281–4290.

    Article  PubMed  CAS  Google Scholar 

  • Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G., & Kell, D. B. (2004). Metabolomics by numbers: Acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252.

    Article  PubMed  CAS  Google Scholar 

  • Hrydziuszko, O., Silva, M. A., Perera, T. P. R., Richards, D. A., Murphy, N., Mirza, D., et al. (2010). Application of metabolomics to investigate the process of human orthotopic liver transplantation: A proof-of-principle study. OMICS: A Journal of Integrative Biology, 14, 143–150.

    Article  CAS  Google Scholar 

  • Jornsten, R., Wang, H., Welsh, W., & Ouyang, M. (2005). DNA microarray data imputation and significance analysis of differential expression. Bioinformatics, 21, 4155–4161.

    Article  PubMed  Google Scholar 

  • Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., & Itoh, M. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids Research, 36, D480–D484.

    Article  PubMed  CAS  Google Scholar 

  • Kenny, L. C., Broadhurst, D. I., Dunn, W., Brown, M., North, R. A., McCowan, L., et al. (2010). Robust early pregnancy prediction of later preeclampsia using metabolomics biomarkers. Hypertension, 56, 741–749.

    Article  PubMed  CAS  Google Scholar 

  • Kim, K. Y., Kim, B. J., & Yi, G. S. (2004). Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5, 160.

    Article  PubMed  Google Scholar 

  • Kim, D. W., Lee, K. Y., Lee, K. H., & Lee, D. (2007). Towards clustering of incomplete microarray data without the use imputation. Bioinformatics, 23, 107–113.

    Article  PubMed  Google Scholar 

  • Kincius, M., Liang, A., Nickkholgh, K., Hoffmann, C., Flechtenmacher, C., Ryschich, E., et al. (2007). Taurine protects from liver injury after warm ischemia in rats: The role of Kupffer cells. European Surgical Research, 39, 275–283.

    Article  PubMed  CAS  Google Scholar 

  • Little, R. J. A. (1998). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198–1202.

    Google Scholar 

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. New Jersey: John Wiley & Sons.

    Google Scholar 

  • Oba, S., Sato, M., Takemasa, I., Monden, M., et al. (2003). A Bayesian missing values estimation method for gene expression profile data. Bioinformatics, 19, 2088–2096.

    Article  PubMed  CAS  Google Scholar 

  • Parsons, H. M., Ekman, D. R., Collette, T. W., & Viant, M. R. (2009). Spectral relative standard deviation: A practical benchmark in metabolomics. Analyst, 134, 478–484.

    Article  PubMed  CAS  Google Scholar 

  • Parsons, H. M., Ludwig, C., Günther, U. L., & Viant, M. R. (2007). Improved classification accuracy in 1- and 2- dimensional NMR metabolomics data using the variance stabilizing generalised logarithm transformation. BMC Bioinformatics, 8, 234.

    Article  PubMed  Google Scholar 

  • Payne, T. G., Southam, A. D., Arvanitis, T. N., & Viant, M. R. (2009). A signal filtering method for improved quantification and noise discrimination in Fourier transform ion cyclotron resonance mass spectrometry-based metabolomics data. Journal of American Society for Mass Spectrometry, 20, 1087–1095.

    Article  CAS  Google Scholar 

  • Pedreschi, R., Hertog, M. A. T. M., Carpentier, S., Lammertyn, J., Robben, J., et al. (2008). Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics, 8, 1371–1383.

    Article  PubMed  CAS  Google Scholar 

  • Rubin, D. R. (1976). Inference and missing data. Biometrica, 63, 581–592.

    Article  Google Scholar 

  • Sangster, T. P., Wingate, J. E., Burton, L., Teichert, F., & Wilson, I. D. (2007). Investigation of analytical variation in metabonomics analysis using liquid chromatography/mass spectrometry. Rapid Commun. Mass Spectrometry, 21, 2965–2970.

    Article  CAS  Google Scholar 

  • Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3.

    Article  PubMed  CAS  Google Scholar 

  • Scheel, I., Aldrin, M., Glad, I., Sorum, R., Lyng, H., & Frigessi, A. (2005). The influence of missing values imputation on detection of differentially expressed genes from microarray data. Bioinformatics, 21, 4272–4279.

    Article  PubMed  CAS  Google Scholar 

  • Silva, M. A. (2006). Arginine and urea metabolism in the liver graft: A study using microdialysis in human orthotopic liver transplantation. Transplantation, 82, 1304–1311.

    Article  PubMed  CAS  Google Scholar 

  • Southam, A. D., Payne, T. G., Cooper, H., Arvanitis, T. N., & Viant, M. R. (2007). Dynamic range and mass accuracy of widescan direct infusion nanoelectrospray Fourier transform ion cyclotron resonance mass spectrometry-based metabolomics increased by the spectral stitching method. Analytical Chemistry, 79, 4595–4602.

    Article  PubMed  CAS  Google Scholar 

  • Steuer, R., Morgenthal, K., Weckwerth, W., & Selbig, J. (2007) A gentle guide to the analysis of metabolomic data. In: Metabolomics: Methods and protocols (pp. 105–129). New Jersey: Humana Press.

  • Sumner, L. W., Amberg, A., Barret, D., Beale, M. H., Berger, R., et al. (2007). Proposed minimum reporting standards for chemical analysis chemical analysis working group (CAWG) metabolomics standards initiative (MSI). Metabolomics, 3, 211–221.

    Article  CAS  Google Scholar 

  • Taylor, N. S., Weber, R. J. M., Southam, A. D., Payne, T. G., Hrydziuszko, O., Arvanitis, T. N., et al. (2009). A new approach to toxicity testing in Daphnia magna: An application of high throughput FT-ICR mass spectrometry metabolomics. Metabolomics, 5, 44–58.

    Article  CAS  Google Scholar 

  • Taylor, N. S., Weber, R. J. M., White, T. A., & Viant, M. R. (2010). Discriminating between different acute chemical toxicities via changes in the daphnid metabolome. Toxicological Sciences, 118, 307–317.

    Article  PubMed  CAS  Google Scholar 

  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525.

    Article  PubMed  CAS  Google Scholar 

  • Tuikkala, J., Elo, L. L., Nevalainen, O. S., & Aittokallio, T. (2008). Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics, 9, 202.

    Article  PubMed  Google Scholar 

  • van Buuren, S., & Groothuis-Oudsshoorn, K. (2010). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 1, 68–74.

    Google Scholar 

  • Walczak, B., & Massart, D. L. (2001). Dealing with missing data part I. Chemometrics and Intelligent Laboratory Systems, 58, 15–27.

    Article  CAS  Google Scholar 

  • Westerhuis, J. A., Hoefsloot, H. C. J., Smit, S., Vis, D. J., Smilde, A. K., van Velzen, E. J. J., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4, 81–89.

    Article  CAS  Google Scholar 

  • Wu, H., Southam, A. D., Hines, A., & Viant, M. R. (2008). High throughput tissue extraction protocol for NMR- and MS-based metabolomics. Analytical Biochemistry, 372, 204–212.

    Article  PubMed  CAS  Google Scholar 

  • Xia, J., Psychogios, N., Young, N., & Wishart, D. S. (2009). MetaboAnalyst: A web server for metabolomics data analysis and interpretation. Nucleic Acids Research, 37, W652–W660.

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

We thank Drs. Alessia Lodi, Stefano Tiziani and Chris Bunce for provision of the FT-ICR MS datasets of the cancer cell extracts.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark R. Viant.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 2993 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hrydziuszko, O., Viant, M.R. Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics 8 (Suppl 1), 161–174 (2012). https://doi.org/10.1007/s11306-011-0366-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-011-0366-4

Keywords

Navigation