A toolbox to explore NMR metabolomic data sets using the R environment

https://doi.org/10.1016/j.chemolab.2013.04.015Get rights and content

Abstract

We describe herein the implementation of graphical and statistical tools developed in the R free software environment to explore metabolomic data sets. This toolbox, available upon request from the authors for the latest releases, includes univariate, bivariate and multivariate existing approaches accompanied with various graphical displays and interactive facilities. Concretely, very basic knowledge in R is required: from Excel data files as input to graphical and numerical outputs the user is led through a set of questions he only has to answer. We illustrate the potential of the toolbox on a data set coming from a 1H NMR metabolomic study of cerebellums from a murine model of Alzheimer's disease. We show the complementarity of various graphical techniques in order to provide information easier to interpret. In particular, a simple correlation study can be highly meaningful, and competitive with a more sophisticated multivariate analysis, when using ad hoc graphical representations depending on the level of interest: global, multiple or single metabolite focus.

Introduction

Molecular biology is now strongly driven by high-throughput facilities resulting in large-scale molecular profiling. Many aspects of a biological system can be studied by exploring the “omics-land” (www.omics.org) with ad hoc technologies: genomics (sequencer), transcriptomics (microarray), proteomics (mass spectrometry, MS), metabolomics (mainly Nuclear Magnetic Resonance (NMR), and MS)…. The application of these techniques results in a huge amount of data generated from a single biological sample. In the “omics” context, metabolomics plays a key role as the metabolome can be viewed as the response of living systems to biological perturbations (genetic modification, physiological and/or pathophysiological stimuli for instance).

NMR-based metabolomic analyses provide spectrum-shaped data that require specific mathematical pre-processing. De-noising, baseline adjustment, peak detection, multiple peaks alignment, binning, etc. are topics that have triggered many methodological developments [1], [2], [3], [4].

Once the pre-processing has been done, exploratory analyses must be performed to face the overwhelming amount of data. Unsupervised (Principal Component Analysis, PCA) and supervised (Projections to Latent Structures-Discriminant Analysis, PLS-DA) methods are commonly used to highlight relevant underlying information [5].

In addition, relevant features can be extracted from the analysis of the correlation matrix between peaks or buckets defined from a spectrum. In particular, analyzing correlated peaks can be very meaningful to identify signals from the same molecule as well as metabolites belonging to the same metabolic pathway [6]. The link between statistical correlation and relationship in the underlying metabolic network cannot be drawn directly [7] but this topic has stimulated many investigations [8], [9] and an in-depth analysis of the correlation matrix could reveal useful information [10], [11].

The purpose of the present work was to develop in the R free software environment [12] statistical and graphical techniques to study NMR data. Concretely, very basic knowledge in R is required. The user just has to source one of the R script files corresponding to the analysis he wants to perform. Once the script is sourced, data are imported from xls files. Then the user can customize his analysis by answering some very simple questions asked by the program like “Do you want different colors? yes/no” or “Insert the order to display boxplots”. Output files including numerical and graphical results are produced and stored in one main directory with sub-directories to facilitate the localization of data. This toolbox includes univariate, bivariate and multivariate statistical analyses. Univariate approaches propose a set of graphical and numerical tools to provide information for each variable of the data set. Bivariate approaches lead to various representations of the correlation matrix to explore highly correlated metabolites. Then, multivariate methods provide a global overview of the data set either in an unsupervised (PCA) or a supervised (PLS-DA) framework. Sparse versions of these methods enable the selection of the most relevant variables to focus on. All these tools are accompanied with interactive facilities to make easier the biological interpretation of the results.

Several other packages or toolboxes already exist with a quite similar purpose: see for instance metaP-Server [13], and the metabonomic [14] and MUMA [15] R packages. Other references can be found in a review recently proposed in [16] and many analyses performed using in-house codes developed in Matlab are sometimes available upon request. In general, considering the statistical analysis of the data, i.e. once the pre-processing has been done, these packages essentially deal with multivariate unsupervised (PCA) or supervised (PLS-DA) analysis or classification approaches (k-nearest neighbors). In our toolbox, we opt for a larger choice of graphical representations rather than methods. We also include bivariate approaches less commonly used except in the STOCSY representation [6].

To illustrate the potential of our toolbox, we present the data of a 1H NMR metabolomic study comparing the cerebellum metabolism of control and Alzheimer's disease (AD) model mice. In this case study, we chose to concentrate on the interpretation of the correlation matrix with three levels of interest. The first one is global as it takes place before the identification of potential discriminant metabolites. It uses pairwise correlations on the whole set of metabolites in the samples. The second and third levels of interest require a previous selection of either only one variable of interest (therefore called single) or several variables (multiple). We illustrate in this case study the pros and cons of three graphical techniques: STOCSY [6], heatmap [17] and correlation networks [18], [19]. We show that some are more appropriate than others to provide information easier to interpret depending on the level of interest.

A set of R scripts that requires only basic knowledge in R for the user is available from the authors upon request.

Section snippets

Methods

A synthetic view of the methods available in our toolbox is displayed in Fig. 1. They are presented in the following paragraphs according to the variables the methods can deal with simultaneously.

Software

Routines for computation and graphics were written in the free software environment R using various packages including: xlsReadWrite to read and write Excel files [28] (this dependency constrains the user to work under Windows 32 bits), igraph [29] and tcltk (based on Tcl/Tk, www.tcl.tk) to build and plot networks and mixOmics for multivariate analysis [23]. Concretely, very basic knowledge in R is required. The user has only to source one of the R script files corresponding to the analysis he

Sample collection and tissue extraction

Seventeen cerebellums were collected after cervical dislocation of 8 control (Tg) and 9 transgenic (Tg+) AppSwe Tg2576 mice. Tissues were extracted according to Beckonert's procedure [30] with methanol/chloroform/water. The upper methanol/water phase was collected. Methanol was eliminated by vacuum centrifugation (Speed-Vac). Borate buffer (550 μL) at pH 10.0 was added to the remaining aqueous phase, which was then lyophilized and stored at − 80 °C. Before NMR analysis, the dried-frozen extract

Reviewer assessments

B. Féraud

Institut de Statistique Biostatistique et Sciences Actuarielles, Université Catholique de Louvain. 20 voie du Roman Pays 1348 Louvain-La-Neuve, Belgique.

I, Baptiste Féraud, PhD researcher (at ISBA, UCL, Belgium) working on 2D-NMR metabonomics under the supervision of Prof. Bernadette Govaerts and Prof. Michel Verleysen declare to have tested this R toolbox in a strictly independent way. I received from Mr. Stéphane Balayssac and Mr. Sébastien Déjean all necessary information and files

Conclusion

The toolbox we developed provides many graphical techniques associated with statistical methods in order to facilitate the interpretation of the results for addressing biological problems. This toolbox can be easily handled by non experienced R users because: i\ inputs are xls files, ii\ questions are asked of the user to design the desired analysis, iii\ interactive facilities are available to customize some graphics and iv\ outputs are stored in various sub-directories. We have illustrated

Acknowledgments

The authors are grateful to Floriane Gaffet, Nadia Saouate, Thibault Duprat and Leïla Ait Ou Ammi who contributed to the development of the software during their internships.

References (47)

  • C.D. Pederzolli et al.

    N-acetylaspartic acid promotes oxidative stress in cerebral cortex of rats

    International Journal of Developmental Neuroscience

    (2007)
  • F.B. Goldstein

    Biosynthesis of N-acetyl-l-aspartic acid

    Biochimica et Biophysica Acta

    (1959)
  • A. Antoniadis et al.

    Nonparametric pre-processing methods and inference tools for analyzing time-of-flight mass spectrometry data

    Current Analytical Chemistry

    (2007)
  • A.C. Sauve et al.

    Normalization, baseline correction and alignment of high throughput mass spectrometry data

  • R. Rousseau et al.

    Comparison of some chemometric tools for metabonomics biomarker identification

    Chemometrics and Intelligent Laboratory Systems

    (2007)
  • O. Cloarec et al.

    Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets

    Analytical Chemistry

    (2005)
  • D. Camacho et al.

    The origin of correlations in metabolomics data

    Metabolomics

    (2005)
  • R. Steuer

    On the analysis and interpretation of correlations in metabolomic data

    Briefings in Bioinformatics

    (2006)
  • M. Müller-Linow et al.

    Consistency analysis of metabolic correlation networks

    BMC Systems Biology

    (2007)
  • E. Allen et al.

    Correlation Network Analysis reveals a sequential reorganization of metabolic and transcriptional states during germination and gene-metabolite relationships in developing seedlings of Arabidopsis

    BMC Systems Biology

    (2010)
  • S. Sato et al.

    Time-resolved metabolomics reveals metabolic modulation in rice foliage

    BMC Systems Biology

    (2008)
  • R Development Core Team

    R Foundation for Statistical Computing, Vienna, Austria

    (2012)
  • G. Kastenmüller et al.

    metaP-Server: a web-based metabolomics data analysis tool

    Journal of Biomedicine and Biotechnology

    (2011)
  • Cited by (10)

    • Metabolomics in Alzheimer's disease: The need of complementary analytical platforms for the identification of biomarkers to unravel the underlying pathology

      2017, Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences
      Citation Excerpt :

      Dedeoglu et al. compared in vivo MRS and in vitro NMR to investigate the differences in the neurochemical profile between APPTg2576 transgenic mice and wild type (WT) littermates, thus demonstrating that a wider range of compounds can be measured by using high resolution spectroscopy (Fig. 1) [29]. Metabolomic analysis of cerebellum samples from this animal model also showed significant alterations in levels of important neurochemicals, such as N-acetyl-aspartate, γ-aminobutyric acid or glutamate, among others [30]. On the other hand, Forster et al. examined longitudinal metabolic changes in whole brain extracts from TASTPM transgenic mice aged between 3 and 18 months, and surprisingly did not find significant differences in N-acetyl-aspartate levels [31].

    • Multi-element, multi-compound isotope profiling as a means to distinguish the geographical and varietal origin of fermented cocoa (Theobroma cacao L.) beans

      2015, Food Chemistry
      Citation Excerpt :

      Analytical performance was checked by inserting laboratory standards of GA (13C = −27.30‰, (−0.45 as correction factor); δ15N = 4.85‰, (−0.14 as correction factor)) between samples to check for stability and to allow drift correction to be made when necessary. First, a package developed in the R environment was used for a univariate approach based on analysis of variance for each variable, and discriminating variables were uncovered through a supervised univariate approach with t-tests and boxplots (Balayssac, Déjean, Lalande, Gilard, & Malat-Martino, 2013). After mean-centering and auto-scaling, the data matrix was subjected to several multivariate statistical analyses using SIMCA-P+ 12.0 software (Umetrics, Umeå, Sweden).

    • Region-specific metabolic alterations in the brain of the APP/PS1 transgenic mice of Alzheimer's disease

      2014, Biochimica et Biophysica Acta - Molecular Basis of Disease
      Citation Excerpt :

      Furthermore, the role of a dysregulated endocannabinoid–eicosanoid network in the pathogenesis of AD has been recently demonstrated in the APP/PS1 mice with inactivated monoacylglycerol lipase [52]. On the other hand, other studies focused on individual brain areas including the hippocampus [42,65,72], cortex [11] and cerebellum [1,43], because metabolic perturbations induced by AD-type disorders could be region-specific in the brain. In this sense, the characterization of regional metabolomic perturbations may be of greater interest in order to investigate the impact of disease on different brain regions and determine the most affected ones in AD mice.

    • Characterization of heroin samples by <sup>1</sup>H NMR and 2D DOSY <sup>1</sup>H NMR

      2014, Forensic Science International
      Citation Excerpt :

      A total of 73 variables were considered for statistical correlation analysis. A STOCSY-like representation from bucketed data using Pearson's correlation coefficients was employed to aid in the identification of signals [18,19]. 1D 1H NMR data were processed using Bruker TopSpin software 2.1 with one level of zero-filling and Fourier transformation after multiplying FIDs by an exponential line-broadening function of 0.5 Hz.

    View all citing articles on Scopus
    1

    These two authors equally contributed to this work.

    View full text