Bioinformatics a field that permits data collection, analysis, parsing, and contextual interpretation, and it supports decision-making on those bases. Bioinformatics can be defined as conceptualizing biology in terms of molecular components and by applying “informatics techniques” borrowed from disciplines such as applied mathematics, computer science, and statistics to understand and organize information on a large scale (Luscombe et al
2001). The major challenge is to reduce the dimensionality by selecting informative metabolic signals from the highly noisy raw data. Chemometric tools are widely used to achieve this goal. Chemometrics is defined as the science of extracting useful information from chemical systems by data-driven means (Brereton
2014). It may be applied to solve both descriptive and predictive problems, using biochemical data. In multivariate methods, representative samples are presented as points in the space of the initial variables. The samples can then be projected into a lower dimensionality space based on components or latent variables, such as a line, a plane, or a hyperplane, which can be seen as the shadow of the initial data set viewed from its best perspective. The sample coordinates of the newly defined latent variables are the scores, while the directions of variance to which they are projected are the loadings. The loadings vector for each latent variable contains the weights of each of the initial variables (metabolites) for that latent variable. Unsupervised methods attempt to reveal patterns or clustering trends in the data that underpin relationships between the samples. These methods also highlight the variables that are responsible for these relationships, using visualization means. Chemometrics methods are mainly divided into unsupervised and supervised methods. In unsupervised methods, no assumptions are made about the samples and the aim is mainly exploratory. In metabolomics data, metabolic similarity shapes the observed clustering. Principal component analysis (Hotelling
1933) is a widely used pattern recognition method; it is a projection-based method that reduces the dimensionality of the data by creating components. Principal component analysis allows a two- or three-dimensional visualization of the data. Because it contains no assumptions on the data, it is used as an initial visualization and exploratory tool to detect trends, groups, and outliers. It allows simpler global visualization by representing the variance in a small number of uncorrelated latent variables. Independent component analysis (ICA) is another unsupervised method that is a blind source separation method that separates multivariate signals into additive subcomponents (Bouveresse and Rutledge
2016). Its interpretation is similar to PCA, but instead of orthogonal components, it calculates non-Gaussian and mutually independent components (Wang et al
2008; Al-Saegh
2015). Compared to PCA, ICA as a linear method could provide potential benefits for untargeted metabolomics. ICA has been successfully used in metabolomics (Li et al
2012; Monakhova et al
2015; Liu et al
2016). Other unsupervised methods, such as clustering, aim to identify naturally occurring clusters in the data set by using similarity measures defined by distance and linkage metrics (Wiwie et al
2015). A dendrogram or a heat map can be created to visualize the similarities between samples. Commonly used clustering methods include correlation matrix, k-means clustering (Hartigan and Wong
1979), hierarchical cluster analysis (Johnson
1967), and self-organizing maps (Kohonen
1990; Goodwin et al
2014). In supervised methods, samples are assigned to classes or each sample is associated with a specific outcome value, and the aim is mainly explanatory and predictive. When the variables are discrete (e.g., control group versus diseased group), the task is called classification. When the variables are continuous (e.g., metabolite concentration) the task is called regression. The main purposes of supervised techniques are (i) to determine the association between the response variable and the predictors (metabolites) and (ii) to make accurate predictions based on the predictors. In metabolomics biomarker discovery, within the modeling process, it is important to find the simplest combination of metabolites that can produce a suitably effective predictive outcome. The biomarker discovery process involves two parameters, the biomarker utility and the number of metabolites used in the predictive model. The main challenges are therefore predictor selection and the evaluation of the fitness and predictive power of the built model. Predictor selection aims to identify important metabolites from among the detected ones that best explain and predict the biological or clinical outcome. Different predictor selection techniques have been described. Some of these suggested strategies are based on univariate or multivariate statistical proprieties of variables used as filters (loading weights, variable importance on projection scores, or regression coefficients), while others are based on optimization algorithms (Saeys et al
2007; Yi et al
2016). Recently, another elegant method has been reported that essentially combines estimation of Mahalanobis distances with principal component analysis and variable selection using a penalty metric instead of dimension reduction (Engel et al
2017). This method was successfully applied for inherited metabolic diseases (IMD) screening purposes. Finally, we need goodness-of-fit metrics to assess the model predictive power. Commonly used statistics may include root mean square error (RMSE) for regression problems and sensitivity, specificity, and the area under the receiver-operating characteristic (ROC) curve for classification models. To have independent test data sets, sometimes, data collection may be expensive or hampered by limited samples such as in rare diseases which is the case in IMD. In this case, various resampling methods are used to efficiently use the available data set, such as cross-validation, bootstrapping, and jackknifing (Westad and Marini
2015). Regarding the supervised methods, various techniques can be used in metabolomics. Some of the most used techniques include linear discriminant analysis (LDA) (Balog et al
2013; Ouyang et al
2014) and partial least squares (PLS) methods such as PLS-discriminant analysis (PLS-DA) (Wold et al
2001) and orthogonal-PLS-DA (OPLS-DA) (Trygg and Wold
2002; Manwaring et al
2013), as well as support vector machines (Cortes and Vapnik
1995; Lin et al
2011) and random forest (Breiman
2001; Huang et al
2015). Recently, Habchi et al proposed an innovative supervised method based on ICA called IC-DA. This method has been successfully applied to analyze DIMS metabolomics data that could be useful for high throughput screening (Habchi et al
2017). Furthermore, new methods based on topology data analysis are drawing interest and seem promising for data analysis because of their intrinsic flexibility and exploratory and predictive abilities (Liu et al
2015; Offroy and Duponchel
2016). Recently, a new method, called statistical health monitoring (SHM), has been adapted from industrial statistical process control; an individual metabolic profile is compared to a healthy one in a multivariate fashion. Abnormal metabolite patterns are thus detected, and more intelligible interpretation is enabled (Engel et al
2014). This approach has been successfully applied in IMD investigations (Engel et al
2017). The aim of metabolomics studies and the data analysis strategy are highly interdependent. Moreover, multivariate and univariate data analysis pipelines are not mutually exclusive, and they are often used together to enhance the quality of the information recovery. For further details on data analysis techniques and tools used in metabolomics, the reader may refer to recent reviews on this issue (Gromski et al
2015; Ren et al
2015; Misra and van der Hooft
2016).