Statistical approaches used in metabolomics
Two types of data analysis approaches, univariate analysis and multivariate analysis, have been widely used in metabolomics projects, including most of the CF related studies referenced in this paper.
Univariate analysis approach analyzes each metabolite separately. It includes parametric methods such as paired
t test, Welch
t test, and linear model and non-parametric methods such as Wilcoxon signed rank test, Mann-Whitney test, and Kruskal-Wallis ANOVA. A parametric method is used when the data basically meets normality assumption. Data transformations such as log transformation are often used to improve normality.
t test is frequently used for the comparison of two classes, e.g., CF vs. non-CF. While
t test is easy to use, one of the advantages of using a linear model is that multiple confounders can be controlled in the model, so that metabolic variations due to these confounders can be removed from the data. These linear models were for example used in the blood-based studies referenced here [
6,
7]. Univariate analysis of metabolomics data usually includes hundreds of tests (one for each metabolite), therefore the control of false discovery rate (FDR) in multiple testing is very important. The most commonly used FDR control methods are
Q value [
29] and the Benjamini-Hochberg procedure [
30].
Multivariate analysis approach analyzes all of the metabolites in the data simultaneously in one analysis. It detects important metabolic variations though dimension reduction. It includes non-supervised classification methods such as principal component analysis (PCA) and supervised classification methods such as partial least squares discriminant analysis (PLS-DA) and orthogonal partial least squares discriminant analysis (OPLS-DA). PCA detects major variations in the data without using sample classes, therefore being suitable for data exploration, and detection of outliers in the dataset. Most of the CF studies compare two or more classes of samples, e.g., CF vs. non-CF [
6], CF with low inflammation vs. high inflammation [
16], stable CF vs. unstable CF [
19], and CF vs. PCD vs. healthy subjects [
20], such studies benefit from using supervised classification methods. Unlike PCA which looks for major metabolic variations in the data, PLS-DA and OPLS-DA look for metabolic variations that are specifically related to the study classes. A major challenge of multivariate analysis is overfitting, i.e., the model fits the data so well that it cannot generalize to new data. Overfitting occurs when the model describes noise in the data. Statistical cross validation can be used to reduce overfitting. In cross validation, the samples are split into training set and testing set, and results from the training set are tested in the testing set as used in several of the studies here [
6,
16,
19,
20,
25]. A limitation of cross validation is that all of the samples used for validation come from the same study. The most rigorous validation is external validation, i.e., validation performed in an independent data set. Montuschi et al. provided some good examples of using external validation to identify CF-related biomarkers [
19,
20]. When an independent data set is not available, permutation-based validation could be used as an alternative, not used in any of the referenced CF studies. In permutation-based validation, the predictive power of the model is compared with the predictive powers calculated using hundreds and thousands of permutated data sets. Using it in combination with cross validation provides a more reliable validation result than using cross validation alone.
It is important to choose appropriate statistical method based on the purpose of the analysis and statistical assumptions. A summary of the statistical methods and their assumptions is in Table
1. The results generated from univariate analysis and multivariate analysis often complement each other. Examining results from both approaches allows researchers to look at the data from different perspectives and extract the best possible amount of information from the data.
Table 1
Common statistical approaches used in metabolomics data analysis
A. Methods that analyze each metabolite separately |
Parametric methods | Paired t test | Compare two groups | Random sampling, normality, paired samples, no major outliers |
Student t test | Compare two groups | Random sampling, normality, independent samples, equal variances, no major outliers |
Welch t test | Compare two groups | Random sampling, normality, independent samples, unequal variances, no major outliers |
Linear model | Compare two or more groups and with the possibility to control confounders | Random sampling, linearity, and additivity, errors are independent, homoscedastic, and follow normal distribution, no major outliers |
Nonparametric methods | Wilcoxon signed rank test | Compare two groups | Random sampling, paired samples, differences between paired samples have symmetrical distribution |
Mann-Whitney U test | Compare two groups | Random sampling, independent samples |
Kruskal-Wallis ANOVA | Compare more than two groups | Random sampling, independent samples |
B. Methods that analyze all of the metabolites simultaneously |
Unsupervised classification methods | PCA | Detect major pattern in the data, detect outliers | Linearity |
Supervised classification methods | PLS-DA | Find metabolites that best separate two or more study groups | Linearity, no major outliers |
| OPLS-DA | Find metabolites that best separate two or more study groups, with easier result interpretation than PLS-DA | Linearity, no major outliers |