Background
The STRATOS initiative and the STRATOS topic group TG9 “High-dimensional data”
Study design
Methods
Structure of the paper
Section | Analytical goals | Common approaches | Examples |
---|---|---|---|
IDA | Initial data analysis and preprocessing | ||
IDA1 | Identify inconsistent, suspicious or unexpected values | Visual inspection of univariate and multivariate distributions | Histograms, boxplots, scatterplots, correlograms, heatmaps |
IDA2 | Describe distributions of variables, and identify missing values and systematic effects due to data acquisition | Descriptive statistics, tabulation, analysis of control values, graphical displays | Measures for location and scale, bivariate measures, RLE plots, MA plots, calibration curve, PCA, Biplot |
IDA3 | Preprocess the data | Normalization, batch correction | Background correction, baseline correction, centering and scaling, quantile normalization, ComBat, SVA |
IDA4 | Simplify data and refine/update analysis plan if required | Recoding, variable filtering and exclusion of uninformative variables, construction of new variables, removal of variables or observations due to missing values, imputation | Collapsing categories, variable filtering, discretizing continuous variables, multiple imputation |
EDA | Exploratory data analysis | ||
EDA1 | Identify interesting data characteristics | Graphical displays, descriptive univariate and multivariate statistics | PCA, Biplot, multidimensional scaling, t-SNE, UMAP, neural networks |
EDA2 | Gain insight into the data structure | Cluster analysis, prototypical samples | Hierarchical clustering, k-means, PAM, scree plot, silhouette values |
TEST | Identification of informative variables and multiple testing | ||
TEST1 | Identify variables informative for an outcome | Test statistics, modelling approaches | t-test, permutation test, limma, edgeR, DESeq2 |
TEST2 | Perform multiple testing | Multiple tests, control for false discoveries | Bonferroni correction, Holm’s procedure, multivariate permutation tests, Benjamini-Hochberg (BH), q-values |
TEST3 | Identify informative groups of variables | Tests for groups of variables | Gene set enrichment analysis, over-representation analysis, global test, topGO |
PRED | Prediction | ||
PRED1 | Construct prediction models | Variable transformations, variable selection, dimension reduction, statistical modelling, algorithms, integrating multiple sources of information | Log-transform, standardization, superPC, ridge regression, lasso regression, elastic net, boosting, SVM, trees, random forest, neural networks, deep learning |
PRED2 | Assess performance and validate prediction models | Choice of performance measures, internal and external validation, identification of influential points | MSE, MAE, ROC curves, AUC, misclassification rate, Brier score, calibration plots, deviance, subsampling, cross-validation, bootstrap, use of external datasets |
Results
IDA: Initial data analysis and preprocessing
IDA1: Identify inconsistent, suspicious or unexpected values
Histograms |
Histograms divide the range of values into intervals and then count how many values fall into each interval. They can be useful to visualize the shape of the data distribution and identify outlying points. Sometimes use of a transformation before plotting will improve visualization by providing better resolution of densely packed values and drawing more extreme values closer to the main body of the distribution |
Boxplots |
A boxplot (also called box-and-whisker-plot) is a graphical display that gives a quick impression of location and spread of data values and thus makes the comparison between variables simpler. A central box indicates the values that include the central 50% of the data (interquartile range), the median is indicated with a line within the box, and the lines extending vertically from the box (whiskers) indicate the area of all values that are not further than 1.5 times the interquartile range from the edges of the box. In addition, a commonly used option is to plot points individually that are outside the main area indicated by the whiskers. When boxplots are used to display variables with many values (like the expression values of all genes within an experiment), it is expected that many values fall in this category and plotting them individually can create the impression of many extreme values |
Scatterplots |
Scatterplots display one variable plotted against another, with each axis corresponding to one of the two variables. Both variables may be observed (e.g., expression of one gene against expression of a different gene), or one of the two variables could be a factor such as time, order of entry into study, or order in which a measurement such as an assay was conducted. Plotted points may represent the values of two variables for each of the study subjects, or each point could represent one of many different variables measured on an individual subject. For HDD, plots in which each point represents a different variable may contain an extremely large number of points making them hard to interpret due to many overlapping plotting symbols. Strategies such as use of semi-transparent colors for the plotted points or density plots, where regions with more observations appear darker in the plot, may be necessary. Another strategy is to randomly sample points to create a subset that provides a less dense plot |
Correlograms |
A correlogram (or corrgram) is a graphical representation of the correlation matrix [26]. It is a visual display for depicting patterns of relations among variables directly from the correlation matrix. In a correlogram, the values are rendered to depict sign and magnitude. Further, variables can be reordered such that similar variables are positioned adjacently, in order to facilitate the perception of the relations. Since correlograms visualize correlation matrices, they are only useful for LDD, i.e., if the number of variables is not too large. Of course, the correlations themselves can be computed from high-dimensional vectors. Figure 1 [27] shows an example of a correlogram |
Heatmaps |
A common two-dimensional visualization method is a heatmap [28] where the individual values contained in a data matrix are represented as colors in boxes of a rectangular grid. Sometimes raw data values are centered or scaled within rows or columns prior to display, which can be particularly helpful when rows or columns represent variables having different ranges or measurement scales. Clear description of any such centering and/or scaling is essential for proper interpretation of the figure. Choice of color-palette and ordering of rows and columns can both heavily influence the information conveyed by the graphical display. Complementary colors (e.g., red and green, blue and orange) can be used to emphasize two sides of a centered scale. Examples include many published heatmaps for gene expression microarray data in which shades of red might represent degrees of overexpression (relative to median or mean) of a gene and shades of green could represent underexpression. Another consideration for a heatmap display is the ordering of the rows and columns. Sometimes there is an ordering of the observations based on experimental design, for example, samples collected in a time course experiment are represented as ordered columns in the heatmap. As a quality check, it can be useful to order columns by sequence in which samples were assayed. Unexpected trends may indicate assay drift or batch effects. If rows correspond to factors such as gene transcript or protein levels, it can be illuminating to order them according to similarity of pattern across observations. Various clustering methods can be applied to construct orderings of observations or variables. These orderings might be illustrated by dendrograms, which can be displayed along axes of the heatmap to depict the distance structure (see section “EDA2.1: Cluster analysis” for discussion of clustering methods). Figure 2 [29] shows an example of a heatmap |
IDA2: Describe distributions of variables, and identify missing values and systematic effects due to data acquisition
Measures for location and scale |
As measure of location, the mean is standard for continuous data, the median is robust regarding extreme values, and the mode is often used for categorical data. Such measures can be extended to higher dimensions by calculating them component-wise, i.e., for every variable separately, and then collecting the values into a vector |
As measure of scale, the standard deviation for continuous data and the median absolute deviation to the median (MAD) as a robust counterpart are often used. The coefficient of variation scales the standard deviation by dividing by the mean and is helpful for comparing variables that are measured on different scales |
Bivariate measures |
Bivariate descriptive statistics are based on pairs of variables, often the correlation coefficient is used to quantify the relationship between two variables. The classical Pearson correlation coefficient captures only linear relationships, whereas Spearman’s rank-based correlation coefficient may more effectively capture strong non-linear, but monotonic, relationships |
RLE plots |
Relative log expression (RLE) plots [32] can be used for visualizing and detecting unwanted variation in HDD. They were developed for gene expression microarray data, but are now very popular especially for the analysis of single-cell expression data. For each variable (e.g., expression of a particular gene), first, its median value across all observations is calculated. Then, the median is subtracted from all values of the corresponding variable. Finally, for each observation, a boxplot is generated of all deviations across the variables. Comparing the boxplots, if one of them looks different with respect to location or spread, it may indicate a problem with the data from that observation. RLE plots are particularly useful for assessing the effects of normalization methods that are applied for removing unwanted variation, which might be due to, e.g., batch effects, see also section “IDA3.2: Batch correction.” An example RLE plot is presented in Fig. 3 [32] |
MA plots (Bland–Altman plots) |
A natural way to assess concordance between measurements that are supposed to be replicates is to construct a simple scatterplot and look for distance from the 45-degree line. However, a preferred approach is to construct a Bland–Altman plot [33] instead of a scatterplot. In the omics literature, this plot is often referred to as an MA plot [34]. The horizontal (“x”) axis of a Bland–Altman plot is the mean of the paired measurements, and the vertical (“y”) axis is the difference, often after measurements have been log transformed. The advantage of this plot compared to a traditional scatterplot is that it allows better visualization of differences against a reference horizontal line at height zero and improved ability to detect changes in variability (spread) of those differences moving along the x-axis (see section “IDA4: Simplify data and refine/update analysis plan if required”). An example of a Bland–Altman Plot is presented in Fig. 4 [35] |
Calibration curve |
A typical calibration process for a single-analyte assay might involve running a series of reference standard samples with known values of the target analyte followed by construction of a calibration curve. This curve can then be inverted to produce a mathematical correction that is applied to the raw measured values from the test samples. A multiplicative factor applied to all raw assays values is a simple example of a correction |
In the setting of HDD such as omics data, it would be infeasible to construct a separate calibration curve for every analyte measured by the assay. Instead, calibration approaches used for omics assays typically rely on corrections derived either from a small subset of the analytes measured by the assay platform or on assumptions about the global distribution of the measured values across all analytes measured. An example of the subset approach in the context of gene expression arrays is the calculation of a mean over a small set of so-called “housekeeper genes”, whose expression levels are expected to be roughly constant across all samples being analyzed. This mean is compared to a specified reference value to generate a multiplicative factor specific to each sample, which is then applied globally across the expression data for all genes measured for the sample. Figure 5 [37] shows several examples of calibration curves |
Principal component analysis (PCA) |
The basic idea of PCA [38] is to transform the (possibly correlated) variables into a set of linearly uncorrelated variables as follows. The first variable (first principal component) is constructed to capture as much of the total multidimensional variability in the data as possible, the second variable is uncorrelated with the first and maximizes capture of the residual variability (i.e., variability not already captured by the first principal component), and so on. Each principal component is a linear combination of the original variables. The result is a set of uncorrelated variables of decreasing importance, in the sense that the variables are ranked from the most informative (the first principal component, i.e., the one with the highest variance) to the least informative (i.e., the one with the lowest variance). The positions of each observation in the new coordinate system of principal components are called scores, and the loadings indicate how strongly the variables contribute to each PC. A major portion of the total variation in the data is often captured by the first few principal components alone, which are the only ones retained for the further analysis. Use of principal components greatly reduces the dimension of the data typically without losing much information (with respect to variability in the data). In the context of IDA, often the first two principal components are plotted to inspect for peculiarities in the data. Figure 6 [39] shows a PCA plot constructed from high-dimensional gene expression profiles generated from analysis of lymphoma specimens |
Biplot |
Biplots, introduced by Gabriel [40], are designed to show PCs’ contributions with regard to both observations and variables. In a biplot, both the principal component scores and loadings are plotted together. The most common biplot is a two- or three-dimensional representation, where any two (or three) PCs of interest are used as the axes. Since often most of the variation in the data is explained by the first few PCs, it usually suffices to concentrate on plotting those. The biplot allows identifying samples that are “different” from the majority of samples, and at the same time, it illustrates nicely where these differences occur, i.e., for which variables the samples show different values. Figure 7 [39] shows a biplot for the same data used for the PCA plot |
IDA3: Preprocessing the data
Background correction |
A classic example of such a step is a background correction applied to data generated from some of the earliest microarrays [41]. In this approach, the signal of interest is obtained by summarizing the pixel intensity values within a designated region or “spot” (e.g., corresponding to location of probe for a particular gene) on a scanned image of a hybridized array. Ideally, pixels for areas outside the spots should have zero intensity, but this is rarely the case because of the fluorescence of the array surface itself. This fluorescence is termed the background. Because background may contaminate the measurement of spot fluorescence, the signal in the spot should be corrected for it by subtracting the fluorescence measured in the background |
Baseline correction |
In proteomic mass spectrometry [42], the counterpart of background correction is “baseline correction.” In mass spectrometry, the mass-to-charge ratio (m/z) of molecules present in a sample is measured. A resulting mass spectrum is an intensity vs. m/z plot representing the distribution of proteins in a sample. In this technology, chemical noise is usually present in the spectrum, which is typically caused by chemical compounds such as solvent or sample contaminants that did not originate from the analyzed biological sample. Chemical noise can cause a systematic upward shift of measured intensity values from the true baseline across a spectrum. The presence of baseline noise poses a problem, as the intensity is used to infer the relative abundance of molecules in the analyzed sample. A baseline shift will distort those relative measures; hence, baseline subtraction is typically applied when preprocessing mass spectrometry data |
Centering and scaling |
Normalization aimed at addressing between-run differences typically involves re-centering or re-scaling data obtained for a particular run by applying a correction factor that captures the difference between the measurements from that run and measurements from some type of average over multiple runs or from a reference run. The correction factor may be obtained by using internal controls or standards. These can be either analytes known to be present in the sample or analytes added to the sample that should, theoretically, yield the same measurements if the same amount of sample material is measured. If the measured values of internal standards differ across runs, then these internal control or standard values can be used for re-centering or re-scaling purposes |
An alternative approach is to use a run-based estimate of the constant that is calculated across the many measured variables for an individual sample. Examples include re-centering or re-scaling the measurements by their mean value (as in the total ion current normalization of mass spectrometry data), or by an estimate reflecting the amount of processed biological material (as in library size normalization of next-generation sequencing data) |
Data preprocessing terminology can be confusing for high-dimensional omics data. Although centering and scaling are often referred to generically as standardization, here, centering and scaling will refer to adjustment to all values of one observation (across variables). Standardization as meaning centering and scaling of all values of a variable (across observations) is described in section “PRED1.1: Variable transformations.” |
Quantile normalization |
Quantile normalization [43] is a widely used normalization procedure that addresses between-run differences and has been popular for use with omics data. The method assumes that the distribution of measured values across the many analytes measured is roughly similar from sample to sample, with only relatively few analytes accounting for differences in phenotypes (biological or clinical) across samples. Quantiles of the distribution of raw measured values (e.g., across genes) for each sample are adjusted to match a reference distribution, which is obtained either from a reference sample or constructed as some sort of average over a set of samples. Although the numerical quantiles are forced to match, the particular analyte (e.g., pertaining to a certain gene) to which the quantile corresponds can vary from sample to sample, thus capturing the biological differences across samples. Figure 8 [44] shows the effect of quantile normalization |
ComBat |
ComBat is a widely used batch correction method that has been shown to have generally good performance [47]. For each gene, this method estimates location and scale parameters for each batch separately. Then the data are transformed using these parameter estimates so that the location and scale parameters are the same across batches. This method is robust to outliers also in small sample sizes and thus especially well-suited for HDD analysis. ComBat-Seq [48] is an extension specifically developed for count data using a negative binomial model, and it is compatible with differential expression algorithms that require counts. Figure 9 [49] shows an example of the effect of ComBat, comparing the results with and without using this batch correction. |
SVA (surrogate variable analysis) |
Variability in measurements may arise from unknown technical sources or biological sources that are not expected or controllable and can affect the accuracy of statistical inference in genome-wide expression experiments. SVA [50] was developed to deal with the unmeasured factors that influence gene expression by introducing spurious signal or confounding biological signal. SVA identifies unobserved factors and construct surrogate variables that can be used as covariates in subsequent analyses to improve the accuracy and reproducibility of the results. SVA was first developed for microarray data and later adapted for sequencing data [51]. |
IDA4: Simplify data and refine/update analysis plan if required
Collapsing categories |
When a categorical variable has substantial imbalance in its distribution across categories, especially when relatively few observations are assigned to a certain category, it can cause instability in analyses. Models incorporating categorical variables with substantial imbalance can be strongly influenced by them. To avoid the undue influence of a rare category on the analysis, it may be necessary to accept the information loss by collapsing the variable, i.e., merging the rare category with another category that is similar in terms of content but more frequent. |
Variable filtering |
Variable filtering is typically accomplished by calculation of a score for each variable, followed by exclusion of variables having a score below a threshold from further analyses. Modelling or multiple testing procedures can then be applied only to the resulting variable set. However, in order to preserve the correct error control in multiple testing, it is crucial that the filtering is independent of the test statistics that will be used to analyze the filtered data [52]. This is generally accomplished using “nonspecific” filters, where the filtering does not depend on the outcome data. For example, when comparing groups using two-sample t-tests, first removing the variables that exhibit a small difference in the mean values of the classes and then applying the multiple testing corrections to the remaining variables leads to greatly inflated type I errors and overoptimistic multiplicity adjusted p-values. In contrast, type I error is correctly controlled if the filter is based on the overall variance or mean of the variables (combined across both groups), filtering out the variables with small overall variability or low overall expression [52‐54]. Although computationally helpful, filtering that does not inflate errors also does not necessarily increase statistical power; for example, Bourgon et al. [52] showed an example for Affymetrix gene expression data, where filtering out a large proportion of the genes with low expression actually decreased the number of true discoveries |
Variable filtering is implicitly performed also by some methods that can be used in regression modelling. These methods include Lasso, which will be discussed in the context of prediction modelling in section “PRED: Prediction.” |
Discretizing continuous variables |
Discretization of a variable refers to the process of converting or partitioning a continuous variable into a nominal or ordinal categorical variable. Often, the variable is discretized into partitions of equal width (e.g., when constructing a histogram) or of equal frequencies (e.g., quartiles). Alternatively, the categorization may be based on historical context, for example if it is known that age above a certain threshold is a risk factor for a specific outcome. However, categorization introduces several problems and is often criticized in LDD [56, 57], especially for the extreme version with only two groups, called dichotomization. This simplification of the data structure often leads to a considerable loss of power, and the use of a data-driven optimal cutpoint for dichotomization of a variable leads to a serious bias in prediction models including the variable |
Multiple imputation |
Multiple imputation is a widely used approach for handling missing data under the MAR scenario. It uses a regression model based on the available variables to predict the missing values. In an iterative fashion, missing values of a specific variable are predicted using a regression model that depends on the other observed variables, and the resulting predicted value is used in the main regression model. To account for the uncertainty in the imputation, multiple imputed datasets are generated and then analyzed, and the results are summarized according to “Rubin’s rule” [61]. Software for multiple imputation is widespread in major statistical packages. As described above, for HDD, before applying multiple imputation, often a pre-selection of variables is advisable |
Future directions for HDD analysis include a more detailed look at MAR settings (as all procedures provided so far are fully justified only when the MCAR assumption is tenable), the addition of auxiliary information for specifying the imputation model, and development of analysis methods that can directly cope with missing values, such as robust PCA and random forests. The best method depends also on the analysis goal, such as cluster analysis or developing a prediction model |
EDA: Exploratory data analysis
EDA1: Identify interesting data characteristics
Multidimensional scaling (MDS) |
Multidimensional scaling requires as input a distance matrix with elements corresponding to distances between all pairs of observations calculated in the original (high-dimensional) space, and the lower dimension space (often two-dimensional) to which the data should be projected is specified. A representation of the data points in the lower-dimensional space, called an embedding, is constructed such that the distances between pairs of observations are preserved as much as possible. Functions that quantify the level of agreement between pairwise distances before and after dimension reduction are called stress functions. MDS implements mathematical algorithms to minimize the specified stress function |
Classical Multidimensional Scaling was first introduced by Torgerson [63]. Mathematically, it uses an eigenvalue decomposition of a transformed distance matrix to find an embedding. Torgerson [63] set out the foundations for this work, but further developments of the technique associated with the name principal coordinates analysis are attributed to Gower [64]. While Classical Multidimensional Scaling uses eigenvector decomposition to embed the data, non-Metric Multidimensional Scaling (nMDS) [65] uses optimization methods |
T-Distributed Stochastic Neighbor Embedding (t-SNE) |
Some newer approaches to derive lower-dimensional representations of data avoid the restriction of PCA, which requires the new coordinates to be linear transformations of the original. One popular approach is t-SNE [66], which is a variation of Stochastic Neighbor Embedding (SNE) [67]. It is the most commonly used technique in single-cell RNA-Seq analysis. t-SNE explicitly optimizes a loss function, by minimizing the Kullback–Leibler divergence between the distributions of pairwise differences between observations (subjects) in the original space and the low-dimensional space. PCA plots, which are typically based on the first two or three principal component scores, focus on preserving the distances between data points widely separated in high-dimensional space, whereas t-SNE aims to provide representations that preserve the distances between nearby data points. This means that t-SNE reduces the dimensionality of data mainly based on local properties of the data. t-SNE requires the specification of a tunable parameter known as “perplexity” which can be interpreted as a guess for the number of the effective neighbors (number of neighbors that are considered close). Figure 10 shows the result of t-SNE on a dataset with eight classes |
Uniform manifold approximation and projection (UMAP) |
t-SNE has been shown to efficiently reveal local data structure and was widely used for identifying subgroups of populations in cytometry and transcriptomic data. However, it has some limitations. It does not preserve well the global structure of the data, i.e., relations between observations that are far apart are not captured well by the low-dimensional representation. A further drawback is large computation time for HDD, especially for very large sample size n. A newer approach called uniform manifold approximation and projection (UMAP) [68] overcomes some of these limitations by using a different probability distribution in high dimensions. In particular, construction of an initial neighborhood graph is more sophisticated, e.g., by incorporating weights that reflect uncertainty. In addition, UMAP directly uses the number of nearest neighbors instead of the perplexity as tuning parameter, thus making tuning more transparent. On real data, UMAP has been shown to preserve as much of the local and more of the global data structure than t-SNE, with more reproducible results and shorter run time [69] |
Neural networks |
Neural networks provide another way to identify non-linear transformations to obtain lower-dimensional representations of HDD, which in many cases outperform simple linear transformations [70]. The concept is briefly described in section “PRED1.5: Algorithms” in the context of reducing the number of variables in preparation for development of prediction models or algorithms. Yet, research is ongoing to determine how best to develop low-dimensional representations and corresponding derived variables, and which of those derived variables might be most suitable depending on their subsequent use for statistical modelling or other purposes |
EDA2: Gain insight into the data structure
Hierarchical clustering |
Hierarchical clustering is a popular class of clustering algorithms, mostly in an agglomerative version, where initially all objects are assigned to their own cluster, and then iteratively, the two most similar clusters are joined, representing a new node of the clustering tree [72]. The similarities between the clusters are recalculated, and the process is repeated until all observations are in the same cluster. The distance metric to be used for comparing two individual objects is specified by the researcher. For defining distances between two clusters of objects, there are also several options. In hierarchical clustering, the approach for measuring between-cluster distance is referred to as the linkage method. Single linkage specifies the distance between two clusters as the closest distance between the objects from two clusters; average linkage calculates the mean of those distances, and complete linkage specifies the largest distance. Single linkage has the disadvantage that it tends to generate long thin clusters, whereas complete linkage tends to yield clusters that are more compact, and average linkage typically produces clusters with compactness somewhere in between. Hierarchical clustering results are often displayed in a tree-like structure called a dendrogram. A dendrogram is viewed from the bottom up, with each object beginning in its own cluster as the terminal end of a branch and eventually being merged with other objects as clusters are formed climbing up the branches of the tree toward the root where all objects are combined into one cluster. The heights in the tree at which the clusters are merged correspond to the between-cluster distances. Cutting the tree at a particular height defines a number of clusters. Although the hierarchical structure displayed in the dendrogram may seem appealing, it should be interpreted with caution as there can be substantial information loss incurred as a result of enforcing a flattened tree structure. Figure 11 [73] shows an example for a dendrogram resulting from hierarchical clustering |
k-means |
A popular partitioning clustering algorithm is k-means [74]. For its traditional implementation, the researcher must specify the number of clusters. First, random objects are chosen as initial centroids for the clusters. Then the algorithm proceeds by iterating between two steps, (i) comparing each observation to the mean of each cluster (centroid) and assigning it to the cluster for which the squared Euclidean distance from the observation to the cluster centroid is minimized, and (ii) recalculating cluster centroids based on the current cluster memberships. The iterative process continues until no observations are reassigned. k-means is not guaranteed to converge to the optimal cluster assignments that minimize the sum of within-cluster variances, and it can be strongly influenced by the selected number of clusters and initial cluster centroids. Nonetheless, it is a relatively simple algorithm to understand and implement and is widely used. Figure 12 [75] visualizes the k-means algorithm with an example |
PAM |
Several important extensions and generalizations of k-means have been developed. PAM (partitioning around medoids, [76]) allows using arbitrary distances instead of Euclidean distance, and instead of mathematically calculated centroids, actual observations are selected as prototypes of clusters. The algorithm iteratively improves a starting solution with respect to the sum of distances of all observations to their corresponding prototypes, until no improvement can be obtained by replacing one current prototype with another observation |
Scree plots |
One traditional approach for estimation of the number of clusters is the construction of a scree plot, which involves plotting some measure of within-cluster variation on the y-axis and the number of clusters assumed in applying the algorithm on the x-axis. For hierarchical clustering, which does not require a priori specification of the number of clusters, a similar plot can be constructed by “cutting” the dendrogram at different levels corresponding to a range of numbers of clusters. The optimal number of clusters is determined by visual inspection where a line connecting the points shows a kink and there is diminished reduction in within-cluster variation with increasing number of clusters. Noise accumulating over the variables in HDD coupled with no guarantee that applications of the algorithms identify the optimal clusterings may lead to scree plots that fail to reveal a strong indication for the number of clusters. Figure 13 [80] shows such a typical scree plot |
Silhouette values |
Silhouette values are numerical tools for estimating the number of clusters [81]. The silhouette value of a single observation measures how well the observation fits to its assigned cluster by comparing its average similarity to members of its own cluster to the average similarity to the next best cluster. It is scaled such that the value 1 corresponds to an optimal fit (similarities to members of own cluster extremely large compared to next best cluster) and − 1 to the worst case (similarities to members of own cluster extremely small compared to best other cluster). The average silhouette width (asw) is then defined as average of all single silhouette values, which quantifies the quality of the clustering result. The asw requires no distributional assumptions for the data. In contrast, when using distribution-based clustering, typically so-called information criteria are required for selecting the number of clusters. These balance the coherence of the clusters (as large as possible) and the number of clusters (as small as possible). Figure 14 [82] shows a silhouette plot that visualizes the silhouette values of observations that were grouped into four clusters |
TEST: Identification of informative variables and multiple testing
TEST1: Identify variables informative for an outcome
t-test |
The t-test is a standard test for comparing the means of two groups, for continuous outcomes (e.g., blood pressure or tumor size after therapy for a treatment and a control group, or expression values of a gene for two patient groups with different diseases). The null hypothesis is that the true difference between the group means is 0, and the alternative hypothesis is that it is not 0 (two-sided testing). The t-statistic underlying the usual t-test equals the ratio of the observed mean difference and a pooled standard error of both groups. It is important to note that validity of a statistical test depends on assumptions that should be checked. For this t-test, assumptions include independence of the observations, approximate normal distribution of the variable in each group and similar variance of the variable irrespective of group. t-tests tend to be sensitive to outliers, and in such situations, alternative nonparametric tests may be preferred. Extensions include the Welch test, if group variances are not assumed equal, and one-way ANOVA (analysis of variance), when more than two groups are compared |
Permutation test |
The idea behind a permutation test is to scramble the data to mimic a null hypothesis situation in which a variable is not associated with a particular outcome or phenotype. For the simple example of comparing the distribution of a variable between two phenotype classes, a permutation test would randomly scramble or re-assign class labels to the collection of observations. For each data permutation, the test statistic is calculated and recorded. After this statistic has been calculated on many permuted versions of the data, a p-value can be computed as the number of permutations on which the calculated test statistic was as extreme or more than the test statistic calculated of the original data |
Limma |
Linear Models for Microarray Data (limma) developed by Smyth and colleagues [95, 96] and implemented in the R package limma was developed to address several challenges of multiple testing for HDD. Limma offers a unifying, statistically based framework for multiple testing that uses empirical Bayes shrinkage methods in the context of linear models. Initially popularized in the context of traditional gene expression analysis with microarrays, limma is based on normal distribution theory. It evolved from a procedure to modify t-statistics by “borrowing information” across variables to improve variance estimation and increase statistical power. Limma provides a way to balance the need for small type I errors for testing individual variables in HDD settings against statistical power to identify true discoveries. Designs more complex than simple two-group comparisons are easily accommodated by limma’s linear model framework. Although it was developed originally to identify differentially expressed genes for normalized measurements from microarrays, it has also been used successfully for analysis of data generated by other omics technologies, e.g., proteomics [97] |
For simplicity of explanation, the focus of discussion here is how limma works in the context of simple two-group comparisons as an extension of the familiar t-test. Limma relies on the concept of borrowing information across a collection of similar variables (e.g., expression levels for the thousands of genes measured on a microarray). Many omics studies have relatively small sample size compared to the number of variables, so the idea of borrowing information across a very large number of variables is very attractive. If one can assume that the true variances across the many variables follow some overarching distribution, then variance estimates for individual variables that are imprecise due to small sample size can be made more precise by shrinking them toward a variance estimate that is pooled from all variables. The amount of shrinkage depends on the distribution estimated (empirical Bayes) or assumed (Bayes) for the true variances. Limma is based on an empirical Bayes approach that assumes normally distributed variables and shrinks the individual variances toward the mean of the estimated distribution of true variances |
Out of this empirical Bayes framework comes the moderated t-statistic, which is similar in form to the usual t-statistic, but with an adjusted estimate of standard deviation for each variable that has been shrunk toward the mean of the distribution of variances, replacing the usual sample standard deviation estimate in the denominator. These shrunken estimates are more precise as reflected in larger degrees of freedom achieved by “gathering strength” across the many variables, resulting in higher statistical power to identify true discoveries |
An additional advantage of limma is the complexity of experimental designs that it can handle. Many extensions beyond two class comparison problems can be accommodated by the linear model framework. Comparisons can be made between more than two classes, including linear contrasts, for example to assess for linear trends in means across classes. In addition, limma offers a powerful set of tools to address a broad range of experimental settings in which data can be reasonably represented by a Gaussian linear model. Included in the limma framework are factorial designs, which consist of two or more factors with levels (discrete possible values), for which all combinations across the factors are investigated. This allows the analysis of main effects and interactions between variables |
The evolution of technologies for gene expression analysis from microarrays to sequencing-based approaches such as RNA-Seq presented new statistical challenges for HDD analysis. Gene expression measurements generated by these newer technologies are typically count data rather than continuous intensity values as for microarray technologies. Count data are generally not compatible with assumptions of normally distributed data on which limma relies. For example, RNA-Seq measures the number of reads (DNA fragments) that map to specific genomic locations or features represented on a reference genome. Two extensions to limma were developed to address gene expression measurements expressed as counts. Limma-trend shrinks the gene-wise variances of the log-transformed count values toward a global mean–variance trend curve. Limma-voom extends this idea further by also taking into account global differences in counts between samples, for example due to different sequencing depths |
Several other methods to analyze count data were developed independently of the limma extensions, with foundation on negative binomial models to characterize the distribution of count data. The negative binomial includes the Poisson as a special case and is generally preferred in the setting of modern gene expression analysis. It has greater flexibility for modelling variances of counts, particularly when those counts are not large or when the number of replicates for each biological group or condition is not large |
edgeR |
The edgeR procedure [98] assumes that the read count for a particular genomic feature follows a negative binomial (NB) distribution. Although a genomic variable of interest need not correspond exactly to a gene, in the following the term gene is used for simplicity of discussion. Much of the discussion is framed in terms of gene expression count data arising from RNA-Seq measurements, but the developers note that the methods implemented in edgeR apply more generally also to count data generated by other omics technologies, including ChIP-Seq for epigenetic marks and DNA methylation analyses |
The measured count for gene g in sample i is assumed to follow a NB distribution with mean equal to the library size for that sample (total number of DNA fragments generated and mapped) multiplied by a parameter representing the relative abundance of gene g in the experimental group j to which sample i belongs. The variance of the count for a specific gene based on the NB distribution is assumed to be a function of the mean and a dispersion parameter; specifically, the variance is modeled as the sum of the technical variation and the biological variation. Technical variation for gene expression and other types of omics count data can usually be adequately modeled as a Poisson variable, but incorporating biological variability leads to additional variability. To incorporate this additional variability, an “overdispersion” term is introduced into the variance. Specifically, the variance of a count is modeled as the mean multiplied by the sum of one and the mean multiplied by a term that represents the coefficient of variation of biological variation between samples. This expression reflects a partition of the variance into contributions from technical and biological variation. When there is no biological variation between samples, e.g., when samples are true technical replicate sequencing runs from a single library produced for a sample, this variance reduces to the Poisson variance, which equals the mean. This model provides a flexible and intuitive expression for the variance and incorporates dependence of the variance on the mean as expected for count data |
Using an empirical Bayes approach similar in flavor to that described for limma, the edgeR procedure borrows information across genes to shrink the gene-specific dispersion parameters toward a model describing the distribution of dispersion parameters. The simplest model is one in which all genes share a common dispersion parameter, which can be estimated from the data. Allowing greater flexibility, dispersion parameters can be modeled as a smooth trend as a function of average read count for each gene. To allow for further gene-specific reasons for variation in the count of a gene, empirical Bayes methods are employed to estimate weighted averages that combine gene-specific dispersion estimates with those arising from dispersion models, in this way “shrinking” gene-specific dispersion estimates toward the overall model |
The edgeR software allows the user to compare gene expression between groups when there are replicate measurements in at least one of the groups and more generally when the group mean structure can be expressed as a linear model. Scientific questions of interest can be framed in terms of inferences about the relative abundance parameters in the linear model. For example, one might wish to compare relative abundance of a particular gene transcript in a group of samples taken from cell cultures that had not been exposed to a new drug to that in samples from cultures after exposure to the new drug. There could be interest in examining the pattern of change in relative abundance of the gene, sampling from a series of cultures that are exposed to the new drug for differing lengths of time. From the specified linear model and shrunken variance estimates, the edgeR software can perform gene-wise tests of significance, based on likelihood ratio statistics, for any parameters or contrasts of parameters in the mean model |
DEseq2 |
DESeq2 [99] is another method for differential analysis of count data that is widely used. Performance of DEseq2 compares with edgeR in terms of false discovery control and statistical power to detect differentially expressed genes. It also uses a negative binomial model for the counts with variance expression that incorporates a dispersion parameter, as described for edgeR. Dispersion parameters are modeled across genes as a smooth curve depending on average gene expression strength. Using empirical Bayes methods, gene-specific dispersion parameters are shrunk toward the curve by an amount dependent on how close the individual dispersion estimates tend to be to the fitted smooth curve and the sample size (through the degrees of freedom) |
A feature of DEseq2 that distinguishes it from other methods is incorporation of shrinkage into estimation of mean parameters. Shrinkage of mean parameters, e.g., fold-change, has appeal because researchers tend to find larger effects more convincing. Genes that attain statistically significant effects but exhibit small effect sizes are frequently manually filtered out due to concern that the significance could be due to random experimental noise. Shrinkage of fold-changes implemented by DESeq2 provides a more statistically based approach to address these less reliable findings, which are observed particularly often for genes with small counts. Additional useful features of DESEq2 include options for outlier detection |
TEST2: Multiple testing
Null hypothesis truth status | |||
---|---|---|---|
Test result | True | False | Total |
Rejected | V | U | R |
Not rejected | m0 − V | m1 − U | m − R |
Total | m0 | m1 | m |
Bonferroni correction |
The Bonferroni correction specifies that when m statistical tests are conducted, each one should use a critical level of α/m where α is the desired type I error for the full collection of tests. For example, a Bonferroni correction applied in the setting of 10,000 hypothesis tests would require that an individual test reaches statistical significance at a critical level = 0.05/10,000 = 0.000005. Achieving this level of significance would require an extremely large sample size or effect size (e.g., magnitude of association) in order for an individual test to have reasonable power |
Holm’s procedure |
Order the p-values from smallest to largest as p(1), p(2),..., p(m), where m is the number of tests. Beginning with p(1), proceed in order, comparing each p(i) to the critical value α/(m-i + 1). Stop the first time that p(i) exceeds the critical value α/(m-i + 1). Call this index j. Declare all p-values p(1), p(2),..., p(j-1) to be statistically significant |
This procedure controls the FWER to be no more than α. It is clear from comparison of the sequential Holm critical values to the fixed Bonferroni critical value that the Holm procedure has the potential to reject more tests and therefore offers greater power, although when the number of tests m is very large, as often in HDD, the actual difference in critical values can be extremely small |
Westfall-Young permutation procedure |
The Westfall-Young permutation procedure [104] is a multivariate permutation procedure to control the FWER that is more efficient (powerful) than Bonferroni-like procedures (as Bonferroni and Holm’s procedure) in finding true discoveries. It exploits the correlations among variables, which are preserved in the permutation process, since all variables are permuted at the same time. The method is a step-down procedure similar to the Holm method. After p-values are calculated for all variables and ranked, multiple times class labels are permuted and corresponding p-values are calculated. Then the successive minima of these new p-values are retained and compared to the original p-values. For each variable, the proportion of number of permutations where the minimum new p-value is less than the original p-value is the adjusted p-value |
Benjamini-Hochberg (BH) |
The Benjamini–Hochberg procedure [108] to control the FDR specifies that the ith ordered (smallest to largest) unadjusted p-value is compared to the threshold (\(\alpha\)/m)·i, where i is the ranking of the p-value, m is the total number of tests, and \(\alpha\) is the desired level of FDR control. Then the largest p-value that is smaller than its threshold is identified, and the corresponding test and all tests with a smaller p-value are considered significant. Alternatively, one can convert the unadjusted p-values to FDR-adjusted p-values where the adjusted p-value associated with a variable represents the smallest value of FDR at which the procedure would have rejected the test associated with that variable. The intuition behind this correction is linked to the fact that the p-values of null variables for independent tests are uniformly distributed; therefore, the ranked p-values should lie approximately on the line y = i/m. In the presence of true positive variables (non-null hypotheses), one would expect a higher concentration of small p-values, therefore an excess of p-values falling below the line i/m for lower ranks |
q-values |
Adjusted p-values can also be calculated for FDR-controlling procedures. For a particular variable, the FDR-adjusted p-value is sometimes called a q-value and can be interpreted as the expected proportion of false positives among all variables with test statistics as or more extreme (with smaller adjusted p-values) as the observed value for the variable under examination [110]. Thus, the q-value estimates the FDR that would be obtained if this specific p-value would be used as the upper threshold for the inclusion of the variables in the list of discoveries. Therefore, q-values do not have an obvious interpretation at the level of a single hypothesis. A related limitation is that the interpretation of the FDR results should be restricted to the complete list of discoveries obtained from the analysis, as the properties of subsets with respect to what number or proportion of false discoveries they might contain are not well defined |
TEST3: Identify informative groups of variables
Gene set enrichment analysis (GSEA) |
The popular gene set enrichment analysis (GSEA; [118]) and its extensions are considered mixed approaches, as they test whether any of the variable groups is associated to the outcome variable and if any of the variable groups is enriched by variables associated to the outcome variable. A summary statistic is computed for each variable, a relative enrichment score based on a signed Kolmogorov–Smirnov statistic is calculated for each group, and its significance is evaluated using permutations. The groups with scores above or below a threshold are called enriched and the false positive rate is evaluated using a permutation procedure that permutes the specimens rather than the variables. Efron and Tibshirani [119] proposed to base the score on a standardized “maxmean” statistic (the standardized maximum of positive and negative summary statistics in each group), thus improving the power of the method |
Over-representation analysis |
Over-representation analysis (ORA; [120]) uses a similar concept to GSEA. It determines which variable groups are more present (overrepresented) in a subset of a given list of “interesting” variables than would be expected by chance. This can also be applied to situations where GSEA is used, but then instead of the Kolmogorov–Smirnov statistic the hypergeometric distribution is used for determining the significance of the over-representation, and thus a subjective cutoff for the summary statistic must be chosen a priori |
Global test |
The global test [121] is based on the estimation of a regression model where all the variables belonging to the group are included as covariates, and the global null hypothesis is tested whether any of the variables is associated with the outcome variable. The method is particularly good at identifying groups containing many variables, each of which might have relatively small effects |
topGO |
The topGO algorithm [122] provides methods for testing specific gene groups defined via the Gene Ontology (GO). The Gene Ontology is a widely recognized comprehensive reference for gene annotations. It assigns genes to GO terms belonging to the three main domains: biological processes, molecular functions, or cellular components. The corresponding gene groups (defined according to GO terms) are widely used prespecified groups of variables, often referred to as gene sets. However, when scoring the relevance of GO terms with methods as mentioned above, due to the high redundancy of many terms resulting in many similar groups of variables, the list of the most significant groups is also highly redundant. topGO provides algorithms for testing GO terms while accounting for the relationships between the corresponding gene groups. As a result, the final list of the most significant groups better represents the diversity of all significant groups, see Figure 16 [123] for the result of the topGO algorithm |
PRED: Prediction
PRED1: Construct prediction models
Log-transform |
Variables with nonnegative values are frequently encountered in practice and typically have a right-skewed distribution. A logarithmic transformation may be helpful to make the distribution of the data more symmetric. In principle, instead of X, the derived variable log(X) is used as input for prediction modelling [131]. An example in a high-dimensional context is gene expression microarray data, which typically enter in a prediction model after being log2 transformed (see, e.g., [43]). Other transformations than the logarithmic one are, of course, also possible, but rarer |
Standardization |
Another variable transformation often performed in high-dimensional contexts is standardization. Here, the variable is centered (for each value of the variable the mean of the variable is subtracted) and scaled (each centered value is divided by the standard deviation of the variable). This procedure has advantages from an interpretation point of view. For example, the intercept of a linear model including age would represent a person of average age instead of a hypothetical person of age 0. Further, standardization is crucial for the correct implementation of many regularized methods (e.g., lasso and ridge regression, see section “PRED1.4: Statistical modelling”). Note that standardization can cause problems when applying a prediction model to a new dataset. In this case, one either has to use the correction factors calculated from the original dataset or re-compute them on the new dataset, which is problematic because then individual predictions depend on other observations that happened to be included in the new dataset. Standardization is not mutually exclusive with other transformations, e.g., the logarithmic transformation described above, thus it is often performed in addition (i.e., after the logarithmic transformation) |
Supervised principal components (SuperPC) |
PCA is conducted based on a subset of preliminarily selected variables. In SuperPC [55], first a variable selection method (see above) is used to reduce the number of prediction variables. This means that the additional step in comparison with PCA is that the subset of predictors selected is based on their association with an outcome, explaining the name supervised. Then, a classical PCA is performed on the reduced space (i.e., only considering the selected variables). The newly constructed components are then used for prediction |
Ridge regression, lasso regression, and the elastic net |
Two of the most commonly used constrained regression methods are ridge regression and lasso. Interestingly, the problem of minimization of a loss function under particular constraints can be mathematically rewritten as the minimization of the same loss function with an additional penalty term. Consequently, ridge regression estimates the regression coefficients by minimizing the negative log-likelihood (in linear regression this corresponds to the sum of squared errors) plus a penalty term defined as the sum of the squared values of the coefficients. For lasso, the penalty term is instead the sum of absolute values of the coefficients. In both cases, the amount of penalty to be added is controlled by a tuning parameter, which must be chosen either by the user or as part of the algorithm (usually by cross-validation) |
A nice property of the lasso penalty is that it forces many regression coefficients to be 0, providing implicit variable selection (those predictor variables whose coefficients are estimated equal to 0 are removed from the model). However, the lasso has more difficulties in handling correlations among prediction variables. To try to take advantage of the strengths of both methods, a solution that combines both penalties has been proposed under the name of elastic net [141]. A further tuning parameter (in addition to the one that controls the strength of penalty) must be chosen, to define the balance between the two types of penalty. For extreme values of this parameter, namely 0 and 1, elastic net reduces to ridge regression and lasso, respectively |
Boosting |
An alternative to adding constraints to solve the dimensionality problem for HDD is to pursue a stagewise approach. Starting from the simplest model (e.g., in regression, the null model), a single new predictor variable is added stepwise to the model, gradually improving it [142, 145]. The basic idea of boosting (combine several partial improvements to obtain a final good model) works particularly well when the improvements are small. Therefore, at each step, a regularized approach to the univariate problem is performed. For example, in a regression problem, rather than allowing only a single opportunity to add each predictor variable and produce its coefficient estimate, boosting allows a regression coefficient to be updated several times. At each step, the method selects the variable whose regression coefficient is to be updated, based on the minimization of the loss function |
Valuable properties already mentioned for lasso, such as shrinkage and intrinsic variable selection, are also achieved by boosting. Shrinkage results from the use of a loss function incorporating a penalty to constrain parameter estimates. The stagewise nature of the procedure potentially allows for stopping before all predictors have been added to the model, effectively setting the regression coefficients for the remaining predictor variables to zero. When to stop updating the model to avoid excessive complexity and, consequently, overfitting is a crucial decision for which several criteria have been proposed, see, e.g., Mayr et al. [146] |
Support vector machine (SVM) |
A support vector machine (SVM) is a typical example of an algorithmic method developed in the machine learning context [150]. It is mostly used for classification, i.e., to predict the response class of the observations (e.g., healthy vs. sick patients), but can also be applied for regression. An SVM divides a set of observations into classes in such a way that the widest possible area around the class boundaries remains free of observations; it is a so-called Large Margin Classifier. The main idea is to construct a p-1 dimensional hyperplane (imagine a two-dimensional plane in a three-dimensional space, or a straight line in a plane) which separates the observations based on their response class. Often it is unrealistic to find such a perfectly separating hyperplane and one should accept some misclassified observations. Therefore, in the standard extended version of an SVM, observations on the wrong side of the boundaries are allowed, but their number and their combined distance to the boundary are restricted, such that a tuning parameter, usually denoted by C, defines how much “misclassification” is allowed. In addition, the extended implementation of kernel-based methods allows non-linear separating boundaries. |
Trees and random forests |
One of the simplest algorithmic tools for prediction is a tree, in which the prediction is based on binary splits on the variable space. For example, a simple tree could have two nodes (splits): a root (the first split), which divides the space into two regions based on the presence of a genetic mutation, and a second node that divides the observations with this mutation again into two parts, based on another mutation. A tree can be grown further, until a predetermined (usually via cross-validation) number of regions in the variable space is reached [151]. In many studies, variables are measured on different scales (binary, ordinal, categorical, continuous) and several binary splits are possible, raising the issue of multiple testing. Algorithms which do not correct for multiple testing are biased in favor of variables allowing several cut points over binary variables [152]. |
Simple trees are often unstable, i.e., fitting a tree to subsets of the data leads to very different estimated trees. One idea to solve this problem is to aggregate the results of trees computed on several bootstrap samples (bagging = Bootstrap AGGregatING, [153]). For example, for continuous variables, the predictions of different trees are typically averaged, and for categorical variables, for each category, the proportion of trees with this category as prediction is used as estimate of the probability of that category. |
While bagging partially mitigates the instability problem, often it is not very effective, due to the strong correlation among the trees. Random forests [154] improve upon this approach by limiting the correlation among the trees through use of only a subset of the variables in the construction of each tree. As in bagging, the results of the different trees are then aggregated to obtain a final prediction rule. Tuning parameters such as the size of the subset and the number of bootstrap samples must be chosen, but often default values are successfully used. While using the default values is often a good strategy in the LDD case, this is not necessarily the case for HDD problems. For example, the best size of the variable subset depends on the dimension of the total number of variables available [155]. An overview from early development to recent advances of random forests was provided by Fawagreh et al. [156]. |
Neural networks and deep learning |
In recent years, machine learning techniques like neural networks and deep learning have gained much interest due to their excellent performance in image recognition, speech recognition, and natural language processing [157, 158]. They are based on variable transformations: in neural networks, the predictor variables are transformed in a generally non-linear fashion through what is called an activation function. One popular choice for the activation function is a sigmoid or logistic function, which is applied to a linear combination of predictor variables (the coefficients used in the linear function, which provide the individual contribution of each predictor variable, are called weights). These new transformed variables (neurons in machine learning terminology, latent variables in statistical terms) form the so-called hidden layers, which are used to build the predictor. Mathematical theorems show that increasing the number of hidden layers and decreasing the number of neurons in each layer can improve the prediction performance of neural networks. |
Specific neural networks with many hidden layers are called deep learning. The choice of the tuning parameters (activation function, number of hidden layers, and number of neurons per layer) characterizes the different kinds of neural networks (and deep learning algorithms). In the high-dimensional contexts, special approaches (e.g., selecting variables or setting weights to zero) are used to avoid overfitting. |
Deep learning methods are extremely successful in the situation of a very large number of observations (as in image classification and speech recognition based on huge databases). However, they tend to generate overfitted models for typical biomedical applications in which the number of observations (e.g., number of patients or subjects) does not exceed a few hundred or thousand (see Miotto et al. [159] for a discussion of opportunities and challenges). |
PRED2: Assess performance and validate prediction models
Mean squared error (MSE) and mean absolute error (MAE) |
Mean squared error (MSE) and mean absolute error (MAE), sometimes denoted as mean squared prediction error (MSPE) and mean absolute prediction error (MAPE) to emphasize the fact that they are computed on a test set (see discussion below), are commonly used measures to evaluate the prediction performance of a model in the case of a continuous target variable. They are computed by averaging the squared differences or the absolute differences, respectively, between the values predicted by the model and the true values of the target variable. Note that the MSE, being a quadratic measure, is sometimes reported after a square root transformation, the so-called root mean squared error (RMSE) |
ROC curves and AUC |
A receiver operating characteristic (ROC) curve is a graphical plot that facilitates visualization of the discrimination ability of a binary classification method. Many statistical methods classify observations into two classes based on estimated probabilities of their membership. If the probability is larger than a threshold, then the response is classified as positive (e.g., sick), otherwise as negative (e.g., healthy). This threshold is mostly set to 0.5 or to the prevalence of the positive cases in the dataset. Choosing a lower threshold corresponds to more positive predictions, with the consequence of increasing the percentage of observations correctly classified positive among those actually positive (sensitivity) with the potential cost of decreasing the percentage of observations correctly classified negative among those actually negative (specificity). Conversely, a larger threshold generally leads to lower sensitivity and higher specificity |
The ROC curve is typically constructed with values for 1 − specificity (x-axis) plotted against the values for sensitivity (y-axis) for all possible values of the threshold. The result is a curve that indicates how well the method discriminates between the two classes. Models with the best discrimination ability will correspond to ROC curves occupying the top left corner of the plot, corresponding to simultaneous high sensitivity and high specificity. A ROC curve close to the diagonal line from lower left to upper right represents poor discrimination ability that is no better than random guessing, e.g., by flipping a coin. The information provided by the ROC curve is often summarized in one single number by calculating the area under the curve (AUC). Best classifiers obtain an AUC value close to 1, while methods not better than random guessing exhibit values close to 0.5. Figure 17 [185] shows an exemplary ROC curve corresponding to high discrimination ability with AUC = 0.90 (and confidence interval [0.86, 0.95]) |
Caution is advised regarding the risk of overestimating the performance of a classifier based solely on the AUC value, as the binary decision depends on an optimized threshold, which can be quite different from 0.5. This problem is especially important for HDD, since there is a lot of flexibility to tune and optimize the classifier, including the decision threshold, based on the large number of predictor variables. Calibration plots (see below) are also important to assess whether the classifier is well calibrated, i.e., estimated probabilities correspond to similar proportions in the data |
Misclassification rate |
A simpler measure of the prediction ability in the case of categorical response is the misclassification rate that quantifies the proportion of observations that have been erroneously classified by the model. Here, in contrast to AUC, smaller values are better. While this measure is simple and can be used even if the classifier does not assign probabilities to observations, but only predicts classes, it does not differentiate between false positives and false negatives. Therefore, the overall misclassification rate can be heavily dependent on the mix of true positive and true negative cases in the test set |
Brier score |
While the misclassification rate only measures accuracy, the Brier score also takes into account the precision of a predictor [180, 186]. The Brier score can be applied for binary, categorical, or time-to-event predictions. It calculates quadratic differences between predicted probabilities and observed outcomes. Thus, it can be considered the counterpart for these prediction targets of the MSE used for regression models. The Brier score is particularly useful because it captures both aspects of a good prediction, namely calibration (similarity between the actual and predicted survival time) and discrimination (ability to predict the survival times of the observations in the right order). For survival data, the Brier score is generally plotted as a function of the time, where higher curves mean worse models. Alternatively, the area under the Brier score curve is computed, leading to the integrated Brier score, which summarizes in a single number the measure of the prediction error (lower being better) |
Calibration plots |
Calibration plots for statistical prediction models can be used to visually check if the predicted probabilities of the response variable agree with the empirical probabilities. For example, for logistic regression models, the predicted probabilities of the target outcome are grouped into intervals and for all observations within each interval the proportion of observations positive for the target outcome are calculated. The means of the predicted values are plotted against the proportion of true responders across the intervals. For survival models, the Kaplan–Meier curve (the observed survival function) can be compared with the average of the predicted survival curves of all observations. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making [187]. Figure 18 [187] visualizes different types of miscalibration using calibration plots |
Deviance |
The deviance measures a distance between two probabilistic models, and it is based on likelihood functions. It can be used to perform model comparison, for any kind of response variable for which a likelihood function can be specified. For a Gaussian response, it corresponds (up to a constant) to the MSE and thus provides a measure of goodness-of-fit of the model compared to a null model without predictors. For model comparison, when computed on the training set (see discussion below) to choose the “best” model among several alternatives, it is often regularized. A factor is applied which penalizes larger models (large p, where p is the number of predictor variables), obtaining measures such as the information criteria AIC (penalty equal to 2p) and BIC (penalty equal to p * log n). The specific choice of the information criterion is difficult and depends, e.g., for classification tasks, also on the relative importance of sensitivity and specificity [188] |
Subsampling |
Subsampling is probably the most straightforward procedure to address the stability issue discussed above. Instead of relying on the result of one single split in training and test sets, the prediction measure is computed for a large number (at least 100) of splits. In practice, for each split, the model (or algorithm) is trained on a part of the data and evaluated on the rest. The results are then averaged to yield a summary measure of performance |
Cross-validation |
Subsampling can substantially improve stability compared to use of a single data split, but a potential criticism is that it does not guarantee (for a finite number of replications) that all observations are used equally frequently in the training set and in the test set. Cross-validation ensures balance by splitting the observations in K approximately equal-sized portions (folds) and using, in turn, K − 1 folds to build the model and the remaining fold to evaluate its performance. Every single observation is then used K − 1 times to train the model and once to test it. The K results are then averaged. One drawback for classical cross-validation is that the procedure relies on the specific split in K folds. To address this issue, the cross-validation procedure can be repeated several times, combining the idea of cross-validation and subsampling [193] |
Bootstrapping and its modifications |
Similar to subsampling, bootstrapping is based on the idea of generating a large number of training and test sets. In contrast to subsampling, bootstrapping generates training sets of the same size as the original sample, by resampling observations with replacement [194, 195]. Since some observations are then used multiple times in the bootstrap-generated training set, other observations are not included at all, and these then form the test set on which the model is evaluated |
Bootstrapping is known to overestimate the prediction error, since the training datasets are smaller than the full dataset, as discussed above. Adjustments to the method have been proposed, for example, the 0.632 bootstrap and the 0.632+ bootstrap [196]. Both modifications balance the overestimated bootstrap-based error estimate with the heavily underestimated corresponding error estimate computed on the training set. An overview of many different bootstrapping approaches for practitioners and researchers was provided by [197] |
Use of external datasets (“external validation”) |
While resampling-based approaches can be useful to evaluate and compare performance of prediction models, they do not meet validation standards typically desired in real-world scenarios. Generally, the goal is to develop a prediction model that generalizes well to independent patient cohorts. This refers both to future patients from the same, say, clinical centers as those from which the data used for the construction of the model were obtained, and to patients from different clinical centers [198‐200]. Resampling techniques reflect the model performance for independent patient data only if the distribution of the independent data is the same as the original. This assumption can justifiably be questioned when high-dimensional omics or other biomarker data are involved, which may be generated in a new laboratory or according to modified methods, or at very least subject to different batch effects (see section “IDA3.2: Batch correction”). For all of these reasons, validation on external data (cohorts) is essential to have sufficient confidence in the performance of a predictor for clinical use |
Good reporting to improve transparency and reproducible research
Discussion
All other topic groups work on issues that are also relevant for the analysis of HDD. Obviously, all papers are written in the context of LDD. Appropriate study designs (TG5) are a key to improve research in the health sciences. It is well known that mistakes in design are often irremediable [229]. Nearly all studies in HDD and LDD have to cope with missing data (TG1, [58]) and data preprocessing is a relevant topic for all studies, closely related to tasks in initial data analysis (TG3). Analyzing LDD, the importance of IDA was largely ignored and a recent review showed that reporting of IDA is sparse [230]. In section “IDA: Initial data analysis and preprocessing,” we provided a discussion of IDA aspects in the context of HDD. Measurement error and misclassification (TG4) is a common problem in many studies in LDD and HDD, which is often ignored in practice [231]. Studies with a survival time output are popular in HDD, and they have to cope with several issues discussed in the survival analysis group (TG8, [232]) |
In the context of LDD, TG2 published a review focusing on approaches and issues for deriving multivariable regression models for description [136]. Although analyses of HDD concentrate more on models for prediction, some of the issues are also relevant and the very large number of variables and (too) small sample sizes strengthen some problems severely. In LDD, issues in deriving models for prediction are discussed in TG6 [233]. Finally, the overarching aim of many HDD studies is to discover knowledge that is causally related to an outcome of interest. However, causal inference imposes several important challenges (TG7, [234]) |
Conclusions
1 | Methods for visual inspection of univariate and multivariate distributions: Histograms, boxplots, scatterplots, correlograms, heatmaps (Table 2) |
2 | Methods for descriptive statistics: Measures for location and scale, bivariate measures, RLE plots, MA plots (Table 3) |
3 | Method for analysis of control values: Calibration curve (Table 4) |
4 | Methods for graphical displays: Principal component analysis (PCA), Biplot (Table 5) |
5 | Methods for background subtraction and normalization: Background correction, baseline correction, centering and scaling, quantile normalization (Table 6) |
6 | Methods for batch correction: ComBat, SVA (surrogate variable analysis) (Table 7) |
7 | Method for recoding: Collapsing categories (Table 8) |
8 | Method for filtering and exclusion of variables: Variable filtering (Table 9) |
9 | Method for construction of new variables: Discretizing continuous variables (Table 10) |
10 | Method for imputation of missing data: Multiple imputation (Table 11) |
11 | Methods for graphical displays: Multidimensional scaling, t-SNE, UMAP, neural networks (Table 12) |
12 | Methods for cluster analysis: Hierarchical clustering, k-means, PAM (Table 13) |
13 | Methods for estimation of the number of clusters: Scree plots, silhouette values (Table 14) |
14 | Methods for hypothesis testing for a single variable: T-test, permutation test (Table 15) |
15 | Methods for hypothesis testing for multiple variables in HDD: Limma, edgeR, DEseq2 (Table 16) |
16 | Methods for multiple testing corrections: Bonferroni correction, Holm’s procedure, Westfall-Young permutation procedure (Table 18) |
17 | Methods for multiple testing corrections controlling the FDR: Benjamini-Hochberg, q-values (Table 19) |
18 | Methods for multiple testing for groups of variables: Gene set enrichment analysis (GSEA), over-representation analysis, global test, topGO (Table 20) |
19 | Methods for variable transformations: Log-transform, standardization (Table 21) |
20 | Method for dimension reduction: Supervised principal components (Table 22) |
21 | Methods for statistical modelling with constraints on regression coefficients: Ridge regression, lasso regression, elastic net, boosting (Table 23) |
22 | Methods for statistical modelling with machine learning algorithms: Support vector machine, trees, random forests, neural networks and deep learning (Table 24) |
23 | Methods for assessing performance of prediction models: MSE, MAE, ROC curves, AUC, misclassification rate, Brier score, calibration plots, deviance (Table 25) |
24 | Methods for validation of prediction models: Subsampling, Cross-validation, Bootstrapping, use of external datasets (Table 26) |