Background
The role of a medical practitioner is to perform interventions on patients that are as beneficial as possible in a broad sense, taking into account both the short term and the long term, health outcomes, quality of life and often other aspects, such as ease of use or extended application to a wider set of indications. In contrast, it is the goal of medical researchers to develop new interventions that are superior—or non-inferior with fewer side effects—to those already in existence. Rightfully, the unbiased systematic evaluation of the new and previously existing interventions is considered crucial by the medical community and is a focal point of medical literature. The concept of “evidence-based medicine” (EBM) has been receiving growing attention and credibility for decades.
We can draw a parallel between medicine and computational statistics, with the statistical consultant analogous to the medical practitioner and the applied statistical researcher to the medical researcher. See Table
1 for an overview of this analogy that will be developed throughout the paper. The role of a statistical consultant is to analyze the client’s data such that the results help to answer a research question as completely as possible, uncovering and approximating a truth that is assumed to exist behind this question. Once more, there may be other considerations, such as cost and computation time. The goal of applied statistical researchers is to
develop data analysis methods and tools that are, again, in some sense superior to those already in existence. However, here the parallel between medicine and computational statistics ends, as the unbiased evaluation of these new statistical methods and tools in real-data settings, including their comparison to existing methods, is given usually only poor attention in the literature. In this paper, we explore the disparity between evidence-based medicine and “evidence-based computational statistics” by examining the state of methodological aspects of benchmark studies, the systematic comparison of statistical methods using real datasets.
Table 1
Analogy between clinical research and computational statistical research
Trial type | In vitro/animal study | Simulation |
| Clinical trial | Benchmark study |
| Blinded | Neutral and blind analysis |
| (Placebo) controlled | (Null-model) controlled |
| Cross-over | Paired samples |
| Multi-arm | Multiple methods |
Investigators | Trialist | Researcher conducting benchmark experiment |
| Medical researcher | Methodological researcher in computational statistics |
| Sponsor | Methodological researcher in computational statistics |
Observation unit | Clinical trial patients | Real datasets |
Comparators | Therapies, interventions and controls | Statistical and machine learning methods |
Problem | Treatment of medical condition | Answering a question using data, e.g. prediction problem |
Context | Patient’s preference, social context | Substantive context |
| Personalized medicine | Meta-learning |
Objectives | Improving patient’s health | Yielding reliable answer, e.g. increasing prediction performance |
| Selecting and applying therapy to patient | Selecting and applying methods to datasets |
| Application by medical practitioner | Application by statistical practitioner/consultant |
Endpoints | Relevant clinical endpoints | Error rate, AUC, computing time, etc. |
| Missing value (e.g. dropout) | Failure to produce output |
Greenhalgh et al. [
1] state: “It is more than 20 years since the evidence based medicine working group announced a “new paradigm” for teaching and practicing clinical medicine. Tradition, anecdote, and theoretical reasoning from basic sciences would be replaced by evidence from high quality randomized controlled trials and observational studies, in combination with clinical expertise and the needs and wishes of patients.” Our aim is to start a discussion on “evidence-based” data analysis in which “tradition, anecdote, and theoretical reasoning from basic sciences [including simulations] would be [complemented] by evidence from high-quality [benchmark studies], in combination with [statistical] expertise and the needs and wishes of the [substantive scientists]”. Our discussion is based on an analogy between clinical trials, which play a crucial role in evidence-based medicine, and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions (see Table
1).
In computational statistics, “evidence” can be generated through theoretical considerations (e.g., the proof that an algorithm converges or the asymptotic relative efficiency of a test), by simulations (i.e., with artificial datasets randomly drawn from a particular distribution) or through real-data examples. However, theory is often of little help in highly complex real-world situations, since it usually requires unrealistic simplifying assumptions regarding the data structure. In this paper we focus on the role of real-data analysis (as opposed to simulations) and the design of such studies. This type of evidence can be seen as “empirical evidence”.
In a specific example of the greater incorporation of evidence in evaluating and presenting statistical methodology, the recently established STRATOS (STRengthening Analytical Thinking for Observational Studies) cooperation [
2] aims at providing guidance regarding the choice of statistical method based on empirical evidence and experts’ experience in the context of observational studies in medical research. However, such groundbreaking projects are still in their infancy and the concept of evidence and the role of real data in this context are not yet well defined.
In machine learning—a field focused on
prediction as opposed to
explanation—ideas relating to evidence from real data are becoming commonplace. Systematic comparison studies, often denoted
benchmark experiments, based on real data are a core of the literature, their realizations made substantially easier through the use of databases of datasets available for benchmarking, such as the UC Irvine (UCI) machine learning repository [
3] and the OpenML platform [
4]. Machine learning challenges [
5], which can be seen as collective benchmarking studies, are also receiving substantial attention from the community. Machine learning scientists work to obtain empirical evidence on the performance of algorithms (often algorithms to construct prediction models) on real datasets in analogy to medical doctors obtaining empirical evidence on the performance of therapies on human patients (see Table
1). Machine learning scientists are further aware of the no-free-lunch theorem (i.e., that no algorithm works best in all situations) and again address this problem through evidence-based research, by evaluating which algorithm performs best in which situations and even automatizing this process—a task known as meta-learning.
We statisticians are in general reluctant to adopt the concept of evidence-based decision-making: that the choice of statistical method to use in a given situation should be guided primarily by “evidence” in general and empirical evidence more specifically. This reluctance is also strong in the context of prediction modelling, which lies in the domains of both statistics and machine learning and has long been addressed by machine learners in benchmark experiments. Some statisticians feel a more evidence-based approach implies the jettisoning of the experience of the statistical consultant in favor of a suspect set of guidelines inspired by oversimplification. The idea that the choice of a method may be reached in a more or less automatic manner makes us feel unwell. Statisticians often argue that no ruleset or guideline can replace the judgement of an expert statistician, nor can a ruleset take into account all aspects of a problem, such as the substantive context. Interestingly, these are exactly the types of arguments invoked by EBM-sceptics. Medical doctors questioning EBM argue that an evidence-based approach based on systematic rules cannot cope with the complexity of individual cases—e.g., with respect to multi-morbidity—and, again, ignores important considerations, such as the wishes and social backgrounds of patients.
The existence of specific datasets (in statistics) or patients (in medicine) with complexity that cannot be accommodated by simple evidence-based rules may be seen as an argument in favor of the need for more evidence, i.e. evidence tailored to particular dataset or patient profiles. This need has long been acknowledged in medical research and is being addressed in the emergence of personalized/individualized medicine, with subgroup and interaction effect analyses in clinical trials being simple steps in this direction. Similarly, in computational sciences, the development and use of meta-learning are steps towards tailored algorithms.
Medical doctors or statisticians may still maintain that even the best and most individualized evidence cannot replace expert intuition, nor adequately take into account the specific context of the choice. This is a controversial issue. One may argue that an expert’s intuition is simply the result of the unconscious collection of evidence from personal experience during a career, and that such evidence could also be formalized as systematic rules. If so, the question becomes whether one can trust machines to derive rules as reliably as the brain of an expert, and whether this can be achieved in practice considering the current state-of-the-art of computational sciences. In a given situation, the amount of information (e.g., number of cases) and the source of the information (e.g., type and quality of studies and data) play further crucial roles in this appraisal.
Statisticians may also contend that the need for empirical evidence in statistics is not as strong as in medicine, as theoretically one can subject a dataset to as many statistical methods as one desires, while the same cannot be said for patients and interventions. Clearly, it does not harm a dataset to undergo different statistical analyses, but it may harm patients to undergo different interventions before identifying the most appropriate. While the sense of this argument is evident, it is well known that the approach of performing a large number of statistical methods on a dataset and deciding which one is the “right one” based on the results may yield substantial problems relating to the idea of “fishing for significance” [
6]. An illustration is provided in an experiment in which they asked 29 statisticians to analyze a dataset with the goal of assessing the potential correlation between skin color of football players and red cards [
7]. Perhaps surprisingly to some, but likely not to many statisticians, the researchers obtained very different—and partly contradictory—answers! Which result should be reported as definitive? Researchers are obviously tempted to report that which is most fitting to their goals. Due to multiple comparison effects, this strategy is likely to yield false research findings that are simply the result of optimization and data dredging and would fail to be validated using independent data [
6,
8].
When considering types of evidence, statisticians are usually keener to evaluate their methods using data simulated from known distributions as opposed to conducting benchmark studies consisting of a large number of real datasets. The use of simulations as opposed to real-data analysis can be considered analogous to using in vitro studies or animal trials as opposed to patient-based experiments: one can control important factors—genotypes, age, diet, etc. in medicine; the dataset size, the signal strength, etc. in computational science—and thus obtain homogeneous groups and “know the truth”. In the context of, say, hypothesis testing or the estimation of a variable’s effect for the purpose of explanation (as opposed to prediction), the truth is unknown in real data, thus making simulations indispensable for evaluating how well statistical methods uncover the truth. Moreover, one can simulate as much data as computationally feasible, allowing reliable systematic evaluation of the methods in the considered simulation settings. In many situations, simulations are indispensable. However, even with the best simulations, one would often remain uncertain as to the performance of the examined methods in the much more complex real world.
In this context, we would like to start and fuel a discussion on the appropriate design of real-data studies yielding evidence in statistical research, always with careful consideration of a dataset’s specificities/substantive context and without discarding expert intuition and simulations. In analogy to EBM and the choice of therapies, large-scale benchmarking research in statistics may yield tentative rulesets and guidelines to facilitate the choice of data analysis methods without dictating them. We would like to discuss the question—without claiming to have the ultimate answers—of the role of EBM-inspired concepts in real-data benchmark analysis in computational applied statistics.
In this paper, we assume that the performance of a statistical method on a real dataset can be objectively assessed using some criterion. This is the case, for example, for prediction methods: natural criteria are error measures such as the error rate (in the case of binary classification) or the Brier score (in the case of survival prediction), which can be estimated through the use of resampling techniques such as cross-validation. This is the context we will use to explain our ideas. Methods whose performance on real datasets cannot be quantified using real data, such as hypothesis tests or effect estimation procedures, are not considered here. Moreover, unless stated otherwise we assume that a method is well defined and runs automatically on a dataset without human intervention such as parameter initialization or preprocessing. The issue of human intervention is discussed further in “
Role of the user” section.
While one cannot incorporate all aspects of EBM into the context of the evaluation of statistical methods using benchmarking, we claim that some precepts commonly accepted in EBM may be helpful in defining a concept of “evidence-based computational science”. A simple example is that of sample size, an extensively researched question on the number of patients required in a clinical trial in order to make valid statistical claims on any result. Analogously, in benchmarking, in order to draw conclusions from real-data analysis beyond illustrative anecdotic statements, it is important to have considered an adequate number of datasets; see Boulesteix et al. [
9] for a discussion on the precise meaning of “an adequate number”. In the remainder of this paper, we discuss further concepts essential to formulating evidence-based statements in computational research using real datasets. After the presentation of a motivating example in “
Motivating example” section, the significant question of the definition of selection criteria for datasets is addressed in “
Selecting datasets: a major challenge” section, while other concepts from medical sciences such as analysis protocols, placebos, evidence levels, and bias are discussed in “
Further EBM-related concepts” section (see Table
2 for an overview of these concepts).
Table 2
Some ideas for the improvement of benchmarking practice
Sample size calculation | | Possible and desirable [ 9] | |
Strict inclusion criteria | Sec. 3 | Possible and desirable | |
Trial protocol | Sec. 4.1 | Principle might be helpful in adapted form | |
Quality control | Sec. 4.2 | Principle might be helpful in adapted form | |
| | e.g. via platforms like OpenML [ 4] | |
Placebo/reference | Sec. 4.3 | Principle might be helpful in adapted form | |
Blinding | Sec. 4.4.1 | Principle might be helpful in adapted form | |
Intention-to-treat | Sec. 4.4.2 | Adequate treatment and reporting of | |
| | missing values: possible and desirable | |
Levels of evidence | Sec. 4.5 | Principle might be helpful in adapted form | |
Motivating example
Since the early 2000s, numerous supervised classification algorithms have been proposed in the bioinformatics and statistics literature to handle the so-called
n≪
p problem, i.e. the case where the number
p of candidate covariates (for example, gene expression data collected using microarray technology) by far exceeds the number
n of patients in the dataset. Common approaches to handle this dimensionality problem include preliminary variable selection, dimension reduction, penalized regression or methods borrowed from machine learning [
10].
In this setting, the comparison of existing classification methods using real microarray datasets has been the topic of a number of papers [
11‐
18], which do not aim at demonstrating the superiority of a new method proposed by the authors: they consider only existing methods, thus implying a certain level of neutrality [
19]. Most of these papers consider measures such as the error rate or the area under the curve to assess the performances of the classification methods. For each considered performance measure, the results are essentially presented in the paper as an
N×
M table, where
N is the number of considered real datasets and
M is the number of considered methods. The number
M of methods varies across papers and ranges between
M=2 and
M=9. These methods are sometimes combined with different variable selection or dimension reduction techniques, thus yielding a higher number of investigated method variants. Considered methods vary across papers and include, for example, discriminant analysis, tree-based methods, nearest neighbors, or support vector machines.
We feel that these studies contain major flaws if examined in the light of the principles of clinical research. First of all, most are underpowered: with the exception of the study by de Souza et al. [
18] (
N=65 datasets) and that of Statnikov et al. [
17] (
N=22 datasets), they use between
N=2 and
N=11 datasets to compare the methods, too few to achieve reasonable power when comparing the performances of classification methods [
9]. Following the sample size calculation approach outlined in Boulesteix et al. [
9], the required number of datasets to detect a difference in error rates between two methods of, say, 3% with a paired sample t-test at a significance level of 0.05 and a power of 80% is as high as
N=43 if the standard deviation (over the datasets) of the difference in error rates is 7%—a standard deviation common in this setting [
9].
Furthermore, it is not clear how the datasets included in the studies were selected. No clear search strategy or inclusion criteria are described in the papers. Some of the datasets are used in several studies and sometimes denoted “benchmark datasets”, but none are used in all studies. The problem of the selection of datasets for inclusion in a benchmark study will be further discussed in “
Selecting datasets: a major challenge” section. It is further unclear whether datasets have been eliminated from the study after having been originally included, since this task is not documented. We will come back to this problem in “
Non-compliance and missing values” section.
Regarding neutrality, even if the aims of these studies are not to demonstrate the superiority of a particular “favorite” new method, authors are sometimes likely to have preferences for or better competence in one or several methods compared to the other(s). For example, one of the authors’ team may have a very strong background in a particular classification method, support vector machines, as suggested by their publication records. They are then likely to be more proficient users of this method than of the other(s) (although we can admittedly not verify this conjecture!), which may induce bias, as further discussed in “
Neutrality and blinding” section.