Marker studies should ideally be designed and conducted with a specific clinical question in mind just like therapeutic trials, but unfortunately, rigorously conducted marker studies seem to be the exception rather than the rule [
6]. Many marker studies are conducted with a lack of attention to design and in the absence of a clinically meaningful marker question, that is, for what clinical use the marker is being considered or proposed [
7]. Frequently marker studies are conducted retrospectively on 'convenience' specimen sets, which are specimen sets assembled based on availability, and may represent patients with highly diverse pathologic and clinical characteristics. The specimens may have been collected under unknown conditions, and the quality and completeness of associated clinical and pathological data may be unreliable. All of this heterogeneity makes it difficult to identify a coherent clinical setting in which the marker might be useful, even if the study is able to identify statistically significant associations between the marker and patient characteristics or outcomes.
Ideal execution of retrospective predictive marker studies
Predictive marker studies are most reliable when conducted using specimens that had been collected as part of a prospective clinical trial that randomized patients between a standard of care treatment and a new therapy for which the marker is being assessed for its predictive utility. Such specimen collections can be used to provide a high level of evidence for a marker's predictive or prognostic clinical utility under appropriate conditions, including careful pre-specification of the statistical analysis plan for evaluation of the marker [
8].
A risky practice in the evaluation of predictive markers is to look for an association of a marker with clinical outcome by studying only patients who receive the new therapy. The problem with this approach is that prognostic effects can be confused with predictive effects. Suppose, for example, that the marker under study has a substantial prognostic effect so that patients with high levels of the marker will have better clinical outcome than patients with low levels of the marker regardless of what treatment the patient receives. Looking only at the patients treated with the new therapy might lead one to conclude erroneously that the improved outcome for patients with high levels of the marker was due to a preferential benefit of the new therapy for that marker-defined subgroup when, in fact, it is possible that the new therapy benefits no patients. Results of these types of studies can be misleading in the opposite direction as well. This can occur if the marker predicts for poor outcome under standard therapy, but patients with this marker benefit from the new therapy. The marker might exhibit no association with clinical outcome in the setting of new therapy, but only because the outcome for patients who were positive for the marker had been improved by the new therapy to be equivalent to the outcome for patients who were not positive for the marker. These examples underscore the need for examination of an appropriate control group (placebo or standard of care) when evaluating a potential predictive marker. A randomized trial provides an ideal setting to ensure that no other confounding factors influenced which patients received the standard therapy versus the new therapy.
Extensive exploratory data analyses may result in spurious findings
Often extensive data analyses are performed in prognostic and predictive studies in a quest for associations between markers and clinical outcomes that demonstrate statistically significant
P values. With the possibility to test association of multiple markers with multiple clinical endpoints in several patient subgroups, the chance of generating spuriously significant results in retrospective prognostic and predictive marker studies can be substantial [
9].
Consider testing the association between a marker and a clinical outcome in each of four disjoint patient subgroups. If each statistical test is performed at the usual significance level of 0.05, the probability that a statistically significant result will be obtained in at least one of the four subgroups is 19%. Now consider multiple types of clinical outcomes and multiple markers and multiple cut-points applied to dichotomize continuous markers, and the likelihood of such a study producing at least one statistically significant result by chance can become very large. A similar problem occurs when treatment differences are tested in a clinical trial comparing two or more treatments arms in a multitude of subsets defined by markers or other patient characteristics. If treatment differences are found in some subsets and not others, investigators are tempted to claim that they have identified predictive subgroups. Most often such findings are spurious due to the multiple testing and are not confirmed in subsequent studies. Statisticians sometimes explain this phenomenon as 'if you torture the data long enough, they will confess to anything'. If these statistically significant findings are then retrofitted to a clinical question and published with no indication of the exploratory context in which the results were obtained, the result may represent a serious distortion of the significance (both statistical and clinical) of the findings. Together with the long recognized problem of publication bias favoring studies that report positive findings, the result may be a body of literature that is heavily influenced by false-positive findings.
Sample sizes for adequate statistical power
Evaluation of predictive and prognostic markers using specimens collected within treatment trials is not a panacea, however. When designing clinical trials, sample size is generally determined to permit sufficient statistical power to detect a treatment effect of a pre-specified size. Often marker questions are either not specified during the planning stage for a therapeutic trial, or if they are, they are usually relegated to secondary aims that the study might not be sized to address with high statistical power. Add to that an inability to collect specimens from some patients in the trial, and statistical power can be diminished further. Major determinants of statistical power for analyses examining prognostic and predictive markers and their association with time-to-event endpoints (for example, time to disease recurrence or progression, or time to death) include the testing significance level (alpha or type I error), the expected number of events, the distribution of the marker (for example, positivity rate for a binary marker), the treatment randomization ratio, and the magnitude of effect.
Understanding the proper quantification of prognostic and predictive effects is important for determination of clinical utility and proper study design to evaluate those effects. For survival analyses, the effect of a binary prognostic marker is usually expressed as a hazard ratio. The relevant effect for a binary predictive marker is a treatment-by-marker interaction. Presence of a treatment-by-marker interaction means that the treatment effect, that is, the difference in clinical outcome between a new treatment and a standard treatment, differs depending on the status of the patient's marker. An interaction effect is often expressed in statistical terms as a ratio of the treatment hazard ratios, with one treatment hazard ratio being calculated in the marker 'positive' subgroup and the other treatment hazard ratio being calculated in the marker 'negative' subgroup. A treatment-by-marker interaction is most clinically relevant when it is a qualitative interaction. Qualitative means that the direction of treatment benefit is reversed in one marker subgroup compared to the other. For example, the new treatment might confer a substantial survival advantage to patients who are positive for the marker, but it may be the same or worse than standard treatment in the marker negative subgroup. Quantitative interactions occur when the treatment benefit is in the same direction but of different magnitude in the two patient subgroups. Unless the differential magnitude leads to a different treatment decision, a quantitative interaction may not translate to clinical utility of a test based on the marker.
The statistical power of a prognostic or predictive marker study depends on the distribution of the marker values in the patient population as well as the size of the effect that one aims to detect. When comparing survival between two groups of patients, for example, patients who receive two different treatments or patients whose tumors do versus do not express a particular marker, power is maximized when the groups are of equal size. Therefore, if a clinical trial has been designed with one-to-one randomization to detect a specified hazard ratio between treatment groups, a test for the effect of a binary prognostic marker on clinical outcome of the same magnitude as the treatment effect will have lower power than the treatment comparison when the binary marker has prevalence substantially different than 50%.
To test a marker-by-treatment interaction, the situation is even more challenging. To adequately power a clinical trial to test a treatment by marker interaction can easily require two to four times the sample size required to detect a treatment effect unless there is a fairly dramatic treatment effect nearly exclusive to the biomarker-predicted benefiting subgroup. Importantly, the biomarker-defined subgroup should not be defined post hoc by exploratory analyses and then tested as though it had been pre-specified unless proper care has been taken to statistically adjust for this form of multiple testing to avoid false-positive findings. An added problem is the inaccuracy of some marker assays. If assay inaccuracies cause misclassification of patients with regard to marker status, this error will cause further reduction in the statistical power for detecting predictive marker effects. Taken together, the considerations just discussed explain why it can be so difficult to establish utility of prognostic and predictive marker tests in a statistically rigorous way when the marker-related questions are retrofitted to therapeutics trials.
If there exist no suitable treatment trials with adequate specimen collection to adequately answer an important predictive or prognostic marker question, several options remain. These are to prospectively design a trial to specifically answer the marker question, or to try to combine specimens or marker data across several completed trials. Many options have been proposed for designing trials to validate marker-based tests [
10], but such trials can be costly and currently are conducted less frequently than trials designed purely to answer a treatment question. Alternatively, combining over different marker studies might be possible, but care must be taken to select studies to represent the full spectrum of relevant studies, regardless of publication status or presence of statistically significant findings. Not only must patient characteristics and treatments be comparable in order for the studies to be combined sensibly, but the marker assays used in the different studies need to be comparable. All of these options require adequate resources, clear and unbiased reporting of studies, sharing of data, and potentially sharing of specimens.