Background
The multi-arm multi-stage (MAMS) clinical trial design described by Royston et al. [
1,
2] for time-to-event outcomes and by Bratton et al. [
3] for binary outcomes is a relatively simple and effective framework for accelerating the evaluation of new treatments. The design has already been successfully implemented in cancer [
4] and is starting to be used in other areas such as tuberculosis [
5].
In this particular family of MAMS designs, multiple experimental arms are compared to a common control at a series of interim analyses on an appropriate intermediate outcome (
I) that is on the causal pathway to the definitive primary outcome of the study (
D). In cancer, a common choice of
D is overall survival with failure-free survival (a composite of progression-free and overall survival) used for
I [
6]. Alternatively, if a suitable
I outcome is unavailable then
D itself or, in some cases,
D observed at an earlier time point could be used [
7]. At each interim analysis, recruitment is stopped to experimental arms that fail to show a predetermined minimum level of benefit over the control on
I. Recruitment continues to the next stage of the study to all remaining experimental arms and the control. Experimental arms that pass all interim analyses continue to the final stage of the study at the end of which they are compared to the control on
D.
Two useful measures of type I error rate in a MAMS trial are the pairwise (PWER) and familywise (FWER) type I error rates. The PWER is the probability of incorrectly rejecting the null hypothesis for
D for a particular experimental arm at the end of the study regardless of other experimental arms in the study. In contrast, the FWER is the probability of incorrectly rejecting the null hypothesis for
D for at least one experimental arm in a multi-arm study and gives the type I error rate for the trial as a whole. Royston et al. [
2] provide a calculation for the PWER; however, it is made under the assumption that the null hypotheses for
I and
D for a particular experimental arm are true. In practice, a treatment that is ineffective on
D may have an effect on
I different from that under the null hypothesis and we show how this affects the PWER. In particular, the PWER can often be higher than the value calculated by the method of Royston et al. [
2] and so we show how to determine and control its maximum value.
In a MAMS trial with more than one experimental arm, controlling the FWER rather than the PWER might be more appropriate particularly if the trial is confirmatory [
8]. A calculation of the FWER using a simulation of trial-level data has previously been described in [
9] and we use this to show how the FWER can vary for different underlying treatment effects on
I. We determine the scenario under which the FWER is maximised and thus describe how it may be controlled in the strong sense, that is, for any set of underlying treatment effects on
I and
D. In an example, we use the methodology to estimate the maximum PWER and FWER of the original design of the STAMPEDE (Systemic Therapy in Advancing or Metastatic Prostate Cancer: Evaluation of Drug Efficacy) trial in prostate cancer [
6] and show how the trial design may have looked had the FWER been controlled in the strong sense at some conventional level.
Results
The STAMPEDE trial in prostate cancer started as a six-arm four-stage trial using the methodology described by Royston et al. [
1,
2]. The trial used failure-free survival as
I and overall survival as
D. Recruitment began in 2005 and was completed in 2013. The original design of the trial is shown in Table
1. An allocation ratio of
A=0.5 was used for this design so that one patient was allocated to each experimental arm for every two patients allocated to the control. Because distinct hypotheses were being tested in each of the five experimental arms, the design focus for STAMPEDE was on the pairwise comparisons of each experimental arm against control, with emphasis on the control of the pairwise type I error.
Table 1
Design of the six-arm four-stage STAMPEDE trial in prostate cancer
1 | 0.75 | FFS | 0.500 | 0.95 | 113 |
2 | 0.75 | FFS | 0.250 | 0.95 | 216 |
3 | 0.75 | FFS | 0.100 | 0.95 | 334 |
4 | 0.75 | OS | 0.025 | 0.90 | 403 |
Overall | | | 0.013 | 0.83 | |
Using Eq.
1, the PWER was estimated to be 0.013. However, as explained above, the maximum PWER is actually equal to the final-stage significance level,
α
4=0.025. Using the calculation described in ‘
Methods’, the maximum FWER of the original STAMPEDE design was 0.103.
Although the FWER was not controlled in STAMPEDE, below we use the trial in an example to show how strong FWER control can be achieved in a MAMS design with
I≠
D. Using a search procedure over
α
4 in
nstage
, similar to that used above for the two-stage designs, we found that final-stage significance levels of
α
4=0.0054 and
α
4=0.0113 would have been required to control the FWER at 2.5 % and 5 %, respectively. Stata code for determining the final-stage significance level for a FWER of 2.5 % is shown in the
Appendix.
Consequently, this would have increased the required number of D events on the control arm in the final stage from 403 to 558 and 485, respectively (as estimated by nstage
) and may, therefore, have led to a prolonged trial should any experimental arm reach the final stage. Thus, investigators designing and conducting a trial should consider carefully the necessity of controlling the FWER in their trial, and whether it is achievable from a practical point of view.
Discussion
The MAMS design is an effective and relatively simple approach for accelerating the evaluation of multiple new treatments. It works by simultaneously assessing experimental arms against a common control in a single trial, stopping recruitment to poorly performing arms during the trial, and allowing interim assessments to be based on an outcome that is observed earlier than the primary outcome of the study. In this article, we described how the type I error rate for each individual experimental arm and for the trial as a whole can be determined and controlled in I≠D designs and I=D designs with non-binding stopping guidelines. We also investigated the impact of the underlying treatment effect on the type I error rate in I≠D designs and showed that it is possible for the PWER to be higher than previously thought, with the maximum value being equal to the final-stage significance level of the trial, α
J
. Similarly, for I≠D designs with more than two arms, the maximum FWER does not depend on the stagewise significance levels prior to the final stage and can be calculated simply by treating the design as a standard one-stage trial with the PWER equal to α
J
. We found that even for arms with modest effects on I but no effect on D (a scenario often seen in practice), the type I error rate can approach quite rapidly to these maximum values. Thus, controlling the maximum PWER or FWER should be an important design consideration in any future MAMS trials.
An advantage of controlling the maximum PWER or FWER of the trial by
α
J
is the increased flexibility of allowing recruitment to poorly performing experimental arms to be continued to the next stage without inflating the type I error rate. This flexibility allows arms showing promising effects on other important outcome measures to be assessed further, albeit at the expense of a larger sample size. Interim stopping guidelines can also be non-binding in
I=
D designs if the maximum PWER and FWER are controlled by
α
J
only. Another benefit is that the FWER calculation is somewhat simplified and is similar to the Dunnett procedure for a one-stage trial [
15]. However,
I=
D MAMS designs with binding stopping rules may also be used in practice and so a method for controlling their PWER or FWER is required. Alternatively, other approaches to designing MAMS trials with a single normally distributed outcome have been proposed in [
11,
13]. Methods for controlling the FWER in these designs are available (e.g. using the mams package in R) and, unlike the MAMS designs we have considered in this paper, stopping guidelines for efficacy such as those in standard group sequential trials (e.g. [
16,
17]) can be built into the design. Other approaches are also available for multi-arm trials with strong FWER control where only the most promising treatment is to be selected at an interim analysis based on a combination of both short- and long-term endpoint data [
18,
19]. Such designs are, therefore, more suited to situations where the best of several treatments is to be determined, as might often be the case in a pharmaceutical setting.
There is currently much debate over whether the FWER should be controlled in a multi-arm study. It has been argued that FWER control is most appropriate in confirmatory settings [
20] and has also been proposed for exploratory studies to limit the chance of evaluating an ineffective treatment in a potentially expensive confirmatory study [
8]. However, Hughes [
21] argues that adjusting for multiple comparisons should not be a requirement, since no such adjustment would have been made if each experimental arm were evaluated in a separate two-arm study. Freidlin et al. [
22] suggest that this argument is only reasonable if each treatment is distinct and a multi-arm trial was used purely for efficiency reasons. If, on the other hand, the experimental arms are closely related (e.g. if they are different doses or schedules of the same drug), then the FWER should be controlled. Despite this guidance, Wason et al. [
12] show that many multi-arm confirmatory trials do not correct for multiple testing even if the treatments are closely related. It remains unclear whether the FWER should be controlled in confirmatory trials of several distinct treatments and further guidance from regulators is required [
12].
There has recently been much discussion over the adding of arms to an ongoing MAMS design, such as the STAMPEDE trial, which to date has added three new arms since it commenced [
8,
23,
24]. The effect of adding new experimental arms is advantageous as it obviates the often lengthy process of initiating a new trial. However, the impact of adding arms on the FWER in the class of MAMS designs discussed here has not yet been fully explored. Therefore, methods for quantifying and, in some cases, controlling the FWER in such a trial are required. In addition, it is not initially clear how much the FWER will be inflated when arms are added only when existing arms are dropped for lack of benefit. A related question is whether a sequentially rejective procedure, such as that described by Proschan et al. [
25], could be applied to the MAMS design [
26]. Such a procedure relaxes future stopping guidelines if arms are dropped during the course of the trial, so that the power for the remaining comparisons is increased without inflating the FWER. For instance, if a two-stage trial initially has two experimental arms and recruitment to one arm is stopped at the first analysis, then the question is whether a final-stage significance level that is higher than that proposed in the initial design could be used.
Abbreviations
FFS, failure-free survival; FWER, familywise error rate; HR, hazard ratio; MAMS, multi-arm multi-stage; OS, overall survival; PWER, pairwise error rate; STAMPEDE, Systemic Therapy in Advancing or Metastatic Prostate Cancer: Evaluation of Drug Efficacy
Acknowledgments
We are grateful to Patrick Royston for his helpful comments on a previous draft. We also thank the associate editor and two reviewers for their useful comments on the earlier version of this article. This work was supported by the UK Medical Research Council (MRC) London Hub for Trials Methodology Research, through grant MC_EX_G0800814 (510636,MQEL).