Background
Multiple primary outcomes may be specified in a randomised controlled trial (RCT) when it is not possible to use a single outcome to fully characterise the effect of an intervention on a disease process [
1‐
3]. The use of multiple primary outcomes (or ‘endpoints’) is becoming increasingly common in RCTs. For example, a third of neurology and psychiatry trials use multiple primary outcomes [
4]. Data on two primary outcomes (abstinence and time to dropout from treatment) were collected in a trial evaluating the effectiveness of a behavioural intervention for substance abuse [
5] and data on four primary outcomes were collected in a trial evaluating a multidisciplinary intervention in patients following a stroke [
6]. Typically, these outcomes are correlated and often one or more of the outcomes has missing values.
Typically multiple statistical tests are performed to investigate the effectiveness of the intervention on each outcome. If two outcomes are analysed independently of each other at the nominal significance level of 0.05, then the probability of finding at least one false positive significant results increases to 0.098. This probability is known as the familywise error rate, ‘FWER’. One approach to control the FWER to its desired level is to adjust the
p-values corresponding to each statistical test used to investigate the intervention effects. Many adjustments have been proposed including the Bonferroni [
7], Holm [
8], Hochberg [
9], Hommel [
10] and Dubey/Armitage-Parmar [
11] methods. Once the
p-values have been adjusted, they can be compared to the nominal significance level. For example in the trial on substance abuse [
5], two unadjusted
p-values: 0.010,0.002 were reported. If the Bonferroni method was used, the p-values could have been adjusted to 0.020, 0.004 and compared to the significance level
α of 0.05. Alternatively, the significance level could be adjusted (to 0.05/2 = 0.025 in this example) and compared to the unadjusted
p-values.
In clinical trials, it is also important to consider the power of the tests to detect an intervention effect. In the context of multiple outcomes, the power of the study can be defined in a number of ways depending on the clinical objective of the trial: i) ‘disjunctive power’, ii) ‘conjunctive power’ or iii) ‘marginal power’ [
12].
The disjunctive power (or minimal power [
13]) is the probability of finding at least one true intervention effect across all of the outcomes [
12,
14]. The conjunctive power (or maximal power [
13]) is the probability of finding a true intervention effect on all outcomes [
14]. It may be noted that the disjunctive and conjunctive power have previously been referred to as ‘multiple’ and ‘complete’ power respectively [
13]. The marginal (or individual) power is the probability of finding a true intervention effect on a particular outcome and is calculated separately for each outcome. When the clinical objective is to detect an intervention effect for at least one of the outcomes the disjunctive power and marginal power are recommended whereas the conjunctive power is recommended when the clinical objective is to detect an intervention effect on all the outcomes [
12,
14]. In this paper, we are focusing on the former clinical objective and therefore we focus on disjunctive and marginal power.
The power requirements of a trial should match the clinical objective which needs to be pre-specified when designing the study and the sample size calculation should be performed accordingly. In current practice, the sample size calculations for trials often focus on the marginal power for each outcome. An approach that has been recommended and is often used in trials is to calculate the sample size separately for each of the primary outcomes by applying a Bonferroni correction to adjust the significance level [
15]. The largest value of the sample size is then considered as the final sample size for the trial [
16].
Missing outcome data are common in RCTs [
17] which will inevitably reduce the power and efficiency of the study [
18] which may result in failure to detect true intervention effects as statistically significant.
When using multiple primary outcomes, there is limited guidance as to which method(s) should be used to take account of multiplicity in the sample size calculation and during the statistical analysis.
Some studies have compared a selection of methods which adjust
p-values to account for multiplicity to handle multiple outcomes in trials. Sankoh, Huque and Dubey [
11] compare a selection of adjustment methods for statistical analysis in terms of FWER but they do not evaluate the methods with respect to the power obtained. Blakesley et al. discuss both FWER and power requirements for selected methods for a large number of outcomes with varying degrees of correlation [
19]. Lafaye de Micheaux provide formulae to calculate the power and sample size for multiple outcomes [
20] which require several assumptions to be made about the outcomes, including normality and whether the covariance matrix between the outcomes is known or not. They discuss global testing procedures, including the Hotelling T
2 method. None of these studies have investigated the adjustment methods in the presence of missing data.
There is limited literature discussing the sample size requirements for clinical trials with multiple primary outcomes where the clinical objective is to detect an intervention effect for at least one of the outcomes. Dmitrienko, Tamhane and Bretz [
14] and Senn and Bretz [
13] provide some discussion regarding the sample size in the context of multiple outcomes. However, neither discuss sample size in the context of which adjustment method should be used and they do not provide a comparative table depending on the type of desired power to show implications on the required sample sizes.
In this paper, we compare easy to use methods to adjust p-values in terms of FWER and power, when investigating two, three and four outcomes in presence of complete outcome data and outcome data with missing values. We also consider a range of correlations between the outcomes. We consider both marginal and disjunctive power. Based on our findings, we provide practical recommendations on the adjustment methods which could be used for the sample size calculation and analysis of RCTs with multiple primary outcome. We also present tables showing the implications of using the marginal and disjunctive power on the required sample size for a trial under different scenarios.
Methods
We assume that we have a two-arm trial in which there are M primary outcomes. We are interested in testing the null hypotheses Hj (j = 1, … , M) that there is no intervention effect on the nominated outcomes. The test statistics tj are used to test the null hypotheses Hj. Further suppose that there is an overall null hypothesis \( H(M)={\bigcap}_{j=1}^M{H}_j. \) Under this overall hypothesis, the joint test statistic (t1, … , tM) has a M-variate distribution. We denote pj as the marginal, unadjusted p-values obtained from the appropriate statistical test associated with analysing each outcome separately in a univariate framework. For example, when analysing continuous outcomes, an unpaired Student’s t-test may be used or when analysing binary outcomes a Chi-squared test may be used to investigate the intervention. To control the FWER a correction method is then applied to the unadjusted p-values (pj). We compare the following commonly used adjustment methods in this paper: Šidák, Bonferroni, Holm, Hochberg and Hommel. In addition, we consider the Dubey/Armitage-Parmar (D/AP) adjustment and Stepdown minP resampling procedure which take account of the pairwise correlation between the outcomes.
The method proposed by Šidák is defined as
\( {p}_j^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}=1-{\left(1-{p}_j\right)}^M \). Equivalently, the significance level could be adjusted to
\( {\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}=1-{\left(1-\alpha \right)}^{1/M} \), where
α is the unadjusted significance level. Under the assumption that the outcomes are independent, the adjustment can be derived as
$$ {\displaystyle \begin{array}{c}P\left( no\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathbf{1}\kern0.28em test\right)=1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}},\\ {}\to P\left( no\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathbf{M}\kern0.28em test s\right)={\left(1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}\right)}^M,\\ {}\to P\left(\boldsymbol{atleast}\ on e\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathrm{M}\kern0.28em test s\right)=1-{\left(1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}\right)}^M=\alpha .\end{array}} $$
The Bonferroni method is the most common approach to account for multiplicity due to its simplicity. In this method, the unadjusted
p-values
pj are multiplied by the number of primary outcome. The Dubey/Armitage-Parmar (D/AP) is an ad-hoc method based on the Šidák method, which takes into account the correlation between the outcomes [
11]. The adjusted
p-value is
\( {p}_j^{adj}=1-{\left(1-{p}_j\right)}^{g(j)} \) where
g(
j) =
M1 − mean ρ(j) and
mean ρ(
j) is the mean correlation between the
jth outcome and the remaining
M − 1 outcomes. When using this method in the analysis of multiple outcomes, the mean correlation may be estimated from the data. There has been little theoretical work to assess the performance of this approach [
11].One of the nice properties of the D/AP procedure, which may have contributed to its development, is that when the average of the correlation coefficients is zero, the D/AP adjustment is according to the Bonferroni test, and when the average correlation coefficient is one, the D/AP adjusted and the unadjusted
p-values are the same. The Holm method [
8] involves a step-down method, whereby the unadjusted
p-values are ordered from smallest
p(1) to largest
p(M) and each unadjusted
p-value is adjusted as
\( {p}_{(k)}^{Holm}=\left(M-k+1\right)\ {p}_{(k)} \), where
k = 1, …
M is the rank of the corresponding
p-value. Then starting with the most significant p-value (smallest p-value), each adjusted
p-value is compared to the nominal significance level, until a p-value
greater than the significance level is observed after which the method stops [
21]. The Hochberg step-up method [
9] is similar to the Holm step-down method but works in the other direction. For this method, the unadjusted
p-values are ranked from largest
p(1) to smallest
p(M) and adjusted as
\( {p}_{(k)}^{Hoch}=\left(M-k+1\right)\ {p}_{(k)} \). Starting with the least significant
p-value (largest
p-value), each adjusted
p-value is compared to the pre-specified significance level, until a
p-value
lower than the significance level is observed after which the method stops [
21]. Contrary to the Šidák based approaches, this is a semiparametric method meaning the FWER is only controlled when the joint distribution of the hypotheses test statistics is known, most commonly multivariate normal [
22]. The Hommel method [
10] is another data-driven stepwise method. For this method, the unadjusted
p-values are ranked from largest
p(M) to smallest
p(1). Then let
l be the largest integer for which
\( {p}_{\left(M-l+j\right)}>\frac{j\alpha}{l} \) or all
j = 1, …
l. If no such
j exists then all outcomes can be deemed statistically significant; otherwise, all outcomes with
\( {p}_i\le \frac{\alpha }{j} \) may be deemed statistically significant, where
j = 1, … ,
M;
i = 1, … ,
M. To control the FWER, the Hommel method requires that the joint distribution of the overall hypothesis test statistic is known.
Another step-down method to adjust
p-values is the ‘Stepdown minP’ procedure [
23,
24]. Unlike the previous methods, it does not make any assumptions regarding the distribution of the joint test statistic. Instead it attempts to approximate the true joint distribution by using a resampling approach. This method takes into account the correlation structure between the outcomes and therefore may yield more powerful tests compared to the other adjustment methods [
25]. The Stepdown minP adjusted
p-values are calculated as follows: 1) calculate the observed test statistics using the observed data set; 2) resample the data with replacement within each intervention group to obtain bootstrap resamples, compute the resampled test statistics for each resampled data set and construct the reference distribution using the centred and/or scaled resampled test statistics; 3) calculate the critical value of a level
α test based on the upper
α percentile of the reference distribution, or obtain the raw
p-values by computing the proportion of bootstrapped test statistics that are as extreme or more extreme than the observed test statistic [
26]. That is, the Stepdown minP adjusted
p-value for the
jth outcome is defined as [
24,
26]
\( {p}_j^{minP}={\max}_{k=1,\dots, j}\left\{\kern0.5em \Pr \left(\left(\ {\min}_{l=k,\dots, M}\kern.45em {p}_l\le {p}_k\kern0.5em \right|\kern0.5em H(M)\right)\right\}, \) where
pk is the unadjusted
p-value for the
kth outcome,
pl is the unadjusted
p-value for the
lth outcome (
l =
k, … ,
M), and
H(
M) is the overall null hypothesis.
Although, the resampling based methods have previously been recommended for clinical trials with multiple outcomes they are not widely used in practice [
25]. The Stepdown minP has been shown to perform well when compared to other resampling procedures [
26] and was therefore investigated in this paper.
We perform a simulation study to evaluate the validity of these methods to account for potentially correlated multiple primary outcomes in the analysis and sample size of RCTs. We focus on two, three and four outcomes as a review of trials with multiple primary outcomes in the psychiatry and neurology field found that the majority of the trials had considered two primary outcomes [
4]. Additionally, it has been recommended that a trial should have no more than four primary outcomes [
27]. We estimate the family wise error rate (FWER), the disjunctive power to detect at least one intervention effect and the marginal power to detect an intervention effect on a nominated outcome in a variety of scenarios.
Simulation study
We used the following model to simulate values for two continuous outcomes
Yi = (
Yi, 1,
Yi, 2),
$$ {\boldsymbol{Y}}_{\boldsymbol{i}}={\boldsymbol{\beta}}_{\mathbf{0}}+{\boldsymbol{\beta}}_1{x}_i+{\boldsymbol{\epsilon}}_{\boldsymbol{i}} $$
(2)
where
xi indicates whether the participant
i received intervention or control,
β1 = (
β11,
β12 )
T is vector of the intervention effects for each outcome,
ϵi are errors which are realisations of a multivariate normal distribution
\( {\boldsymbol{\epsilon}}_{\boldsymbol{i}}={\left({\epsilon}_{i,1},{\epsilon}_{i,2}\ \right)}^T\sim N\left(\left(\genfrac{}{}{0pt}{}{0}{0}\right),\left(\begin{array}{cc}1& \rho \\ {}\rho & 1\end{array}\right)\ \right), \) and
ρ ϵ {0.0, 0.2, 0.4, 0.6, 0.8}. The model was also extended to simulate three and four continuous outcomes. When simulating three and four outcomes we specified compound symmetry, meaning that the correlation between any pair of outcomes is the same. We explored both uniform intervention effect sizes and varying effect sizes across outcomes. For the uniform intervention effect sizes, we specified an effect size of 0.35 for all outcomes, that is
β1 = (0.35, 0.35)
T,
β1 = (0.35, 0.35, 0.35)
T or
β1 = (0.35, 0.35, 0.35, 0.35)
T for two, three and four outcomes scenarios respectively. This represents a medium effect size, which reflects the anticipated effect size in many RCTs [
28]. For the varying intervention effect sizes, we specified that
β1 = (0.2, 0.4)
T,
β1 = (0.2, 0.3, 0.4)
T or
β1 = (0.1, 0.2, 0.3, 0.4)
T for two, three and four outcomes scenarios respectively. We also explored the effect of skewed data by transforming the outcome data with uniform intervention effect sizes to have a gamma distribution with shape parameter = 2 and a scale parameter = 2. The gamma distribution is often used to model healthcare costs in clinical trials [
29,
30] and may also be appropriate for skewed clinical outcomes.
We set the sample size to 260 participants, with an equal number of participants assigned to each arm. This provides 80% marginal power to detect a clinically important effect size of 0.35 for each outcome, using an unpaired Student’s t-test and the significance level is unadjusted at 0.05. We introduced missing data under the assumption that the data were missing completely at random (MCAR). When simulating two outcomes, 15 and 25% of the observations in outcome 1 and 2 are missing respectively, and on average approximately 4% of the observations would be missing for both outcomes. When simulating three outcomes, 15% of the observations are missing in one outcome and 25% of the observations are missing in the other two outcomes. When simulating four outcomes, 15% of the observations are missing in two outcomes and 25% of the observations are missing in the other two outcomes. This proportion of missingness in outcomes is often observed in RCTs [
31‐
34].
We estimated the FWER and disjunctive power by specifying no intervention effect (β1j = 0) and an intervention effect (β1j ≠ 0), respectively, and calculating the proportion of times an intervention effect was observed on at least one of the outcomes. The marginal power was similarly estimated but we calculated the proportion of times an intervention effect was observed on the nominated outcome. For each scenario we ran 10,000 simulations. The simulations were run using R version 3.4.2. The Stepdown minP procedure was implemented using the NPC package.
We calculated the sample size based on disjunctive power using the R package “mpe” [
35] and we calculated the sample size based on the marginal power using the R package “samplesize” [
36]. The statistical methodology used for the sample size calculation in these packages is described in the Additional file
1.
Discussion
When using multiple primary outcomes in RCTs it is important to control the FWER for confirmatory phase III trials. One approach to do this is to adjust the p-values produced by each statistical test for each outcome. Additionally, some of the outcomes are likely to have missing values, consequently this needs to be considered when choosing an appropriate method to adjust the p-values.
Statistical analysis
We found that all methods investigated, except the D/AP, controlled the FWER. This agrees with the results previously reported in [
19]. The Stepdown minP performed best in terms of FWER, but the R package used to implement the method uses listwise deletion removing participants with at least one missing value before the analysis resulting in a loss of power. The validity of this approach depends on how the method is implemented and the extent of the missing data.
We recommend that the Hommel method is used to control FWER when the distributional assumptions are met, as it provides slightly more disjunctive power than the Bonferroni and Holm methods. The distributional assumption associated with the Hommel method is not restrictive and is met in many multiplicity problems arising in clinical trials [
22]. Even when the data followed a skewed distribution, the Hommel method performed well, showing it may be used to analyse a variety of outcomes, including those with a skewed distribution.
Given the availability of the software packages to implement the more powerful approaches, there is little reason to use the less powerful methods, such as Holm method. For example, the Hommel method can easily be implemented in R or SAS. Even though it is not currently available in Stata or SPSS, the p-values can be copied across and adjusted in R. However, if the assumptions cannot be met, the simpler Holm method could be used.
When the intervention effect size varied across the outcomes, we found that the differences in disjunctive power between the methods were less pronounced. It appeared that the outcome with the largest effect size ‘dominated’ the disjunctive power. When the sample size is based on the disjunctive power, the outcomes with the largest effect size would have high marginal power, whereas the outcome with the smallest effect size would have low marginal power – much below the overall desired level of power. It follows that when investigators are looking for an intervention effect for at least one outcome, it is unlikely that they will see an intervention effect on the outcomes with the smaller effect sizes without seeing an intervention effect on the outcomes with the largest effect size. Consequently, in this scenario, it may be advisable to pick the outcome(s) with the largest effect size as the primary outcome(s) and treat the other outcomes as secondary outcomes, however, this decision will need to account for the relative clinical importance of the outcomes. Alternatively, when the intervention effect size varies across the outcomes, investigators may wish to consider ‘alpha spending’ in which the total alpha (usually 0.05) is distributed or ‘spent’ across the M analyses.
We appreciate that in practice the choice of the adjustment method may also depend on other factors, such as the availability of simultaneous confidence intervals and unbiased estimates. It is standard practice to report the 95% confidence intervals alongside point estimates and
p-values. When using multiple primary outcomes, it may be necessary to adjust the confidence interval so that it corresponds to the
p-values adjusted for multiplicity. The confidence interval may be easily adjusted when using Bonferroni or Holm adjustments, using the R function “AdjustCIs” in the package “Mediana” [
38]. However, it is not straightforward to adjust the confidence interval when using the Hochberg and Hommel. Consequently, the confidence intervals reported may not align with the
p-values when these adjustments are used. As stated in the European Medical Agency (EMA) guidelines, in this instance, the conclusions should be based on the p-values and not the confidence intervals [
3]. If confidence intervals that correspond to the chosen multiplicity adjustment are not available or are difficult to derive, then the EMA guidelines advise that simple but conservative confidence intervals are used, such as those based on Bonferroni correction [
3].
The statistical analysis plan of a trial should clearly describe how the outcomes will be tested including which adjustment method, if any, will be used [
39].
Our review of trials with multiple outcomes showed that majority of the trials analysed the outcomes separately without any adjustments for multiple comparisons [
4]. Where adjustment methods were used, only the most basic methods were used, possibly due to their ease of implementation. The Bonferroni method was the most commonly used method, although the Holm and Hochberg methods were also used. As a consequence, we focused on relatively simple techniques in this paper. However, more advanced approaches, such as graphical methods to control the FWER are available and described in Bretz et al. [
40] and Bretz et al. [
41] .
It is not necessary to control the FWER for all types of trial designs, for example, for trial designs with co-primary outcomes where all outcomes have to be declared statistically significant for the intervention to be deemed successful. The FDA guidelines state that in this scenario no adjustment needs to be made to control the FWER [
39] and the ‘conjunctive’ power is used. We have not evaluated the conjunctive power as it is not relevant to the scenarios considered in this paper. The conjunctive power may be substantially reduced compared to the marginal power for each outcome [
39] and is never larger than the marginal power [
13]. The conjunctive power behaves in reverse to the disjunctive power in that as the correlation between the outcomes increases, the conjunctive power increases.
Additionally, multiplicity adjustments may not be necessary for early phase drug trials. However, it is generally accepted that adjustments to control the FWER are required in confirmatory studies, that is when the goal of the trial is the definitive proof of a predefined key hypothesis for the final decision making [
42].
Sample size
When designing a clinical trial, it is important to calculate the sample size needed to detect a clinically important intervention effect. Usually the number of participants that can be recruited in a trial is restricted because of ethical, cost and time implications. The sample size calculation for a trial is usually based on an appropriate statistical method which will be used for the primary analysis depending on the study design and objectives. The sample size can vary greatly depending on if the marginal power or overall disjunctive power is used highlighting the importance of calculating the sample size based on the trial objective. To account for multiplicity in the sample size calculation, we recommend that the Bonferroni adjustment is used. The Bonferroni adjustment can be applied easily within the sample size calculation using an analytical formula [
39] and our simulation study showed that it maintains the FWER to an acceptable level for low to moderate correlation between the outcomes. Additionally, there is not much loss in power when using the Bonferroni adjustment, compared to the other methods, in the presence of missing data. In contrast, the other methods investigated in this paper are data driven and therefore it is not clear how these can be incorporated without prior data.
One approach that has previously been used to calculate the sample size for multiple primary outcomes, was to calculate the sample size based on the individual marginal powers for each outcome and to choose the maximum sample size for the trial [
43]. This approach guarantees adequate marginal power for each individual test. However, this approach will overestimate the number of participants required if the investigators are interested in disjunctive power. Moreover, it may be problematic to achieve that sample size in trials where recruitment is a problem and may result in trials being closed down prematurely. Finally, the sample size should be inflated to account for the expected amount of missing data.
Study extensions and limitations
In this paper, we only explored continuous outcomes. However, in RCTs binary outcomes or a combination of continuous and binary outcomes may be used. For two binary outcomes, the maximum possible pairwise correlation between the outcomes will be less than one in absolute magnitude [
44] and therefore we would expect similar results but with less pronounced differences between methods for the strong correlations.
Additionally, we only explored global effects, that is either no interventions effect on any of the outcomes (β1j = 0 ) or an intervention effect on all the outcomes (β1j ≠ 0). Global effects are most realistic when the strength of the correlation between the outcomes is moderate to strong. However, in practice a mixture of no effects and some intervention effects may be observed, especially when the strength of the correlation between the outcomes is weak.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.