Design operating characteristics
We evaluate the performance of the proposed designs based on the motivating trial of the LOXL2 inhibitor. The study drug is hypothesized to exert a therapeutic effect in fibrosis and cancer by inhibiting fibroblast activation and thereby altering the pathologic matrix in different disease states. The consequences of inhibiting fibroblast activation include substantial reduction of desmoplasia, decreased expression of growth factors and cytokines, lack of formation of tumor vasculature, and increased necrosis, pyknosis, and autophagy of tumor cells. Given the hypothesized action of this MTA, the investigators assumed that efficacy either increases or increases and then plateaus in the tested dose range. Toxicity was assumed to be nondecreasing with an increasing dose. Two dose levels were to be evaluated, and a maximum of 54 patients were to be enrolled. The primary goal of the study was to determine whether both dose levels would result in a target response rate of greater than or equal to 30% against a null hypothesis of 10%. If both doses achieved the target response rate and appeared comparable, the investigators would proceed with the lower dose for subsequent testing. However, if activity was primarily seen at the higher dose level, or if both doses achieved the target response rate, yet the higher dose had a considerably higher response rate, the investigators would test the higher dose in subsequent studies.
In this study, response was defined as the clinical response based on the International World Group criteria. In particular, stable disease with improvement in bone marrow fibrosis score, clinical improvement, partial remission, or complete remission would be considered a response. Toxicity was defined as dose-limiting toxicity (DLT) with pre-specified categories and grades. In the corresponding phase I study, four dose levels had been evaluated in patients with advanced solid tumors, and three patients had been treated at each dose level. Given a patient’s weight of 70 kg, the two middle dose levels tested in the phase I trial were very close to the dose levels considered in this study. As no DLTs or drug-related severe adverse events had been observed at any dose in the phase I trial, to elicit informative prior distribution for toxicity at a given dose by incorporating toxicity data at the same and higher dose levels from the phase I trial, we chose to treat every three patients at the higher dose level without toxicity as five patients at the lower dose level without toxicity, when the lower doses were studied in this phase II trial. Our final prior distributions for the probabilities of toxicity were and , both of which were obtained by assuming a beta(0.5,0.5) prior distribution prior to observing the phase I toxicity data.
We first compare our proposed designs with two alternative designs, both of which use futility and efficacy continuous monitoring rules. The first design is an independent design that uses Bayesian hypothesis tests with a nonlocal alternative prior [
15] at each dose level. The null and alternative hypotheses at the two dose levels are
for i = 1,2. Arm i (i = 1,2) is terminated for efficacy if , and is terminated for futility if , with P
a⋆ > 0.5 being a cutoff value to be tuned by simulations. We conclude H
0 if both and are found to hold; we conclude H
1 if and are found to hold; and we conclude if and are found to hold. For these independent designs, we cannot obtain an exact posterior probability that the two response rates are equal, so we use approximations. If holds, then we conclude H
2 if p (θ
2 - θ
1 > 0.1 ∣ x) ≤ P
d
, and conclude H
3 otherwise, with P
d
being some threshold to be calibrated by simulations. We assign independent prior distributions for the toxicity probabilities at the two dose levels, and terminate the trial if both dose levels are toxic and close either arm if the corresponding dose level is toxic.
The second design we assess for comparison is based on Bayesian isotonic regression transformation (BIT) [
23]. The prior distributions are independent Uniform(0,1) distributions for both
θ
1 and
θ
2. After data are observed, the unconstrained posterior distributions of
θ
1 and
θ
2 are independent beta distributions. For each pair of (
θ
1,
θ
2) drawn from the unconstrained posterior beta distributions, the order-restricted posterior samples
are obtained as weighted averages of (
θ
1,
θ
2) when the order is violated, or otherwise remain unchanged, where the weights
ω are proportional to the unconstrained posterior precision at the two dose levels. Consider three posterior probabilities:
p
1 =
P (
θ
1 >
θ ⋆ -
δ,
θ
2 >
θ
⋆ -
δ ∣ data),
p
2 =
P (
θ
1 <
θ
0 +
δ,
θ
2 >
θ
⋆ -
δ ∣ data), and
p
3 =
P (
θ
1 <
θ
0 +
δ,
θ
2 <
θ
0 +
δ ∣ data). The trial is terminated if the maximum sample size is reached,
p
1 >
P
h
, or
p
2 +
p
3 >
P
i
. This procedure is undertaken after the outcome of each subsequently treated patient is observed. At the end of the trial, if
p
1 >
P
h
, we conclude
, and claim
H
2 if
p (
θ
1 =
θ
2 ∣ data) ≥
p (
θ
1 <
θ
2 ∣ data), and
H
3 otherwise. If
p
2 +
p
3 >
P
i
, we conclude
H
0 if
p
2 -
p
3 ≤
P
j
, and conclude
H
1 otherwise. The rule for early termination due to toxicity is the same as in the BHT approach.
Given that little toxicity was found in the phase I studies, we assumed low toxicity probabilities in our simulation scenarios, specifically, 0.15 at both dose levels. The upper limit of the toxicity probability . The null and target response rates are θ
0 = 0.1 and θ
⋆ = 0.3. We chose τ
1 = 0.06 and τ
2 = 0.015 to correspond with a prior mode at 0.3 and a value of interest, the between-dose difference in response rate of 0.1. The between-dose difference of interest refers to a minimal clinically meaningful difference between doses. To facilitate the comparison of the performances of several methods, the cutoffs of each method were chosen to approximately match the resulting type I error and average sample size under H
0, based on simulations. The cutoffs used were P
a
= 0.65, P
b
= 1.2, P
c
= 0.8, P
a⋆ = 0.7, P
d
= 0.11, P
e
= 0.7, P
f
= 0.65, P
g
= 1.3, P
h
= 0.65, P
i
= 0.65, P
j
= 0.4, P
k
= 0.02, and δ = 0.1. For the independent design, the maximum sample size was set at 27 at each dose level. For all designs, we used a minimum total sample size of 24 (12 at each dose level for the independent design) as “burn-in”, and continuous monitoring of futility and efficacy after the burn-in period. In addition, we monitored toxicity continuously starting from the first patient. We constructed 12 scenarios with different true response rates at the two dose levels. Under each scenario, we simulated 1,000 trials.
The operating characteristics of the four designs are summarized in Table
1, with the joint BHT design labeled ‘BHT-A’, the independent BHT design labeled ‘indep’, the BMA design ‘BMA’, and the BIT design ‘BIT’. We also conducted a sensitivity analysis to evaluate different prior probabilities of the four hypotheses under the joint BHT design. Instead of assuming equal prior probability of each of the four hypotheses, we gave equal prior probability for the three hypotheses, i.e., 1/3, to
H
0,
H
1, and
, and 1/6 probability to each of
H
2 and
H
3. The corresponding results are shown in Table
1 under column ‘BHT-B’. Under each scenario, we list the true response rates at the two dose levels in the top row, the probability of concluding that each of the four hypotheses is true in the next four rows, the probability of concluding that both dose levels are promising in the sixth row, and the average sample size and percentage of inconclusive trials in the bottom two rows, respectively.
Table 1
Probability of concluding each hypothesis, average sample size and percentage of inconclusive trials (toxicity probability = 0.15)
0.1
&
0.1
| | | | | |
0.1
&
0.3
| | | | | |
P(H
0) | 0.936 | 0.935 | 0.894 | 0.94 | 0.935 | P(H
0) | 0.258 | 0.295 | 0.277 | 0.396 | 0.355 |
P(H
1) | 0.041 | 0.049 | 0.057 | 0.037 | 0.039 | P(H
1) | 0.604 | 0.645 | 0.676 | 0.435 | 0.501 |
P(H
2) | 0.022 | 0.014 | 0.003 | 0.018 | 0.016 | P(H
2) | 0.085 | 0.029 | 0.02 | 0.03 | 0.051 |
P(H
3) | 0 | 0 | 0 | 0.003 | 0.01 | P(H
3) | 0.045 | 0.024 | 0.012 | 0.129 | 0.088 |
P() | 0.022 | 0.014 | 0.003 | 0.021 | 0.026 | P() | 0.13 | 0.053 | 0.033 | 0.159 | 0.139 |
avg ss | 25.209 | 25.205 | 24.985 | 24.709 | 24.621 | avg ss | 26.862 | 26.96 | 25.328 | 26.711 | 25.606 |
% inconclusive | 0.001 | 0.002 | 0.046 | 0.002 | 0 |
% inconclusive | 0.008 | 0.007 | 0.014 | 0.01 | 0.005 |
0.2
&
0.2
| | | | | |
0.1
&
0.4
| | | | | |
P(H
0) | 0.434 | 0.524 | 0.401 | 0.519 | 0.496 | P(H
0) | 0.092 | 0.092 | 0.077 | 0.171 | 0.136 |
P(H
1) | 0.133 | 0.183 | 0.225 | 0.106 | 0.115 | P(H
1) | 0.765 | 0.84 | 0.877 | 0.553 | 0.723 |
P(H
2) | 0.386 | 0.246 | 0.104 | 0.267 | 0.255 | P(H
2) | 0.06 | 0.029 | 0.016 | 0.017 | 0.018 |
P(H
3) | 0.036 | 0.029 | 0.025 | 0.088 | 0.129 | P(H
3) | 0.081 | 0.037 | 0.027 | 0.247 | 0.123 |
P() | 0.422 | 0.275 | 0.129 | 0.355 | 0.384 | P() | 0.141 | 0.066 | 0.042 | 0.264 | 0.141 |
avg ss | 26.782 | 27.805 | 26.341 | 26.919 | 25.995 | avg ss | 25.871 | 25.68 | 24.84 | 27.431 | 25.754 |
% inconclusive | 0.011 | 0.018 | 0.245 | 0.02 | 0.005 |
% inconclusive | 0.002 | 0.002 | 0.004 | 0.012 | 0 |
0.1
&
0.2
| | | | | |
0.3
&
0.4
| | | | | |
P(H
0) | 0.577 | 0.619 | 0.609 | 0.701 | 0.674 | P(H
0) | 0.03 | 0.031 | 0.024 | 0.04 | 0.04 |
P(H
1) | 0.326 | 0.336 | 0.341 | 0.203 | 0.241 | P(H
1) | 0.152 | 0.249 | 0.274 | 0.043 | 0.109 |
P(H
2) | 0.072 | 0.034 | 0.013 | 0.047 | 0.036 | P(H
2) | 0.425 | 0.383 | 0.319 | 0.379 | 0.283 |
P(H
3) | 0.01 | 0.002 | 0.004 | 0.045 | 0.048 | P(H
3) | 0.392 | 0.336 | 0.324 | 0.535 | 0.565 |
P() | 0.082 | 0.036 | 0.016 | 0.092 | 0.084 | P() | 0.817 | 0.719 | 0.643 | 0.914 | 0.848 |
avg ss | 27.881 | 27.231 | 25.673 | 25.704 | 25.426 | avg ss | 25.244 | 25.52 | 25.291 | 25.116 | 25.191 |
% inconclusive | 0.015 | 0.009 | 0.034 | 0.004 | 0.001 |
% inconclusive | 0.001 | 0.001 | 0.059 | 0.003 | 0.003 |
0.3
&
0.3
| | | | | |
0.3
&
0.5
| | | | | |
P(H
0) | 0.085 | 0.108 | 0.086 | 0.124 | 0.108 | P(H
0) | 0.006 | 0.004 | 0.005 | 0.006 | 0.011 |
P(H
1) | 0.104 | 0.192 | 0.211 | 0.061 | 0.081 | P(H
1) | 0.179 | 0.295 | 0.293 | 0.041 | 0.133 |
P(H
2) | 0.618 | 0.557 | 0.355 | 0.503 | 0.455 | P(H
2) | 0.233 | 0.2 | 0.193 | 0.236 | 0.162 |
P(H
3) | 0.187 | 0.141 | 0.141 | 0.3 | 0.353 | P(H
3) | 0.581 | 0.501 | 0.496 | 0.714 | 0.689 |
P() | 0.805 | 0.698 | 0.496 | 0.803 | 0.808 | P() | 0.814 | 0.701 | 0.689 | 0.95 | 0.851 |
avg ss | 25.594 | 26.214 | 25.779 | 26.084 | 25.505 | avg ss | 24.74 | 25.202 | 25.063 | 24.832 | 25.374 |
% inconclusive | 0.006 | 0.002 | 0.206 | 0.012 | .003 |
% inconclusive | 0.001 | 0 | 0.013 | 0.003 | 0.005 |
0.4
&
0.4
| | | | | |
0.4
&
0.5
| | | | | |
P(H
0) | 0.006 | 0.019 | 0.009 | 0.012 | 0.013 | P(H
0) | 0 | 0.001 | 0.002 | 0.002 | 0.002 |
P(H
1) | 0.032 | 0.078 | 0.103 | 0.011 | 0.03 | P(H
1) | 0.049 | 0.087 | 0.11 | 0.011 | 0.029 |
P(H
2) | 0.611 | 0.589 | 0.497 | 0.573 | 0.487 | P(H
2) | 0.336 | 0.335 | 0.34 | 0.405 | 0.319 |
P(H
3) | 0.351 | 0.314 | 0.319 | 0.404 | 0.47 | P(H
3) | 0.615 | 0.577 | 0.533 | 0.582 | 0.65 |
P() | 0.962 | 0.903 | 0.816 | 0.977 | 0.957 | P() | 0.951 | 0.912 | 0.874 | 0.987 | 0.969 |
avg ss | 24.326 | 24.755 | 24.742 | 24.436 | 24.458 | avg ss | 24.289 | 24.408 | 24.514 | 24.319 | 24.359 |
% inconclusive | 0 | 0 | 0.072 | 0 | 0 |
% inconclusive | 0 | 0 | 0.014 | 0 | 0 |
0.5
&
0.5
| | | | | |
0.4
&
0.6
| | | | | |
P(H
0) | 0 | 0 | 0 | 0 | 0 | P(H
0) | 0.001 | 0.001 | 0 | 0.001 | 0 |
P(H
1) | 0.007 | 0.017 | 0.014 | 0.002 | 0.002 | P(H
1) | 0.053 | 0.114 | 0.112 | 0.01 | 0.036 |
P(H
2) | 0.495 | 0.498 | 0.513 | 0.558 | 0.491 | P(H
2) | 0.159 | 0.14 | 0.175 | 0.237 | 0.176 |
P(H
3) | 0.498 | 0.485 | 0.457 | 0.44 | 0.507 | P(H
3) | 0.787 | 0.745 | 0.711 | 0.752 | 0.788 |
P() | 0.993 | 0.983 | 0.97 | 0.998 | 0.998 | P() | 0.946 | 0.885 | 0.886 | 0.989 | 0.964 |
avg ss | 24.028 | 24.188 | 24.21 | 24.025 | 24.072 | avg ss | 24.193 | 24.491 | 24.412 | 24.177 | 24.341 |
% inconclusive | 0 | 0 | 0.016 | 0 | 0 |
% inconclusive | 0 | 0 | 0.002 | 0 | 0 |
In all scenarios, there are few early terminations due to toxicity because the true probabilities of toxicities are assumed low and the prior distributions for toxicity are informative (summary not shown). In the first scenario, H
0 is true. The five designs are tuned to result in similar probabilities of declaring H
0 (0.936, 0.935, 0.894, 0.94, and 0.935, respectively) and the corresponding average sample sizes, with the independent BHT design performing a little worse. In the second scenario, although the true response rate of 0.2 is between the null value 0.1 and the target value 0.3, it may be desirable to claim H
0 because neither dose achieves the target response rate (but this may be debatable). The BHT-B and BMA designs result in higher probabilities of claiming H
0, and use slightly larger sample sizes. The independent BHT design performs the worst. In the third scenario, i.e., 0.1 & 0.2, we may similarly want to claim H
0. BMA and BIT yield the highest probabilities of claiming H
0, and also use fewer patients than the joint BHT. In the next three scenarios, H
2 is true. The probability of concluding is higher under BHT-A, BMA, and BIT. BHT-A has the largest probability of correctly claiming H
2 under scenarios 0.3 & 0.3 and 0.4 & 0.4, and BMA has the largest probability of claiming H
2 under scenario 0.5 & 0.5. The results in these three scenarios highlight the advantages of using nonlocal alternative priors in BHT and BMA, as these priors are perceived to be helpful in identifying equality of response rates between doses. In the next two scenarios, H
1 is true. The independent BHT and BHT-B designs lead to the highest probabilities of claiming H
1. The independent BHT design also uses the smallest sample sizes on average. In these two scenarios, BMA performs the worst, because the response rate estimates corresponding to the model in which the two response rates are equal are averaged in the final results, which decreased the higher response rate. In the last four scenarios, H
3 is true. BMA yields the highest probability in all four scenarios. This may be explained by the fact that the incorrect model that assumes equality for the response rates (i.e., M
1) in fact has strengthened the claim that both doses are promising. For 0.3 & 0.4 and 0.3 & 0.5, BMA and BIT yield the highest probabilities of claiming H
3. And for 0.4 & 0.5 and 0.4 & 0.6, BHT-A and BIT result in the highest probabilities of H
3. As expected, BHT-B leads to a smaller probability of than BHT-A. In all 12 scenarios, the inconclusive percentages are the highest under the independent BHT design.
The overall results suggest that the BHT-A design performs the best among all designs. The BMA design performs reasonably well (except in scenarios 0.1 & 0.3, and 0.1 & 0.4), similarly to or marginally better than the BIT design and independent BHT design. The BHT-B design tends to perform adequately in most scenarios, but not the best in any scenario. To summarize the robustness of the performances of all five designs, we counted the number of scenarios out of 10 scenarios (excluding scenarios 0.2 & 0.2, and 0.1 & 0.2) in which each design performs the best or almost the best (in terms of the percentage of drawing the correct conclusion) across all five designs and the number of scenarios in which the design performs inadequately (defined as when the chance of drawing a correct conclusion is at least 15 percentage points less than that of the best design for that scenario). For example, the corresponding numbers are (3,1) for the BHT-A design, meaning that in 3 out of the 10 scenarios, the design performs the best or nearly the best, and in 1 out of the 10 scenarios it performs inadequately. The corresponding pairs of numbers for the BHT-B, indep, BMA, and BIT designs are (0,2), (2,3), (2,2), and (3,2), respectively. We excluded scenarios 0.1 & 0.2 and 0.2 & 0.2 because it is unclear which conclusion should be deemed ‘correct’ in these scenarios. We re-counted these numbers by further excluding scenarios 0.3 & 0.4 and 0.4 & 0.5, because they represent minimal levels of clinically meaningful difference in the response rate between doses, and thus may be of less relevance than other scenarios. With these exclusions, the numbers are (3,0), (0,1), (2,2), (2,2), and (1,2) for the BHT-A, BHT-B, indep, BMA, and BIT designs, respectively. These results demonstrate the robust performance of the proposed BHT-A design.
Comparison to the independent Simon optimal two-stage designs
We also compared our proposed Bayesian designs to the independent Simon optimal two-stage designs, perhaps the most commonly used design for single-arm phase II trials. Since we have demonstrated that the BHT-A design performs more robustly than the BMA and BHT-B designs, we compare only the BHT-A and independent Simon two-stage designs. We conduct this comparison separately because the Simon design does not include early stopping for efficacy. For simplicity, we assumed toxicity was low at both dose levels again, and only considered efficacy.
We extended the Simon optimal two-stage design to a setting with two doses, as follows. First, we applied the two-stage design to each dose level independently. We aimed to control the type I error to be 0.05 when both response rates were 0.1 and the type II error to be 0.2 when both response rates were 0.3. So at each dose, the type I error was chosen to be 0.0253 and the type II error was 0.106. Under the optimality criterion of Simon, the required maximum sample size for each dose was 45, with 17 in the first stage. We concluded H
0 if both doses were rejected, H
1 if only the lower dose was rejected, and if neither dose was rejected. Here rejection of a dose means that the dose is not considered to be promising (or hypothesis θ
1 = θ
0 or θ
2 = θ
0 is concluded). If was concluded, we claimed H
2 if R R
1 ≥ R R
2 - δ, and H
3 otherwise, where R R
1 and R R
2 were the observed proportions of patients who experienced efficacy at lower and higher doses, respectively. We first used δ = 0.05, and performed additional sensitivity analyses using δ = 0.03 and 0.07. If only the higher dose was rejected, the trial was claimed to be inconclusive. The reasons why we compare with the independent Simon’s designs are: 1) Based on our experiences at MD Anderson Cancer Center, it is a commonly used approach in designing phase II oncology trials with more than one dose groups, even under a plausible assumption that the response rates are ordered between dose levels; 2) we are not aware of a published version of the Simon optimal two-stage designs for ordered dose groups in the literature.
To make the BHT-A design comparable with the Simon two-stage designs, we modified our monitoring rule to allow for early stopping only for futility. Specifically, we terminated the trial if p (H
0|x) was above 0.848, and closed the lower dose arm if p (H
1|x) was above 0.848. The BHT-A design utilized continuous monitoring after the outcomes of a minimum of 24 patients across both doses had been observed. The maximum sample size was also set to be 45 at each dose level. At the end of the trial, we claimed H
0, H
1, or if the corresponding posterior probability was above 0.5. If was claimed, we concluded H
2 if p (x|H
2)/p (x|H
3) > 1.37, and concluded H
3 otherwise.
We considered the same 12 scenarios, and under each scenario we simulated 1,000 trials. The operating characteristics of both designs are shown in Tables
2, with
δ = 0.05, 0.03, and 0.07 for the Simon designs labeled as ‘Simon I’, ‘Simon II’, and ‘Simon III’, respectively. The column ‘SS’ shows the average total sample size; columns ‘SS 1’ and ‘SS 2’ show the average sample size for the lower and higher doses, respectively. The type I error rate was slightly lower under BHT. For scenario 0.3 & 0.3, the BHT design performed much better than the Simon I design, with 14
% higher
and 35
% higher
P(
H
2). For scenario 0.1 & 0.3, BHT resulted in a little lower
P(
H
1) than Simon I. In other scenarios, BHT-A and Simon I designs perform comparably. In scenario 0.1 & 0.1 where neither dose level is promising, BHT-A required at least six patients fewer at each dose compared to the Simon I design. The comparisons with the Simon II and Simon III designs are similar. In summary, compared with the Simon optimal two-stage designs, our proposed BHT-A design can terminate trials of unpromising doses early, by utilizing a continuous monitoring rule and nonlocal alternative prior distributions in the hypothesis tests.
Table 2
Comparisons of BHT-A and Simon optimal two-stage designs
| | | | |
0.1 & 0.1
| | | | |
BHT-A | 0.978 | 0.014 | 0.003 | 0.000 | 0.003 | 34.5 | 17.1 | 17.4 | 0.005 |
Simon I | 0.954 | 0.022 | 0.001 | 0.000 | 0.001 | 47.1 | 23.6 | 23.5 | 0.022 |
Simon II | 0.954 | 0.022 | 0.001 | 0.000 | 0.001 | 47.1 | 23.6 | 23.5 | 0.022 |
Simon III | 0.954 | 0.022 | 0.001 | 0.000 | 0.001 | 47.1 | 23.6 | 23.5 | 0.022 |
| | | | |
0.2 & 0.2
| | | | |
BHT-A | 0.408 | 0.142 | 0.408 | 0.003 | 0.411 | 68.6 | 33.3 | 35.3 | 0.039 |
Simon I | 0.261 | 0.250 | 0.203 | 0.040 | 0.243 | 73.2 | 36.7 | 36.5 | 0.246 |
Simon II | 0.261 | 0.250 | 0.174 | 0.069 | 0.243 | 73.2 | 36.7 | 36.5 | 0.246 |
Simon III | 0.261 | 0.250 | 0.221 | 0.022 | 0.243 | 73.2 | 36.7 | 36.5 | 0.246 |
| | | | |
0.1 & 0.2
| | | | |
BHT-A | 0.623 | 0.343 | 0.027 | 0.000 | 0.027 | 53.6 | 23.6 | 30.0 | 0.007 |
Simon I | 0.515 | 0.457 | 0.008 | 0.004 | 0.012 | 60.0 | 23.7 | 36.3 | 0.016 |
Simon II | 0.515 | 0.457 | 0.008 | 0.004 | 0.012 | 60.0 | 23.7 | 36.3 | 0.016 |
Simon III | 0.515 | 0.457 | 0.011 | 0.001 | 0.012 | 60.0 | 23.7 | 36.3 | 0.016 |
| | | | |
0.3 & 0.3
| | | | |
BHT-A | 0.039 | 0.056 | 0.773 | 0.132 | 0.905 | 86.7 | 42.8 | 43.9 | 0.000 |
Simon I | 0.008 | 0.094 | 0.574 | 0.220 | 0.794 | 85.5 | 42.9 | 42.6 | 0.104 |
Simon II | 0.008 | 0.094 | 0.506 | 0.288 | 0.794 | 85.5 | 42.9 | 42.6 | 0.104 |
Simon III | 0.008 | 0.094 | 0.632 | 0.162 | 0.794 | 85.5 | 42.9 | 42.6 | 0.104 |
| | | | |
0.4 & 0.4
| | | | |
BHT-A | 0.002 | 0.008 | 0.706 | 0.284 | 0.990 | 89.7 | 44.7 | 44.9 | 0.000 |
Simon I | 0.000 | 0.014 | 0.675 | 0.300 | 0.976 | 89.3 | 44.6 | 44.7 | 0.010 |
Simon II | 0.000 | 0.014 | 0.588 | 0.387 | 0.976 | 89.3 | 44.6 | 44.7 | 0.010 |
Simon III | 0.000 | 0.014 | 0.748 | 0.227 | 0.976 | 89.3 | 44.6 | 44.7 | 0.010 |
| | | | |
0.5 & 0.5
| | | | |
BHT-A | 0.000 | 0.001 | 0.659 | 0.340 | 0.999 | 90.0 | 45.0 | 45.0 | 0.000 |
Simon I | 0.000 | 0.001 | 0.691 | 0.305 | 0.996 | 89.9 | 45.0 | 44.9 | 0.003 |
Simon II | 0.000 | 0.001 | 0.626 | 0.370 | 0.996 | 89.9 | 45.0 | 44.9 | 0.003 |
Simon III | 0.000 | 0.001 | 0.762 | 0.234 | 0.996 | 89.9 | 45.0 | 44.9 | 0.003 |
| | | | |
0.1 & 0.3
| | | | |
BHT-A | 0.150 | 0.811 | 0.030 | 0.006 | 0.036 | 64.9 | 23.9 | 41.0 | 0.003 |
Simon I | 0.104 | 0.872 | 0.006 | 0.016 | 0.022 | 66.4 | 23.6 | 42.8 | 0.002 |
Simon II | 0.104 | 0.872 | 0.002 | 0.020 | 0.022 | 66.4 | 23.6 | 42.8 | 0.002 |
Simon III | 0.104 | 0.872 | 0.007 | 0.015 | 0.022 | 66.4 | 23.6 | 42.8 | 0.002 |
| | | | |
0.1 & 0.4
| | | | |
BHT-A | 0.021 | 0.949 | 0.007 | 0.023 | 0.030 | 65.2 | 20.9 | 44.4 | 0.000 |
Simon I | 0.013 | 0.957 | 0.001 | 0.028 | 0.029 | 68.5 | 23.9 | 44.6 | 0.001 |
Simon II | 0.013 | 0.957 | 0.000 | 0.028 | 0.029 | 68.5 | 23.9 | 44.6 | 0.001 |
Simon III | 0.013 | 0.957 | 0.002 | 0.027 | 0.029 | 68.5 | 23.9 | 44.6 | 0.001 |
| | | | |
0.3 & 0.4
| | | | |
BHT-A | 0.006 | 0.097 | 0.358 | 0.539 | 0.897 | 87.5 | 42.7 | 44.8 | 0.000 |
Simon I | 0.000 | 0.100 | 0.336 | 0.556 | 0.892 | 87.7 | 42.9 | 44.8 | 0.008 |
Simon II | 0.000 | 0.100 | 0.261 | 0.631 | 0.892 | 87.7 | 42.9 | 44.8 | 0.008 |
Simon III | 0.000 | 0.100 | 0.412 | 0.480 | 0.892 | 87.7 | 42.9 | 44.8 | 0.008 |
| | | | |
0.3 & 0.5
| | | | |
BHT-A | 0.000 | 0.116 | 0.071 | 0.813 | 0.884 | 87.3 | 42.3 | 45.0 | 0.000 |
Simon I | 0.000 | 0.104 | 0.075 | 0.820 | 0.896 | 87.9 | 43.0 | 45.0 | 0.001 |
Simon II | 0.000 | 0.104 | 0.050 | 0.846 | 0.896 | 87.9 | 43.0 | 45.0 | 0.001 |
Simon III | 0.000 | 0.104 | 0.122 | 0.774 | 0.896 | 87.9 | 43.0 | 45.0 | 0.001 |
| | | | |
0.4 & 0.5
| | | | |
BHT-A | 0.000 | 0.010 | 0.289 | 0.701 | 0.990 | 89.7 | 44.7 | 45.0 | 0.000 |
Simon I | 0.000 | 0.014 | 0.353 | 0.632 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
Simon II | 0.000 | 0.014 | 0.280 | 0.705 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
Simon III | 0.000 | 0.014 | 0.424 | 0.560 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
| | | | |
0.4 & 0.6
| | | | |
BHT-A | 0.000 | 0.021 | 0.051 | 0.928 | 0.979 | 89.4 | 44.4 | 45.0 | 0.000 |
Simon I | 0.000 | 0.016 | 0.074 | 0.910 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
Simon II | 0.000 | 0.016 | 0.051 | 0.934 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
Simon III | 0.000 | 0.016 | 0.116 | 0.868 | 0.984 | 89.6 | 44.6 | 45.0 | 0.000 |
To compare the robustness of the performances of all four designs, i.e., BHT-A and Simon I, II and III designs, we similarly report the numbers of scenarios in which each design performs the best and inadequately. The pairs of numbers are (5,0), (2,1), (6,2), and (4,1) out of 10 scenarios, and (4,0), (2,1), (4,2), and (4,0) out of the 8 scenarios, for the BHT-A, Simon I, II, and III designs, respectively, with the same 10 and 8 scenarios selected in Section “Design operating characteristics”. We extended the definitions for ‘best’ and ‘inadequate’ by also accounting for situations where an average total sample size is reduced by 10 or more when the percentages of drawing the correct conclusion are similar (i.e., scenario 0.1 & 0.1). These results suggest that the proposed BHT-A design performs more robustly than the three versions of the independent Simon optimal two-stage designs.