Boundary selection of equivalence regions in (Bayesian) equivalence tests
In this section, a new proposal is made how to determine the equivalence region for Bayesian equivalence tests in practice. Note that the proposal deals primarily with equivalence tests. However, the interval Bayes factor and the ROPE can easily be used for Bayesian superiority tests, too, although these are not studied in this paper. For example, a ROPE can be selected as [
c,
∞) for some
\(c\in \mathbb {R}\) to resemble a superiority test of
H0:
θ≥
c against its alternative, and the interval Bayes factor could be extended to use an interval hypothesis (
c,
∞) in the same way. Inferiority tests would work accordingly. However, the simulation study deals only with Bayesian equivalence tests
8. The results will be used later to implement the new proposal made in this section, and reanalyse the illustrative example.
Regarding the choice of the equivalence region, Morey and Rouder [
41] stressed: “Choices of the equivalence regions and weights of the point nil reflect reasoned beliefs about the problem at hand. In fields where interesting effects are smaller (...) the width of the null region may be (...) small. In other fields, where interesting effect sizes are larger (...) the region may be made larger to suit. The task of selecting boundaries is simplified somewhat by the parameterizations. The models are parameterized with respect to standardized effect size. General guidelines already exist (Cohen, 1988), and we note that many journals require reporting some measure of effect size.”Morey and Rouder ([
41], p. 25-26)
For a variety of quantities used in biomedical research widely accepted standards exist how to interpret different magnitudes of these quantities. Examples are effect sizes, which have a tradition of being categorized in the biomedical, social and psychological sciences, see Cohen [
34]. For effect sizes, a widely accepted ROPE
R around a null hypothesis
H0:
δ=0 is given as
R=[−0.1,0.1], whose boundaries
δ=−0.1 and
δ=0.1 are half of the magnitude necessary for at least a small effect as defined by to Cohen [
34]. Both Kruschke [
58] and Morey and Rouder [
41] proposed this default ROPE on
δ9. However, the range of proposals how to select the equivalence region (no matter if for frequentist or Bayesian equivalence tests) is broad, and below only the most established options with a focus on the biomedical sciences are outlined briefly:
(i)
According to Lakens et al. [
38], researchers often know better which sample sizes are attainable in their field of work than which effect sizes can expected to be observed in a study. As the amount of data available limits the effect size which can be detected, researchers can derive the smallest effect size which they can detect after selecting a test level
α and their sample size
n and use this smallest detectable effect size as the equivalence boundary. Note that although it seems that this method primarily applies to frequentist tests because the Bayesian paradigm contains no concept of a type I error, the results of the simulation study presented below will allow to use this method also for Bayesian equivalence tests.
(ii)
The U.S. Food and Drug Administration has recommended equivalence bounds for establishing bioequivalence [
75], for a discussion see Senn [
76].
(iii)
Cook et al. [
77,
78] proposed three methods: The anchor method for determining the minimally clinical important difference (MCID), where the judgement of relevant stakeholders is used, compare Jaeschke et al. [
79]. The distribution method, where both the standard error of a measurement and the smallest detectable difference of a statistical test is employed. The health economic method which asks which effect is necessary in “health units” to justify the amount of money spent for the treatment or therapy.
(iv)
Weber and Popova [
80] recommended to incorporate meta-analyses to determine the equivalence region.
(v)
Simonsohn [
81] proposed to set the equivalence boundary at the effect size which a previous study would have had ≈33
% power to detect. For details see also Lakens et al. [
38].
(vi)
Ferguson [
82], Beribisky, Davidson and Cribbie [
83] and Rusticus and Eva [
84] argued for incorporating pilot studies to determine the equivalence region.
(vii)
Other approaches and examples which select the equivalence region based on prior research are given in Perugini, Gallucci & Constantini [
85] and Kordsmeyer and Penke [
86].
(viii)
In case none of the other justifications of equivalence boundaries is possible, Maxwell, Lau and Howard [
87] proposed to use a trivially small value like an effect size of
δ=0.10 according to Cohen [
34]
10.
(ix)
Kruschke [
36] provides an in-depth discussion of selecting the boundaries for the ROPE in the Bayesian approach.
(x)
Finally,
“the ideal specific meaningful effect should be made through a multi-faceted decision-making process” ([
83], p. 5), see also Rogers et al. [
88].
Now, in addition to these proposals another one is made: To use objective criteria like the type I error rate, power and robustness to the prior selection to determine the equivalence region (or to decide between available Bayesian equivalence tests). This has the advantage that it is a stronger justification than using recommended default values such as δ=0.1 – see point (viii) – and it can easily be combined with the other approaches. For example, method (i) can be used to select a desired type I level α and specify the attainable sample size n in the frequentist paradigm. The results of the simulation study presented in this paper allow to use this method for Bayesian equivalence tests, too. They enable to determine which equivalence region is compatible with these desiderata and which power is attained. While it may be the case that the equivalence region compatible with the desired objective criteria is too broad or narrow, this approach allows to judge the consequences of selecting an equivalence region more objectively. Also, if prior research or pilot studies strongly recommend a specific equivalence region – see approaches (iii)-(vi) – the results can be used to investigate the resulting type I error rate and power when selecting this equivalence region and pick the Bayesian equivalence test with the best properties for a specified equivalence region and prior distribution.
Design of the simulation study
To use the new method for equivalence region selection, a simulation study was performed to analyze the behaviour of the different approaches to Bayesian equivalence testing in the setting of Welch’s two-sample t-test. This section details the design of the simulation study. The next section presents the results and the section thereafter discusses these and shows how to apply them in practice by revisiting the illustrative example.
Pairs of data were simulated which consist of two samples, one for each group, both of which are normally distributed. Four settings were selected to investigate the sensitivity of the approaches: In the first setting, no effect was present, and both groups were identically distributed as standard normal
\(\mathcal {N}(0,1)\). This allows studying the type I error rate produced by each of the approaches presented in the previous sections. In the second setting, a small effect was present, and the first group was simulated as
\(\mathcal {N}(2.89,1.84)\) and the second group as
\(\mathcal {N}(3.5,1.56)\), resulting in a true effect size of
$$\begin{array}{*{20}l} \delta=\frac{(2.89-3.5)}{\sqrt{\left(\left(1.84^{2}+1.56^{2}\right)/2\right)}}\approx -0.357 \end{array} $$
(3)
In the third simulation setting, a medium effect was present. The first group was generated according to a
\(\mathcal {N}(254.08,2.36)\) distribution, and observations in the second group followed a
\(\mathcal {N}(255.84,3.04)\) distribution, resulting in a true effect size of
$$\begin{array}{*{20}l} \delta=\frac{(254.08-255.84)}{\sqrt{\left(\left(2.36^{2}+3.04^{2}\right)/2\right)}}\approx -0.646 \end{array} $$
(4)
The last setting modelled data in the first group as
\(\mathcal {N}(15.01,3.4)\) and in the second group as
\(\mathcal {N}(19.91,5.8)\), which yields a true effect size of
$$\begin{array}{*{20}l} \delta=\frac{(15.01-19.91)}{\sqrt{\left(\left(3.4^{2}+5.8^{2}\right)/2\right)}}\approx -1.03 \end{array} $$
(5)
For each of the four effect size settings, 1000 datasets following the corresponding group distributions as detailed above were simulated. This procedure was repeated for different samples sizes
n, ranging from
n=10 to
n=200 in steps of size 10 to investigate the influence of sample size
n on the different approaches. For the equivalence testing approaches based on Bayes factors, the Bayes factor
BF01 was computed for each data set. The equivalence testing approaches based on the ROPE were also computed for each data set. First, for each data set the overlapping hypotheses Bayes factor
\(BF_{01}^{\text {OH}}\) was computed via transitivity by employing two JZS Bayes factors as detailed in
Appendix A. The Cauchy prior width
r0 under the null hypothesis was selected as a tenth of the Cauchy prior width
r1 under the alternative in all simulations. Three settings
\(C(0,1/\sqrt {2}), C(0,1)\) and
\(C(0,\sqrt {2})\) were chosen under
\(H_{1}^{\text {OH}}\) which are based on the recommendations of Rouder et al. [
28] and Kelter [
40]. The corresponding priors under the null hypothesis
\(H_{0}^{\text {OH}}\) in the OH model are then given as
\(C(0,1/(\sqrt {2}\cdot 10)), C(0,1/10)\) and
\(C(0,\sqrt {2}/10)\).
Second, the non-overlapping hypotheses Bayes factor
\(BF_{01}^{\text {NOH}}\) was computed according to the numerical integration routine given in Morey et al. [
41]. The hyper-parameter
ν was chosen as
ν0=1 and the scale of the resulting Cauchy prior on
δ was selected as
\(1/\sqrt {2}\), 1 and
\(\sqrt {2}\) to make the results of the OH model and NOH model comparable (for details on the relationship between the
\(t_{\nu _{0}}\)-prior and the Cauchy prior
C(0,
γ) on
δ see the Appendix A in Morey et al. [
41]).
Notice that the informed Bayes factor for equivalence testing proposed by Van Ravenzwaaij et al. [
44] using the default hyper-parameters
μδ=0 with varying Cauchy scales
\(\gamma =1/\sqrt {2}, \gamma =1\) and
\(\gamma =\sqrt {2}\) was not computed for each data set, because it yields identical results as the NOH model of Morey et al. [
41] (interested readers can check this in the provided replication script provided at the Open Science Foundation under
https://osf.io/2cs75/).
Fourth, the 95% and 100% ROPE equivalence tests based on the standard HPD interval were computed for each data set, and subsequently, the ROPE equivalence test based on the (100%) BF=1 support interval was conducted.
All simulations were repeated for three different ROPEs: The recommended default ROPE [−0.1,0.1] around
δ=0, a narrower ROPE of [−0.05,0.05] and a slightly wider ROPE [−0.15,0.15]. This allows judging the influence of the ROPE itself on the obtained results next to the influence of the prior elicitation and sample size. The ROPEs were selected to include the widely recommended default choice
R=[−0.1,0.1], as well as a larger and smaller one. ROPEs of substantial size (e.g. [−0.4,0.4]) are of less interest, as the use of accepting a very wide interval hypothesis (like
H0:
δ∈[−0.4,0.4]) is of limited use in practice. Also, effects like
δ≥0.2 would already be categorized as small according to Cohen [
34], so a ROPE of [−0.2,0.2] would already include effects which are often already regarded as non-negligible.
The quantities of interest in the simulations were the type I and type II errors, the power and robustness to the prior modeling. Also, the total error rate was of interest. While formally Bayesian statistical theory has no concept of type I or II error, a Bayes factor
BF01<3 (or
BF10≥3) was interpreted as a false-positive result when the true effect size
δ was zero. Similarly, if an effect was present (no matter if small, medium or large), a Bayes factor of
BF01≥3 (or
BF10<3) was interpreted as a false-negative result, a type II error. The threshold reflects at least moderate evidence for or against a hypothesis according to conventional Bayes factor scales [
33,
60].
A result based on the 95% ROPE or 100% ROPE equivalence test using an HPD or support interval was interpreted false-positive when it was located completely outside the corresponding ROPE around δ=0, although the true effect size is zero. Similarly, if the HPD or support interval was located entirely inside the ROPE but the effect size was nonzero, this was interpreted as a type II error.
The percentage of type I and II errors was computed as the number of significant results divided by
n=1000. This is a Monte Carlo estimate for the type I and II error probabilities of the different Bayesian equivalence testing approaches and a quantity crucial for making research reproducible [
89]. The sum was calculated as a Monte-Carlo estimate for the total error rate of a method.
As solutions based on the ROPE only require a posterior distribution
p(
δ|
x) of the effect size, for all results the corresponding posterior
p(
δ|
x) of the NOH model of Morey and Rouder [
41] was used based on 5000 MCMC draws, which is implemented in the
BayesFactor
R package [
90]. This ensures that differences in the obtained results are not caused by the different statistical models on which the posterior distribution is based
11. The ROPE indices were computed via the
bayestestR
package [
47], and the OH and NOH Bayes factors of Morey et al. [
41] were computed via the
BayesFactor
R package [
90]
12.
The statistical programming language R [
91] was used for the simulations. A commented replication script which reproduces all results and figures is provided at the Open Science Foundation at
https://osf.io/2cs75/.