In order to investigate the accuracy of the Berger-Exner test for detecting third-order selection bias in RCTs, the numbers of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results were observed from RCT simulations. A ‘true positive’ test result (TP) was observed when the test indicated a positive result in the presence of bias and a ‘true negative’ test result (TN) was observed when the test indicated a negative result in the absence of bias. Accordingly, a ‘false positive’ test results (FP) was observed when the test indicated a positive result in the absence of bias and a ‘false negative’ test result (FN) was observed when the test indicated a negative result in the presence of bias. These numbers were converted into two rates: the test sensitivity and test specificity. The test sensitivity is defined as the proportion of cases with third-order selection bias in relation to the total number of cases that show positive test results, and the test specificity is defined as the proportion of cases without third-order selection bias in relation to the total number of cases that show negative test results [
19]. For high accuracy it is ideal that sensitivity and specificity have somewhat reasonably high percentages, i.e., > 80%. In order to obtain a summarized predictive value of the two rates, the Diagnostic Odds Ratio (DOR) was computed. The DOR combines sensitivity and specificity into one single predictive summary measure and is defined as: DOR = (TP × TN)/(FP × FN) [
19]. The DOR may range from zero to infinity, has no pre-defined cut-off threshold and is utilized for comparing the predictive evidence strength of different diagnostic parameter settings. A DOR value of, or close to 1.00 provides no predictive evidence and corresponds with the rising diagonal in Summary Receiver Operating Characteristic (SROC) graphs. The higher the DOR value (>1.00), the better the predictive accuracy [
20].
Seven different randomisation procedures for each of three “trait frequencies” within each of the three sample sizes (N) were included. The trait frequency (TF) was defined as the percentage of subjects with a characteristic/trait ‘X’. Such trait was assumed to function as a confounding factor that would cause intervention success (Y = 1), regardless of the type of intervention group.
Trial simulation and parameters
For the purpose of this study an RCT simulation was obtained by assuming the comparison of two interventions (Intervention A and B) with dichotomous outcomes (Intervention failure: Y = 0; Intervention success: Y = 1). A simulated RCT consisted of three components: (i) a sequence of subject ID (accession) numbers; (ii) a sequence of the Reverse Propensity Score (RPS) [
11] per subject ID with regard to the propensity of the subject to be allocated to Group A; and (iii) a sequence of dichotomous outcomes per subject ID (Y = 1 or 0). The RPS reflects the probability of allocation of a patient to group A [
2]. For example, with block size 4 and an [ABAB] block, the sequence of RPS values would be 2/4, 1/3, ½, 0/1, respectively, reflecting the ratio of the number of remaining A allocations within the block to overall remaining allocations within the block.
The parameters Trait Frequency (TF), subject number (N) and type of randomisation method (RM) were introduced into the simulation (Table
1). The seven randomisation methods were: fixed block randomisation with block size 4, 6, or 8; block randomisation with randomly varying block size 4, 6, 8 with equal probability (1/3) and the maximal procedure [
9] with a maximum tolerated imbalance (MTI) of 2, 3, and 4. The Trait frequency (TF) was set to be 10%, 20% or 50% of the total number of subjects (N), which in turn was set to be 120, 240 or 480 subjects.
Table 1
Generated parameter sets for both scenarios
01 | Fixed/BS = 4 | 120 | 10/12 | 33 | Varying | 240 | 50/120 |
02 | 20/24 | 34 | 480 | 10/48 |
03 | 50/60 | 35 | 20/96 |
04 | 240 | 10/24 | 36 | 50/240 |
05 | 20/48 | 37 | Maximal procedure/MTI = 2 | 120 | 10/12 |
06 | 50/120 | 38 | 20/24 |
07 | 480 | 10/48 | 39 | 50/60 |
08 | 20/96 | 40 | 240 | 10/24 |
09 | 50/240 | 41 | 20/48 |
10 | Fixed/BS = 6 | 120 | 10/12 | 42 | 50/120 |
11 | 20/24 | 43 | 480 | 10/48 |
12 | 50/60 | 44 | 20/96 |
13 | 240 | 10/24 | 45 | 50/240 |
14 | 20/48 | 46 | Maximal procedure/MTI = 3 | 120 | 10/12 |
15 | 50/120 | 47 | 20/24 |
16 | 480 | 10/48 | 48 | 50/60 |
17 | 20/96 | 49 | 240 | 10/24 |
18 | 50/240 | 50 | 20/48 |
19 | Fixed/BS = 8 | 120 | 10/12 | 51 | 50/120 |
20 | 20/24 | 52 | 480 | 10/48 |
21 | 50/60 | 53 | 20/96 |
22 | 240 | 10/24 | 54 | 50/240 |
23 | 20/48 | 55 | Maximal procedure/MTI = 4 | 120 | 10/12 |
24 | 50/120 | 56 | 20/24 |
25 | 480 | 10/48 | 57 | 50/60 |
26 | 20/96 | 58 | 240 | 10/24 |
27 | 50/240 | 59 | 20/48 |
28 | Varying | 120 | 10/12 | 60 | 50/120 |
29 | 20/24 | 61 | 480 | 10/48 |
30 | 50/60 | 62 | 20/96 |
31 | 240 | 10/24 | 63 | 50/240 |
32 | | | 20/48 | | | | |
Study scenarios
As both interventions were assumed to have no effect, thus would not lead to ‘success’ on its own, the distribution of subjects with trait ‘X’ (Y = 1) among groups A and B served as an indicator for the presence/absence of selection bias. When subject allocation strictly followed a true random sequence, all subjects with trait ‘X’ were evenly distributed between groups A and B and neither intervention group was superior to the other (= Scenario 1: No selection bias). Subversion of the random allocation by correct prediction of the random sequence through use of the RPS and knowledge about which of the subjects carry trait ‘X’ allowed allocation of these subjects in favour of intervention group A (= Scenario 2: third-order selection bias). In this scenario, intervention A was superior to intervention B solely by virtue of an uneven distribution of subjects with trait ‘X’ (Y = 1), being equal to the specified TF.
Scenario 1 was simulated, using R statistical software, by assigning the first (N*TF)/100% of participants to Y = 1, and all others to Y = 0. Scenario 2 was simulated, using R statistical software, by assigning Y = 1 in accordance with the highest RPS, favouring intervention A above B as per TF. For each parameter set, 25 individual random sequences (‘runs’) were generated.
In order to illustrate the ‘effect size inflation due to third-order selection bias’ at the various parameter settings of TF and N in the RCT simulations, fixed effect meta-analysis was conducted of all simulated RCTs per TF/N setting, using RevMan 4.2.10 statistic software. Pooled Odds ratios (OR) with 95% Confidence Intervals (CI) were computed from dichotomous datasets consisting per intervention group of (i) the number of subjects with intervention failure, Y = 0 and (ii) the total number of subjects per TF/N setting. A Ratio of Odds Ratios (ROR) was calculated, being the Odds ratio (OR) of all datasets of Scenario 2 divided by the OR of Scenario 1. In absence of bias (Scenario 1) the OR is 1.00, indicating no difference in intervention failures between test- and control group. An OR below 1.00 indicates less intervention failures in the test group and an OR above 1.00 indicates less intervention failures in the control group. In Scenario, 2 bias favours the test group over that of the control group, i.e.: having less numbers of failed intervention (n), thus resulting in a lower OR. Because the OR of Scenario 1 is lager than the OR of Scenario 2, the calculated ROR is necessarily lower than 1.00. In line with convention [
21], an ROR less than 1.0 indicates an overestimation in the effect size of the former group of datasets in comparison to the effect size of the referent group of datasets. From the thus established ROR, an effect size overestimation in percent, [1 – ROR] × 100%, was calculated. All OR were computed using RevMan 4.2.10 statistic software from the subject number (N) and numbers of failed intervention (n) per intervention group.
As the actual effect of both interventions was of no interest within this context, it was set at zero (Y = 0). Hence, when compared with each other, neither intervention would yield any result superior to that of the other (= Odds ratio 1.00). Such setting allowed investigation of the bias effect alone.
From the various parameters (Randomisation method, TF and N) a total of 63 different parameter sets for the simulated RCTs were generated and are presented in Table
1. All RCT simulations were conducted in four steps, using R statistical software, based on the generated variables: ID, BLOCK, TRT, RPS, Y
Step 1: Generation of subject identification (ID) i = 1:N;
Step 2: Generation of randomisation-blocks/MTI (BLOCK) and randomisation according to different RM);
Step 3: Generation of 2-arm treatment (TRT) in each block. For fixed and varying block randomisation, two treatments have equal probability to be assigned to either group, and within each block, the number of each treatment should be the same, Thus random numbers were generated from a standard uniform distribution for a simulated RCT with block size m
i, i = 1,…,N, m/2 of which those above the median were assigned as intervention A (TRT = 1), and all other m/2 were assigned as intervention B (TRT = 0). For the maximal procedure (MP), the generation of treatment sequence was based on the MTI [
9]; i.e., the qualified random sequence should satisfy I(D) < = MTI and I
N(D) = 0, where I
k(D) = |S
k,A(D)-S
k,B(D)|, and S
k,A(D) = sum(X
i(D)), I = 1,…,k, and S
k,B(D) = k- S
k,A(D), where X
i(D) = 1, if sequence D assigns the i
th patient to treatment group A, X
i(D) = 0, if sequence D assigns the i
th patient to treatment group B.
Step 4: Calculation of the RPS from the generated block and treatment, using code “rps.gen” in R (Additional file
1: Appendix file 1) on the basis of the block information (BLOCK) and the treatment information (TRT).
Bias testing
The Berger-Exner test has been applied for bias testing. The test consisted of linear regression analysis, separately per treatment group, including the RPS as independent, and the Y-values as dependent variables [
2]. Regression analysis was conducted separately per intervention group and the resulting p-values were recorded. In order to investigate the influence of various alpha levels on the test accuracy, alpha was set at 1%, 5% and 20%.
A true negative (TN) result was established when both p-values for intervention group A and B were above 0.01; 0.05 or 0.20 for alpha 1%, 5% and 20% (two-sided), respectively. A true positive (TP) result was established when at least one of the p-values of either intervention group was below 0.01; 0.05 or 0.20 for alpha 1%, 5% and 20%, respectively. The number of false positive (FP) results was calculated by subtracting the total number of TN from the total number of runs per parameter set; i.e., FP = 25 – TN. The number of false negative (FN) results was calculated by subtracting the total number of TP from the total number of runs per parameter set; i.e., FN = 25 – TP.
Data analysis and summary measures
From the 63 separate parameter sets, the established total numbers of true negative/false positive (TN/FP) and true positive/false negative (TP/FN) results, per set alpha level, were entered into Meta-DiSc Version 1.4 statistical software [
22] and the pooled specificity and sensitivity with 95% Confidence Intervals (CI) for each alpha level were computed. In addition, symmetrical Summary Receiver Operating Characteristic (SROC) curves per alpha level were generated from this data. The SROC curve shows the relationship of the sensitivity and the complement of the specificity for all the individual test results; i.e., the fractional relationship between TP (TP/(TP + FN)) and FP (FP/(FP + TN)).
The influence of parameters N, TF and RM was investigated by computing the Diagnostic Odds Ratio (DOR with 95% CI) from the relevant TN/FP and TP/FN data per parameter setting of each study parameter per alpha level (Table
1), using Meta-DiSc Version 1.4 statistical software.
All data pooling, using Meta-Disc 1.4 software, was based on the standard Der Simonian Laird random-effects model [
22]. The DerSimonian Laird method produces a random-effects meta-analysis that incorporates an assumption that different studies are estimating different, yet related, effects. The model may not be optimal but remains valid even when the random effects are not normally distributed. In addition, the model allows the treatment effects to differ across runs, with an underlying true effect, and a between-runs variance, by using a non-iterative model to estimate the treatment effect variance.