Background
The commonly used method for a random effects meta-analysis is the DerSimonian and Laird approach (DL method) [
1]. It is used by popular statistical programs for meta-analysis, such as Review Manager (RevMan [
2]) and Comprehensive Meta-analysis [
3]. However, it is well known that the method is suboptimal and may lead to too many statistically significant results when the number of studies is small and there is moderate or substantial heterogeneity [
4‐
10]. If a treatment is inefficacious and testing is done at a significance level of 0.05, the error rate should be 5%, i.e. only one in 20 tests should result in a statistically significant result. For the DL method, the error rate can be substantially higher, unless the number of studies is large (≫ 20) and there is no or only minimal heterogeneity [
4‐
10].
Given this deficiency, alternative methods for random effects meta-analysis have been proposed. In particular, the method described by Hartung and Knapp [
4‐
6] and by Sidik and Jonkman [
11,
12] (HKSJ method) is claimed to be simple and robust [
13]. Simulations have shown that the HKSJ method performs better than DL, especially when there is heterogeneity and the number of studies in the meta-analysis is small [
4‐
14]. This means that for most meta-analyses the HKSJ method might be more appropriate than the conventional DL method. In a sample of 22453 meta-analyses, Davey et al. show that the number of studies in a meta-analysis is often relatively small, with a median of 3 studies (Q1-Q3: 2–6), and only 1% of meta-analyses containing 28 studies or more [
15]. Some detectable heterogeneity is present in about half of meta-analyses of clinical studies [
15‐
18].
Based on earlier results that showed that the results of a single large trial were unreliable [
19], we hypothesized that the meta-analyses methods, including HKSJ, would perform less adequately when the meta-analysis is carried out on a mixture of very unequal-sized studies, e.g. one large and several small trials. Such a situation is not uncommon. In a random sample of 186 systematic reviews of the Cochrane Database [
18] the ratio between large and small trial sizes ranged between 1 and 1650, with a median of 5 and an interquartile range from 3 to 10. Sixty per cent of the reviews contained no large trials, but 40% had one trial that was at least twice as large as the median trial size, 25% had one trial that was at least five times larger, and 10% had one trial that was even 10 times larger.
Although several simulations have shown that the HKSJ method performs better than the DL method, the focus in these studies was not on a systematic evaluation of the effects of specific trial size mixtures in combination with low trial numbers. They either only reported the overall results of various mixtures combined or they studied only a limited number of combinations. In order to investigate the impact of unequal study sizes, we used simulations, mimicking such realistic conditions rather than situations where trials have implausibly similar sample sizes. We focused on meta-analyses with small numbers of studies (up to 20) with a dichotomous outcome (odds ratio, relative risk) or a continuous outcome. To mimic the variation in trial sizes, we explicitly varied the sample sizes of the trials within the simulated meta-analyses, varying from scenarios where all trials in a meta-analysis were of equal size, to scenarios with only one large trial, 10 times as large as the other trials, or one small trial, 10 times smaller than the other trials.
In order to complement the simulations, empirical data, based on recent meta-analyses - added or updated in 2012 - from the Cochrane Database of Systematic Reviews (CDSR) of interventions were used to assess the number of nominally statistically significant findings (with p < 0.05) of both methods in practice. This allows to examine whether inferences would be very different based on these two models.
Currently not all standard software packages like Review Manager provide an option to perform an HKSJ analysis, although the HKSJ method is computationally not complicated and the importance of suitable methods for meta-analyses with small numbers of trials is apparent. Version 3.0 of Comprehensive Meta-analysis [
3] will contain the HKSJ method (personal communication by Julio Sánchez-Meca, September 2013). Also the R package metafor [
20] and the metareg command in Stata [
21] include the HKSJ method. However, not everybody will be acquainted with the use of R or Stata. Moreover, use of these packages is not straightforward when a post-hoc conversion is desired, i.e. when the results of a DL random effects analysis must be converted to the HKSJ approach. In order to fill this gap, we show step by step how the HKSJ analysis can be performed without the use of these packages, when the results of a common random effects (DL) meta-analysis are available, e.g. from a systematic review. This conversion is applicable for continuous outcomes and for outcomes where metrics are log-transformed, like the risk ratio (RR), odds ratio (OR), hazard ratio (HR) or Poisson rate. This simple modification of the common random effects analysis will improve the summary results, and it can be done through some basic calculations or a few statements in Excel. An Excel file is available as Additional file
1 web material. R code for the metafor package is provided in Appendix 3.
The simulations, the selection of empirical data and the statistical analysis are described in the Methods section. In the Results section the error rates for the DL and HKSJ methods for several realistic simulated scenarios are provided. For the Cochrane meta-analyses, we present the number of nominally statistically significant findings with the DL and HKSJ methods. The conversion of DL results into HKSJ results is illustrated, including examples from systematic reviews as presented in the Cochrane Library.
Methods
We used simulated data as well as empirical data of the Cochrane 2012 Issues to evaluate the DL and HKSJ approaches. The pooled effect estimate is equal for both approaches, but the methods differ with respect to the calculation of the confidence interval and the statistical test. For DL, these are based on the normal distribution, whereas for the HKSJ method, they are based on the t-distribution with the degrees of freedom equal to the number of trials minus one, and a weighted version of the DL standard error. Detailed statistical methods are presented in Appendix 1.
Methods - simulations
Our first aim was to investigate the error rates of the HKSJ meta-analysis method in comparison to the common (DL) method for various realistic scenarios, i.e. combinations of study sizes, study size mixtures and heterogeneity in series of just a few trials. Therefore we simulated series of trials with two up to 20 studies, where each series provided the data for one meta-analysis. First, we considered series that consisted of equally sized trials, each with two groups of 25, 50, 100, 250, 500 or 1000 subjects. Second, we looked into series of trials with different trial sizes, i.e. the percentage of large trials was 25%, 50% or 75%, e.g. a series of one large trial and three small trials. Average group sizes were 100, 250, 500 or 1000 subjects, and the large trials had 10 times more subjects than the small trials. For example, a series of six small (normal) and two large trials, with an average group size of 100, has group sizes of 31 and 308 in the small and large trials, respectively. Third, we simulated extreme scenarios, in which a series had only small trials, except for one large one, or only large trials, except for one small one. Both continuous and dichotomous outcomes were evaluated. For continuous outcomes, a normally distributed overall mean difference between the group means was simulated. In the trials with a dichotomous outcome, the event rates in the groups varied between scenarios and ranged from 0.1 to 0.9, in steps of 0.2. The heterogeneity was superimposed and set at I
2 = 0, 0.25, 0.50, 0.75 and 0.9. I
2 represents the heterogeneity, i.e. the degree of inconsistency in the studies’ results, in comparison to the total amount of variation [
16,
22]. The levels correspond to no, low, moderate, high and very high heterogeneity, respectively [
16].
Our aim was to evaluate the error rate, i.e. the percentage of statistically significant meta-analyses when the overall mean treatment difference was zero. Hence we simulated series with an overall treatment difference equal to zero and performed on each series a DL [
1] and an HKSJ [
11] random effects meta-analysis. The two-sided significance level was 0.05. For each scenario, we simulated 10,000 series of trials. In the ideal situation, 5% of the 10,000 meta-analyses should have a statistically significant result when the significance level is 0.05. For the scenarios with the dichotomous outcome we determined the error rate when the OR was evaluated (logistic model) and when the RR was estimated. In these cases, meta-analysis was done on the logarithmic scale, and the error rates were determined for OR = 1 or RR = 1. More details can be found in Appendices 1 and 2.
Methods - empirical data from the 2012 Cochrane Database of Systematic Reviews
Cochrane Reviews are systematic reviews of primary research in human health care and health policy, and are internationally recognised as the highest standard in evidence-based health care [
23]. The aim of the Cochrane collaboration is to provide accessible and credible evidence to guide decision making in medicine and public health. We were very fortunate that the UK Cochrane Editorial Unit provided us with the statistical data added to the CDSR in 2012, which allowed us to assess the number of statistically significant results in real data.
Many Cochrane reviews include multiple meta-analyses. Many of those overlap or are based on correlated data. Usually, the first analysis is the primary analysis. Hence, we decided to use per review only the first meta-analysis that was based on at least three studies. In order to maximize the number of meta-analyses, we used both the first continuous and the first binary outcome meta-analysis, whenever possible. Thus some systematic reviews provided none, and some provided one or two meta-analyses for our research. We always performed a random effects meta-analysis, even when the authors originally performed a fixed-effects analysis. Details can be found in Appendix 1.
It is impossible to determine which of the Cochrane reviews compared treatments that truly had equal efficacy. It is thus unknown which of the statistically significant results were in fact false positive findings, so we could not determine the false positive error rate. Hence we decided to present the total number of significant findings of the DL and HKSJ methods instead of the error rates. This provides an indication of the impact a change from DL to HKSJ would have in practice.
Discussion
The DL approach to random effects meta-analysis is still the standard method, almost to the exclusion of all other methods. This might be considered remarkable, bearing in mind the high false positive rates of the DL method which have been shown repeatedly with simulations [
4‐
14] and also an empirical study suggesting that results are sensitive to the choice of random effects analysis method [
26]. Thorlund et al. did an empirical assessment in 920 Cochrane primary outcome meta-analyses of > = 3 studies of method-related discrepancies [
26]. In total, 326 (35.4%) meta-analyses were statistically significant when the analysis was based on a t-distribution – as in the HKSJ method – and 414 (45%) when it was based on the normal distribution as in the DL method. Our evaluation of Cochrane meta-analyses of interventions resulted in a similar result: a substantially larger amount of significant findings with the DL method than with the HKSJ method. Our simulations suggest that among the DL significant findings in the Cochrane reviews there may be a considerable number of false positives.
DL results can easily be converted into HKSJ results, which have a much better performance. We confirmed this with simulations, for mixtures of trial size distributions in settings with up to 20 trials per meta-analysis. When there was heterogeneity, the mean error rates of the DL approach were consistently higher than those of the HKSJ approach, although also the latter doubled to 10% in scenarios with only one large trial. When there was no heterogeneity, the DL error rates were lower than 5%, and the HKSJ rates were approximately 5%.
However, there are some limitations with respect to the HKSJ analysis method. Although the error rates of the HKSJ method were closer to the 5% level than those of the DL method, our simulations showed that in some scenarios the HKSJ error rates more or less doubled, although the DL error rates could be more than four times too high in these same settings. Hence, the results of the HKSJ analysis are also not perfect. Like we hypothesized, the error rates were maximal if one of the trials in the meta-analysis was substantially larger than the other ones.
Further, when study numbers are small, the distribution of the treatment effects is unknown and does not necessarily follow the normal or t-distribution. Kontopantelis and Reeves [
27] showed that with slight heterogeneity the coverage of the HKSJ method was consistently 94% when the true effects were not distributed according to the normal or t-distribution, but with larger heterogeneity the non-parametric permutation (PE) method of Follmann and Proschan [
7] performed better than the HKSJ method. However, the PE method can only be performed when the number of studies is larger than five, whereas many meta-analyses are smaller [
15]. Several other methods have been developed, like the Quantile Approximation (QA) method [
28], the Profile Likelihood approach [
29], natural weighting instead of empirically based weighting of studies [
30], use of fixed effects estimates with a random effects approach to heterogeneity [
31] and more recently, higher-order likelihood inference methods [
32]. However, most of these methods are based on asymptotic statistics and they may therefore be less robust in case of a limited number of trials, or they remain difficult to use in practice, because no statistical packages are available to perform them and it is very difficult to carry out the calculations with standard software. Regarding the non-asymptotic, computationally straightforward QA method, Sánchez-Meca and Marín-Martínez [
13] have already shown that it was outperformed by the HKSJ method. It would require a very extensive evaluation to investigate the performance of all of these methods. We restricted ourselves to the HKSJ method, because of its computational simplicity and we show that HKSJ results can easily be derived from DL results.
As far as we know, we are the first to present systematically the error rates in relation to explicit trial size mixtures when the numbers of trials range from 2 to 20. Follmann and Proschan [
7] show that for certain trial size mixtures and low numbers of trials the DL error rates can be highly increased, however, they did not evaluate the HKSJ method. The results reported by Hartung, Knapp and Makambi [
4‐
6,
8,
9] imply that for meta-analyses of three, six or twelve studies the DL error rates for studies with similar sizes were closer to 5% than for studies of different sizes, and that the HKSJ method performed much better than DL in the latter situation. However they did not report the explicit relationship between the trial size mixtures and error rates as we do (Table
1). Sánchez-Meca and Marín-Martínez [
13] also varied the sample size ratios in their simulations. They concluded that the average sample size scarcely affected the performance of the different methods, but this was based on the combined results of 5–100 studies and they presented no results of particular trial size mixtures.
As all studies show that in settings with few studies the HKSJ method always resulted in error rates closer to 5% than the DL method, the latter method should not be used and the HKSJ method should be the standard approach. To facilitate its more widespread application, the conversion of DL results into HKSJ results is presented step by step. At the same time, we urge caution when any random effects model, including HKSJ, is applied to situations where there are very few studies, and even more so when the sample sizes of the combined studies are very different. Even the HKSJ confidence intervals may be conservatively narrow in these situations and inferences may be spurious, if the confidence intervals are taken at face value.
Appendix 1: Statistical details
For
k studies, let the random variable y
i be the effect size estimate from the
i
th study. The random effect model can be defined as follows:
for i =1, . . ., k, where δ
i
= δ + d
i
; e
i
and d
i
independent, and d
i
~ N(0, τ
2). is the within-study variance, describing the extent of estimation error of δ
i
, and the parameter τ
2
represents the heterogeneity of the effect size between the studies.
For studies with dichotomous outcomes where no events were observed in one or both arms, the computation of the random effects model yields a computational error. In these cases, before performing any meta-analysis, we added 0.5 to all cells of such a study.
Random effects analysis
Let w
i
be the fixed effects weights, i.e. the inverse of the within-study variance , and let be the fixed effects estimate of δ.
Let
Q be the heterogeneity statistic
.Then
is an estimate of the variance τ2.
The random effects estimate for the average effect size δ is
where
The DerSimonian and Laird method estimates the variance of
by
and uses the normal distribution to derive P-values and confidence intervals.
In contrast, the Hartung, Knapp, Sidik and Jonkman method estimates the variance of
by
and uses the t-distribution with k-1 degrees of freedom to derive P-values and confidence intervals, with k the number of studies in the meta-analysis.
Heterogeneity estimates
Although
or
Q can be used as measures of the heterogeneity, Higgins and Thompson [
16] propose
I
2 is a relative measure. It compares the variation due to heterogeneity (
τ
2
) to the total amount of variation in a ‘typical’ study (
τ
2
+
∈
2
), where
∈ is the standard error of a typical study of the review [
33]:
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GB conceived the idea. JIH contributed substantially to the study design, developed the software and performed the statistical analyses. JIH, GB and JPAI drafted the manuscript, and read and approved the final manuscript.