Background
Meta-analysis provides a way of quantitatively synthesising the results of medical studies or trials that target a particular research question. As shown in a 2005 review of the clinical research literature [
1], it is still most common to meta-analyse results across clinical studies using the inverse variance approach, to yield a 'fixed' or 'common' effect estimate. By obtaining individual patient data (IPD) from all trials in a meta-analysis, some aspects of clinical heterogeneity can be minimised through data cleaning [
2]. However, regardless of whether the meta-analysis is based on IPD or aggregate data, substantial statistical heterogeneity between studies may still remain.
Cochran's
Q statistic has long been used to assess statistical heterogeneity in meta-analysis. When
Q is larger than its expected value
E[
Q] under the null hypothesis of no heterogeneity, the difference
Q -
E[
Q] can be used to furnish the most popular estimate of the heterogeneity parameter, using the DerSimonian and Laird method [
3]. Higgins and Thompson's
I
2 statistic [
4,
5] is also a simple function of
Q and quantifies the proportion of total variation that is between trial heterogeneity. Unlike
Q,
I
2 is designed to be independent of the number of trials constituting the meta-analysis and independent of the outcome's scale, so it can easily be compared across meta-analyses. It is now reported as standard, with or without Cochran's
Q.
The presence of significant and substantial heterogeneity demands some form of action. Ideally, after exploration of the data, heterogeneity can be explained by variation in the constituent trial's characteristics. If this is not possible then some may feel a meta-analysis inappropriate altogether, whereas some would opt for fitting a random effects model to the data instead. There is no accepted rule for deciding on when a move from a fixed to a random effects model is the right course of action [
6]. Clearly, all other things being equal, the larger the magnitude of the heterogeneity the stronger the case for a shift. However, as the amount of heterogeneity increases, so too does the potential impact of moving from one model to the other. Thus, with increasingly diverging interpretations, it is sometimes very difficult to make a satisfactory decision on which model to choose, or indeed whether to pool the trials in a meta-analysis at all.
In Methods we review the standard approach to meta-analysis and heterogeneity quantification based on the Q statistic. We then introduce a similar approach based on a 'generalised Q' statistic that has recently been proposed. In Results we analyse the summary data from 18 separate IPD meta-analyses to see whether the original conclusions could have been sensitive to the choice of fixed or random effects model. A more in-depth analysis is then conducted on the two meta-analyses with the largest observed heterogeneity. The 18 meta-analysis are then used to illustrate the relative performance of the standard and generalised Q statistics in measuring the extent of heterogeneity present. Finally, in Discussion and Conclusions we review the issues raised and offer recommendations for the future quantification and reporting of heterogeneity in meta-analysis.
The data
The MRC Clinical Trials Unit has carried out systematic reviews and IPD meta-analyses, predominantly in cancer since 1991. Their common primary aim has been to assess whether treatment interventions have improved patient survival. Specific areas of focus include cancers of the brain, lung, cervix, ovaries and bladder. Table
1 shows the summary statistics of 18 such IPD meta-analyses [
7‐
17]. The usefulness of these meta-analyses is that they all pre-specified subgroup analyses by trial and patient characteristics in order to explain potential heterogeneity. For illustration these analyses are done ignoring any pre-specified groupings and are only with respect to the primary outcome of overall survival. A two-stage approach was taken for each meta-analysis treatment comparison. That is, fixed effect hazard ratio estimates were calculated for each trial using the log rank method, [
18], these estimates were then combined using fixed and random effects models in the same manner as for aggregate data. The meta-analyses differed in terms of their size (from 5-19 studies), their fixed effect hazard ratio effect estimate (0.65-1.20) and their heterogeneity (Q statistic p-values from 1.97 × 10
-5 to 0.99 and
I
2 from 0 to 75%).
Table 1
The summary statistics for 18 meta-analyses carried out by the MAG.
| 18 | 44.48, 0.00 | 62 | 1.05 (0.93-1.19) 0.39 |
| 18 | 20.83 0.23 | 18 | 0.76 (0.67-0.85) 0.00 |
| 5 | 9.18, 0.06 | 56 | 0.65 (0.53-0.80) 0.00 |
| 9 | 7.27, 0.51 | 0 | 0.91 (0.83-1.01) 0.08 |
| 6 | 2.25, 0.81 | 0 | 0.75 (0.60-0.96) 0.02 |
| 17 | 28.98, 0.02 | 45 | 1.04 (0.96-1.12) 0.33 |
| 7 | 3.63, 0.73 | 0 | 0.98 (0.83-1.14) 0.76 |
| 25 | 22.32, 0.56 | 0 | 0.90 (0.83-0.97) 0.01 |
| 11 | 39.63, 0.00 | 75 | 0.84 (0.74-0.95) 0.01 |
| 19 | 21.92, 0.24 | 18 | 0.98 (0.91-1.06) 0.69 |
| 11 | 12.83, 0.23 | 22 | 0.93 (0.83-1.05) 0.23 |
| 9 | 14.78, 0.06 | 46 | 0.88 (0.79-0.98) 0.02 |
| 9 | 10.35, 0.24 | 23 | 0.91 (0.80-1.05) 0.21 |
| 12 | 2.57, 1.00 | 0 | 1.02 (0.93-1.12) 0.66 |
| 9 | 13.06, 0.11 | 39 | 1.21 (1.08-1.34) 0.00 |
| 14 | 11.80, 0.54 | 0 | 0.89 (0.76-1.03) 0.12 |
| 6 | 10.37, 0.07 | 52 | 0.89 (0.78-1.01) 0.06 |
| 12 | 13.29, 0.27 | 17 | 0.85 (0.78-0.92) 0.00 |
Methods
Consider a meta-analysis of
M studies. When study
i out of
M's effect estimate - denoted by
- is assumed to be normally distributed with a known variance
, then one can think of the study estimates as centered around a common mean parameter
θ as in formula (1):
The
ϵ
i
term relates to the precision of study
i's estimate, and is assumed to follow a
N (0,
) distribution.
The
u
i
term is assumed to have zero mean and a variance of
τ
2; it is included to represent potential between trial heterogeneity. When
τ
2 equals 0 all studies provide an estimate of the same mean parameter
θ . Under the assumption that
τ
2 is 0 the fixed effect (FE) estimate, associated variance and assumed asymptotic distribution can be obtained:
where
W
i
= 1/
is study
i's precision.
Heterogeneity quantification using the standard Q-statistic
If the fixed effects assumption is true then Cochran's statistic:
should follow, asymptotically, a
χ
2 distribution, with expected value equal to
M - 1. However, if
τ
2 is non zero so that there is a degree of heterogeneity among the trials, study
i provides an estimate of
θ +
u
i
and the expected value of
Q equals
where
, and is referred to as the 'typical' within study variance.
The most commonly applied estimate of
τ
2 is due to DerSimonian and Laird [
3]. This simply replaces
E[
Q] in formula (4) with its observed value in formula (3) and solves
τ
2 to give what we term
. This estimate is truncated to zero if negative and then used to provide re-weighted overall mean estimate (and variance
V
RE
) by replacing
W
i
in (2) with
. The 'RE' subscript denotes 'random effects'. This method has become synonymous with random effects meta-analysis, because of its ease of use - it does not require statistical maximisation software and does not impose constraints on the distribution of the random effects
u
i
[
19]. Furthermore,
can be used to furnish the most popular measure of the extent of heterogeneity - Higgins and Thompson's
I
2 value [
4,
5] - since
when Q > M - 1.
From a philosophical perspective, fixed effect and random effects estimates target very different quantities. Fixed effect models estimate the weighted mean of the study estimates, whereas random effects models estimate the mean of a distribution from which the study estimates were sampled. However, if model (1) is correct and we are additionally willing to assume that the
u
i
terms are independent of the
ϵ
i
terms, then they should both provide estimates of the same parameter
θ. Another consequence of this independence assumption is that the individual study estimates
should be independent of the
ϵ
i
terms, and hence we do not expect the magnitude of the effect estimate to be correlated with its precision.
Heterogeneity quantification using a 'generalised' Q-statistic
is very easy to calculate but may itself be a misleading estimate of the true heterogeneity present. More sophisticated likelihood-based methods - such as 'REML' [
20], or Bayesian methods using MCMC [
21] - may be preferred, but are more computationally demanding to calculate and also impose distributional assumptions on the random effects. Recently, a method has been championed that combines some of the computational simplicity of the DerSimonian and Laird method, with the rigor and accuracy of likelihood based approaches. DerSimonian and Kacker [
22] (and others [
23‐
25]) have noted that a generalisation of the
Q statistic in equation (
3) can be written as:
where
and where
is also calculated from equation (
2) by replacing
W
i
with
. Like the standard
Q statistic in equation (
3), this also follows a
distribution under the null hypothesis of no heterogeneity. Paule and Mandel [
23] (PM) and DerSimonian and Kacker [
22] propose to estimate
τ
2 by iterating equation (
5) until
Q(
τ
2) equals its expected value of
M-1; this estimate will be referred to as
. DerSimonian and Kacker recommend using
since it is still very easy obtain, is guaranteed to have at most one solution and provides a more accurate estimate of
τ
2 that closely mirrors both the REML estimate and the generalized Bayes estimate [
24], which are both much harder quantities to obtain computationally.
Viechtbauer [
25] suggests that equation (
5) can additionally be used to provide an
α-level confidence set for
, by finding the values of
τ
2 that equate
Q(
τ
2) with the
α/2th and 1-
α/2th percentiles of the
distribution. He showed that this method performed very well in a simulation study that evaluated its coverage properties compared to a range of other methods - such as Biggerstaff and Tweedie [
26] and Sidik and Jonkman [
27] - primarily because it is based on an exact
χ
2 distribution, rather than a distributional approximation.
A criticism one might therefore have of
I
2 is that its standard definition is intertwined with the commonly applied DerSimonian and Laird estimate
. A generalised
I
2 statistic, say
, could easily be defined for a meta-analysis with typical within study variance
s
2 as
for any estimate of the between study variance
. From now on we will refer to Inconsistency statistics specifically utilising the DL method as
and those specifically utilising the PM method as
. The term
I
2 will be reserved for discussing the general concept of Inconsistency.
Reference intervals for
and
Since the Inconsistency statistic is a data derived estimate, it is possible to plot a confidence (or 'reference') interval around it to highlight its inherent uncertainty. Higgins and Thompson [
4] recommend basing reference intervals for
using the variance of the related 'H' measure, since they are simple functions of one another. This involves using one of two formulae depending on the value of
Q relative to
M. The lower bound of these intervals, if negative, is curtailed at zero. We will calculate (1-
α) level reference intervals for
for each meta-analysis as
where
and
represent the values of
τ
2 equating
Q(
τ
2) to the lower
α/2 and upper 1-
α/2 percentiles of the relevant
χ
2 distribution.
Conclusions
In this paper we have restricted our focus to the estimation of the meta-analytical quantities
τ
2,
I
2 and the overall mean parameter
θ, as well as providing confidence intervals for the latter two. We note that this does not reflect the state-of-the-art in what can estimated via a random effects meta-analysis; one can for instance also estimate trial level effect parameters (
θ +
u
i
), predict the likely effects of future studies and test hypotheses relating to these additional parameters [
19]. With this in mind, we make the following tentative conclusions.
The actual magnitude of the estimate
τ
2 is often overlooked as a heterogeneity measure [
41], and in keeping with modern developments the Dersimonian and Laird estimate is no longer considered to be the best choice [
22,
24]. We recommend using the PM estimate for
τ
2 - and by extension the
it implies - since it is still very easy to calculate, but shares much of the accuracy and rigor of more complex methods. Van der Tweel and Bollen [
42] use the PM method to estimate the overall random effects mean
θ
RE
and heterogeneity parameter within the context of a sequential meta-analysis, but appear to stick with the original
for other aspects of their analysis. We recommend that practitioners additionally make use of the PM estimate in the Inconsistency measure
.
R
code to estimate
,
θ
RE
and
(with confidence intervals) is provided below.
An
I
2 of over 75% has traditionally been considered as indicating a high level of inconsistency,
I
2's of above 50% as moderate and
I
2's of below 25% as low. It is tempting to consider a random effects model when the
I
2 is high. However, the range of the reference intervals shown in Figure
6 (left) highlights the considerable uncertainty around this measure. The recently updated Cochrane handbook [
6] now gives overlapping rather than mutually exclusive regions for low, moderate and high heterogeneity, but when the heterogeneity is measured with as much uncertainty as in the Cervix 3 meta-analysis (90% reference intervals for
of 0% to 93%) any categorisation feels dubious. Inconsistency intervals based on the
statistic will generally be wider than those based on the standard
measure but is a more accurate reflection of the uncertainty present. These findings are based on a fairly large simulation study for widely varying
τ
2, typical within study variance
s
2 and trial number
M. Although the simulated data were normally distributed, we do not think the conclusions would have changed if the study effects had been drawn from a more non-standard distribution. By plotting
at the lower and upper reference levels, as well at a spread of more central measures such as the mean, median and mode, one can easily and effectively convey this uncertainty to the analyst. For a comprehensive comparison of methods for estimating the heterogeneity parameter
τ
2 see Biggerstaff and Tweedie [
26] or Viechtbauer [
25].
In the presence of heterogeneity, the naive and automatic application of the random effects model has been widely criticised. It is sensible to conduct a further investigation the data [
34,
43,
44], but this may not lead to the identification of any explanatory factors. If unexplained heterogeneity also leads to large differences between the fixed and random effects estimates, there is the obvious prospect that conflicting clinical interpretations could arise. When funnel plot asymmetry is the predominant cause of this,
I
2 statistics have a less meaningful interpretation. For this reason Rücker et. al [
37] have recently proposed an alternative 'G' statistic, that expresses the inconsistency between studies after this asymmetry has been accounted for (through a bias correction for small study effects). As demonstrated on the NSCLC meta-analysis, the Henmi-Copas method combining a fixed effects estimate with a 'random effects' confidence interval provides an alternative way of dealing with funnel plot asymmetry without making an explicit bias correction. Both the approaches of Rücker et. al. and Henmi and Copas appear to offer sensible and practical solutions to this problem, and merit further investigation.
R code
This code calculates point estimates and
α-level confidence intervals for
,
and
, given the estimated effect sizes
y within study standard errors
s and desired type I error
Alpha. This code is based on the algorithm suggested by DerSimonian and Kacker [
22].
PM = function(y = y, s = s, Alpha = 0.1){
K = length(y) ; df = k -1 ; sig = qnorm(1-Alpha/2)
low = qchisq((Alpha/2), df) ; up = qchisq(1-(Alpha/2), df)
med = qchisq(0.5, df) ; mn = df ; mode = df-1
Quant = c(low, mode, mn, med, up) ; L = length(Quant)
Tausq = NULL ; Isq = NULL
CI = matrix(nrow = L, ncol = 2) ;MU = NULL
v = 1/s^2 ; sum.v = sum(v) ; typS = sum(v*(k-1))/(sum.v^2 - sum(v^2))
for(j in 1:L){
tausq = 0 ; F = 1 ;TAUsq = NULL
while(F>0){
TAUsq = c(TAUsq, tausq)
w = 1/(s^2+tausq) ; sum.w = sum(w) ; w2 = w^2
yW = sum(y*w)/sum.w ; Q1 = sum(w*(y-yW)^2)
Q2 = sum(w2*(y-yW)^2) ; F = Q1-Quant[j]
Ftau = max(F,0) ; delta = F/Q2
tausq = tausq + delta
}
MU[j] = yW ; V = 1/sum(w)
Tausq[j] = max(tausq,0) ; Isq[j] = Tausq[j]/(Tausq[j]+typS)
CI[j,] = yW + sig*c(-1,1) *sqrt(V)
}
return(list(tausq = Tausq, muhat = MU, Isq = Isq, CI = CI, quant = Quant))
}
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JFT and SB produced an early version of this paper. JB substantially revised the paper to bring it to its current form. AJC and JFT provided invaluable advice to JB during this process. All authors read and approved the final manuscript.