Evolution of Heterogeneity (I2) Estimates and Their 95% Confidence Intervals in Large Meta-Analyses

Kristian Thorlund; Georgina Imberger; Bradley C. Johnston; Michael Walsh; Tahany Awad; Lehana Thabane; Christian Gluud; P. J. Devereaux; Jørn Wetterslev

doi:10.1371/journal.pone.0039471

Abstract

Background

Assessment of heterogeneity is essential in systematic reviews and meta-analyses of clinical trials. The most commonly used heterogeneity measure, I², provides an estimate of the proportion of variability in a meta-analysis that is explained by differences between the included trials rather than by sampling error. Recent studies have raised concerns about the reliability of I² estimates, due to their dependence on the precision of included trials and time-dependent biases. Authors have also advocated use of 95% confidence intervals (CIs) to express the uncertainty associated with I² estimates. However, no previous studies have explored how many trials and events are required to ensure stable and reliable I² estimates, or how 95% CIs perform as evidence accumulates.

Methodology/Principal Findings

To assess the stability and reliability of I² estimates and their 95% CIs, in relation to the cumulative number of trials and events in meta-analysis, we looked at 16 large Cochrane meta-analyses - each including a sufficient number of trials and events to reliably estimate I² - and monitored the I² estimates and their 95% CIs for each year of publication. In 10 of the 16 meta-analyses, the I² estimates fluctuated more than 40% over time. The median number of events and trials required before the cumulative I² estimates stayed within +/−20% of the final I² estimate was 467 and 11. No major fluctuations were observed after 500 events and 14 trials. The 95% confidence intervals provided good coverage over time.

Conclusions/Significance

I² estimates need to be interpreted with caution when the meta-analysis only includes a limited number of events or trials. Confidence intervals for I² estimates provide good coverage as evidence accumulates, and are thus valuable for reflecting the uncertainty associated with estimating I².

Citation: Thorlund K, Imberger G, Johnston BC, Walsh M, Awad T, Thabane L, et al. (2012) Evolution of Heterogeneity (I²) Estimates and Their 95% Confidence Intervals in Large Meta-Analyses. PLoS ONE 7(7): e39471. https://doi.org/10.1371/journal.pone.0039471

Editor: Frank Emmert-Streib, Queen's University Belfast, United Kingdom

Received: January 31, 2012; Accepted: May 21, 2012; Published: July 25, 2012

Copyright: © 2012 Thorlund et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: No current external funding sources for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Measures of heterogeneity are essential in systematic reviews and meta-analyses of clinical trials [1]–[6]. The most commonly used heterogeneity measure, I², provides an estimate of the proportion of variability in a meta-analysis that is explained by differences between the included trials rather than by sampling error [2], [3]. Several studies have demonstrated important shortcomings of the I² measure [7]–[12]. I² estimates may be particularly unreliable in meta-analyses including a small number of trials (e.g., less than 10 trials) due to lack of power [7], [8]. I² estimates may be underestimated as a result of time-lag bias [9], [10]. Moreover, comparably higher or lower precision in the most recently added trials may inflate or deflate I² under different circumstances [8], [11].

Imprecise or biased estimates of heterogeneity can have serious consequences [6], [12]. Underestimation of heterogeneity may inappropriately prevent exploration of the cause(s) of heterogeneity. Overestimation of heterogeneity may inappropriately prevent a meta-analysis actually being done. Overestimation may also trigger inappropriate exploration of the cause(s) of heterogeneity. For example, large I² estimates may prompt authors to exhaust all possibilities of subgroup analyses – a conduct notorious for its tendency to yield findings beyond replication [13].

In response to the above identified shortcomings, it has been proposed that reported I² estimates should be accompanied by their associated 95% confidence interval (CI) [6], [12]. Confidence intervals may be a desirable addition to the single I² estimate; they give an appreciation of the spectrum of possible degrees of heterogeneity (e.g., mild to moderate), allowing for more appropriate interpretation of the overall intervention effect estimate. One concern, however, is the possibility that the I² estimate's dependence on power, trial weights, and time-lag bias may cause fluctuations beyond the play of chance. With such fluctuations, the 95% CIs may not retain their desired coverage.

To explore these issues we selected a sample of 16 large Cochrane meta-analyses, each including a sufficient number of trials, patients and events to provide reliable treatment effect estimates and I² estimates. We retrospectively re-analysed the data for each meta-analysis, starting with the first chronological trial, and calculating a cumulative I² estimate and its associated 95% CI after each new trial was added to the meta-analysis. We then estimated the number of events and trials generally needed for I² estimates and 95% CIs to converge.

Statistical framework and theoretical considerations

In this section we first outline the construct of the I² measure and its associated 95% CI. We secondly provide an overview of meta-analysis factors and properties of the I² measure that may inappropriately affect the magnitude of the I² estimate. Lastly, we provide the rationale for empirically studying I² estimate and their associated 95% CIs over time.

Measuring heterogeneity between studies

Higgins et al. proposed the now widely popular measure of heterogeneity, I², as well as methods for calculating the associated 95% CIs [2], [3]. I² expresses the proportion of variability in a meta-analysis which is explained by between-trial heterogeneity rather than by sampling error. Mathematically, I² is expressed as I² = τ²/(σ²+τ²), where τ² denotes the between-trial heterogeneity, σ² denotes some common sampling error across trials, and σ²+τ² is the total variation in the meta-analysis. I² is usually calculated as (Q−df)/Q×100%, where Q is the Cochran's homogeneity test statistic and df is the degrees of freedom (the number of trials minus 1) [2], [3], [14]. Higgins et al. explored a number of methods for obtaining 95% CIs of the I² estimate [2]. For this study, we will use the method referred to as the test based method in Higgins et al. [2] This method yields good coverage in most situations and is easy to calculate [2]. The required calculations for this method are outlined in the appendix S1.

Factors affecting I² estimates

I² estimates may be unreliable due to lack of power and precision [7], [8], [11], due to the presence of time-dependent biases [9], [10], or due to dependence on trial weights and precisions.

Power and precision.

Since I² is a monotonically increasing function of Cochran's Q, large values for Q result in large I² estimates and small values for Q result in small I² estimates. The power of Cochran's Q depends on the number of trials and the precision of the trials (i.e., the number of patients and events in the trials) [7], [8], [11]. When the number of trials or their respective precision are small, Cochran's Q usually has inappropriately low power to detect heterogeneity, and therefore tends to yield conservative (low) test values [7], [8]. To illustrate, the median number of trials is seven for Cochrane meta-analyses and 12 for meta-analyses published in paper journals [15], [16]. The median sample size in randomized clinical trials is typically less than 100 in most medical specialties [17], [18]. Thus, it is common for Cochran's Q to have low power. This lack of power is likely to cause underestimation of I², particularly if there a few events among the included trials [7].

Time-dependent bias.

Time-dependent bias (i.e., time-lag bias and publication bias) is known as a threat to the validity of the pooled estimate of effect in meta-analyses [19]–[21]. In addition, time-dependent bias may compromise the validity of heterogeneity estimates [9], [10]. It is accepted that statistically significant trials with large intervention effect estimates usually get published the fastest [21]. If a meta-analysis is conducted at a time where all trials yield large promising treatment effects, the similarity across trials will result is a relatively small I² estimate. If the meta-analysis is updated some years later, this update is likely to include trials that found more moderate, neutral, or negative treatment effects. The inclusion of such trials will generate larger estimates of heterogeneity.

Dependence on trial weights and precisions.

From the mathematical expression I² = τ²/(σ²+τ²), it is clear that relatively large sampling errors across trials will result in small I² estimates, and conversely, that relatively small sampling errors across trials will result in large I² estimates [2], [3], [8], [11]. The “common” sampling error, σ², across trials may change considerably over time. For example, if early trials enroll a more homogeneous or heterogeneous set of patients than later trials, if they have shorter follow-up than later trials, or if changes are made to the definition of the outcome measure (e.g., the definition of myocardial infarction has changed considerably over the past decades); then the “common” sampling error may be considerably different at a later stage in a meta-analysis than it was in the early stage.

Provided the between study variance, τ², remains relatively stable over time, changes in the “common” sampling error may cause considerable changes in I² estimates over time. Further, if the between-study variance incurs considerable changes over time, changes in the “common” sampling error may either inflate or deflate the representation of such changes through the I² estimate.

The need to assess convergence of I² estimates and confidence intervals

From the above discussion, it is evident that I² estimates may incur considerable fluctuations over time. Currently, no studies have explored the magnitude of this problem, and no recommendations exist as to how many events or trials are needed to achieve adequately stable I² estimates in meta-analysis.

It has been proposed that I² should be reported with their associated 95% CIs. By construct, the conventional frequentist CI represents the spectrum of results that would include the true underlying value in a particular proportion (typically 95%) if the experiment were independently repeated many times. In meta-analysis, we can conceptually think of an ‘experiment’ as a set of trials ‘sampled’ randomly from a universe of all possible trials. However, as outlined in the above sections, the patterns with which different types of trials are included in a meta-analysis over time are typically not random. For example, small trials are likely to precede larger trials. Thus, the statistical assumptions on which I² confidence intervals are based may not hold in many meta-analyses. For this reason, it is important to explore, empirically, how 95% confidence intervals perform as more trials are accumulated over time.

Materials

In a previous empirical study, we extracted data from 920 binary ‘primary outcome’ meta-analyses in Cochrane systematic reviews [22]. We had defined primary outcomes as one of the first three outcomes in the first comparison group [22]. The data set only included meta-analyses that pooled results across all trials; meta-analyses reporting only sub-totals were excluded. For this current study, we used the same population of meta-analyses and selected the subset of meta-analyses that met the following eligibility criteria:

- The total number of included trials surpassed 30. We employed this eligibility criterion because the number of trials is an important measure of the reliability of estimates of variation between trial results (i.e., I²). Since we accept the final cumulative I² as representing a good approximation of the ‘truth’, it was important that the number of trials was large enough to make it likely that the final I² had converged and was stable.
- The total number of patients surpassed a required information size (i.e., required meta-analysis sample size) based on α = 5% and β = 20% (i.e., 80% power). The required information size used for each meta-analysis was powered to detect a 25% relative risk reduction assuming control group event rate equal to the median of all trials. Calculation of a required information size requires an estimation of heterogeneity [23]. For the purpose of estimating a reasonable required information size (and allowing confidence that our final effect estimate is reliable), we chose to assume a 50% degree of heterogeneity for these calculations. I² is a function of Cochran's Q and Cochran's Q is a function of the sum of squared differences between each trial effect estimate and the meta-analysed effect estimate. Thus, if the meta-analysed effect estimate cannot be considered reliable, I² may not be reliable either. The use a fixed a priori I² = 50% for the heterogeneity correction is based on a recent comprehensive simulation study, in which we demonstrated that meta-analysed effect estimate is highly unlikely to be over- or underestimated after the cumulative number of patients have surpassed a 50% heterogeneity corrected information size [24].
- The disease of interest was a common disease. We employed this criterion because most interventions for common diseases yield intervention effects close to a 25% relative risk reduction or smaller, thus giving credence to our considerations for the required information size (above criterion).

From the pool of 920 meta-analyses, 18 meta-analyses were originally eligible for our analysis, and after further considerations 16 studies were included. Post hoc, we elected to exclude two meta-analyses. These two meta-analyses each included two significantly different subgroups where all or the majority of trials in the second subgroup had been conducted after the trials in the first subgroup. We therefore did not find it appropriate to assess convergence of I² in this meta-analysis. Table 1 presents the characteristics of the 16 included meta-analyses.

Download:

Table 1. Characteristics of the 16 included meta-analyses.

https://doi.org/10.1371/journal.pone.0039471.t001

Analysis

For each of the 16 meta-analyses we calculated and plotted the cumulative I² estimate and associated 95% CI after each year of publication. We accepted the final I² estimate (i.e., the I² estimated based on the meta-analysis including all trials) as representing a good approximation of the ‘truth’. First, we assessed the variation of I² estimates over time by calculating the difference between the maximum and minimum observed I² estimate over time in each meta-analysis. We refer to this difference as the fluctuation span of I². Second, we assessed how many events and trials were required for the cumulative I² estimate to become stable. We defined the considered I² estimates moderately and highly stable from the points where the cumulative I² estimate came within a +/−20% and a +/−10% absolute distance of the final cumulative I² estimate and stayed within this distance. Third, we recorded the cumulative number of trials and events where the 95% CIs temporarily did include the final I² estimate. At these time points, we assessed how far the closest CI limit was to the final I² estimate. That is, if the final I² estimate was above the temporary 95%CI, we calculated the distance between the upper CI limit and the final I² estimate, and vice versa if the final I² estimate was below the 95% CI.

Results

Columns 2–4 in table 2 present the minimum, the maximum, and the fluctuation span of I² values observed over time in each of the 16 included meta-analyses. The median, minimum and maximum fluctuation span was 47.5%, 15%, and 81%. Ten of the 16 meta-analyses (62.5%) had a fluctuation span larger than 40%. Columns 5–8 in table 2 present the number of trials and events required for the cumulative I² estimate to become moderately and highly stable. In 3 of the 16 meta-analyses (meta-analyses 14–16) the cumulative I² estimates were moderately stable throughout the entire meta-analysis. For the remaining 13 meta-analyses, the median (minimum to maximum) number of trials and events required to become moderately stable was 11 (5 to 25) and 467 (138 to 1894) respectively. The median (minimum to maximum) number of trials and events required to become highly stable was 20 (10 to 37) and 958 (257 to 2766). Further, graphical inspection revealed that, except for two meta-analysis (meta-analysis 8 and 10, see figures 1, 2, 3, 4), no major fluctuations occurred after the first point where the cumulative meta-analysis included at least 500 events and 15 trials.

Download:

Figure 1. Presents the evolution of the cumulative I² estimates and their associated 95% confidence intervals (CIs) over the accumulation of events in meta-analyses (1) to (8).

The cumulative I² are represented by the dot-dashed line ( ), and their associated cumulative 95% CIs are represented by the dotted lines ( ).

https://doi.org/10.1371/journal.pone.0039471.g001

Download:

Figure 2. Presents the evolution of the cumulative I² estimates and their associated 95% confidence intervals (CIs) over the accumulation of events in meta-analyses (9) to (16).

The cumulative I² are represented by the dot-dashed line ( ), and their associated cumulative 95% CIs are represented by the dotted lines ( ).

https://doi.org/10.1371/journal.pone.0039471.g002

Download:

Figure 3. Presents the evolution of the cumulative I² estimates and their associated 95% confidence intervals (CIs) over the accumulation of trials in meta-analyses (1) to (8).

The cumulative I² are represented by the dot-dashed line ( ), and their associated cumulative 95% CIs are represented by the dotted lines ( ).

https://doi.org/10.1371/journal.pone.0039471.g003

Download:

Figure 4. Presents the evolution of the cumulative I² estimates and their associated 95% confidence intervals (CIs) over the accumulation of trials in meta-analyses (9) to (16).

The cumulative I² are represented by the dot-dashed line ( ), and their associated cumulative 95% CIs are represented by the dotted lines ( ).

https://doi.org/10.1371/journal.pone.0039471.g004

Download:

Table 2. Fluctuation spans of I² values and the number of trials and events required to become stabile.

https://doi.org/10.1371/journal.pone.0039471.t002

In 3 of the 16 meta-analyses (meta-analyses 7, 9 and 14), the 95% CIs temporarily did not include the final I² estimate (see figures 1, 2, 3, 4). In meta-analysis 7, the 95% CI at the second publication year was 0–69% and the final I² estimate was 77%. The cumulative number of events and trials at this point was 349 and 5. In meta-analysis 9, the 95% CI at the fourth publication year was 57–88% and the final I² estimate was 54%. The cumulative number of events and trials at this point was 177 and 5. In meta-analysis 14, the 95% CI at the third and fourth publication year was 77–94% and 75–93% and the final I² estimate was 74%. The cumulative number of events and trials was 349 and 7 at the third year of publication and 407 and 8 at the fourth year of publication.

Discussion

In summary, our findings suggest I² estimates are likely to incur considerable fluctuations when a meta-analysis includes less than roughly 500 events and less than 15 trials, and that 95% CIs for the I² estimate provide good coverage over time. All instances where the 95% CI temporarily did not include the final I² estimate occurred in cumulative meta-analyses with less than 500 events and 10 trials. However, even in the rare cases where the 95% CIs did not include the final I² estimate, it is unlikely that inferences about the degree of heterogeneity based on the temporary 95% CIs would have differed from inferences based on the final I² estimate.

Our study offers several strengths. First, it represents the first empirical evaluation of the evolution of I² estimates and their associated 95% confidence intervals over time. Second, our results provide novel insights on the use of one of the most important inferential measures, I², in meta-analytic practice. Third, we selected meta-analyses including a sufficiently large number of trials and patients to help ensure a sufficiently reliable sample.

Our study has a number of limitations. We only evaluated I² estimates and their associated 95% CIs after each year of publication. Since all of the included meta-analyses included more than 1 trial for some of the years, it is possible that the I² estimates in some of the 16 meta-analyses may have become stable with a smaller number of events and trials than indicated in table 2. Some of the number of events and trials required to reach convergence which we present in table 2 may be therefore overestimates. However, a preliminary analysis plotting the I² estimates by trial (results not shown), where trials were ordered alphabetically (according to first author's last name or trial acronym) did not reveal additional fluctuations compared with the by-publication-year plots.

Only 16 meta-analyses were eligible, having covered a limited spectrum of medical areas. Our findings may therefore not be generalizable to meta-analyses that bear little resemblance to the meta-analyses included in this study. Similarly, we also did not examine meta-analyses published in paper journals. A number of differences between Cochrane meta-analyses and journal based meta-analyses have been documented (e.g., meta-analyses published in paper journals are more likely to present statistically significant findings) [15], [16]. One could therefore speculate that fluctuations in I² estimates may differ between Cochrane and paper journal meta-analyses. In the second section of this paper (statistical framework and theoretical considerations), we explained that I² estimates may fluctuate due to lack of power, time-dependent bias, and evolving trial weights and precisions. We did not perform an in-depth assessment of the degrees to which each of these factors caused I² estimates to fluctuate in the 16 meta-analyses. We believe empirical analysis of a data set obtained through more lenient eligibility criteria (e.g., only insisting on minimum 15 trials), or specifically tailored simulation studies may have cast more light on these issues. While such studies would provided advantages over this study, they would also come with limitations that are not present in the current study. We therefore believe and in-depth evaluation of causes of I² fluctuations and convergence are best kept separate from this study. Finally, we did not examine if any of the review authors took any precautions about uncertainty associated with I² estimates (especially in early versions of the systematic reviews where the meta-analysis included less than 500 events and 15 trials). However, given the paucity of methodological literature on the I² measure just five years ago, it is likely that most Cochrane review authors would have been unaware of the issues related to uncertainty associated with estimating I².

The median number of trials in a meta-analysis is 7 in Cochrane reviews and 12 in systematic reviews published in paper journals [15], [16]. With clinical trial sample sizes typically being smaller than 100 [17], [18], it is likely that most published meta-analyses will incur considerable fluctuations (i.e., meta-analyses with less than 500 events and 15 trials). Hence, there is a need for presenting the I² estimate with its associated 95% CI. As outlines in the statistical framework section, 95% CIs represents the spectrum of results that would include the true underlying value in 95% of all ‘meta-analysis experiments’. As such, clinicians should consider what their interpretation of the degree of heterogeneity would be if the I² estimate was equal to the lower 95% CI limit, and separately the upper 95% CI limit. This should yield an appropriate spectrum of interpretations and adequately reflect the uncertainty.

Unreliable I² estimates have potential negative implications for the assessment of reliability of intervention effect estimates. Recent literature as well as the GRADE initiative have promoted the need for assessing intervention effects in relation to the strength of evidence [23]–[28]. One of the factors when considering the overall quality of evidence is the precision of the pooled estimate of effect, which is achieved, in part, through considering the required (or optimal) information size [23]–[28]. However, to carry out such assessments reliably it is necessary to have a good idea of the expected degree of heterogeneity in the meta-analysis, and if this is not possible, one should at least carry out sensitivity assessments based on a plausible spectrum of degrees of heterogeneity. Uninformed use of the current I² estimate does not provide a solid basis for such assessments, but interpretation of the I² estimate in relation to the cumulative amount of evidence and the associated 95% CI does.

Previous studies have already identified limitations associated with the I² measure as well as the uncertainty associated with I² estimates [7], [8], [11], [12]. Our study adds to the previous literature by introducing temporality. However, as pointed out above, our findings do have limitations and need confirmation in simulation studies and perhaps other empirical studies. Example papers, which put statistical inferences about the degree of heterogeneity in a clinical context, are also required. The latter may be realized if confidence intervals became an integral part of the widely used systematic review software Review Manager as well as other meta-analysis software packages [29].

In conclusion, I² estimates are likely to fluctuate considerably in meta-analyses with less than roughly 500 events and 15 trials. Confidence intervals for I² estimates provide good coverage as evidence accumulates, and are thus valuable for reflecting the uncertainty associated with estimating I². It is our hope that the next updates of systematic review and meta-analysis software packages, such as Review Manager, will include confidence intervals for the I² estimate.

Supporting Information

Appendix S1.

I² confidence intervals.

https://doi.org/10.1371/journal.pone.0039471.s001

(DOCX)

Author Contributions

Conceived and designed the experiments: KT GI BCJ TA LT CG PJD JW MW. Performed the experiments: KT. Analyzed the data: KT. Contributed reagents/materials/analysis tools: KT TA. Wrote the paper: KT GI BCJ TA LT CG PJD JW MW.

Reference

1. Engels EA, Schmid CH, Terrin N, Olkin I, Lau J (2000) Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Statistics in Medicine 19: 1707–1728.
- View Article
- Google Scholar
2. Higgins JP, Thompson S (2002) Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21: 1539–1558.
- View Article
- Google Scholar
3. Higgins JP, Thompson S, Deeks J, Altman DG (2003) Measuring inconsistency in meta-analyses. British Medical Journal 327: 557–560.
- View Article
- Google Scholar
4. Higgins JP, Green S (2009) Cochrane Handbook for systematic reviews of interventions, version 5.0.0. John Wiley & Sons.
- View Article
- Google Scholar
5. Lau J, Ioaniddis JP, Schmid CH (1998) Summing up evidence: one answer is not always enough. Lancet 351: 123–127.
- View Article
- Google Scholar
6. Ioaniddis JP (2008) Interpretation of tests of heterogeneity and bias in meta-analysis. Journal of Evaluation of Clinical Practice 14: 951–957.
- View Article
- Google Scholar
7. Huedo-Medina T, Sánchez-Meca J, Marín-Martínez F (2006) Assessing heterogeneity in meta-analysis: Q Statistic or I(2) Index? Psychological Methods 11: 193–206.
- View Article
- Google Scholar
8. Mittlböck M, Heinzl H (2010) A simulation study comparing properties of heterogeneity measures in meta-analysis. Statistics in Medicine 25: 4321–4333.
- View Article
- Google Scholar
9. Jackson D (2006) The implications of publication bias for meta-analysis' other parameter. Statistics in Medicine 25: 2911–2921.
- View Article
- Google Scholar
10. Jackson D (2007) Assessing the implications of publication bias for two popular estimates of between-study variance in meta-analyses. Biometrics 63: 187–193.
- View Article
- Google Scholar
11. Rücker G, Schwarzer G, Carpenter JR, Schumacher M (2008) Undue reliance on I(2) in assessing heterogeneity may mislead. BMC Medical Research Methodology 79:
- View Article
- Google Scholar
12. Ioannidis JP, Patsopoulos NA, Evangelou E (2007) Uncertainty in heterogeneity estimates in meta-analysis. BMJ 335: 914–916.
- View Article
- Google Scholar
13. Guyatt G, Wyer P, Ioaniddis JP (2008) When to Believe a Subgroup Analysis. Users' Guide to the Medical Literature: A manual for Evidence-Based Clinical Practice. McGraw-Hill.
14. Cochran WG (1954) The combination of estimates from different experiments. Biometrics 10: 101–129.
- View Article
- Google Scholar
15. Moher D, Tetzlaff J, Tricco AC, Sampson M, Altman DG (2010) Epidemiology and reporting characteristics of systematic reviews. PLoS Medicine 4: e78.
- View Article
- Google Scholar
16. Tricco AC, Tetzlaff J, Pham B, Brehaut J, Moher D (2009) Non-Cochrane vs. Cochrane reviews were twice as likely to have positive conclusion statements: cross-sectional study. Journal of Clinical Epidemiology 62: 380–386.
- View Article
- Google Scholar
17. Chan AW, Altman DG (2005) Epidemiology and reporting of randomized clinical trials published in PubMed journals. Lancet 365: 1159–1162.
- View Article
- Google Scholar
18. Gluud C (2006) The culture of designing hepato-biliary randomised clincial trials. Journal of Hepatology 44: 607–615.
- View Article
- Google Scholar
19. Dickersin K (1990) The existence of publication bias and risk factors of its occurence. Journal of American Medical Association 263: 1385–1389.
- View Article
- Google Scholar
20. Dwan K, Alman D, Arnaiz J, Bloom J, Chan AW, et al. (2008) Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS Medicine 3: e3081.
- View Article
- Google Scholar
21. Ioaniddis JP (1998) Effect of Statistical Significance of Results on the Time to Completion and Publication of Randomized Efficacy Trials. Journal of American Medical Association 279: 281–286.
- View Article
- Google Scholar
22. Thorlund K, Wetterslev J, Awad T, Thabane L, Gluud G (2012) Comparison of statistical inferences from the DerSimonian-Laird and alternative random-effects model meta-analyses - an empirical assessment of 920 primary outcome Cochrane meta-analyses. Research Synthesis Methods 4: 238–253.
- View Article
- Google Scholar
23. Wetterslev J, Thorlund K, Brok J, Gluud C (2009) Estimating required information size by quantifying diversity in a random-effects meta-analysis. BMC Medical Research Methodology 9:
- View Article
- Google Scholar
24. Thorlund K, Imberger G, Walsh M, Chu R, Gluud C, et al. (2011) The Number of Patients and Events Required to Limit the Risk of Overestimation of Intervention Effects in Meta-Analysis—A Simulation Study. PLoS ONE 6: e25491.
- View Article
- Google Scholar
25. GRADE Working Group (2009) Grading quality of evidence and strength of recommendations in clinical practice guidelines. Part 1 of 3. An overview of the GRADE approach and grading quality of evidence about interventions. Allergy 64: 669–677.
- View Article
- Google Scholar
26. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, et al. (2008) GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 336: 924–926.
- View Article
- Google Scholar
27. Thorlund K, Devereaux PJ, Wetterslev J, Guyatt G, Ioannidis JP, Thabane L, Als-Nielsen B, Gluud LL, Gluud C (2009) Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses? International Journal of Epidemiology 38: 276–286.
- View Article
- Google Scholar
28. Wetterslev J, Thorlund K, Brok J, Gluud C (2008) Trial sequential analysis may establish when firm evidence is reached in cumulative meta-analysis. Journal of Clinical Epidemiology 61: 64–75.
- View Article
- Google Scholar
29. Copenhagen: The Nordic Cochrane Centre (2008) Review Manager (RevMan) Version 5.0, version.

[ref1] 1. Engels EA, Schmid CH, Terrin N, Olkin I, Lau J (2000) Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Statistics in Medicine 19: 1707–1728.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Higgins JP, Thompson S (2002) Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21: 1539–1558.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Higgins JP, Thompson S, Deeks J, Altman DG (2003) Measuring inconsistency in meta-analyses. British Medical Journal 327: 557–560.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Higgins JP, Green S (2009) Cochrane Handbook for systematic reviews of interventions, version 5.0.0. John Wiley & Sons.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Lau J, Ioaniddis JP, Schmid CH (1998) Summing up evidence: one answer is not always enough. Lancet 351: 123–127.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Ioaniddis JP (2008) Interpretation of tests of heterogeneity and bias in meta-analysis. Journal of Evaluation of Clinical Practice 14: 951–957.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Huedo-Medina T, Sánchez-Meca J, Marín-Martínez F (2006) Assessing heterogeneity in meta-analysis: Q Statistic or I(2) Index? Psychological Methods 11: 193–206.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Mittlböck M, Heinzl H (2010) A simulation study comparing properties of heterogeneity measures in meta-analysis. Statistics in Medicine 25: 4321–4333.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Jackson D (2006) The implications of publication bias for meta-analysis' other parameter. Statistics in Medicine 25: 2911–2921.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Jackson D (2007) Assessing the implications of publication bias for two popular estimates of between-study variance in meta-analyses. Biometrics 63: 187–193.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Rücker G, Schwarzer G, Carpenter JR, Schumacher M (2008) Undue reliance on I(2) in assessing heterogeneity may mislead. BMC Medical Research Methodology 79:
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Ioannidis JP, Patsopoulos NA, Evangelou E (2007) Uncertainty in heterogeneity estimates in meta-analysis. BMJ 335: 914–916.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Guyatt G, Wyer P, Ioaniddis JP (2008) When to Believe a Subgroup Analysis. Users' Guide to the Medical Literature: A manual for Evidence-Based Clinical Practice. McGraw-Hill.

[ref14] 14. Cochran WG (1954) The combination of estimates from different experiments. Biometrics 10: 101–129.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref15] 15. Moher D, Tetzlaff J, Tricco AC, Sampson M, Altman DG (2010) Epidemiology and reporting characteristics of systematic reviews. PLoS Medicine 4: e78.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref16] 16. Tricco AC, Tetzlaff J, Pham B, Brehaut J, Moher D (2009) Non-Cochrane vs. Cochrane reviews were twice as likely to have positive conclusion statements: cross-sectional study. Journal of Clinical Epidemiology 62: 380–386.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref17] 17. Chan AW, Altman DG (2005) Epidemiology and reporting of randomized clinical trials published in PubMed journals. Lancet 365: 1159–1162.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref18] 18. Gluud C (2006) The culture of designing hepato-biliary randomised clincial trials. Journal of Hepatology 44: 607–615.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref19] 19. Dickersin K (1990) The existence of publication bias and risk factors of its occurence. Journal of American Medical Association 263: 1385–1389.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref20] 20. Dwan K, Alman D, Arnaiz J, Bloom J, Chan AW, et al. (2008) Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS Medicine 3: e3081.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref21] 21. Ioaniddis JP (1998) Effect of Statistical Significance of Results on the Time to Completion and Publication of Randomized Efficacy Trials. Journal of American Medical Association 279: 281–286.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref22] 22. Thorlund K, Wetterslev J, Awad T, Thabane L, Gluud G (2012) Comparison of statistical inferences from the DerSimonian-Laird and alternative random-effects model meta-analyses - an empirical assessment of 920 primary outcome Cochrane meta-analyses. Research Synthesis Methods 4: 238–253.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref23] 23. Wetterslev J, Thorlund K, Brok J, Gluud C (2009) Estimating required information size by quantifying diversity in a random-effects meta-analysis. BMC Medical Research Methodology 9:
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref24] 24. Thorlund K, Imberger G, Walsh M, Chu R, Gluud C, et al. (2011) The Number of Patients and Events Required to Limit the Risk of Overestimation of Intervention Effects in Meta-Analysis—A Simulation Study. PLoS ONE 6: e25491.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref25] 25. GRADE Working Group (2009) Grading quality of evidence and strength of recommendations in clinical practice guidelines. Part 1 of 3. An overview of the GRADE approach and grading quality of evidence about interventions. Allergy 64: 669–677.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref26] 26. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, et al. (2008) GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 336: 924–926.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref27] 27. Thorlund K, Devereaux PJ, Wetterslev J, Guyatt G, Ioannidis JP, Thabane L, Als-Nielsen B, Gluud LL, Gluud C (2009) Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses? International Journal of Epidemiology 38: 276–286.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref28] 28. Wetterslev J, Thorlund K, Brok J, Gluud C (2008) Trial sequential analysis may establish when firm evidence is reached in cumulative meta-analysis. Journal of Clinical Epidemiology 61: 64–75.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref29] 29. Copenhagen: The Nordic Cochrane Centre (2008) Review Manager (RevMan) Version 5.0, version.

Figures

Abstract

Background

Methodology/Principal Findings

Conclusions/Significance

Introduction

Statistical framework and theoretical considerations

Measuring heterogeneity between studies

Factors affecting I2 estimates

Power and precision.

Time-dependent bias.

Dependence on trial weights and precisions.

The need to assess convergence of I2 estimates and confidence intervals

Materials

Analysis

Results

Discussion

Supporting Information

Appendix S1.

Author Contributions

Reference

Factors affecting I² estimates

The need to assess convergence of I² estimates and confidence intervals