Background
A meta-analysis aggregates indexes of effectiveness of individual trials into one pooled estimate. When the outcome of interest is a dichotomous variable, the commonly used effect sizes include the odds ratio (OR), the relative risk (RR), and the risk difference (RD). When the outcome is a continuous variable, then the effect size is commonly represented as either the mean difference (MD) or the standardised mean difference (SMD) [
1].
The MD is the difference in the means of the treatment group and the control group, while the SMD is the MD divided by the standard deviation (SD), derived from either or both of the groups. Depending on how this SD is calculated, the SMD has several versions such, as Cohen's d [
2], Glass's Δ [
3], and Hedges' g [
4].
When the outcome is measured in different units across trials, then we have no other choice but to use the SMD to combine the outcomes in the meta-analyses. On the other hand, when the outcome is measured in the same unit in every trial, theoretically, we can use either the MD or the SMD. In this latter case, there currently appears to be no unanimous agreement about which effect size is preferable, and different textbooks of meta-analyses provide differently nuanced recommendations about the selection of the appropriate effect size for continuous variables.
According to the Cochrane Handbook for Systematic Reviews of Interventions [
1], the “selection of summary statistics for continuous data is principally determined by whether studies all report the outcome using the same scale when the MD can be used.” The American Psychological Association (APA) Task Force on Statistical Inference maintains that “if the units of measurement are meaningful on a practical level (e.g. number of cigarettes smoked per day), then we usually prefer a MD to a SMD” [
5]. Egger et al. writes that “the overall treatment effect [in terms of SMD] can also be difficult to interpret as it is reported in units of standard deviation rather than in units of any of the measurement scales used in review” [
6].
On the other hand, there are also authors who recommend the SMD along with or over the MD. The APA Publication Manual suggests that it can often be valuable to report not only the MD but also the SMD [
7]. Borenstein, in his “Introduction to Meta-Analysis” [
8], wrote that if the unit is unfamiliar, the SMD serves as an easy way to judge the magnitude of the effect, thanks to the general rules of thumb described by Cohen that suggest that an SMD of 0.2 represents a “small” effect, an SMD of 0.5 represents a “medium” effect, and an SMD of 0.8 represents a “large” effect [
2]. For example, when you read that a treatment group’s mean post-treatment score on scale X was 10 points higher than that of a control group, there is no way of appreciating how much a difference this actually represents unless you are very familiar with the scale that is being used. But if the difference is expressed in terms of SMD as corresponding to an effect size of 0.5, for example, you can understand that it represents a moderate effectiveness in comparison with the control. In fact, Tian et al. noted that the SMD does not depend on the unit of measurement, and therefore the SMD has been widely used as a measure of intervention effect in many applied fields [
9].
The preferability of the MD or the SMD can be examined from three aspects. First, which of the MD or the SMD is clinically more interpretable? The above-summarised arguments made by different authors seem to concern mainly this aspect. Second, which of the MD or the SMD is more generalizable (as any summary index should have a good generalizability so that it can be applied to the next group of patients) [
10]? Third, which of the MD or the SMD is statistically more powerful (as we expect meta-analyses to provide as precise an estimate of the treatment effect as possible and to be as sensitive as possible to differences among treatments)?
To the best of our knowledge, no systematic assessment focusing on the second or third aspects of the MD and the SMD has been conducted. The objective of this research was, therefore, to examine empirically which index is more generalizable and statistically powerful in meta-analyses when the same unit is used: MD or SMD?
Discussion
We empirically examined 1068 reviews from the Cochrane Database of Systematic Reviews. The I2 index, an index of heterogeneity, was significantly smaller when the meta-analysed results were expressed in terms of the SMD, rather than the MD. When each one of the included RCT was compared against the pooled results of the remaining RCTs in each meta-analysis, the SMD showed a significantly greater percentage agreement than the MD in both the random-effects and the fixed-effect models for all degrees of heterogeneity. On the other hand, no statistically significant difference was found in terms of statistical power for identifying significant results between the MD and the SMD in either the random-effects or the fixed-effect model.
Our research agrees with one previous study regarding heterogeneity and statistical power. This previous study examined a relatively new index of effect, known as the ratio of means, for analysing continuous outcomes in comparison with more traditional effect sizes of continuous outcomes, namely the MD and the SMD. Consequently, the study examined only the meta-analyses where the ratio of means could be calculated, and it further limited the analyses to those meta-analyses containing five or more studies and examined the random-effects model only. However, the study also found that the
P-value did not statistically differ between the MD and the SMD and that the SMD was less heterogeneous (defined by
P < 0.1 for the Q statistic) than the MD [
17,
18]. When we conducted sensitivity analyses using our dataset by including only the meta-analyses that contained five or more studies, the results were also similar.
The percentage agreement figures for continuous outcomes, either in terms of MD or SMD, found in our study appeared to be very low, even when the associated I2 index was reasonably low. Clinicians and researchers must therefore be advised to use the most generalizable index of effectiveness, while keeping in mind that the actual degrees of expected overlap may not be as high as one would expect. We found a 5-percentage point difference in the degree of expected agreement when the results were expressed as the SMD vs. when they were presented as the MD; this difference was clinically meaningful and non-negligible.
Much to our surprise, however, the Cochrane authors often pooled their continuous outcomes as the MD and/or the SMD, even when the I2 statistics suggested extreme heterogeneity. While it is true that meta-analyses of continuous outcomes tend to be associated with a greater I2 value than those of dichotomous outcomes because of the former’s greater statistical power, the generalizability of such meta-analytic results would be highly suspected when the I2 values are as high as 80% or 90%.
The comparison of statistical power in the context of greater heterogeneity merits a comment. In this research, we found that the SMD was more generalizable and less heterogeneous than the MD and that there was no significant difference in statistical power between the MD and the SMD. However, strictly and logically speaking, if one statistic is statistically less generalizable and more heterogeneous than another, it would be meaningless to discuss their relative detection power. To discuss statistical power, a high percentage agreement and a low heterogeneity are essential.
As we outlined in the Introduction, whether the MD or the SMD is more clinically preferable as a summary index of meta-analyses of continuous outcomes remains controversial. When the outcome is measured in the same natural unit, such as the amount of bleeding or the number of days of hospitalisation, the MD is definitely better than the SMD from the viewpoint of interpretability. However, our studies have suggested that the SMD may be preferable than the MD from the viewpoint of generalizability. When the outcome is measured using the same patient-reported outcome (PRO) measure, such as the Hamilton Rating Scale for Depression [
19], the Short-Form 36 Quality of Life Questionnaire [
20], or the Severity Scoring of Atopic Dermatitis [
21], even though all the outcomes are measured in the same unit, the superior interpretability of the SMD is not guaranteed unless most clinicians are very familiar with that scale. And for many clinicians in most fields of medicine, such universally known and used PRO instruments are probably rare to non-existent. In such instances, the SMD might be more interpretable than the MD for two reasons. Firstly, the SMD can be interpreted using a general rule of thumb reported by Cohen, in which an SMD of 0.2 represents a small effect, an SMD of 0.5 represents a medium effect, and an SMD of 0.8 or larger represents a large effect [
2]. Second, the SMD can be directly and easily converted to a “number needed to treat” if the control event can be assumed [
22,
23]. In all these instances of PRO instruments, the SMD appears to be more generalizable than the MD.
There is another way to increase interpretability of continuous outcomes. When the minimal important change (MIC) is known, various methods have been proposed to facilitate the interpretability of continuous outcomes [
24], including conversion to MIC units and dichotomisation using the MIC threshold. Each has its own advantages and disadvantages, and a comprehensive discussion of their relative merits is out of the realm of the present study. Unfortunately, the MIC is often either not known or, if known, may not be very precise for most of the existing PRO measures.
Finally, SMD may have another important limitation. Because its value derives from the difference in the means between the treatment and control divided by their SD, if variability of the patients is artificially or accidentally reduced, SMD would be overestimated; and if its variability is increased, SMD would be underestimated [
25]. However, we think that our results concerning the greater generalizability of SMD have partially resolved such concerns because, despite the variability in the SD across trials, SMD had better external applicability and can therefore be said to have been less vulnerable to over- or underestimation. However, the above mentioned possibility of too small or too large SD, and correspondingly overestimated or underestimated SMD, should always be borne in mind in interpreting SMD.
Our research has two limitations. Firstly, we did not consider the influences of multiple comparisons in our analyses. However, even if we corrected the alpha level from 0.05 to 0.0042 using the Bonferroni method because we made 12 comparisons, only the difference in the absolute Z-score in the random-effects model would lose its significance; all the other results would not change. Secondly we did not categorize the nature of the continuous outcomes. It is possible that subjective continuous outcomes may be more prone to unstable measurement and hence be more heterogeneous than objective continuous outcomes. How generalizability and power may be influenced by such differences in continuous outcomes will be an important research topic in the future.
Competing interests
The author(s) declare that they have no competing interests.
Authors’ contributions
NT and TAF conceived and designed the study. NT, YO, AT and YH extracted data. TS checked data extraction and analysed data. All authors commented on drafts of the manuscript and approved the final manuscript.