Background
A multicentre randomized control trial (RCT) is an experimental study "conducted according to a single protocol but at more than one site and, therefore, carried out by more than one investigator"[
1]. Multicentre RCTs are usually carried out for two main reasons. First, they provide a feasible way to accrue sufficient participants to achieve reasonable statistical power to detect the effect of an experimental treatment compared with some control treatment. Second, by enrolling participants of more diverse demographics from a broader spectrum of geographical locations and various clinical settings, multicentre RCTs increase generalizability of the experimental treatment for future use [
1].
Randomization is the most important feature of RCTs, for on average it balances known and unknown baseline prognostic factors between treatment groups, in addition to minimizing selection bias. Nevertheless, randomization does not guarantee complete balance of participant characteristics especially when the sample size is moderate or small. Stratification is a useful technique to guard against potential bias introduced by imbalance in key prognostic factors. In multicentre RCTs, investigators often use a stratified randomization design to achieve balance over key differences in study population (e.g. environmental, socio-economic or demographical factors) and management team (e.g. patient administration and management) at centre level to improve precision of statistical analysis [
2]. Regulatory agencies recommend that stratification variables in design should usually be accounted for in analysis, unless the potential value of adjustment is questionable (e.g. very few subjects per centre) [
1].
The current study was motivated by the COMPETE II trial which was designed to determine if an integrated computerized decision support system shared by primary care providers and patients could improve management of diabetes [
3]. A total number of 511 patients were recruited from 46 family physician practices. Individual patients were randomized to one of the two intervention groups stratified by physician practice using permuted blocks of size 6.The number of patients treated by one physician varied from 1 to 26 (interquartiles = 7.25, 11, 15; mean = 11; standard deviation [SD] = 6). The primary outcome was a continuous variable representing the change of a 10-point process composite score based on eight diabetes-related component variables from baseline to a mean of 5.9 months' follow-up. A positive change indicated a favourable result. During the study, the possibility of clustering within physician practice and its consequence on statistical analysis was a concern to the investigators. The phenomenon of clustering emerges when outcomes observed from patients managed by the same centre, practice or physician are more similar than outcomes from different centres, practices or physicians. Clustering often arises in situations where patients are selective about which centre they belong to, patients in a centre or practice are managed according to the same clinical care paths, or patients influence each other in the same cluster [
4]. Intraclass (or intracentre) correlation (ICC) is often used to quantify the average correlation between any two outcomes within the same cluster [
5]. It is a number between zero and one. A large value indicates that within-cluster observations are similar relative to observations from other clusters and each observation within cluster contains less unique information. This implies that the independence assumption which many standard statistical models are based on is violated. An ICC of zero indicates that individual observations within the same clusters are uncorrelated and different clusters on average have similar observations.
Through a literature review, we identified six statistical methods that were sometimes employed to analyze continuous outcomes in multicentre RCTs: A. simple linear regression (two sample t-test), B. fixed-effects regression, C. mixed-effects regression, D. generalized estimating equations (GEE), E-1. fixed-effects centre-level analysis, and E-2. random-effects centre-level analysis. The first four methods use patient as unit of analysis, yet address centre effects differently [
6‐
8]. Simple linear regression completely ignores centre effects that are likely to arise from two sources: (1) possible differences in environmental, socio-economic or treatment factors between centres, and (2) potential correlation among patients within centres. Although stratified randomization attempts to minimize the impact of centre on standard error of the treatment effect by ensuring that the treated and control groups are largely balanced with respect to centre, failure to control for stratification in analysis will likely inflate variance of the effect estimate. The fixed-effects model treats each participating centre as a fixed intercept to control for possible population or environmental differences among centres. This model assumes that study subjects from the same centre have independent outcomes, i.e. the intraclass correlation is fixed at zero. The mixed-effects model incorporates dependence of outcomes within a centre and treats centres as random intercepts. Proposed by Liang and Zeger [
9], the generalized estimating equation (GEE) model extends generalized linear regression with continuous, categorical or count outcomes to correlated observations within cluster. Under a commonly used and perhaps oversimplified assumption, that the degree of similarity between any two outcomes from a centre is equal, an exchangeable correlation structure can be used to assess treatment effect in Model C and D. Though the within- and between-centre variances (
and
) are estimated differently in these two models. Method E-1 and E-2 are routinely employed to combine information from different studies in meta-analysis [
10]. One can also apply them to aggregate treatment effects over multiple centres [
11‐
13]. The overall effect is obtained as the average within-centre effect differences over centre, using inverse-variance weighting.
To date, only a few studies have been carried out to compare the performance of statistical models in analyzing multicentre RCTs using Monte Carlo simulation [
6,
7,
14], whereas many studies assessed the impact of ICC in cluster randomization trials. Moerbeek et al [
6] compared the simple linear regression model, fixed-effects regression and fixed-effects centre-level analysis with equal centre size. Pickering et al [
7] examined the bias, precision and power of three methods: simple regression, fixed-effects and mixed-effects regression assuming block randomization of size 2 or 4 on a continuous outcome. In the presence of imbalance and non-orthogonality, they found ignoring centres or incorporating them as random-effects led to greater precision and smaller type II error compared with treating centres as fixed effects. Performance of the GEE approach and centre-level methods were not investigated in that work. Jones et al [
14] compared the fixed-effects and random-effects regression models to a two-step Frequentist procedure as well as a Bayesian model, in the presence of treatment by centre interaction, and recommended fixed-effects weighted method for future analysis of multicentre trials. The investigation was further expanded to assessing correlated survival outcomes from large multicentre cancer trials. A series of random-effects approaches were proposed to account for centre or treatment by centre heterogeneity in proportional hazards models [
15,
16].
A lack of definitive evidence on which models perform the best in various situations led to this comprehensive simulation study to examine the performance of all six commonly used models with continuous outcomes. The objective was to assess their comparative performance in terms of bias, precision (simulation standard deviation (SD) and average estimated SE), and mean squared error (MSE) of the point estimator of the treatment effect, empirical coverage of the 95% confidence interval (CI) and the empirical statistical power, over a wide spectrum of ICC value and centre size. We did not consider treatment by centre interaction this study, partly because clinicians and trialists have been making efforts to standardize the conduct and management of multicentre trials via, for instance, uniform patient selection criteria, staff training, and trial monitoring and auditing to reduce heterogeneity of treatment effects among centres. Furthermore it is uncommon to find clinical trials designed with sufficient power to detect treatment by covariate interactions.
In this paper, we survey six methods to investigate the effect of a treatment in multicentre RCTs in detail. We outline the design and analysis of an extensive simulation study, and report how model performance varies with ICC, centre size and the number of centres. We also present the estimated effect of the computer-aid decision support system on management of diabetes using different methods.
Discussion
In this paper, we investigated six modelling strategies in a Frequentist framework to study the effect of an experimental treatment compared to the control treatment in the context of multicentre RCTs with a continuous outcome. We focused on three designs with equal or varying centre sizes and a treatment allocation ratio of 1:1 in the absence of treatment by centre interaction. Results of this simulation study showed that, when the proportion of patients allocated to the experimental treatment was the same in each centre or subject to chance imbalance only, models using patient-level and centre-level data yielded unbiased point estimates of treatment effect across a wide spectrum of ICC values. Ignoring stratification by centre or within-centre correlation did not bias the estimated treatment effects even when ICC was large. In fact, Parzen et al showed that mathematically the usual two-sample t-test, naively assuming independent observations of the response within centre was asymptotically unbiased in this context [
30].
The simulation study also indicated that these models produced different standard errors of
, and the properties of interval estimates were affected by several factors: whether and how centre effects were incorporated in analysis, the combination of centre size and number of participating centres, and the level of non-orthogonality of the observed data. Treating centre as a random intercept resulted in the most precise estimate, and nominal values of coverage and power were attained in all circumstances. The fixed-effects model had extremely similar performance compared with the mixed-effects model in balanced design, but was slightly less efficient when the number of centres was large (J > 20) in an unbalanced design. Pickering and Weatherall observed the same pattern in their simulation study comparing three patient-level models with small ICC values [
7]. The GEE model using information sandwich covariance method tended to underestimate the standard error across centre effects when the sample of centres was small, a property noticed by researchers [
20,
31]. This resulted in higher statistical power. That is, the treatment effect estimate was more likely to be significant with a smaller standard error, but was associated with a lower coverage of the conference interval. Marray et al suggested that at least 40 centres should be used to ensure reliable estimate of standard error in the context of cluster randomized trials [
32]. Our simulation results suggested that such cut value was also applicable to multicentre RCTs. Failure to control for centre effects in any form resulted in inflation of standard error, falsely high interval coverage and sizable drop of power, as ICC increased. Parzen et al quantified the impact of correlation among observations within centre on the variance of
in Model A as 1/(1-ICC) [
30]. Alternatively, one may consider a variant of robust variance estimation or a GEE model with an independent working correlation to control for the impact of ICC on variance estimation using t-test. Centre-level models generally produced larger standard errors, lower coverage or power than the patient-level models. Centre-level random-effects model incorporated variability of the treatment effect over centres, and was not a fair comparator to other models. Interestingly, this model seemed to fare better than the centre-level fixed-effects model in terms of precision and coverage even though the simulated datasets contained no treatment by centre interaction. Despite that the random-effects centre-level model may be a reasonable alternative for patient-level models when the number of patients per centre is large (≥30), centre-level models cannot adjust for patient-level covariates, a potential fatal drawback in the presence of patient prognostic imbalance.
Statisticians have different viewpoints on treating centre effects and treatment by centre interaction as fixed or random effects when analyzing multicentre RCTs [
12,
13,
21,
33]. Our simulation results demonstrated the advantage of treating centres as random intercepts in the absence of treatment by centre interaction. When many centres enrol a few patients and allocation is unbalanced, the random intercept models can give more precise estimates of the treatment effect than the fixed intercept models, because they recover inter-centre information in unbalanced situations. For instance, in a multicentre RCT consisting of 45 centres each recruiting 4 patients, the empirical variance of the estimator of the treatment effect resulting from the fixed-effects model was 24.8% and 26.0% greater than that from the random-effects model when the ICC was 0.01 and 0.05, respectively. In the sentence alluded to, we need to compare the empirical variance of 0.162
2 with the value of 0.145
2 for ICC = 0.01, and 0.174
2 to 0.155
2 for ICC = 0.05 (Table
6 scenario 4). We therefore take the same position as Grizzle [
33] and Agresti and Hartzel [
12] that, "Although the clinics are not randomly chosen, the assumption of random clinic effect will result in tests and confidence intervals that better capture the variability inherent in the system more realistically than clinical effects are considered fixed".
Our results have some implications for the design of multicentre RCTs in the absence of treatment by centre interaction. First, regardless of the pre-determined allocation ratio, permutated block randomization (of relatively small block sizes) should be used to maintain approximate balance or orthogonality (i.e. same treatment allocation proportion across centres [
7]) between treatments and centres, so that their individual effects can be evaluated independently. Variable block sizes can be used to strengthen allocation concealment. Second, for a given sample size, the number of patients randomized in majority of centres should be sufficiently large to ensure reliable estimate of within-centre variation. Third, it is essential for investigators to obtain a rough estimate of ICC for within-centre responses, through literature review or a pilot study. To reach nominal power of 80% or 90% (in the absence of clustering), centre effects should be taken into consideration in sample size assessment. When centre effects are included without treatment by centre interaction, the analysis becomes more powerful than a two-sample t-test. One method to assess sample size is to start with a two sample t-test for continuous outcomes (ignoring centre effect) then multiple the original estimated error variance by an variation inflation factor of 1/(1-ICC). This factor would have the effect of increasing the required sample size. Ignoring centre effects results in the larger sample size in the absence of interaction. Sample size determined using information sandwich covariance of GEE model could lead to slight loss of power, when the number of centres is small (≥40) and no proper adjustment is done. Lastly, there is no particular reason to require equal numbers of patients being enrolled in all participating centres and this is seldom the case in practice. Throughout the simulations, we observed similar results for studies of equal and varying centre sizes. In the study, we considered three scenarios representing the particular centre composition of the COMPETE II trial. For discussion on potential impact of enrolment patterns on the point and interval estimates of treatment effect, readers can refer to the publications on random enrolment verse determined enrolment, and relative efficiency between equal and unequal cluster sizes in the reference list [
34,
35].
The current ICH E9 guideline recommends that researchers investigate treatment effect using a model that allows for centre differences in the absence of treatment by centre interaction [
1]. However, it is implausible or impractical to include centre effects in statistical modelling or stratify randomization by centre, when it is anticipated from the start that trials may have very few subjects per centre. As it is acknowledged in the document, these recommendations are based on fixed-effects models. Mixed-effects models on the other hand may also be used to explore the centre and centre by interaction effects, especially when the number of centres is large [
1]. Our simulation results indicated that when a considerable number of centres contains only a few patients, adjusting for centre as a fixed effect may lead to reduced precision (depending on distribution of patients between arms) compared with the naïve unadjusted analysis. Our work complements the ICH E9 guideline, by studying the impact of intraclass correlation on the assessment of treatment effects - a challenge that is seldom discussed, although routinely faced by investigators in reality. Our investigation suggests that, (1) ignoring centre effects completely may cause substantial overestimation of the standard error, faulty increase of coverage of the confidence interval and reduction of power; and (2) mixed-effects models and GEE models, if employed appropriately, can produce accurate and precise effect estimates, regardless of the degree of clustering. We recommend consider these methods in developing future guidelines.
When the number of patients per centre is very small, it is not practical to include centre as a fixed effect to analyze patient-level data, as centre effects cannot be reliably estimated, and precision of the treatment effect will be compromised. In fact for extremely small centres, all patients may be allocated to the same treatment group, and such centres will be ignored by the fixed-effects model [
36‐
39]. The alternatives include collapsing all centres to perform a two-sample t-test, collapsing smaller centres to create an artificial centre and treating it as a fixed effect, and exploring other models discussed above. The mixed-effects model utilizes small centres more efficiently by "borrowing" information from larger centres. The GEE approach models the average treatment difference across all centres and adjusts for centre effects through a uniform correlation structure. This is an intuitively more efficient model which unfortunately does not always converge when the number of patients per centre was highly variable (simulation scenarios 7 and 8). In the current study, non-convergence problems were more likely to arise for very small or large ICC values (less than 0.1 or greater than 0.4 for block size 2 or 4) due to non-positive definite working correlation matrices, and the frequency could be as big as 10% after 2000 iterations. Conversely, convergence problems did not occur for the mixed-effects models in any scenarios. Our results show that analysis of trials consisting of very small centres (i.e. those containing less than 2 patients per arm) using centre-level models may not be an optimal strategy, because the within-centre standard deviation of treatment difference cannot be estimated for such centres, and consequently these very small centres are excluded from the analysis.
Results of two large empirical studies and one systematic review of cluster RCTs in primary care clinics suggested that most ICC values on physical, functional and social measures were less than 0.10 [
26‐
28]. The estimated ICC in the COMPETE II trial using GEE and linear mixed-effects model, on the other hand, was 0.124 and 0.138, respectively. We chose to include rare yet possible large ICC values (0-0.75) in this simulation to examine the overall trend of model performance by ICC, and for the purpose of completeness and generalizability. Readers should anticipate the ICC values likely to emerge from their studies when interpreting these results. Throughout the work, we quantified correlation among subjects within centre using ICC, the most commonly used concept to assess clustering in biomedical literature. As indicated in previous sections, ICC reflects the interplay of two variance components in multicentre data: the between-centre variance and within-centre variance. These variance components are relatively easy to interpret for analysis of continuous outcomes using linear models. For analysis of binary or time-to-event data from multicentre trials using generalized mixed and frailty models, interpretation of centre heterogeneity can present challenges because random effects are linked to the outcome via nonlinear functions [
40]. Reparameterization of the probability density function may be used to assess the impact of within- and between-centre variance. Interested readers can refer to Duchateau and Janssen [
40] for more details.
A major limitation of the study is that it did not address model performance when the treatment by centre interaction exists. The interactions may be due to different patient populations or variable standard of care. Interested readers may read Moerbeek et al [
6] for formulas of variance of
in different models and Jones et al [
14] for simulation results. Future studies addressing interaction effects in multicentre RCTs are needed. Datasets in the current paper were generated based on a moderate treatment effect reflected by the standardized mean difference between the treatment and control group. More or less prominent treatment effects are also likely to occur in clinical studies and similar findings are expected. The current study investigated on continuous outcomes in two groups from a Frequentist perspective. The models discussed above can be naturally extended to compare three or more treatments. Agresti and Hartzel [
12] surveyed different methods to evaluate treatments for binary outcomes in multicentre RCTs. Non-parametric approaches and Bayesian methods are also available to obtain treatment contrast. Interested readers can refer to Aitkin [
41], Gould [
11], Smith et al [
42], Legrand et al [
16], and Louis [
43], to name a few.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
RC participated in the design of the study, simulation, analysis and interpretation of data, and drafting and revision of the manuscript. LT contributed to the conception and design of the study, interpretation of data and revision of the manuscript. JM contributed to the design of the study and revision of the manuscript. AH contributed to acquisition of data and critical revision of the manuscript. EP and PJD advised on critical revision of the manuscript for important intellectual content. All authors have read and approved the final manuscript.