The reporting and analysis of SWTs pose many of the same challenges as for CRTs, and the guiding principles developed for CRTs can be applied. However some challenges are unique to SWTs, and guidance to overcome them is currently absent. One issue is standardised reporting of the design of SWTs, and Copas
et al., in this series, addresses terminology and a taxonomy of stepped-wedge trials for clearer presentation of the designs [
3]. In this article we focus on two further issues: reporting of results of SWTs, and selecting an optimal analysis strategy that is statistically efficient and leads to unbiased estimates of the effect of the intervention with appropriately characterised confidence levels.
We first discuss the two issues outlined above in more detail. We then describe how ten recently reported SWTs approached these two issues. Finally, we critically appraise the analytic approach taken by three ‘case studies’ that represent a range of different elements of SWT design. We conclude by discussing issues raised by this investigation and identify some potential ways forward.
Issues in the reporting and analysis of an SWT
Aspects of the design of SWTs are described in detail in Copas
et al. [
3]. Clusters are collections of individuals, such as schools, homes, or hospitals. SWTs randomly allocate clusters to ‘groups’ of clusters that cross over the intervention at different ‘crossover points’. SWTs have up to three main phases [
3]. For all SWTs there will be a ‘rollout period’ during which time groups of clusters are crossing over from the control condition (often ‘business as usual’) to the intervention condition [
4]. At any one time during this rollout period, some groups of clusters will have been allocated to be receiving the intervention condition while others will have been allocated to be receiving the control condition. The time period between the crossover of successive groups is referred to here as the time between successive crossover points, and sometimes elsewhere as ‘step length’. Outcome data may be collected before the rollout period, when all clusters are in the control condition, or later, when all clusters are in the intervention condition.
SWTs are characterised by the timing of the participants’ enrolment and exposure to control and/or intervention conditions within the trial, the duration of follow-up, and the measurements collected during follow-up. For example, individuals may be enrolled and outcome data may be collected by following individuals over time until some event occurs, such as death or becoming a disease case, or until any other ‘censoring’ event when they become ineligible for follow-up. Alternatively, data may be collected at discrete time points over the course of follow-up. Analysing and reporting the data arising from this array of study design characteristics pose some common and some unique challenges.
Reporting of SWTs
Standardisation of reporting practices has greatly aided the interpretation and synthesis of results from CRTs [
5]. In contrast, there is no standard reporting template for SWTs in public health or any other field.
Two features make SWTs more complex to report than the equivalent CRTs, requiring adaptation of the approaches to reporting CRTs. First, SWTs randomly allocate clusters to groups that determine the timing of introduction of the intervention, rather than, as in CRTs, to the study control and intervention conditions [
3]. There may be many groups to which clusters are allocated, and the number of groups will always be greater than the number of conditions. Second, in SWTs, the data corresponding to the intervention condition will be, on average, collected later than data corresponding to the control condition [
3].
Participant flow, or CONSORT diagrams [
6], are very often used in the reporting of CRTs [
7]. These diagrams include rates of recruitment, refusal, drop-out, loss to follow-up, and missing outcome data by study condition among clusters and individuals [
8,
9]. The large number of allocation groups and the crossover from control to intervention allocation can make it less straightforward to present a participant flow (that is, CONSORT diagram) for SWTs relative to CRTs.
Another almost universally supported characteristic of CRT reporting is an assessment of whether or not the randomisation procedure has resulted in study conditions that are balanced at baseline in terms of important covariates [
6]. This is because although randomisation ensures that there is no systematic bias in allocation, the number of clusters may not be large enough to assume that there are no chance imbalances [
10]. The large number of groups, and correspondingly small number of clusters per group, may mean that presentation of group characteristics is infeasible. Researchers may prefer to present an assessment of balance by condition. However, balance between the conditions of SWTs often also depends on the presence of secular trends in the outcome. To assess the risk of bias, it may be important to differentiate between imbalances that are due to chance and imbalances that are systematic.
The presence of a secular trend in the outcome will bias an unadjusted comparison between outcomes corresponding to intervention and control conditions [
4]. Accounting for the potential bias from secular trends in the outcome is a key feature of the analysis methods (see below). Reporting and assessment of the trend is important for understanding the extent of the risk of bias, and the appropriateness of the analysis method.
Analysis of an SWT
Ideally, the analysis method for a SWT will result in (1) an unbiased estimate of the intervention effect, (2) appropriately reflect the level of uncertainty in the point estimate, and (3) be as statistically efficient as possible.
Since SWTs are types of CRTs, principles of analysis for CRTs can be used to guide the analyses in SWTs. For instance, data on individuals, or any other sub-cluster unit, are likely to be correlated with data from others in the same cluster [
11]. There is a rich literature on this issue because it also arises in parallel CRTs [
10,
11]. However, analysis of SWTs poses some additional challenges. In particular, in SWTs the effect estimate is potentially confounded by secular changes in the outcome [
4]. This is rarely an issue for CRTs as clusters allocated to the intervention and control conditions are usually followed up (that is, data are collected) over the same time period.
Taking these issues into consideration, there are several ways to analyse data from a SWT. These are primarily individual-level analyses and adopt one of two broad approaches to address potential bias from secular trends.
The first approach compares outcomes associated with the control and intervention conditions within the periods between successive crossover points, implicitly taking into account secular trends by conditioning on time. Observations corresponding to periods when all clusters are in the control or intervention condition do not contribute to the effect estimate (except indirectly to increase the precision by adjustment using outcome data from before randomisation). Parametric or semi-parametric models available include Cox regression or conditional logistic regression. Alternatively, researchers could calculate the intervention effect size for each of several time intervals, such as the periods between successive crossover points, and plot or summarise these. The advantage of this approach is that it preserves the randomization; it is sometimes referred to as a ‘vertical’ analysis [
12]. This approach also avoids the need to specify time trends in the outcome. A disadvantage of a vertical analysis is that it is unclear how to acknowledge appropriately the clustering of participants over time within clusters. For this reason, we have not observed any strictly vertical analyses in the literature. While we have observed analysis by Cox regression conditioning on time [
12], this was in conjunction with a random (frailty) effects analysis; so that, in order to account for clustering, the analysis used information over time and not solely vertical information in the estimate of the effect.
A second approach explicitly takes into account secular trends by producing an intervention effect adjusted for time trends, which are also estimated. This method compares outcomes corresponding to the control and intervention conditions within the periods between successive crossover points as well as between these periods in the same clusters, and maximises efficiency [
4,
13]. This comparison includes, along with the vertical comparison, a controlled before-after comparison, sometimes referred to as a ‘horizontal comparison’, that is not, strictly speaking, a randomised comparison. The validity of this horizontal comparison therefore requires that the secular trend of the outcome be accounted for in each cluster. Secular trends may arise from changes in the level of the outcome in the population and also from changes in the constituents of the sample in the trial, for example, from attrition from a closed cohort. Time trends are commonly entered into the model as fixed effects, often as factors simply reflecting the periods between crossover points, with the assumption that the trend is the same in all clusters. This assumption may not be correct: the trend may vary across clusters and also may change in form when clusters cross over to the intervention condition. In some cases, the secular trend can be described using a linear trend (or higher orders) so as to reduce the number of parameters to be estimated; however, a companion paper in this series found that the number of parameters estimated does not substantially affect the power [
13]. Researchers sometimes include outcome data in the dependent variable that was collected while all clusters are allocated to the either control or intervention conditions, which will introduce before-after comparisons that are not controlled and could introduce bias if the analysis model is badly mis-specified. This design decision is discussed in Copas
et al. [
3]
Individual-level models can gain efficiency and appropriately reflect the level of uncertainty in the point estimate reflecting the clustering in the data using random effects [
4], generalized estimating equations (GEE) with a working correlation matrix (for example, exchangeable or autoregressive), or through robust standard errors. Multiple levels of clustering (for example, wards within hospitals or repeated measures of the same individuals) can be taken into account with these methods [
14]. Adjustment for individual and cluster-level covariates can be made.
The standard mixed model approach to estimating the intervention effect, as described by Hussey and Hughes and ignoring further covariates for adjustment [
4], involves fitting a model of the form:
$$ {Y}_{ijk}={\beta}_0+{\beta}_j+{\beta}_{effect}{X}_{ij}+{u}_i+{\varepsilon}_{ijk} $$
where the outcome
Y is measured for individual
k at time
j within cluster
i,
β
j
and
β
effect
are fixed effects for the
j time points (often the periods between successive crossover points) and the intervention effect, respectively;
Xij is an indicator of whether cluster
i has been allocated to start the intervention condition by time
j (taking the value 0 if not and 1 if it has changed), and
u
i
is a cluster random effect with mean zero across clusters. The assumptions made by this model are not discussed in detail in Hussey and Hughes [
4], and can be assessed. These include the lack of any interaction between the intervention and either time or duration of intervention exposure, and an assumption of exchangeability: that any two individuals are equally correlated within cluster regardless of whether in the same or different exposure conditions and regardless of time. A key further assumption is that the effect of the intervention is common across clusters. An important implication following from these assumptions - and the inclusion of comparisons of different periods between successive crossovers in the same clusters - is that, unlike in the typical CRT, much information concerning the population intervention effect can be gained from a small number of clusters if these have a large number of participants [
4]. However, if the effect of the intervention is assumed to be, but is not, common across clusters, then the estimate of the intervention effect from the mixed effect model may have spuriously high precision. In mixed model analyses, varying intervention effects across clusters need to be explicitly considered, whereas the GEE approach is robust to mis-specifying the correlation of measurements within clusters, so it is less important to consider whether the effect varies across clusters in a GEE analysis.
Lag in the intervention effect
Many interventions delivered at the cluster level will have a delay between the time when a cluster is allocated to start the intervention, and when changes in the outcome are likely to happen. This time is referred to here as the ‘lag-period’ and can be considered similar to a short-term ‘carry-over’ seen in one-way crossover trials [
15]. In a SWT, lags may be due to training or installation time or because there is a lag in outcome response (for example, the delay in disease response to intervention). Although a lag in changes to the outcome in the intervention condition of a parallel trial may occur, it can be addressed by restricting measurement of outcomes, in both conditions, until after the lag-period is over. In SWTs, this is not so simple because the time between crossover points may not be long enough to avoid collecting data during the lag-period.
To account for hypothesised lags, investigators may consider including a fractional term for the intervention - that is, ranging from 0 to 1 - to reflect the time to reach full fidelity [
16]. Alternatively, lags could be accounted for by excluding observations during the lag-period (similar to the ‘wash-out’ period in cross-over trials [
17]), or shifting the crossover point so as to correspond with the end of the lag-period and assigning outcomes during the lag-period as corresponding to the control condition. Decisions about how to account for lags should be pre-specified so that they can be interpreted as ‘intention-to-treat’ analyses [
18], as opposed to commonly conducted ‘on-treatment’ analyses, where being ‘on treatment’ is determined post hoc [
19].
Ensuring fidelity of the intervention over time may be more challenging for an SWT than a CRT because many SWTs are conducted due to limitations in the capacity of the implementation set-up staff [
20] and take place over long periods of time [
1,
2]. Loss of fidelity may arise from the turnover of staff, degradation of equipment, or from an acquired ‘resistance’ to the intervention, for example, as would be expected with a behaviour-change advertisement campaign. This could be assessed analytically with an interaction between time since crossing over to intervention and the intervention effect (although this will have low power to detect a difference) or graphically.
Unlike for CRTs, no clear framework exists to guide when and how particular methods should be used that account for the challenging characteristics of SWTs described above. We therefore reviewed recently published SWTs to investigate the range of methods used by researchers to analyse and report these trials, appraise them, and make recommendations for future research.