Criteria for considering studies for this review
Types of studies
Randomised controlled trials (RCTs) (including cross-over and cluster randomised trials) which examine the effects of exercise interventions in adults with knee or hip OA in all settings will be included. Studies will be considered as RCT if authors had explicitly stated that it is randomised [
22] or when randomisation cannot be ruled out. Quasi randomised trials are excluded.
Participants
People with symptomatic OA of the knee and hip joint, diagnosed clinically (e.g. American College of Rheumatology Criteria), or radiographically (e.g. Kellgren Lawrence grading), or by the use of other imaging (e.g. magnetic resonance imaging) will be included. There is no restriction on the severity or stage of the disease. Hence, studies involving patients from early to ‘end-stage’ OA, including participants with pain following joint replacement, will be considered.
Studies using mixed disease categories from which the subgroup of knee and/or hip OA cannot be identified or fail to explicitly specify the joint condition as OA will be excluded. Arbitrarily, >80 % of joint replacement patients need to have their surgery attributed to OA in order for the study to be included [
23].
Interventions
Interventions that involve exercise prescribed in any form, such as strengthening exercise, aerobic exercise or mind-body exercise (e.g. tai-chi, yoga), will be eligible for inclusion. In instances where the term ‘exercise’ is not used by the investigators, any physical training that fulfils the basic characteristics of exercise (i.e. a structured programme which is repeated/practised on a regular basis, e.g. three times a week) will also be included. Single and combined interventions will be included, for example exercise, ±adjunct treatment versus a non-exercise intervention such as education, manual therapy or electrotherapy. Additional file
2: Table S1 shows how the pairwise comparisons from a study will determine its eligibility.
Exercise will be classified based on four core exercise types described by American College of Sports Medicine (ACSM) [
24]. An additional category of mind-body exercise will also be included in our study. The classifications are as follow:
(a)
Strengthening/resistance exercise: exercise that aims to improve the muscle’s ability to exert force and involves applying resistance against a contracting muscle.
(b)
Aerobic exercise: exercise that aims to improve cardiorespiratory fitness and involves repetitive movement of large muscle groups, performed at moderate to vigorous intensity for prolonged periods of time.
(c)
Flexibility exercise: exercise that improves the ability to move a joint throughout its range of motion and includes various types of stretching exercises.
(d)
Neuromotor exercise: exercise that improves motor skills such as balance and coordination.
(e)
Mind-body exercise: exercise that combines a physical exercise with meditation or mindfulness. The latter is defined as the intention to be aware and engaged in the present moment, i.e. attention on your breath and movements without disturbed by other issues [
25]. It is a set of mindful movements with a primary purpose of relaxation. The prototype of this exercise includes Tai Chi, Qi Chong and Yoga.
If the authors had not clearly identified the exercise element of interest, intervention that includes more than one category of exercise will be categorised as mixed exercise. Other components (such as home based versus class based) will be added to each if necessary to flag out the difference. Further classifications may be made as appropriate. SLG (a sports physician) and MH (a physiotherapist) will be involved in classifying the exercise interventions.
Comparators
All comparators which have been used in the exercise trials will be included. However, for the purpose of the network meta-analysis, we need to include a common comparator across different exercises. The common comparator is defined as the comparator which had been used by at least two trials for two different exercises. These could be a no treatment/attention control, a waiting list control, a leaflet/education control, a ‘sham’ exercise or an active control (another type of exercise or intervention).
Outcomes
The primary outcome will be pain. Secondary outcomes will include self-reported function, objective measured function/physical test and QoL. Except for outcomes based on physical testing, a hierarchical selection of measurement scale following Fransen [
26] and Regnaux [
27] will be adopted if more than one scale is being used for the same outcome. However, a lower ranked scale may be given priority over a higher ranked scale if (i) it is more comprehensively reported in the study or (ii) if the direction of effect of the higher ranked scale is unclear.
a.
Primary outcome—pain
The hierarchical selection, in descending order of preference, is as follows:
-
Pain overall
-
Pain on walking
-
Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) pain subscale
-
Pain on activities other than walking
-
WOMAC global scale
-
Lequesne osteoarthritis (OA) index global score
-
Other algo-functional scale
-
Patient’s global assessment
-
Other outcomes
b.
Secondary outcomes
(i)
Functional outcomes—self-reported and objective outcomes
Hierarchical selection for self-reported measures in descending order of preference is as follows:
-
Global disability score
-
Walking disability
-
WOMAC disability sub-score
-
Composite disability scores other than WOMAC
-
Disability other than walking
-
WOMAC global scale
-
Lesquesne OA index global score
-
Other functional scales
For objective performance-based outcomes, a different consideration is used and may involve identification of the commonest physical test through ‘vote counting’—i.e. the test that is reported by most studies. This is because no physical tests, either singly or in combination, have been deemed to be sufficiently investigated or robust to warrant any definitive recommendations for the assessment of physical function in hip and knee OA patients [
28]. Alternatively, if the correlations between the tests are known, we may consider pooling the results and generate only one effect size estimate. In the event of any discrepancies, a consensus will be sought within the research team.
(ii)
QoL
Hierarchical selection for QoL measures is as follows:
-
General health of short form (SF)-36 or SF-12
-
EuroQol (EQ)-5D
-
World Health Organization quality of life
-
Mental/physical component summary of short form (SF)-36
-
Mental/physical component summary of short form-12
-
Sickness impact profile
-
Nottingham health profile
-
Other qualities of life scales
-
Others
(iii)
Others
If widely reported by authors, additional measurements such as structural, biomechanical or physiological parameters, including features on imaging (radiographs, magnetic resonance imaging, ultrasound), will also be included.
Study time points
There is no pre-specified study endpoint for eligibility. Outcomes at different time points during the follow-up period will be recorded, and the commonest point across all trials will be used as a primary outcome point. Additionally, outcome points during follow-up will be grouped into intervals (e.g. ≤1 month, 1–3 months, 3–6 months, 6–12 months, 12–24 months etc.) for a secondary analysis for the time-dependent effect.
Data collection
Study selection process
Preliminary screening is by title and abstract. Full text of potential citations will be retrieved for final screening and data extraction. Study selection will be done by one reviewer with the second reviewer performing periodical validations at random. A third reviewer will be involved if any discrepancies arise—i.e. if either one of the reviewer is unsure or could not agree with each other.
Data extraction and management
A structured database is created in Microsoft Access for data entry. In situation where studies are able to contribute two pairwise comparisons, data from four intervention groups will be extracted, and the study will be considered as two separate comparisons. Conversely, studies that have produced companion publications may be combined to maximise the yield of data.
When pairing of multiple groups cannot be performed satisfactorily (such as when there are three groups in a study), then combining groups may be considered. Supposing that all intervention groups are eligible for inclusion, exercise groups may be combined as a single group to be compared against a single control group. The reverse will also be done if two control groups are eligible but only one exercise group is available to contribute to the comparison. Calculation of the standard deviation for the composite group will be performed as described by Cochrane Handbook [
29] (Additional file
4). If there is no clinical or practical justification to combine the groups as described, the intervention or the comparator that is deemed to be the simplest/commonest will be chosen instead. The decision for adopting either approach will be described and explained in the table of summary.
The following data will be extracted.
i.
Study level include:
-
Study identification details (i.e. title, author, year of publication)
-
Number of study arms
-
Number of patients
-
Duration of study
-
Population: hospital/healthcare centre, community
-
Study design: parallel, crossover
ii.
Participant level:
-
Demographic: age, gender
-
Associated risk factors: body mass index (BMI), history of injury
-
Diagnostic criteria used for recruitment
-
Composition of patients with knee and hip OA
-
Disease severity
iii.
Intervention level
-
Types of exercise: strengthening, aerobic, balance/skills, flexibility/ROM, mind-body or mixed,
-
Measurement of exercise adherence
-
Setting: group/class versus individual, hospital versus home-based
-
Compositions of patients with hip and knee OA
-
Number of patients randomised
-
Patient demographics by group
-
Comparator types
-
Baseline measurements
iv.
Outcome level:
-
Types of outcome and measurement tool used
-
Are results adjusted
-
Intention-to-treat (ITT) analysis
-
Number of patients analysed/attrition at various stages of study
-
Types of effect size and relevant statistics (standard deviation, confidence interval etc.)
-
Primary and all measurement time points
An abridged data extraction form (Additional file
5) will be used for data validation. The data extraction will be performed by one reviewer while the second reviewer will independently perform validation. Discrepancies in data will be resolved through discussion between the two reviewers. A third reviewer may also be involved if needed.
Missing data
If the required data cannot be extrapolated from the published article, attempts will be made to contact the authors for additional information. Where this is not possible, we will use statistical imputations as appropriate. Details of imputations and assumptions used will be reported.
As calculation of standard deviation is one of the commonly encountered situations, calculations for transforming other forms of summary data to standard deviation (SD) have been listed in Additional file
4. In studies where insufficient information is provided for calculation, the SD may be substituted with the widest SD obtained from other eligible studies.
Analysis
Approach in analysis will flow from meta-analysis (MA) to NMA and individual patient data (IPD) if the data are available. The MA aims to confirm/update the efficacy of different exercises, whereas the NMA aims to determine the relative efficacy between different types of exercises. The IPD meta-analysis aims to examine who responds better to a specific exercise—that is, to determine predictors of response. Data will be processed using various softwares such as Microsoft Access, Excel, Stata and WinBUGS/SAS.
Unit of analysis
For calculation of effect size, the unit of analysis is the mean (and standard deviation) of each trial. Demographic features of participants that were randomised to each group at the commencement of studies will be described. For the summary of effect, independent pairwise comparisons will be the unit of analysis. This is to say, if a study has four independent arms which can be paired, the study will contribute two effect size estimates to the meta-analysis. If a study has only three independent arms, only one effect size will be synthesise for the summary of effect estimate.
Results of different outcomes and at different time point will be reported separately. Any aggregation of data will be indicated and justified.
Measures of treatment effect
The effect size of continuous data (pain, functional outcome and QoL) will be based on standardised means difference, Cohen d. Whenever possible, the post-treatment mean score will be used for calculation of effect size. If this is not available, the mean change score will be used instead.
From Cohen
d, a correction factor will be used to obtain the unbiased effect size (Hedges’
g). Unless these effect sizes are available by default in the statistical software, calculations will be based on the equations listed in Additional file
4. Point estimates of effect size will be reported with its 95 % confidence interval for pairwise meta-analysis and 95 % credibility intervals for NMA.
An exercise that is able to deliver minimal clinically important difference (MCID) for the respective outcome of interest (e.g. 30 % improvement in visual analogue score from baseline, 20 % improvement in WOMAC function from baseline) will be deemed as effective [
30]. An exercise that is able to deliver a predefined MCID compared to another (e.g. difference of 1 score on numerical rating scale) will be considered as more efficacious.
Assessment of publication bias
Empirical methods to assess publication bias are not considered to be better than visual assessment of funnel plots [
22]. Hence, we will use funnel plot to investigate for publication bias which can be easily detected by presence of gaps in the plot. This scatter plot of study size against treatment effect will demonstrate a void at the lower left section of the graph if the assumption that studies with small sample size and non-significant effect size tend to go unpublished is true.
Assessment of heterogeneity
We will use chi-squared (
χ2) test to examine homogeneity where level of significance is set at
p < 0.1. To quantify the impact of the heterogeneity on the pooled estimates, we will use
I2 statistics. The value of
I2 indicates the magnitude of heterogeneity in the analysis and is interpreted as: 0–40 %—might not be important; 30–60 %—may represent moderate heterogeneity; 50–90 %—may represent substantial heterogeneity; and 75–100 %—considerable heterogeneity [
29]. Assessment of similarity and consistency of the estimate will be performed to ensure that the direct and indirect estimates agree. If the direct and indirect evidence agree, the estimates will be pooled to increase the power of the point estimates.
Risk of bias assessment
A modified Cochrane risk of bias assessment tool will be used to assess the quality of studies [
31] (Additional file
6). This is an assessment tool that considers the various sources of biases by examining the methods of randomisation, concealment, blinding and handling of missing data. Response for each criteria will either be yes, no or unclear following the predefined guidelines as outlined in Additional file
6.
-
Was the randomization procedure adequate?
-
Were there more than 100 subjects in each treatment group?
-
Was the treatment allocation adequately concealed?
-
Were physicians blinded to the intervention?
-
Were patients blinded to the intervention?
-
Were outcome assessors blinded to the intervention?
-
Was incomplete outcome data adequately assessed?
-
Was intention-to-treat analysis used?
-
Were the treatment and control group similar at baseline?
-
Are all pre-specified outcomes of interest reported in the pre-specified way?
Wherever a published protocol for the included study is available, it will be used as a supplementary source of information to enhance the quality assessment of the trial protocol. Again, the assessment will be performed by one reviewer with a second review performing random validation in a sample of the included studies. Quality criteria will be used for extended subgroup analysis.
Data synthesis
ITT and adjusted results will be extracted whenever possible. If final scores and change from baseline score are both reported, preference will be given to the final score. For data that needs to be extracted visually from graphs, the readings will be rounded to the nearest 0.5.
Random effects model will be used for all analyses. If there are discrepancies between the direct estimates from MA and indirect estimates from NMA—if the 95 % confidence interval of the difference between the direct and indirect does not cross 0—we will consider the network inconsistent. Further analysis will be performed to identify the reason. However, the presence or absence of incoherence will not be determined solely on statistical method since this method is also subjected to statistical errors. Type I error (falsely assuming that there is direct and indirect evidence are inconsistency) can occur in a complex network because multiple tests have to be performed. Type II error (falsely assuming that the direct and indirect evidence are consistent) can happen when the dataset in the network is small [
21]. Hence, the statistical significance of inconsistency will be weigh against its clinical significance.
Other than assessing the discrepancy of the estimates, the NMA will also be able to assess the strength and diversity of the treatment network. Comparisons that are supported by a large number of RCTs can be distinguished from those that are informed by only a small number of RCTs.
One major advantage of using Bayesian approach in NMA is that it is able to produce a result for all comparisons in a connected network without the presence of a common comparator. But on the other hand, different prior distributions can be used which can generate different results, and therefore, a sensitivity analysis is always required. As prior knowledge on exercise efficacy is inconclusive, a non-informative prior will be used in our analysis. Posterior distributions of the model parameters will be utilised to present the results of the NMA.
In IPD predictive regression modelling, only clinically meaningful treatment and covariate interaction terms at study and patient level will be investigated [
32]. If sample size is deemed sufficient, different predictive model may be explored for different subgroups. To minimise loss of data, we will avoid dichotomizing continuous predictors for analysis [
33]—checks for linearity of predictor-outcome relationship will be performed during modelling. If there is a need to convert continuous predictor to categorical nature, this will be performed prior to data analysis. Multiple imputations may be used to address missing data.
Subgroup analysis
Extended subgroup analysis will be performed based on the quality criteria of the study (sample size, blinding etc.), patient characteristics (age, BMI, gender etc.), disease (severity, joint involved etc.), exercise and comparator types. Other subgroup analyses will also be undertaken as appropriate.
Sensitivity analysis
Sensitivity analysis will be performed to ascertain if the results are robust. Situations where sensitivity analysis will be indicated include (i) imputations of missing data has been employed; (ii) when some arbitrary decisions have been made with respect to study selection, subgroup data aggregation/segregation; and (iii) when outlying studies are suspected. Small-study effect will be specifically assessed as smaller studies tend to report larger effect size and distort the summary estimates of meta-analysis [
34].
Meta-regression will be used to assess the association of various covariates in the model. This includes adjusting for study level covariates such as mean age, gender ratio, sample size, allocation concealment, blinding, duration of treatment and type of exercise. Baseline severity of the disease such as pain score may be included as a covariate when the post-treatment score has been used for analysis in lieu of change score.