Background
Physical inactivity is considered to be a risk factor for many life-threatening diseases and regarded as a major burden on general public health, therefore international and national guidelines recommend that all adults engage in moderate to vigorous physical activity (MVPA) for at least 30 minutes per day[
1‐
3]. Patients with OA are found to be less physically active than the general adult population, and fewer fulfill the recommendations of 30 minutes MVPA per day[
4,
5]. Being physically active according to the recommended guidelines is beneficial in preserving function and reduce symptoms[
6], and PA is recommended as a first line treatment that should be offered to all individuals with hip or knee OA[
7,
8]. The efficacy and importance of PA and exercise for patients with OA of the lower limbs have been emphasized in several studies[
9‐
12].
Valid and reliable methods for PA assessment are essential for studying its health effects. Frequency, duration and intensity are important factors when evaluating PA as a protective factor against OA progression and functional decline[
13]. Numerous methods for assessing PA are available, and can be categorized into three main groups; self-reported assessments (questionnaires, rating scales, diaries), activity monitors (accelerometers, pedometers, heart rate monitors) and direct assessment of energy expenditure (doubly labelled water, indirect calorimetry). Self-administered questionnaires, including the Physical Activity Scale for the Elderly (PASE), can potentially capture all types of activities and allow grading by intensity. They are widely used, due to being inexpensive and easy to administer, and are considered particularly useful in large epidemiological and longitudinal studies. However, questionnaires have obvious weaknesses considering recall and reporting bias. In contrast, accelerometers offer a method for measuring body acceleration, and thereby quantify amount and intensity of movement[
14]. Accelerometers often serve as a comparator when validity of questionnaires is evaluated, as they are expected to measure the same construct[
15].
Despite the fact that many self-administered questionnaires are available, evidence for validity and reliability is limited [
13]. PASE has been found to significantly correlate in expected directions with physical performance, knee pain and knee functioning in patients with knee pain[
6,
16], and previous studies have reported correlation coefficients of 0.16, 0.43 and 0.49 when compared to an accelerometer in the general, elderly population[
17‐
19]. However, the validity of PASE has not been evaluated in patients with hip OA by comparing it to an accelerometer. The purpose of this study was therefore to evaluate the construct validity and the test-retest reliability of the Norwegian version of the Physical Activity Scale for the Elderly (PASE) in patients with hip OA.
Discussion
This is the first study to address the test-retest reliability and the construct validity of the PASE in patients with hip OA, and the first study to evaluate the validity of the Norwegian version of the PASE. It is also one of relatively few studies evaluating the construct validity of a self-administered instrument for assessing PA by comparing it to an accelerometer, a method for direct measurement of PA, in patients with OA[
13].
In our study we found that 67% of patients with hip OA fulfilled the recommendations of achieving at least 30 minutes of accumulated MVPA per day, but only 30% fulfilled the recommendations of achieving at least 30 minutes of MVPA per day in blocks of minimum 10 minutes. However, a larger percentage of the hip OA patients did fulfill the recommendations compared to the general Norwegian population. Only 20% of the general adult Norwegian population fulfill these recommendations, and a decline in the amount of PA was present after the age of 64 years. Mean counts per minute was 338, compared to 370 in our study[
28]. The patients in our study were found to have high levels of PA when compared to other studies investigating levels of PA by accelerometers in OA patients[
4,
5,
32]. Hirata et al.[
32] found that women with hip OA were engaged in MVPA for 17 minutes per day, and only 14% met the recommendations of more than 30 minutes accumulated MVPA per day[
32]. For patients with knee OA mean time spent on MVPA was 14-25 minutes per day[
4,
5] and 30% met the recommendations[
4]. However, studies on PA levels in patients with knee OA may not be a valid comparison for the patients in our study. These previous studies[
4,
5,
32] may have included patients with more progressive and severe OA than we did in our study, where patients with a Harris Hip Score below 60 points were excluded from participation. It is also important to stress that the hip OA patients in our study originally participated in a RCT where the importance of PA was emphasized through a patient education program, and this may have altered their PA levels. However, no changes in total PASE score was found for the 16 months follow-up of the RCT[
20]. In addition, the possibility for selection bias is present, i.e. patients with a more positive attitude to PA might have been more likely to participate, and the education level was high. Thirty-nine percent of the patients in our study had more than 12 years of education, compared to 28% in the general Norwegian population (
http://www.ssb.no/utniv). The levels of PA found in this study may therefore not be representative for the hip OA population in general.
PA has also been estimated in a representative sample of elderly Norwegians using PASE to assess physical activity[
26]. The mean total PASE score was 127, quite consistent with the findings in our study on hip OA patients, where total PASE score was 143 and 125 at test and retest, respectively.
Measurement properties of an instrument are related to the population and context in which it is being used. In this study we evaluated the test-retest reliability of the PASE in patients with hip OA by calculating the ICC
2.1, and in addition estimating the standard error of measurement (SEM) and the minimal detectable change (MDC). There are no absolute consensus regarding limits for what should be considered an acceptable ICC value. When instruments for assessing PA is evaluated, Terwee et al.[
13,
15] and Forsèn et al.[
33] have suggested, and used, 0.70 as a cut-off for acceptable test-retest reliability. Based on this the test-retest reliability for the total PASE score was considered to be acceptable, with an ICC
2.1 of 0.77. However, Terwee et al.[
34] also suggested that the lower limit of the 96% CI of the ICC should exceed 0.60, and for the total PASE score the lower 95% CI was slightly lower than this, 0.56. The Norwegian version of PASE has previously been found to have acceptable reliability when tested in the general, elderly population, with an internal consistency of items (Cronbach's alpha) of 0.73, and test-retest reliability coefficient (Pearson's) of 0.93-0.99[
26].
The SEM and MDC of the total PASE score were 31 and 87, respectively, indicating that 87 represents the smallest within-person change in score that can be interpreted as a real change, exceeding measurement error. However, a change exceeding the measurement error is not necessarily clinically relevant, which can be evaluated by estimating the Minimal Clinically Important Difference (MCID). It is advised that the MCID is estimated by using an anchor-based approach [
35‐
37]. However, distribution-based approaches for estimating the MCID are also proposed, and the MCID has been found to equal approximately 0.5 SD at baseline[
38] or approximately one SEM[
39]. To be able to distinguish important changes from measurement error and to measure changes over time, the MCID should exceed the MDC[
15], but by the smallest possible limit. The LoA indicates that if a subject completes a questionnaire twice, the second score could be as much as these limits smaller or larger than the first score, due to measurement error. Thus, the MCID should also lie outside the LoA[
15]. Despite an acceptable test-retest ICC of the total PASE score, we consider the reliability to be moderate, due to large measurement error and wide LoA when compared to the mean total PASE score.
In our study, a significant decline in total PASE score of 18 points was present from test to retest, indicating a systematic error. We may therefore question whether the situation or the subjects actually were stable. When systematic error is present, this is often believed to occur due to a learning effect. However, this is not likely to be the case when the instrument of interest is a self-administered questionnaire. A more plausible explanation may be that wearing the Actigraph GT1M encouraged the patients to increase their activity levels, during the week the PASE referred to. According to Reiser and Schlenk[
40] direct observations of PA by accelerometry may modify the pattern and level of PA among the participants, and may therefore bias the results.
Furthermore, this study evaluated the construct validity of the PASE by comparing it to an accelerometer, the Actigraph GT1M, and with another PA questionnaire, the IPAQ. As proposed by Terwee et al.[
30] we tested predefined specific hypotheses including the expected direction and magnitude of correlations. In this study we found no significant correlation between the total PASE score and the Actigraph GT1M mean total counts per minute. The correlation coefficient was 0.30, in line with our a priori hypothesis. It was comparable to previous studies investigating the correlation between PASE and accelerometers in different populations, where correlations between 0.16-0.52 have been reported[
17‐
19,
41,
42]. The correlation did not reach the cut-off for what we considered satisfactory correlation, above 0.50, as suggested by Terwee et al.[
15]. Whereas self-reporting PA questionnaires is found to over-report levels of PA compared to accerelometers[
43,
44], Leenders et al.[
45] found that accelerometers significantly underestimated PA related energy expenditure when compared to the doubly labelled water method. This may be due to some of its limitations. Accelerometers can of course only provide measurements for the particular time it is observed and recorded, cannot measure water exercises, and also fails to measure activities such as cycling and upper limb exercise correctly. Overestimation of total PA levels when using questionnaires and underestimation when using accelerometers, may to some degree explain the discrepancy between the two methods for measuring PA.
The correlation between total PASE score and IPAQ MET-minutes per week was moderate, with a correlation of 0.61, and barely within our a priori hypothesize of correlation between 0.6 and 0.9. Both PASE and IPAQ are self-administered with a seven day recall period, but household- and work activities is included in the PASE and weighed quite highly, whereas the IPAQ mainly captures leisure-time PA. This may, at least partly, explain the discrepancy between the two questionnaires. Both questionnaires were originally developed for use in a general population (generic), with PASE being specifically designed for an elderly population.
The PASE is not designed to be used to measure and report different PA intensity levels separately. One might therefore argue that acceptable test-retest reliability for the overall score is what is important. However, assessment of intensity seems valuable when investigating the effect of exercise and PA, especially for evaluating the dose-response relationship and to establish recommendations for patients with OA regarding amount and intensity. We therefore wanted to evaluate these specific items, to evaluate whether a PA questionnaire is able to provide reliable and valid data for PA intensity. The ICC2.1 for the sub-scores for household/work-related PA and for leisure-time PA was 0.69 and 0.53, respectively, and the ICC2.1 for the items for light, moderate and vigorous PA intensity was 0.46, 0.20 and 0.68, respectively. None of the ICC's for the sub-scores or the single item scores exceeded 0.7, which we interpreted as a cut-off for acceptable reliability, and the 95% CI were wide for all the sub-scores and items. The SEM and the MDC were also large compared to the mean values of the sub-scores and items, indicating moderate to low reliability.
Our a priori hypothesis; that the respective intensity categories of the PASE would correlate strongest with the respective intensity categories of the Actigraph GT1M, was confirmed for moderate PA intensity and vigorous PA intensity, but not for light PA intensity. However, all correlation coefficients were below 0.46. This indicated that the intensity items of the PASE were not able to distinguish between light, moderate and vigorous PA intensity, and we therefore consider the PASE not to be valid or reliable for assessing PA intensity. The item for moderate PA intensity of PASE correlated stronger with the IPAQ category for walking than the IPAQ category for moderate PA intensity. This may be due to the fact that the IPAQ includes a specific item for assessing walking activities, whereas walking activities are included in the items for light, moderate and vigorous PA intensity in the Norwegian version of PASE. Walking is a widespread leisure time activity in Norway, and is likely to be scored in the item for moderate PA intensity of the PASE, giving a higher correlation with the IPAQ walking compared to the IPAQ moderate PA intensity.
This study has some limitations. Both analysis of test-retest reliability and construct validity by comparing PASE to the Actigraph GT1M were based on data obtained from 33 patients. After referring a statistician, and based on that other studies have used similar sample sizes[
19,
33], we decided to include 40 patients in this study. According to the statistician a sample size between 30 and 40 is usually sufficient when evaluating outcome measurements that uses a continuous scale. According to Terwee et al.[
15] sample size in reliability and/or validity studies evaluating PA assessment tools should exceed 50. A recently developed scoring system for rating methodological quality of measurement properties suggests that a sample size of 100 should be considered excellent, 50 as good, 30 as fair and under 30 as poor[
46]. Correlation between PASE and IPAQ was only based on data from 25 patients. The Norwegian version of IPAQ has been validated for the Norwegian population, but has included an item "don't know" as an option for duration of activity which challenge the interpretation and the score calculations.
The use of Actigraph GT1M and the IPAQ to evaluate construct validity have some weaknesses. The doubly labeled water method is often considered to be the gold standard for measuring PA[
15], but is seldom used to evaluate validity of PA questionnaires, as it is expensive, time-consuming and relies on access to both technical expertise and equipment. Only two studies have validated the PASE by comparing it to doubly labelled water, and found correlation coefficients of 0.28[
47] and 0.68[
48]. However, the doubly labelled water method is affected by the basal metabolic rate, and it cannot capture frequency, duration and intensity of activity. Accelerometers may therefore represent a more appropriate comparator because it can provide information on amount, pattern and intensity of PA, and therefore seem to measure the same construct as most PA questionnaires[
15]. There is evidence for reasonable correlation between waist-worn accelerometers and the doubly labelled water method in adults, with correlations ranging from 0.30-0.83[
49]. IPAQ was also included as a comparator because it is a widely used PA questionnaire, but like other questionnaires it is vulnerable to recall and reporting bias. Previous studies comparing IPAQ and accelerometers/activity monitors have reported correlation coefficients between 0.29 to 0.35[
50‐
52]. However, Ainsworth[
53] states that questionnaires may be suitable for assessing PA for most patients. More sophisticated methods, like accelerometers, provide more precise measurements, but are less practical for use in clinical settings. Kayes and McPherson[
54] emphasize that PA questionnaires and accelerometers both have weaknesses, but that both methods are likely to assess important aspects of the PA construct. Use of both tools may therefore be appropriate to capture all aspects of PA.
Authors' contributions
All authors participated in the design of the study, contributed in drafting the article, and read and approved the final manuscript. IS carried out the patient inclusion, handled the administration of questionnaires and accelerometers, and carried out the statistical analysis. EK carried out the processing of the Actigraph GT1M data.