Background
Physical activity (PA) has been linked to a great number of physical and mental health benefits [
1]. For example, there is strong evidence that moderate-to-vigorous physical activity (MVPA) reduces the risks of type 2 diabetes, coronary heart disease, depression and all-cause mortality [
2,
3]. To achieve substantial health benefits, adults should perform at least 150 min of moderate-to-vigorous intensity aerobic activities per week as well as muscle-strengthening activities on two or more days per week [
4]. This evidence is often based upon large cohort studies or randomized controlled trials using self-reported PA. However, the conclusions drawn from these studies depend on the quality of the assessment of exposures and outcomes.
For a long time, PA was exclusively assessed by self-report measures (e.g., questionnaires, diaries) due to the lack of alternatives. Today, research can rely on device-based measures such as accelerometers. Since accelerometry is not always feasible in large epidemiological studies and because questionnaires still provide important information about type (e.g., walking, cycling) and domains of PA (e.g., home, leisure time), questionnaires remain popular to gather valuable information at low cost [
5]. In fact, a large part of the evidence forming the basis of current PA guidelines is based on questionnaire data [
1]. Until now, many questionnaires have been developed since no gold standard for the measurement of PA exists [
6].
Despite the existence of many different questionnaires, no conclusive recommendations can be provided for the best questionnaires to assess PA in various populations due to the inconsistent results regarding their measurement properties [
7,
8]. The results for construct validity are often unsatisfying. Average correlations with accelerometer data of
r = 0.22 for moderate and
r = 0.32 for vigorous PA were reported [
8], and for total PA the coefficients ranged from 0.04 to 0.47 [
9]. This means that there is at best only 25% shared variance between these two methods [
10]. Moreover, it seems that over the last decades the measurement quality of PA questionnaires did not considerably improve, for example, when comparing newly developed with already existing versions [
11].
One shortcoming of a PA questionnaire is the reporting error associated with the recall period [
12]. In a typical administration, a person is asked to recall and summarize all physical activities performed in a defined period (e.g., the past/usual week or past month). This means that a person should be able to correctly report frequency, duration and intensity of PA over the defined period. However, the person’s PA level is determined not only by the true amount of PA but also by the ability to recall all relevant activities. The accuracy of the recall may further be influenced by the type of activity. For example, the recall may be more difficult for sporadic, brief and low-intensity activities [
13,
14].
However, evidence is accumulating that also light PA provides important health benefits such as reductions in mortality risks and improved cardiometabolic health, especially for inactive populations [
15‐
17]. Moreover, current PA guidelines emphasized that also incidental, intermittent activities (e.g., less than 10 consecutive minutes) provide health benefits [
1,
4]. Many existing questionnaires do not capture these brief activities [
8]. A short recall period may be needed to capture all these relevant activities and, thus, help to improve the quality of PA measurement using questionnaires. The advantages of a short recall period have been highlighted previously. For example, Matthews et al. [
14] acknowledged that it will reduce the cognitive demands of the participants because the recall would strongly rely on the recollection of behaviors using episodic memories. Hence, the authors recommended using multiple short-term recalls to obtain more accurate behavior-disease associations.
Although shorter recall periods (e.g., previous day) provide the potential to limit reporting errors, they were typically only applied in diaries or records such as within the “Activities Completed Over Time in 24 Hours” (ACT24) [
18]. These formats tend to show better agreement with device-based measures of PA (e.g., 0.48 ≤
r ≤ 0.60 for ACT24) but at the expense of high burden for participants. Therefore, feasibility on a weekly basis, for example using multiple measurements, in large studies is limited. Until now, only two PA questionnaires used a recall period of one day and both showed good agreements with accelerometers (e.g.,
r = 0.71 for the Danish Physical Activity Questionnaire and
r = 0.74 for the Daily Activity Questionnaire) [
19,
20].
Since these two questionnaires are either too long or developed for a population of patients (i.e., after total hip arthroplasty), the measurement properties of a short daily PA questionnaire in a non-patient population are unknown. A short version could also help to increase feasibility when using multiple measurements. Therefore, the aim of this study was to assess the construct validity of a short self-administered daily PA questionnaire (Physical Activity Questionnaire for 24 h [PAQ24]) within a sample of the healthy population, namely young active adults. For the design of the questionnaire, we modified the International Physical Activity Questionnaire - Short Form (IPAQ-SF) and reduced the recall period from one week to one day.
Regarding PA, we hypothesized that the use of a daily recall period would: i) result in satisfying relative agreement between PAQ24 and other measures of the construct (established PA questionnaire, accelerometer). As recommended [
8], we assumed correlations ≥0.50 between questionnaire and accelerometer and ≥ 0.70 between PA questionnaires as evidence for satisfying relative agreement; ii) result in satisfying absolute agreement between PAQ24 and these instruments. No thresholds for absolute agreement could be defined since there is no gold standard for the measurement of PA [
6].
Discussion
The purpose of this study was to assess the construct validity of a short self-administered daily PA questionnaire (PAQ24) in young active adults. We expected that the short recall period will result in satisfying construct validity. However, the results of the study revealed inconclusive evidence for the construct validity of the PAQ24. Compared to accelerometry, the PAQ24 showed satisfying relative agreements (i.e., ρ ≥ 0.5) on five out of seven days when assessing Total PA. The relative agreements for the overall week (i.e., averages per day) were unsatisfying for all scores, including Total PA. Similar moderate, but not satisfying (ρ < 0.70), agreements were observed when comparing scores of PAQ24 and GPAQ. Furthermore, absolute agreements for both daily and weekly scores were poor because of wide LOA. Additional analyses using different scores of the PAQ24 or accelerometer intensity cut points resulted in similar or lower agreements.
PA reported in the PAQ24 varied from day to day with the highest minutes on Saturday and the lowest on Sunday. This variation may be influenced by daily differences in the amount of leisure time, participation in sport events or convenience of scheduling [
44]. For daily PA assessments, it is important to consider this variation in PA [
45]. In general, the use of multiple short-term measurements should increase the ability of the instrument to distinguish between true variation in PA and other sources of error [
46]. However, depending on the data collection method, study population and choice of PA score under investigation, the number of measurements needed to capture this true variation in PA may vary. For example, three to five days of accelerometer monitoring might be needed to obtain accurate levels of PA in adults [
45,
47] but more days are required for the assessment of Inactivity [
48], when using self-report methods [
46], or in specific populations such as children [
49]. The impact of the variance in PA behaviors on reliability and the minimum number of measurements when using daily PA questionnaires such as the PAQ24 must be evaluated in future studies. This would also help in drawing conclusions about the feasibility of such a questionnaire.
Overall, the daily results for Total PA were comparable to more sophisticated 24 h diaries and recalls [
8,
11,
18,
50]. The relative agreement of weekly Total PA between PAQ24 and accelerometer did not meet our criterion for satisfying validity but was in the upper range of results when using previous questionnaires [
8,
11,
51]. For example, a systematic review of 23 studies on the construct validity of the IPAQ-SF reported correlations ranging from 0.09 to 0.39 for total PA when compared to device-based measures of PA [
51]. Absolute agreement of weekly PA scores of the PAQ24 was rather poor. We observed smaller LOA for weekly compared to daily scores due to reduced random error when using averages of multiple measurements. Moreover, increases in Total PA and VPA were associated with changes in the observed difference between PAQ24 and accelerometer (e.g., shifted from under- to over-reporting with increasing PA scores).
The lack of agreement between PAQ24 and accelerometer may be attributable to differences in individual characteristics and the measurement quality of both methods. For example, it has been shown that brief, unstructured or low-intensity activities are difficult to recall [
13,
14] and that the level of agreement varies depending on factors such as age, weight-status or accelerometer data processing [
52]. Our participants perceived difficulties with classifying the intensity of activities and with reporting the total volume of walking time. On the other side, acceleration of activities such as cycling or resistance training may not be able to be accurately captured by device-based measurement [
24]. However, no improvements in the agreement were observed after excluding these activities. We also observed daily variation in the agreements, which could be influenced by both random and systematic error. For example, activities which are poorly detected by the accelerometer may have been performed on specific days. Likewise, some days may include more structured activities and events (e.g., exercise sessions, competitions) which are easier to recall [
13].
Neither questionnaires nor accelerometers are perfect tools to measure PA. This lack of real gold standard was correctly acknowledged by several researchers [
6,
53]. In addition to the disadvantage of reporting errors, a further limitation is, that questionnaires are always developed for a specific population (e.g., elderly, adults, youth, pregnancy) and the identification of most qualified ones is difficult [
7,
8,
54,
55]. Moreover, the interpretation of questions in the questionnaire (e.g. intensity description) is influenced by characteristics of the participant such as perceived confidence [
56] and origin (e.g., different countries and cultures need cross-cultural adaptations) [
57]. These individual characteristics can limit the measurement quality and may result in an under- or overestimation of self-reported PA. On the other side, PA data derived from accelerometry is influenced by several decisions of the researcher. Depending on brand [
58], body placement [
59,
60] and sampling frequency [
47], the data, used for subsequent analyzing, is already affected by researchers’ pre-choices. Also, several other decisions (e.g., intensity cut points, epoch length, filters, algorithms to detect non-wear, requirements for a valid day/week) have been shown to influence the PA estimates from the accelerometer [
47,
61]. The current lack of consensus on best practices to handle accelerometer data hampers the quality of the assessment of measurement properties of PA questionnaires (since accelerometers are often considered as “reference” measure).
The agreements for the overall week between PAQ24 and accelerometer were lower than what we would assume based on the pattern in the daily results. This may be due to a stronger influence of different systematic (e.g., additive and multiplicative) errors [
62]. For example, consistent under- or over-reporting of PA can influence the estimation of mean, dispersion or participants’ ranking order when considering averages per day. Such a reporting bias may only exist for some but not all participants (differential recall bias) and can either increase or decrease the level of agreement [
63]. Furthermore, our results showed that the true level of PA was also related to the level of agreement, namely by changes in over- and under-reporting with different PA levels (see results from Bland-Altman analysis). These influences on the repeatability of the tools could have reduced the weekly compared to daily relative agreements, even if one instrument would be free of error [
64].
The results also demonstrated poor absolute agreements and only moderate correlations between PAQ24 and GPAQ. Neither Total PA nor VPA did meet our criterion for satisfying construct validity. However, similar correlations were reported in previous studies when comparing forms of the IPAQ with the GPAQ [
65,
66]. This lower agreement can be influenced by differences in the questionnaire format. The PAQ24 includes separate questions for cycling, walking, swimming and resistance training whereas the GPAQ combines these activities into fewer questions and obtains information using different domains of PA [
28]. The use of different recall periods (e.g.,” typical week” in the GPAQ) could also have reduced the level of agreement [
67].
ST of the PAQ24 was strongly related to ST of the GPAQ but less to Inactivity from the accelerometer. Also, Bland-Altman analyses indicated poor absolute agreements for daily and weekly ST which seems to be in line with previous results showing usually an under-reporting of ST compared to the accelerometer [
52]. This poor agreement may be partly explained by difficulties in reporting ST, as mentioned by some participants, and the lower accuracy of wrist-worn accelerometers (without further use of inclinometers) to differentiate between non-movement positions such as lying, sitting or standing [
68]. However, the results of weekly ST are comparable to previous questionnaires [
11]. Finally, PA was not associated with aerobic capacity (see Additional file
5) which may be due to the non-overlapping parts of the two concepts [
69], the variability in PA [
46] and the homogeneity of the sample regarding their usually high fitness levels.
The results of the present study must be interpreted with respect to our specific sample, since participants involved were highly active and trained students. The participants were affine to sports and exercise, and therefore, should be able to better estimate the intensity of PA compared to a sample with a different background. Many participants were members of a sports club with settled weekdays of training and were registered for obligatory university exercise courses. Taking this into account, it might have been easier for them to recall PA, compared to the general population. This strongly limits the generalizability of our results. Future studies are therefore needed to evaluate the PAQ24 and other promising daily PA questionnaires in representative samples of the general population.
Finally, we tried to improve the measurement properties of PA questionnaires by using a short recall period. Although we modified an existing questionnaire for our purposes, we do not recommend using the PAQ24 to measure PA in other studies. Already in 2000, Sallis and Saelens [
5] recognized the existence of too many different questionnaires and recommended to use only the most qualified ones for future research. Therefore, we strongly recommend using an existing questionnaire whenever possible. The choice of the questionnaire should follow the purpose of the study and the evaluation of measurement properties (e.g., content validity, reliability, construct validity, responsiveness). Several reviews on measurement properties of PA questionnaires have been published [
7,
8,
11,
53,
70] and may help in the selection of most qualified questionnaires. However, we invite researchers to use our questionnaire in future validation studies to further improve the measurement quality of PA questionnaires. For example, using smartphone applications for the daily assessment may increase feasibility. Future studies should also evaluate the measurement errors associated with multiple measurements as well as minimum required days of monitoring. Overall, we, together with others [
14], argue that multiple short-term recalls are a promising approach to overcome important short-comings of traditional PA questionnaires.
Strengths and limitations
First, the specific study population (young active adults) limits the generalizability of the findings. Secondly, although participants were instructed to complete the PAQ24 before going to bed, we did not assess whether they were still awake and active after they completed the questionnaire. Thirdly, results for absolute agreement showed a strong dependence on accelerometer intensity cut points which should be considered when interpreting the results. This seems reasonable when using lower or higher cut points and was rather affecting the mean difference than the magnitude and variation of differences (i.e., LOA). Finally, this study did not assess the effect of a short recall period using an experimental design comparing it with a recall period of a week. On the other side, the study has several strengths: i) the use of raw accelerometry to increase transparency and comparability between studies; ii) reporting the influence of different accelerometer intensity cut points on the results; iii) the use of guidelines regarding the validation of PA questionnaires (e.g., specifying a priori hypotheses) [
7,
8,
71]; and iv) data collection was performed within a short period, which reduces the influences caused by changes in weather, seasons or types of activities.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.