Background
Activity trackers are developed to increase an individual’s awareness about physical activity behavior throughout the day. It is well known that regular physical activity decreases the risk of many chronic diseases and can improve quality of life [
1‐
3]. A commonly used physical activity guideline is the 10,000 steps/day norm: healthy adults are recommended to take 10,000 steps per day to maintain physical fitness and health [
4]. However, many people worldwide are not aware if they comply with this recommendation [
1]. In addition, previous research has indicated that most people tend to overestimate their level of physical activity [
5,
6]. Activity trackers may potentially overcome this issue.
Over the past five to ten years, an increasing number and variety of activity trackers have become available on the consumer market. Activity trackers are small and user friendly devices that measure the number of steps taken and/or the amount of time spent performing physical activities at different intensities. Most activity trackers also convert the number of steps with algorithms into measures such as the distance covered and the number of calories burned. Associated (mobile) applications provide users with insight into their individual physical activity behavior over a certain period of time. This might work as a motivator to increase physical activity [
7,
8]. Consumer activity trackers might also be beneficial for scientific research, due to their ease of usability and relatively low cost. Examples of popular devices are the Fitbit, Jawbone Up, and Withings Pulse.
For accurate measurement and interpretation of the data, these devices must be reliable and valid. A number of studies have examined consumer tracker accuracy [
6,
9‐
18], however, six studies were based upon earlier versions of Fitbit devices, and the methodology for assessing reliability and validity varied considerably. For example, different types of activity were used (walking on a treadmill at different speeds, lab cycling, walking stairs, daily activities), and different gold standards were utilized (energy expenditure [EE] measured by breath-to-breath analysis, self-reported physical activity translated to EE [in METs], and real step count). Five studies were performed in a laboratory condition [
9‐
11,
14,
16], and six studies examined the reliability or validity of activity trackers during (semi-structured) free-living conditions [
6,
12,
13,
15,
17,
18]. The validity of activity trackers may differ in free-living conditions compared to standardized lab conditions because of the increased variety in walking speeds, directions, intensities, etc. in free-living. To date, no studies have assessed reliability and validity of consumer trackers in both laboratory and free-living conditions. The aim of this study was to determine the reliability and validity of ten consumer activity trackers, in both a standardized laboratory condition and in free-living conditions.
Discussion
Ten popular consumer activity trackers were tested for their reliability and validity for measuring step count. Seven out of ten trackers were reliable (Lumoback, Fitbit Flex, Jawbone UP, Misfit Shine, Withings Pulse, Fitbit Zip, and Digiwalker), and five of these trackers also demonstrated high validity in laboratory conditions (Lumoback, Jawbone Up, Misfit Shine, Withings Pulse, and Fitbit Zip). The Moves app and Nike+ Fuelband exhibited low reliability and a low validity in laboratory conditions. In free-living conditions, the Fitbit Zip showed the highest validity and the Nike+ Fuelband indicated a low validity.
The validity of the ten activity trackers in laboratory conditions was examined with three methods of which the first was to assess systematic differences. According to Tudor-Locke et al. [
23], activity monitors should not exceed a 1 % error deviation (MAPE) from the gold standard during walking on a treadmill at a speed of 3 mph (4.8 km/h) in order to be considered accurate. In the controlled lab-condition, five trackers achieved this condition: the Lumoback, Jawbone Up, Misfit Shine, Withings Pulse, and Fitbit Zip. The Digiwalker and Omron had an error deviation slightly higher than the 1 % threshold, e.g., 1.2 % and 2.5 %, respectively, which still represents a very low MAPE. The Fitbit Flex (5.6 %), Moves app (9.6 %) and Nike+ Fuelband (18 %) exhibited greater deviation errors whereby the Fitbit Flex and Nike+ Fuelband underestimated the number of steps, and the Moves app overestimated the number of steps. Some trackers were examined in other studies as well for systematic differences using comparable conditions. Melanson et al. [
26] found an accuracy of 97.8 % of the Digiwalker SW-200 during walking on the treadmill with speeds between 3.0 and 3.5 mph (4.8 – 5.6 km/h), which is in accordance with our finding of 1.2 % error. In the study of De Cocker et al. [
27], the Omron differed on an average of 6.7 % compared to the gold standard. The slightly smaller difference of 2.5 % determined in our study could possibly be explained by the longer duration of the treadmill test in this study (30 min vs. 5 min) which decreases the relative size of measurement error. Case et al. [
16] found an error of +6.2 % for the Moves app installed on an IOS device and an error of −6.7 % for the Moves app installed on an Android device. The MAPE found for the IOS device was a bit lower than the +9.6 % difference in our study. An explanation could be the different version of the Iphone that was utilized (Iphone 5S compared to the 4S in our study). For the Nike+ Fuelband, Case et al. found a mean underestimation of 22.7 %. This was in line with our finding of 18 % underestimation.
The second method to determine validity was to examine the ICCs between the trackers and the gold standard. In the laboratory study, all trackers demonstrated a good to excellent agreement with the gold standard, with the exception of the Moves app, Nike+ Fuelband, and Fitbit Flex. Two other studies also examined correlations between the activity trackers and the gold standard in laboratory conditions. For the Fitbit One, Tacacs et al. [
14] ascertained concordance correlations between 0.97 and 1.0 for five different speeds on the treadmill with manual steps counting as the gold standard. This was in accordance with our finding for the Fitbit Zip (ICC .99). For the Digiwalker SW-200, Beets et al. determined an ICC of .99 compared to real step count for children walking on a treadmill at the same speed (4.8 km/h) [
28]. This is somewhat higher than the ICC found in our study (ICC .65). However, if we removed the four outliers in our analyses our ICC increased to .94, which is more in line with the findings of Beets et al.
The third and last way to examine validity was to assess the level of agreement by visualizing the data with Bland-Altman plots [
29]. The difference between the lower and upper limit of agreement (Mean difference ± 1.96SD of difference scores) ranged from 46 steps (Fitbit Zip) to 2422 steps (Nike+ Fuelband). The Lumoback, Jawbone Up, Misfit Shine, Withings Pulse, and Fitbit Zip indicated the narrowest limits of agreement (less than 300 steps) which equals less than 10 % and less than 3 min walking. This can be considered as a relatively small range. Taken together with the small systematic differences of these trackers (less than 1 %), it is suggested that the Lumoback, Jawbone Up, Misfit Shine, Withings Pulse, and Fitbit Zip can be used interchangeably with the gold standard when walking on a treadmill. The systematic differences and the range between the upper and lower limits of agreement of the Moves app (1436 steps) and the Nike+ Fuelband (2422 steps) are considered to be too large to be used interchangeably with the gold standard.
To summarize, the lab results show that most trackers are valid with the Lumoback, Jawbone Up, Misfit Shine, Withings Pulse, and Fitbit Zip demonstrating the highest validity. The Moves app and Nike+ Fuelband are clearly invalid. It should be noted that, in a controlled lab condition, there is no variation in walking speed, intensity, direction, etc. which is in contrast to real life. Therefore, validity was also tested in free-living conditions.
The first way to validate activity trackers in free-living conditions was to assess systematic differences. In free-living conditions, an acceptable mean deviation from the gold standard is 10 % [
23]. Eight activity trackers achieved this criterion. The Nike+ Fuelband and Moves app showed larger percentages of underestimation: 24.0 % and 37.6 %, respectively. Lee et al. [
12] investigated various consumer trackers during different semi-structured activities (the participants followed a 69-min protocol), and compared total energy expenditure with the gold standard (breath-to-breath analysis). The Fitbit Zip, Jawbone Up, and Nike+ Fuelband differed 10.1 %, 12.2 %, and 13.0 %, respectively, from the gold standard. The differences are greater for the Fitbit Zip and Jawbone Up compared to the results of our study which could possibly be explained by the different outcome measure that was utilized in the study of Lee et al. (energy expenditure vs. step count). The difference between the Nike+ Fuelband and the gold standard is smaller compared to the present study (24 %). However, Lee et al. has already mentioned inconsistent results for the Nike+ Fuelband (a relatively small MAPE but also a low correlation with the gold standard) and, therefore, advised interpreting these results with caution. Ferguson et al. [
17] investigated five similar devices (Jawbone UP, Nike+ Fuelband, Misfit Shine, Withings Pulse and Fitbit Zip) in free-living conditions for 48 h. They ascertained differences of 8.1 %, 25.6 %, 10.1 %, 6.3 % and 4.3 %, respectively. These values are in line with our findings in which the somewhat larger differences can be explained by the longer period of measurement. De Cocker et al., [
27] investigated the Omron during free-living conditions and used the Digiwalker as a criterion measure. They reported a more substantial difference between the two devices compared to the findings of the present study (36.9 % vs. 0.4 %) which can be a result of non-walking activities, a longer period of measurement, and the different gold standard.
The second way to determine the validity of the activity trackers during free-living conditions was to calculate ICCs. All activity trackers were highly correlated to the gold standard (ActivPAL). The Nike+ Fuelband and the Moves app showed ICCs which were a bit lower and had broad confidence intervals (.83 [CI .37; .94] and .80 [CI .05 – .99] respectively). The high ICCs in the free-living study can be partially attributed to the differences in activity patterns between the participants during the test day; more variation increases the chances of a high ICC. Lee et al. [
12] indicated similar results for the Fitbit Zip, Jawbone Up, and the Nike+ Fuelband, i.e., high correlations for the Fitbit Zip and Jawbone Up and a lower correlation for the Nike+ Fuelband. Tully et al. investigated the validity of the Fitbit Zip in free-living conditions; the Fitbit Zip was worn for seven days along with the Actigraph accelerometer. They reported a high correlation (Spearman Rho = .91) between steps/day when measured by the Fitbit Zip and by the Actigraph [
15]. In addition, Ferguson et al. reported similar correlations for the Jawbone UP, Nike+ Fuelband, Misfit Shine, Withings Pulse, and Fitbit Zip in their free-living study of 48 h [
17].
Finally, the level of agreement of the activity trackers with the gold standard during free-living conditions was assessed by Bland-Altman plots. The difference between the lower and upper limit of agreement ranged from 861 steps (Fitbit Zip) to 5150 steps (Moves app). For the Fitbit Zip, the range of 861steps (less than 1000 steps, e.g., 10 min walking) appears to be sufficiently low enough to be a valid measure in scientific research. The Misfit Shine and Lumoback demonstrated slightly larger limits of agreement (1400 and 1590 steps, respectively) which still demonstrates a good validity. For the other trackers, the limits of agreement show that, despite the relatively small systematic error (below 400 steps [10 %] for eight of the ten trackers), larger individual differences are evident, resulting in a lower validity.
To summarize, the validity of eight of the ten trackers was good during free-living conditions whereby the Fitbit Zip showed the best validity. The validity of the Nike+ Fuelband is low for measuring steps in free-living conditions.
Our study has some limitations. First, in the laboratory condition, only one type of activity was examined (walking), however, activity trackers can possibly perform differently during different activities or velocities (such as walking slow). The advantage of the 30-min measurement was that reliable data for average walking speed was obtained. Second, for examining free-living activity, we used a time span of 9:00–16:30 in which ‘occupational activity’ was mostly measured. The advantage of this method was that we were able to make a realistic comparison between the different trackers with different wearing positions because cycling was excluded. Cycling could have biased the results between centrally worn and wrist-worn trackers. However, the trackers might perform differently during a greater variety of activities such as more intensive exercise. These activities were not measured in this study. The third limitation was, that in the free-living condition, the Nike+ Fuelband and Moves app were tested with fewer number of participants. Because of a reasonable power (62 %), consistent results with the laboratory condition, and consistent results with other studies [
12,
16,
17], the results of the Nike+ Fuelband are considered reliable. For the Moves app, only preliminary conclusions can be drawn on the validity in free-living conditions. This is due to the low N, consequently a lower power of 39 %, and because the Moves app was tested on different types of phones compared to the laboratory study (Android vs. IOS devices). Therefore, the results of the free-living condition cannot be compared with the lab condition because the different types of firmware may have influenced the results. However, our results for the Moves app on the different types of phones are comparable with the study of Case et al. [
16] who showed that Android devices are associated with a modest underestimation, and IOS devices show a modest overestimation of step counting, which is in line with our results.
By combining the results of both conditions, it can be concluded that the validity of most activity trackers is good (Fitbit Zip, followed by Misfit Shine and Lumoback) or acceptable (Fitbit Flex, Jawbone Up, Withings Pulse, Omron, and Digiwalker). Looking at the wearing position of the trackers (wrist-worn for the Fitbit Flex, Jawbone UP, and Nike+ Fuelband and centrally worn, e.g. close to the pelvis or trunk, for the remaining devices), our results indicate that activity trackers worn close to the body exhibit a better validity than the wrist-worn activity trackers, especially during free-living conditions. For wrist-worn activity trackers, more measurement error can occur due to more variation in the way the arms are used in free-living conditions. This finding is supported by the research of Atallah et al. [
30].
For the choice of a device, different considerations can be taken into account. First, the goal of physical activity measurement should be considered. For individual users, it is most important that the change in physical activity is clearly displayed, therefore, devices should be reliable. For large-scale research, the validity of a tracker is important in order to be able to compare physical activity levels of different groups. In addition, the type of activity that will be measured should be considered so a choice for the wearing position can be made. For example, wrist-worn activity trackers are better able to measure higher limb activity, and ankle worn trackers are better able to measure lower limb activity (e.g. cycling) [
31]. Furthermore, a consumer can choose between a more advanced -and mostly more expensive device-, or a more simple and affordable device. This study demonstrated that less expensive devices are not necessarily less valid.
Authors’ contributions
TJMK participated in the design of the study, undertook data collection for the laboratory study, undertook statistical analysis, and wrote the manuscript. MLD participated in the design of the study, undertook data collection for the free-living study, and contributed to writing the manuscript. SRS participated in the design of the study, contributed to data collection of the free-living study, contributed to statistical analysis of the free-living results, and contributed to writing the manuscript. WPK participated in the design of the study, advised for statistical analysis, and contributed to writing the manuscript. CPS participated in the design of the study and contributed to writing the manuscript. MG participated in the design of the study, gave supervision during the execution of this study and contributed to writing the manuscript. All authors have approved the final version.