Adaptation and translation processes
The construct of the original MARS version was judged to be conceptually equivalent, since all five domains of the scale are highly relevant and appropriate to the mobile apps available in Italian app stores. No item required any major modifications. However, following the production of two forward versions, an issue regarding the translation of IT terminology arose. Although these semantic neologisms are now widely used by IT specialists in Italy, such loanwords may be unfamiliar to people outside the IT field, such as health professionals. Moreover, technical terms related to smartphones (especially gesture commands) are of very recent introduction. To produce a reconciled version of the scale, these technical terms were artificially divided into three categories: calques (e.g. screen – schermo), adapted loanwords (e.g. navigation – navigazione), and non-adapted loanwords (e.g. swipe). Terms belonging to the first two categories were reported in their commonly used “Italianized” dictionary form. By contrast, non-adapted loanwords were mostly reported by using the original English spelling accompanied by their referential meaning in Italian. During the production of the two forward translations, the authors of the source questionnaire were contacted for clarification of two items.
A backward translation was subsequently produced; having been judged satisfactory by the research staff, this was sent, without modification, to the corresponding author of the original scale. The intralingual equivalence between the original and backward-translated versions was then discussed with the team of researchers of the source tool. In general, the backward-translated version was deemed highly congruent with the original version. Most comments made by the MARS developers concerned shades of meaning of single words (e.g. use of adverbs of degree). However, some other comments highlighted a possible problem of non-equivalence (e.g. “through games” was not judged equivalent to “through gamification”). All these comments were addressed by finding the closest translation. Some small modifications were also made in the description/classification section [by adding options “Windows Phone” and “Public body” to the items on app platform and affiliation, respectively, and by combining the options “CBT – Behavioral (positive events)” and “CBT – Cognitive (thought-challenging)” into a single option “CBT – Cognitive behavioral therapy”]. Since there were no major changes to be made in the reconciled version, a second backward translation was judged to be unnecessary. It was deemed that the scale format, instructions and measurement would not affect the operational equivalence.
App selection and piloting
A total of 579 apps were retrieved; after the removal of duplicates (N = 132), 447 apps were screened. Of these, 398 apps were excluded on the basis of exclusion criteria 1 (no Italian version, N = 42), 2 (not relevant to primary prevention, N = 348) and 3 (theory targeted apps for healthcare professionals, N = 8), respectively. One app available in the Windows Store was not downloadable, owing to unmet technical requirements (Nokia Lumia 520 has no front camera). Thus, 48 apps were included.
The first 5 apps were piloted and the percentage of absolute agreement was computed. This varied substantially according to the subscale; it was at least 60 % for engagement (64 %), information (69 %) and subjective quality (60 %), while it was 50 % for functionality and 27 % for aesthetics. Following comparison and review of the results of the pilot test, both raters repeated and discussed the training course, in order to improve the alignment of app ratings. No modifications to the scale were deemed necessary. We then proceeded to evaluate the reliability and validity of the final Italian version of MARS (Additional file
1).
Main study
A total of 43 apps were tested in the validation study. Most of these (N = 30, 70 %) were for the Android platform (since searches in Google Play were conducted first), while 9 (21 %) and 4 (9 %) apps were downloaded from Apple and Windows Stores, respectively. About half of the apps were affiliated to unknown or commercial developers, while 26, 12 and 12 % were developed through the participation of non-commercial organizations, government/public health authorities and universities, respectively. The median time since the last update was 12.3 (interquartile range: 6.5–20.0) months. Only one app had previously been tested in formal studies (item 19 of the information subscale); this item was therefore excluded from all calculations.
As shown in Table
1, the distribution of all composite scores was approximately symmetric, as no skewness coefficient exceeded |1|. The D’Agostino test confirmed normal distributions of all the summary scores produced by both raters. The subscale scores of the two raters were very close to each other and the between-rater difference did not exceed 10 %, ranging from 0 % (aesthetics) to 7.2 % (subjective quality). Paired
t test showed no significant differences between the raters’ scores (engagement:
p = .22; functionality:
p = .54; aesthetics:
p = .99; information:
p = .86; MARS total score:
p = .41; subjective quality:
p = .19). The functionality subscale was probably subject to a ceiling effect, as its score exceeded the pre-specified criterion of 15 %.
Table 1
Mean scores, distribution and floor and ceiling effects, by rater and subscale
Engagement | 0.39 | 0.17 | 2.87 (0.87) | 2.96 (0.79) | 0 | 0 | 2.3 | 0 |
Functionality | −0.28 | −0.87 | 4.10 (0.67) | 4.15 (0.80) | 0 | 0 | 18.6 | 18.6 |
Aesthetics | −0.67 | −0.56 | 3.34 (0.99) | 3.34 (0.94) | 7.0 | 4.7 | 4.7 | 2.3 |
Informationa
| −0.64 | −0.34 | 3.49 (0.80) | 3.48 (0.72) | 0 | 0 | 0 | 0 |
MARS total score | −0.36 | −0.34 | 3.45 (0.66) | 3.48 (0.66) | 0 | 0 | 0 | 0 |
Subjective quality | 0.39 | 0.51 | 2.49 (1.17) | 2.31 (0.99) | 7.0 | 14.0 | 2.3 | 0 |
The ICCs were deemed excellent for 4 of the 5 subscales and the MARS total mean quality scores and good for the functionality subscale (Table
2). The ICCs of single items varied in a range of .59–.93, with a mean of .82 (SD: .11): estimates of 7, 9 and 6 items were classified as excellent, good and moderate, respectively (Additional file
2: Table S1). The lowest ICCs (.59 and .60) were observed for items 17 (visual information) and 5 (target group), respectively.
Table 2
Intra-class correlation coefficients, by subscale
Engagement | .91 | .84–.95 |
Functionality | .88 | .77–.93 |
Aesthetics | .93 | .87–.96 |
Informationa
| .95 | .90–.97 |
MARS total score | .96 | .93–.98 |
Subjective quality | .95 | .89–.97 |
All Cronbach’s
α coefficients were judged to be at least acceptable, independently of both rater and subscale. Notably, these were categorized as excellent for the MARS total and subjective quality subscale scores (Table
3). Moreover, the MARS total score displayed relatively stable internal consistency, as shown by the Spearman-Brown prophecy formula (Rater 1: .81; Rater 2: .84). The estimate of the internal consistency of the average of the MARS total scores assigned by the 2 raters was also good (.85).
Table 3
Cronbach’s α coefficients, by rater and subscale
Engagement | .85 (.76–.91) | .84 (.75–.90) |
Functionality | .77 (.63–.87) | .87 (.79–.92) |
Aesthetics | .92 (.86–.95) | .88 (.81–.93) |
Informationa
| .73 (.57–.84) | .71 (.54–.83) |
MARS total score | .90 (.85–.94) | .91 (.87–.94) |
Subjective quality | .95 (.92–.97) | .93 (.89–.96) |
The convergent validity of the Italian MARS was established, as the item-subscale and item-total correlation coefficients of both raters exceeded the cut-off value of .2; after correction for overlapping, most item-total
ρs (16/22 and 18/22 for raters 1 and 2, respectively) were ≥ .5 (Table
4). Some item-total correlation coefficients were not, however, statistically significant (item 14 as measured by both raters and items 13 and 4 as measured by raters 1 and 2, respectively) as shown by the corresponding 95 % CIs. Similarly, as shown by the correlation matrixes (Additional file
2: Figure S1), most
ρs were > .2 and the average inter-item correlation coefficient also fulfilled the pre-specified criterion (Rater 1: .40; Rater 2: .43).
Table 4
Corrected item-subscale and item-total Spearman’s ρ correlation coefficients, by rater
Engagement | 1 | .80 (.65–.89) | .81 (.68–.89) | .78 (.61–.88) | .74 (.55–.86) |
2 | .82 (.64–.92) | .79 (.64–.89) | .78 (.62–.88) | .75 (.57–.86) |
3 | .47 (.20–.68) | .71 (.52–.84) | .35 (.06–.59) | .62 (.37–.78) |
4 | .62 (.35–.82) | .44 (.15–.69) | .54 (.24–.76) | .28 (−.03–.56) |
5 | .61 (.39–.77) | .54 (.28–.74) | .77 (.60–.88) | .69 (.48–.84) |
Functionality | 6 | .64 (.41–.82) | .62 (.40–.76) | .48 (.20–.69) | .42 (.11–.67) |
7 | .50 (.22–.72) | .71 (.51–.84) | .33 (.02–.60) | .62 (.38–.79) |
8 | .75 (.56–.88) | .78 (.63–.87) | .45 (.17–.68) | .74 (.57–.86) |
9 | .65 (.44–.81) | .80 (.62–.90) | .53 (.29–.70) | .73 (.53–.86) |
Aesthetics | 10 | .69 (.45–.84) | .60 (.36–.78) | .82 (.66–.91) | .69 (.50–.83) |
11 | .75 (.55–.89) | .88 (.80–.93) | .60 (.35–.78) | .75 (.57–.86) |
12 | .86 (.73–.93) | .87 (.76–.93) | .68 (.48–.82) | .75 (.55–.87) |
Informationa
| 13 | .33 (.03–.58) | .43 (.14–.66) | .30 (−.02–.59) | .43 (.11–.69) |
14 | .32 (.01–.59) | .34 (.01–.63) | .23 (−.11–.54) | .27 (−.06–.56) |
15 | .70 (.51–.84) | .76 (.62–.83) | .61 (.35–.80) | .58 (.36–.76) |
16 | .49 (.22–.71) | .51 (.29–.67) | .73 (.54–.86) | .56 (.28–.77) |
17 | .54 (.23–.77) | .54 (.28–.71) | .63 (.39–.79) | .71 (.52–.84) |
18 | .62 (.42–.76) | .59 (.36–.77) | .61 (.38–.78) | .57 (.33–.76) |
Subjective quality | 20 | .94 (.90–.97) | .89 (.80–.94) | .89 (.79–.94) | .83 (.69–.90) |
21 | .88 (.77–.94) | .86 (.75–.92) | .81 (.67–.89) | .81 (.69–.88) |
22 | .88 (.81–.92) | .79 (.65–.86) | .81 (.65–.90) | .69 (.51–.80) |
23 | .95 (.91–.97) | .94 (.89–.97) | .89 (.79–.94) | .88 (.79–.94) |
Pearson’s correlation coefficients between the subscales making up the MARS total score (the objective subscales) are reported in Table
5. Only the subscales “engagement” and “aesthetics” showed
r values above .7 for both raters. Of the 22 items considered, 20 (91 %) displayed a higher correlation with their own subscale than with other subscales (Additional file
2: Table S2).
Table 5
Between-subscale (objective subscales) Pearson’s r correlation coefficients, by rater (Rater 1: upper right triangle; Rater 2: lower left triangle)
Engagement | – | .29 (−.01–.54) | .72 (.54–.84) | .61 (.38–.77) |
Functionality | .34 (.04–.58) | – | .34 (.04–.58) | .53 (.27–.72) |
Aesthetics | .77 (.61–.87) | .47 (.20–.68) | – | .43 (.15–.65) |
Informationa
| .56 (.31–.74) | .66 (.45–.80) | .49 (.22–.69) | – |
Bootstrapped generalised Ferguson's
δ coefficients ranged from .84 to .96 and from .86 to .96 for raters 1 and 2, respectively, indicating that the questionnaire is able to establish differences among the apps. As shown by Loevinger's
H coefficients, the scalability of all the subscales and the MARS total score was acceptable, exceeding the threshold value of .3 (Additional file
2: Table S3).
Thirty-seven (86 %) apps had at least one vote in an app store; of these, 31 and 23 had at least 5 and 10 votes, respectively. As shown in Table
6, the number of votes for an app affected the strength of association between MARS items or subscales (MARS star rating, MARS total score and subjective quality subscales) and the star ratings available in the app stores: the more votes that were given, the more significant was the positive association observed, regardless of both rater and item/scales. However, on applying a 10-vote cut-off, the statistically significant correlation coefficient was only poor to moderate.
Table 6
Correlation coefficients between rating systems available in app stores and MARS star rating, total and subjective quality scores, by number of ratings cut-off and rater
MARS star rating (N23)a
| 37 (86.0) | 1 | .18 | .28 | .26 | .12 |
31 (72.1) | 5 | .25 | .17 | .31 | .086 |
23 (53.5) | 10 | .50 | .015 | .46 | .028 |
MARS total scoreb
| 37 (86.0) | 1 | .02 | .92 | .09 | .62 |
31 (72.1) | 5 | .03 | .89 | .09 | .62 |
23 (53.5) | 10 | .43 | .041 | .37 | .081 |
App subjective qualityb
| 37 (86.0) | 1 | .16 | .35 | .20 | .23 |
31 (72.1) | 5 | .19 | .30 | .26 | .16 |
23 (53.5) | 10 | .50 | .015 | .54 | .008 |
The MARS total score of the apps developed by governmental or non-profit organizations or universities [3.83 (SD 0.47)] was significantly higher (
t = 4.25,
p < .001) than that [3.12 (SD 0.61)] of the apps from unknown/commercial developers. The effect size was large [
d = 1.30 (95 % CI: 0.62–1.98)]. Similarly, comparison of single subscales (Additional file
2: Figure S2) revealed lower scores for the apps from unknown/commercial developers; the highest effect size of 1.61 (95 % CI: 0.90–2.32) was seen for the information subscale, while the lowest concerned aesthetics [
d = 0.76 (95 % CI: 0.13–1.40)].
The internal consistency of the MARS total score was very similar between the Italian version and the original version [αs of .92 vs .90, respectively; F = 1.25, p = .45]. In our study, the ICC for the total score was substantially higher (.96 vs .79), with non-overlapping 95 % CIs. By contrast, the Australian version of MARS displayed higher concurrent validity with the app stores rating system, though the difference did not reach an α < .05 on applying either the 5-vote (z = 1.62, p = .11) or 10-vote (z = 0.53, p = .59) cut-off.
The model that best predicted the MARS total score consisted of two predictors, namely the app-store star rating and the developer’s affiliation. The former was, however, not statistically significant (b = 0.09, p = .48). By contrast, institutional (governmental, non-profit organization, university) affiliation was significantly (p < .001) associated with a 0.82 increase in the MARS total score. The model explained 39.8 % of variance.