Background
Methods
Article selection
-
Purpose of the study was to evaluate one or more measurement properties
-
Instrument under study was a HR-PRO instrument
-
English language publications
-
Systematic reviews, case reports, letters to editors
-
Studies that evaluated construct validity of two or more instruments at the same time by correlating the scores of the instruments mutually, without indicating one of instruments as the instrument of interest. In these studies, it is unclear of which instrument the construct validity is being assessed.
Selection of participants
Procedure
Statistical analyses
Results
percentage agreement | Intraclass kappaa
| |
---|---|---|
Internal consistency |
94
| 0.66 |
Reliability |
94
|
0.77
|
Measurement error |
94
| 0.02b
|
Content validity |
84
| 0.29 |
Structural validity |
86
| 0.48 |
Hypotheses testing |
87
| 0.29 |
Cross-cultural validity |
95
| 0.66b
|
Criterion validity |
93
| 0.23b
|
Responsiveness |
96
|
0.81
|
Interpretability |
86
| 0.02b
|
Item nr | Item | N (minus articles with 1 rating)a
| % agreement | N | Kappa |
---|---|---|---|---|---|
Box A Internal consistency (n = 195)b
| |||||
A1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 185 |
82
| 193 | 0.06 |
Design requirements | |||||
A2c
| Was the percentage of missing items given? | 183 |
87
| 190 | 0.48 |
A3c
| Was there a description of how missing items were handled? | 180 |
90
| 187 | 0.54 |
A4 | Was the sample size included in the internal consistency analysis adequate? | 177 |
87
| 185 | 0.06d
|
A5c
| Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied? | 180 |
92
| 187 | 0.69 |
A6 | Was the sample size included in the unidimensionality analysis adequate? | 166 | 79 | 178 | 0.27 |
A7 | Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately? | 179 |
85
| 187 | 0.31d
|
A8c
| Were there any important flaws in the design or methods of the study? | 174 |
86
| 179 | 0.22d
|
Statistical methods | |||||
A9 | for Classical Test Theory (CTT): Was Cronbach's alpha calculated? | 179 |
93
| 187 | 0.27d,e
|
A10 | for dichotomous scores: Was Cronbach's alpha or KR-20 calculated? | 151 |
91
| 165 | 0.17d,e
|
A11 | for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation) | 154 |
93
| 167 | 0.46d,e
|
Box B. Reliability (n = 141)
b
| |||||
Design requirements | |||||
B1c
| Was the percentage of missing items given? | 129 |
87
| 140 | 0.39 |
B2c
| Was there a description of how missing items were handled? | 125 |
91
| 137 | 0.43d
|
B3 | Was the sample size included in the analysis adequate? | 127 | 77 | 139 | 0.35 |
B4c
| Were at least two measurements available? | 129 |
98
| 140 |
0.72
d
|
B5 | Were the administrations independent? | 129 | 73 | 139 | 0.18 |
B6c
| Was the time interval stated? | 125 |
94
| 136 | 0.50d
|
B7 | Were patients stable in the interim period on the construct to be measured? | 126 | 75 | 138 | 0.24 |
B8 | Was the time interval appropriate? | 125 |
84
| 137 | 0.45 |
B9 | Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions | 127 |
83
| 138 | 0.30 |
B10c
| Were there any important flaws in the design or methods of the study? | 117 | 77 | 129 | 0.08 |
Statistical methods | |||||
B11 | for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? | 119 |
86
| 133 | 0.59e
|
B12 | for dichotomous/nominal/ordinal scores: Was kappa calculated? | 111 |
81
| 127 | 0.32e
|
B13 | for ordinal scores: Was a weighted kappa calculated? | 111 |
83
| 127 | 0.42e
|
B14 | for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic | 108 |
81
| 124 | 0.35e
|
Box D. Content validity (n = 83)
b
| |||||
Design requirements | |||||
D1 | Was there an assessment of whether all items refer to relevant aspects of the construct to be measured? | 62 | 79 | 83 | 0.33 |
D2 | Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting) | 62 | 76 | 83 | 0.46 |
D3 | Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive) | 62 | 66 | 83 | 0.21 |
D4 | Was there an assessment of whether all items together comprehensively reflect the construct to be measured? | 62 | 66 | 83 | 0.15 |
D5c
| Were there any important flaws in the design or methods of the study? | 58 | 76 | 78 | 0.13 |
Box E. Structural validity (n = 118)
b
| |||||
E1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 99 | 78 | 116 | 0f
|
Design requirements | |||||
E2c
| Was the percentage of missing items given? | 95 |
87
| 110 | 0.41 |
E3c
| Was there a description of how missing items were handled? | 93 |
91
| 109 | 0.55 |
E4 | Was the sample size included in the analysis adequate? | 94 |
87
| 109 | 0.56d
|
E5c
| Were there any important flaws in the design or methods of the study? | 89 |
84
| 103 | 0.27 |
Statistical methods | |||||
E6 | for CTT: Was exploratory or confirmatory factor analysis performed? | 92 |
90
| 106 | 0.51d,e
|
E7 | for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? | 62 |
87
| 80 | 0.39e,f
|
Box F. Hypotheses testing (n = 170)
b
| |||||
Design requirements | |||||
F1c
| Was the percentage of missing items given? | 158 |
87
| 168 | 0.41 |
F2c
| Was there a description of how missing items were handled? | 159 |
92
| 169 | 0.60d
|
F3 | Was the sample size included in the analysis adequate? | 157 |
84
| 167 | 0.12d
|
F4 | Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)? | 158 | 74 | 168 | 0.42 |
F5 | Was the expected direction of correlations or mean differences included in the hypotheses? | 159 | 75 | 169 | 0.26e
|
F6 | Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses? | 159 |
82
| 168 | 0.29e
|
F7c
| for convergent validity: Was an adequate description provided of the comparator instrument(s)? | 125 |
83
| 136 | 0.30 |
F8c
| for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described? | 124 |
81
| 135 | 0.35 |
F9c
| Were there any important flaws in the design or methods of the study? | 131 |
81
| 145 | 0.17 |
Statistical methods | |||||
F10 | Were design and statistical methods adequate for the hypotheses to be tested? | 150 | 78 | 161 | 0.00d,e,f
|
Box G. Cross-cultural validity (n = 33)
b
| |||||
Design requirements | |||||
G1c
| Was the percentage of missing items given? | 25 |
88
| 32 | 0.52 |
G2c
| Was there a description of how missing items were handled? | 22 |
82
| 30 | 0.32 |
G3 | Was the sample size included in the analysis adequate? | 26 |
81
| 33 | 0.23 |
G4c
| Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described? | 28 |
89
| 33 | 0.34d
|
G5c
| Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages | 28 |
86
| 33 | 0.46 |
G6 | Did the translators work independently from each other? | 28 |
89
| 33 | 0.61 |
G7 | Were items translated forward and backward? | 28 |
100
| 33 |
1.00
|
G8c
| Was there an adequate description of how differences between the original and translated versions were resolved? | 28 |
86
| 33 | 0.50 |
G9c
| Was the translation reviewed by a committee (e.g. original developers)? | 25 |
88
| 31 | 0.56 |
G10c
| Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension? | 21 |
90
| 29 | 0.61 |
G11c
| Was the sample used in the pre-test adequately described? | 28 | 79 | 32 | 0f
|
G12 | Were the samples similar for all characteristics except language and/or cultural background? | 26 |
81
| 31 | 0.41 |
G13c
| Were there any important flaws in the design or methods of the study? | 26 |
85
| 31 | 0.42 |
Statistical methods | |||||
G14 | for CTT: Was confirmatory factor analysis performed? | 27 | 74 | 32 | 0.03e,f
|
G15 | for IRT: Was differential item function (DIF) between language groups assessed? | 13 | 77 | 23 | 0.28e,f
|
Box H. Criterion validity (n = 57)
b
| |||||
Design requirements | |||||
H1c
| Was the percentage of missing items given? | 35 |
91
| 56 | 0.59d
|
H2c
| Was there a description of how missing items were handled? | 35 |
97
| 56 |
0.79
d
|
H3 | Was the sample size included in the analysis adequate? | 35 | 69 | 54 | 0.06 |
H4 | Can the criterion used or employed be considered as a reasonable 'gold standard'? | 37 | 62 | 57 | 0f
|
H5c
| Were there any important flaws in the design or methods of the study? | 33 | 79 | 54 | 0.10 |
Statistical methods | |||||
H6 | for continuous scores: Were correlations, or the area under the receiver operating curve calculated? | 37 | 78 | 56 | 0.16e
|
H7 | for dichotomous scores: Were sensitivity and specificity determined? | 29 |
83
| 47 | 0.28e,f
|
Box I. Responsiviness (n = 79)
b
| |||||
Design requirements | |||||
I1c
| Was the percentage of missing items given? | 71 |
82
| 76 | 0.14d
|
I2c
| Was there a description of how missing items were handled? | 73 |
92
| 77 | 0.36d
|
I3 | Was the sample size included in the analysis adequate? | 72 | 72 | 76 | 0.40 |
I4c
| Was a longitudinal design with at least two measurement used? | 73 |
100
| 78 |
1.00
d
|
I5c
| Was the time interval stated? | 73 |
89
| 78 | 0.25d
|
I6c
| If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described? | 72 | 78 | 75 | 0.17 |
I7c
| Was a proportion of the patients changed (i.e. improvement or deterioration)? | 70 |
97
| 73 | 0.32d
|
Design requirements for hypotheses testing | |||||
For constructs for which a gold standard was not available | |||||
I8 | Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? | 65 | 69 | 72 | 0.35 |
I9 | Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 60 | 78 | 65 | 0.19e
|
I10 | Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 61 |
90
| 66 | 0.05d,e
|
I11c
| Was an adequate description provided of the comparator instrument(s)? | 56 | 70 | 63 | 0f
|
I12c
| Were the measurement properties of the comparator instrument(s) adequately described? | 56 |
80
| 63 | 0.06 |
I13c
| Were there any important flaws in the design or methods of the study? | 63 | 71 | 68 | 0.03 |
Statistical methods | |||||
I14 | Were design and statistical methods adequate for the hypotheses to be tested? | 63 | 73 | 67 | 0.21e,f
|
Design requirements for comparison to a gold standard | |||||
For constructs for which a gold standards was available: | |||||
I15 | Can the criterion for change be considered as a reasonable 'gold standard'? | 21 | 67 | 28 | 0f
|
I16c
| Were there any important flaws in the design or methods of the study? | 12 | 67 | 21 | 0f
|
Statistical methods | |||||
I17 | for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated? | 28 | 79 | 39 | 0.47e,f
|
I18 | for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined? | 28 | 79 | 37 | 0.15e
|
Box J. Interpretability (n = 42)
b
| |||||
J1c
| Was the percentage of missing items given? | 22 |
95
| 41 |
0.80
|
J2c
| Was there a description of how missing items were handled? | 21 | 76 | 41 | 0.19 |
J3 | Was the sample size included in the analysis adequate? | 23 | 74 | 41 | 0f
|
J4c
| Was the distribution of the (total) scores in the study sample described? | 23 | 74 | 41 | 0.08 |
J5c
| Was the percentage of the respondents who had the lowest possible (total) score described? | 20 |
95
| 40 |
0.84
|
J6c
| Was the percentage of the respondents who had the highest possible (total) score described? | 21 |
90
| 41 |
0.70
|
J7c
| Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population | 21 | 76 | 41 | 0.05 |
J8c
| Was the minimal important change (MIC) or the minimal important difference (MID) determined? | 19 |
89
| 40 | 0.26d
|
J9c
| Were there any important flaws in the design or methods of the study? | 21 | 71 | 41 | 0f
|
Item nr | Item | N (minus articles with 1 rating)a
| % agreement | N | Kappa |
---|---|---|---|---|---|
Generalisability Box (n = 866)
b
|
c
| ||||
Was the sample in which the HR-PRO instruments was evaluated adequately described? In terms of: | |||||
1d
| median or mean age (with standard deviation or range)? | 733 |
86
| 865 | 0.36 |
2d
| distribution of sex? | 735 |
88
| 863 | 0.38e
|
3 | important disease characteristics (e.g. severity, status, duration) and description of treatment? | 746 |
80
| 862 | 0.39f
|
4d
| setting(s) in which the study was conducted? e.g. general population, primary care or hospital/rehabilitation care | 735 |
89
| 863 | 0.30e
|
5d
| countries in which the study was conducted? | 733 |
90
| 861 | 0.40e
|
6d
| language in which the HR-PRO instrument was evaluated? | 733 |
86
| 861 | 0.41e
|
7d
| Was the method used to select patients adequately described? e.g. convenience, consecutive, or random | 729 |
81
| 857 | 0.40 |
8 | Was the percentage of missing responses (response rate) acceptable? | 724 |
82
| 849 | 0.48 |