Background
Health related quality of life (HRQoL) is a crucial outcome metric used in settings from clinical trials [
1,
2] to population health surveillance [
3‐
7]. The Veterans Rand questionnaire (VR) is a multi-attribute generic instrument measuring patient-reported HRQoL. The instrument has a long (VR-36) and a short form (VR-12), both measuring a physical component summary (PCS
VR) and a mental component summary (MCS
VR). The VR-36 also is comprised of eight scales, which correspond closely to the Medical Outcome Study (MOS) Short Form 36 version 1.0 (SF-36, [
8‐
10]).
The VR instruments were created to address the veteran population in the United States (US) [
11]. The Veterans Health Administration (VHA) is a national health care system, which serves over nine million military veterans in the US. It is one of the largest integrated health care systems in the US. This patient population has special medical needs, is older, poorer, sicker (with more diseases than veterans nationally) and has a higher percentage of men than the general adult population [
12‐
14]. The creation of the VR instruments has been previously documented [
13‐
16] and shown to be valid for the VA population [
13,
17‐
27] as well as other general US populations [
28‐
35]. The English-language VR instruments have become an integral part of registries [
36] and studies of National U.S. health programs [
18,
37,
38] including the evaluation of the Medicare Advantage Program by the Centers for Medicare and Medicaid Services (CMS). Advantages of the VR instruments include their validity in older and sicker populations, their availability (all instruments are in the public domain) and their strong psychometric properties across different and wide-ranging socio-demographic and clinical groups.
In this study, we translated and culturally adapted the VR-36 into the German language (Germany) and validated the VR-36 and VR-12 in a population of German patients undergoing inpatient rehabilitation. The German VR-36 and VR-12 were comprehensively validated and compared to the SF-36 and SF-12 in inpatient populations of orthopedic and psychosomatic rehabilitation patients (the two largest clinical indications of German inpatient rehabilitation patients).
The SF-36 and the SF-12 are considered gold standards of self-assessed generic health instruments and they have been extensively distributed and used across a wide range of countries, populations and purposes. They are recommended for measuring patient outcomes in the medical rehabilitation setting in Germany [
39‐
42]. Since the field of medical rehabilitation has been one of the most common applications of the SF-36 in the German-speaking countries, it was important to compare the measurement properties of the VR instruments to the SF-instruments in this setting.
Measures
In addition to the VR and SF instruments, the patient questionnaires contained several other self-report measures. These measures were chosen to correspond to the eight scales and the summary scores of the VR instruments in order to validate the VR instruments.
The EQ-5D-5L questionnaire is an internationally widely used preference-based measure of self-assessed health [
44‐
46]. The questionnaire measures impairments in five dimensions of health using five items, each with five levels of impairments, and a thermometer-like visual analogue scale (EQ VAS). The values of the five items can be converted into a preference-based single utility index. In the present study, index values were calculated using the German tariff [
47].
The Centers for Disease Control and Prevention (CDC) “Healthy Days” is a generic HRQoL questionnaire containing four items measuring self-rated health and the number of disability days (out of the last 30) due to physical and mental health or limitations in activities [
48,
49]. The instrument is valid and reliable [
48].
The Hannover Functional Abilities Questionnaire (HFAQ) is a 12-item generic measure of (physical) functional ability of daily activities [
50‐
52]. Each item has three levels of functioning. All items can be combined to an additive summary score.
The Depression, Anxiety and Stress Scale (DASS) is an extensively validated measure of mental health [
53,
54]. In this study, the short form (21-item, DASS-21) instrument was used.
The Graded Chronic Pain Scale (GCPS) is an internationally established instrument developed by van Korff et al. [
55,
56]. The GCPS measures self-rated pain intensity and pain disability using a 0 to 10 numeric rating scale plus one item regarding number of disability days (in the past three months) due to pain using seven items. Summation of GCPS items produce scores describing pain intensity and pain disability.
The Index for the Assessment of Health Impairments, IMET [
57,
58], measures participation as defined by the WHO International Classification of Functioning, Disability and Health, ICF. The 9-item questionnaire was applied and tested in several samples from rehabilitation patients of different clinical indications. It is suitable as a screening method to assess the risk of a failure in the professional reintegration of rehabilitation patients. The instrument is demonstrated to be an economic, highly practicable, valid and reliable operationalization of “activities and participation” according to the concept of the ICF. Norm values for the IMET were assessed in a random sample of Lübeck inhabitants comprising subjects between 19 and 79 years of age, and enable classification of limitations in participation for people undergoing rehabilitation or suffering from chronic diseases.
The vitality subscale of the Indicators of the REhabilitation Status (IRES-VE) was included to examine the construct validity of the VR items on vitality [
59]. In Germany, the IRES is recommended (in addition to the SF-36) for rehabilitation research and practice [
42].
Statistical analysis
The VR-36 and the VR-12 were analyzed regarding the completeness of data on the scale-level, distributional properties, construct validity, known-groups validity, internal consistency (as one aspect of reliability), and responsiveness to change. This was done on the summary scores of the VR-36 and the VR-12 (physical component score (PCS
VR) and mental component score (MCS
VR)) as well as the eight VR-36 scales: (physical functioning (PF
VR-36), role functioning/physical (RP
VR-36), role functioning/emotional (RE
VR-36), vitality (VT
VR-36), mental health (MH
VR-36), social functioning (SF
VR-36), pain (BP
VR-36), and general health (GH
VR-36)). The VR instruments have not previously been used in German populations and normed scores have not yet been developed. Therefore, summary scores and scales were scored according to the VR-36 and VR-12 algorithms, using a t-score transformation with a mean of 50 and a standard deviation of 10 and normed to a general sample of the US population for the summary scales (PCS and MCS) [
23,
60‐
62]. The scoring algorithms for the VR-36 and the VR-12 impute for missing data. VR-12 extrapolates scoring based on the missing pattern; VR-36 conducts mean imputation at the subscale level if less than 50% of the subscale items is missing. In all analyses, all available data were used (available case analysis). Because the SF-36 and the SF-12 instruments are well validated across a range of populations, they were used as the comparator to the VR instruments for all analyses.
Completeness of data is an indicator of data quality and acceptance of the questionnaire by respondents. The percentage of non-missing responses was calculated for the eight VR-36 scales, stratified by respondent characteristics (e.g. clinical indication, age, sex, education). No imputation was carried out to deal with missing data for statistical analyses.
Distributional properties (such as means, standard deviations and range) for the VR instruments were analyzed on the scale and summary score levels. To compare the distributional properties of the PCS and MCS for both the VR-12 and SF-12 as well as the VR-36 and the SF-36, classical statistical indices of distribution such as mean, standard deviation, minimum, maximum, skewness (to assess and compare the type and strength of symmetry) and kurtosis (as a measure of the steepness / flatness of the frequency distribution) were assessed. Kolmogorow-Smirnov-test was used to compare the distributions of the two summary scores of the VR and the SF—i.e. PCSVR and PCSSF as well as MCSVR and MCSSF. Kernel density plots using the Epanechnikov function were used to visually examine distribution of summary scores and scales.
Construct validity refers to the degree of accuracy with which a measurement instrument captures the construct it claims to measure. To examine construct validity, Pearson correlation coefficients (r
p) between VR summary scores (PCS
VR and MCS
VR) and other self-completed health measures were assessed. We compared these to the correlations between the PCS
SF and MCS
SF with other self-completed health measures. Correlation coefficients were compared using significance tests for correlations for independent samples [
63]. The correlations between PCS
VR and other self-reported physical health measures (e.g. HFAQ, CDC Physical unhealthy days, GCPS Disability) were expected to be higher (convergent validity) than with self-report measures of mental health (divergent validity). Similarly, MCS
VR is expected to be more strongly correlated with self-reported mental health measures (e.g. DASS-Anxiety, DASS-Stress, DASS-Depression, CDC Mental unhealthy days) than with physical measures. Both PCS and MCS are expected to be similarly correlated with generic self-report measures (e.g. EQ VAS, IMET) and GCPS-Pain. Correlations were interpreted as follows: r
p < 0.1 small, 0.3 ≥ r
p < 0.5 moderate, r
p ≥ 0.5 high/strong [
64].
Known-groups validity is a criteria-based technique to investigate the ability of a measure to discriminate between groups known to differ in the construct of interest. For this study, known-groups were defined by clinical indication (psychosomatic, orthopedic), treatment program (“curative therapy” typically for chronically ill patients, “medical follow-up treatment” generally after joint replacement, only for orthopedic patients) age (< 45 years, 45–65 years, > 65 years), duration of rehabilitation (median), sick days in the past 12 month, self-rated health (SRH, “excellent/very good/good” vs. “fair/poor”). We examined if mean PCSVR and mean MCSVR scores were significantly different between those pre-defined groups using t-tests for two groups or ANOVA for more than two groups.
Internal consistency (IC) is a measure of reliability. A scale is considered reliable if its items are homogeneous—i.e., highly correlated because they measure the same underlying construct [
65]. In this study, Cronbach's alpha was used as a measure of IC with α ≥ 0.7 interpreted as acceptable, α ≥ 0.8 as good, and α ≥ 0.9 as excellent.
Responsiveness refers to a self-assessed health instrument’s ability to capture changes in health over time [
66]. The raw difference of SF and VR summary scores from t1 to t2 were divided by the pooled standard deviation of change to produce standardized response means (SRM), or divided by baseline standard deviation to produce standardized effect size (SES). As we assess patients before and after an intensive treatment, analysis were restricted to respondents who reported stable (t1 = t2) or improved (t1 < t2) health on a single SRH item (n = 133) to assess responsiveness to health improvements. We further checked improvement (from t1 to t2) for all PCS- and MCS-scores of all four instruments using paired t-tests. The magnitude of changes in scores (expressed as SRM and SES) was interpreted as following: values of < 0.3 were considered as small, values between 0.3 and 0.59 were considered as medium, and values ≥ 0.6 were considered as large [
67]. Since there are different methods to estimate the magnitude of change within groups, and consensus is lacking on their interpretation [
68], we are calculating both SES and SRM for comparison purposes. Due to the repeated measurement design the measurements are correlated, which was shown to affect the magnitude of SRM [
69]; to account for this, we additionally correlated both measurements (Pearson correlation coefficient, r
t1/t2).
Data were analyzed using IBM SPSS Statistics 24 and STATA SE 13. Wherever applicable, analyses were stratified by clinical indication (orthopedic or psychosomatic rehabilitation).
Discussion
This research project (1) translated and culturally adapted the English VR-36 to the German language (Germany) and (2) validated the adapted VR-36 and VR-12 in German orthopedic and psychosomatic inpatient rehabilitation patients. This article provides details of the translation and cultural adaptation process of the German VR and the main findings of the validation study.
The German translation of the VR was prepared according to "state of the art" criteria for cultural adaptation of self-assessed health questionnaires using forward and backward translations. The study produced a self-report questionnaire that is conceptually and semantically equivalent to the English language VR-36. The only difficulty during translation was the role physical (RP) and role emotional (RE) items which produced double negatives when the question stems and responses were taken together. This was resolved by a slight change in response category wording.
The German VR-36 is the third cultural adaptation and translation of the VR after the Spanish and the Chinese version. Three more language versions (Japanese, Russian, Polish) are being planned.
1
The validation phase of this study found the VR instruments to be acceptable, valid and moderately to strongly responsive to improvements in health. We indirectly compared the German VR-36 and VR-12 to the well-established SF-36 and SF-12, and found the instruments to be comparable in their distribution properties, validity, and responsiveness. Data quality indicators, such as the extent of item non-response, show the VR to be acceptable instruments in a German rehabilitation population, and were similar compared to the SF instruments. PCS score distributions were similar for VR and SF instruments. However, the MCSVR was distributed more in the lower range of the scale than the MCSSF. The VR scales and summary scores were moderately to strongly correlated with expected external measures such as self-reported pain, physical functioning, mental functioning and disability. Both the long and the short form of the VR could distinguish between patient type (orthopedic and psychosomatic), duration of rehabilitation and self-rated health while both PCSVR-12 and PCSVR-36 could also distinguish between type of therapy and PCSVR-12 whether the patient had over 100 sick days in the last year. The short version (VR-12) was similarly responsive as the VR-36 and SF-36. Thus, the VR was established as a valid and responsive measure of quality of life in orthopedic and psychosomatic samples of German inpatient rehabilitation patients.
The number of studies using one of the instruments of the VR family is increasing every year with well over 400 publications [
70]. The developers of the VR family provided the original psychometric evidence for the VR-36 and VR-12 [
13,
15,
16,
23].
Item level missing values were low and comparable to other studies suggesting high acceptability. While in this study 1.8% to 6.5% were missing per question for the baseline VR-36, Kronzer et al. [
71] reported missing values in adult patients undergoing elective surgery on the baseline VR-12 from 1.5 to 3.7% per question and from 3.3 to 8.9% on the follow-up VR-12 (median 56 days).
Descriptive statistics indicated acceptable distributional characteristics. Summary scale means and SD of the PCS
VR-36 are comparable with the results of the Veterans Health Study (VHS), in which the VR-36 was administered to nearly 2,500 veterans receiving ambulatory care (VHS PCS
VR-36: 37.12 ± 11.85, this study: 38.50 ± 10.2), but MCS
VR-36 is different (VHS: 47.81 ± 12.23, this study: 36.2 ± 14.2) [
17]. The differences in MCS may be a function of the populations sampled; while the means were different the SD are quite similar.
The validity results are comparable with other studies investigating physically impaired patients: a study with patients undergoing knee arthroplasty [
31] found a moderate correlation between the PCS
VR-12 and a disease-specific measure (KOOS-pain score: 0.57). Since only few studies investigated the factor structure of the VR-36, e.g. [
60], this needs further investigation.
Oak et al. [
31] found the PCS
VR-12 to capture statistically significant improvements in n = 45 pre- and postoperatively tracked patients who underwent knee arthroplasty. They found no statistical differences in internal or external responsiveness to change among the EQ-5D, VR-12 and PROMIS 10 physical instruments with SRMs of the PCS
VR-12 of 0.681 and for the MCS
VR-12 of 0.103 (SRM EQ-5D: 0.704, PROMIS 10 physical: 0.721, PROMIS 10 mental: 0.083). SRM of VR-12 scores at baseline and at the end of therapy (0.549) can be calculated from results of Levy et al.’s study of physical therapy received through tele-rehabilitation [
73]. This is extremely similar to what we found for the VR-12 in orthopedic patients. Bedigrew et al.’s [
74] study of an orthotic and rehabilitation program found statistically significant improvements only in the PCS but not in the MCS. For orthopedic patients, we found PCS to be less sensitive to changes in both SF and VR than the MCS, with the VR-12 similar or more sensitive to improvements than the SF instruments. However, the VR 36 was found to be slightly less sensitive to improvements than the SF-36 for psychosomatic patients.
Although the VR-36 and VR-12 are based on version 1 of the SF-36 and SF-12, the VR instruments use the five-level response format of the role functioning and role emotional scales whereas the SF version 1 instruments use the two-level format. The SF version 2 uses five-level response scales for those scales, but has slightly different wording and is in general a different instrument than version 1. This difference is likely the source of differences in distribution and responsiveness in our comparison of the VR to SF version 1 instruments. The floor was raised and ceiling lowered with the 5-point set of response choices for the role physical and role emotional scales compared with the dichotomized choices for the SF version 1 instruments [
16]. Previous findings suggest that this could also be a possible explanation for the differences in responsiveness [
16]. Gornet et al. [
35] investigated the conversion of the SF-36 to PCS
VR-12 and MCS
VR-12 in 1968 patients who underwent lumbar (n = 1559) and cervical (n = 409) surgery between 1998 and 2013. They found the SF-36 and converted VR-12 mean scores, the mean (pre to post) change scores for PCS and MCS, and the minimum detectable change (MDC) to be extremely similar. However, as their study only collected SF-36 data, they could not compare how a 2-level and 5-level response category in the two scales might differ.
The primary limitation of this study is the indirect comparison of the instruments: the VR-36, VR-12, SF-36 and SF-12 were completed by different patients. The design choice was to minimize respondent burden and frustration as the four instruments are very similar. Although patients were randomized to the study arms, there could be underlying differences across the groups not captured by demographic or patient characteristics. Thus, it is possible that the detected distribution and responsiveness differences may in part be due to differences in the sample characteristics and perhaps unmeasured variables and not due to the instruments themselves.
Due to the magnitude of this time interval (of four to six weeks) and the intervention, it was not feasible to investigate test-retest reliability. Even after a week, which is the usual lag time between test-retests, we would expect patients to change as they are undergoing intense rehabilitation treatment. This is why we investigated internal consistency as a measure of internal reliability. However, test-retest reliability it is still to be investigated for the German version of VR.
Furthermore, the German VR was validated in an inpatient rehabilitation setting, and the results may not be generalizable to other populations nor to outpatient rehabilitation settings. Future research applying the German VR in other settings is necessary. The instruments were also administered only as a paper-and-pencil survey. As self-assessment questionnaires are increasingly being used in electronic formats, the comparison between the classical paper-pencil and other new computer platform applications should be studied.
Since this is the first study to this new German instrument, which aimed to adapt and test it in the German population, German norms have not yet been developed. This will be one of the next steps of instrument development. Therefore, for evaluation for this study, we relied on the US norms.
Acknowledgements
This research was part of a project, which was funded by the German Pension Insurance (DRV Nord, Germany, Grant No. 205. We want to thank MEDIAN Klinik Bad Sülze, MediClin Dünenwald Klinik Trassenheide, “Moorbad” Bad Doberan, MEDIAN Klinik Heiligendamm, Reha-Klinik “Garder See” GmbH, Lohmen for recruiting patients and fruitful cooperation, and Shasi Poon for providing professional copy editing services. We want to thank Daniel Bullinger (DB), freelance translator established 1990 in Hamburg, Germany, and Stephen C. France (SF), generally sworn interpreter from the Hanover Regional Court and authorized translator for the English language, for forward and backward translation of the VR-36/12.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.