Target Questionnaire: Translation Procedure
Questionnaire translation can be dealt with by many different approaches, from the classical back-translation pioneered by Brislin forty years ago [
19] to the more recent TRAPD procedure and its stems [
6,
9]. Different approaches are justified by different goals, so that the actual goals (and their priority) should always be declared
before beginning the translation work. For a medical questionnaire, at least three main objectives can be identified: to preserve "equivalence"; to obtain a psychometric tool "useful" in the clinics and in clinical trials; and to attain full "comprehensibility" of the medical questions. Equivalence is what we commonly expect from a translation. What is really meant depends greatly on the researcher, so that Herdman et al. could identify not less than 19 different meanings for this term [
10]. Clinical usefulness must be interpreted here as usefulness for the TCM practitioner. It includes using the questionnaire as a convenient filing system for anamnesis, but also providing a quantitative outcome for clinical trials. Comprehensibility is related both to the TCM theory and to the local cultural context. When a medical questionnaire is translated from a source to a target, the source and the target populations often share the same medical paradigms. When this happens, the three above mentioned objectives are likely not to interact with each other, or to interact minimally. As the medical theory is shared, the target and source populations also share a sort of common language.
In our case the situation is different. Not only do we have to cross the bridge between two totally different languages, we also have to face different medical paradigms. The main result is that our three objectives interact strongly. An excessive effort towards equivalence may be detrimental for comprehensibility. Each patient interprets questions on the basis of his or her cultural context. The risk is that an Occidental patient, when answering a TCM question, misinterprets it, and therefore does not provide what is actually useful for the TCM practitioner. These interaction mechanisms are at work in any translation, but may be particularly relevant here. Given the unfeasibility of reaching the three objectives at the same degree simultaneously, a choice of priorities must be made explicit. Of course, this choice influences the selection of the translation procedure.
Our first priority was clinical usefulness. Equivalence was of course a concern, but in suborder. Generally speaking, equivalence is desirable "for the cross-cultural comparison of results to be valid" [
10]. The idea is that scores from different trials might be compared, for example in multicentre trials. As the questionnaire, conceived in a Chinese cultural context, was applied to Occidental patients, serious threats to equivalence were to be expected anyway. Therefore, we decided that giving priority to the equivalence issues would be inadvisable, whenever comprehensibility and clinical usefulness were at stake. This does not necessarily imply that equivalence is not ensured, but equivalence will have to be substantiated
a posteriori. The specific case of operational equivalence is considered in the next section.
A modified TRAPD procedure was considered more suitable than a back-translation, in order to achieve our objectives. Weaknesses and inadequacies of back-translation have been summarized by Harkness et al. (see [
20], page 468). Ponce et al. discuss some potential flaws of back-translation, and clearly warn that "translators have an incentive to choose word-for-word translations instead of striving for concept equivalence" [
21].
The original Chinese version is written with clear and concise wording. This is due partly to the nature of the TCM lexicon, which rarely uses specialized words to designate syndromes, and partly to the original authors, who obviously made an effort to simplify questions. This is one of the reasons why we considered it safe to rely on one main translation only. In fact, the entire process up to the final version was not a direct, straightforward translation. It was a careful balancing of the linguistic issues, of the psychometric characteristics, and of the adaptation to the cultural (and medical) context. The main translation could have been the final version, but the secondary translation emphasized issues of measurement equivalence, and the team discussions delved more deeply into adherence to TCM theory. It is only the harmonious fusion of these three aspects what allowed a meaningful and useful final version. This attempt of fusion is the core of our translation, when compared with other procedures. Of course, we do not recommend our method for the general case. It would be unnecessarily burdensome and time-consuming. However, it proved to be efficient for the ChQoL. We suggest its use whenever the translation targets deeply different cultures, with very different medical contexts.
Target Questionnaire: Response Scales
The response scale originally proposed for the ChQoL was a five-point Likert scale [
2]. In this work, we intentionally adopted a VAS. Apart from a cautious consideration of the general advantages and disadvantages (a critical discussion of VAS can be found in [
22‐
24]), our choice to depart from the original scale was motivated by four reasons.
First, we were particularly interested in the actual score distribution. Several items ask questions which, although perfectly intelligible, are rarely related to HRQoL in Occidental countries. For example, were the respondents able to utilize the entire continuous scale? And, if so, how widespread was this practice among respondents? Did they simplify their task assuming an essentially dichotomous model of good/poor health? A five-point Likert scale, which provides ordinal data, could in principle answer some of these questions, but a continuous scale was considered more suitable for our purpose.
Second, in the initial round of debriefing interviews we found some resistance to the five-point Likert scale. Several respondents found this scoring method unnatural, especially when the question concerned expressing emotions. The threat of annoyance is really important for our O.R. Unit, because of the poor health conditions and the high psychological reactivity of some patients.
Third, a VAS is known to be sensitive and reproducible [
25‐
28]. It is widely used in oncology, even for multidimensional instruments [
29]. In some cases, like pain assessment, a VAS is preferable to other kinds of scale, because it provides a closer description of the patients' experiences [
30]. These characteristics are particularly useful in TCM clinical trials. TCM therapies may bring clinical results which, in the short term, are weaker than those brought by many pharmacological therapies. In these cases, a higher psychometric sensitivity is obviously of help.
Fourth, the respondents dealing with an analogue scale in a test-retest have less chance to recall their previous answers in order to show consistency [
24]. Test-retest is an important aspect of reliability. Although we do not consider it in this paper, we are planning to investigate the problem in the future.
Our interpretation of the preference for the VAS among our patients is that evaluating our emotional status requires placing ourselves in a continuum. With the Likert scale, the respondent has to mentally adapt each of the 5 responses to an emotional status, and then decide if that answer "fits". The same question is likely to be re-read more times (possibly five, with really inattentive respondents). With the continuous VAS the respondent only has to spot the correct orientation of the scale regarding the question. The task requires less linguistic and comprehension efforts, and is more intuitive and straightforward. On the whole, it is less stressful.
This interpretation is founded on explicit feedback from the respondents during the first round of the retrospective debriefing interviews. One common comment was that joy, anger, depression or fear (items 33 to 50) are hardly quantifiable by ticking boxes. Other respondents felt "forced" into one of the five choices, which was unpleasant for them. However, results from other researchers contrast with our interpretation. Guyatt et al. [
31] consider filling Likert scales more intuitive than selecting a position on a continuous line. Children and elderly people have been reported to prefer a Likert scale to a VAS, or to have problems understanding the VAS itself [
32‐
35]. Gift reviews some difficulties reported for VAS [
17]. Generally speaking, the preference for one scale towards another depends both on the scale and on the respondents. It is likely that different groups react in different ways. Our group was made of female oncological patients, and comparative studies with different groups could help clarify this point.
Another departure from the Chinese source lies in the orientation of the response scales. In the ChQoL-CN, 22 items out of 50 had a reverse (i.e. negative) polarity, the highest score corresponding to the poorest health status. Sometimes questionnaires are designed in such a way that polarity is reversed in approximately 50% of the items, in an attempt to force the respondent to pay more attention to the question, and avoid bias. This was not the original aim of the Chinese authors, as apparent from the distribution of the scales among Facets. In the ChQoL-CN, all items in Facets "Complexion" (4 items) and "Joy" (4 items), as well as in all the 4 Facets included in the "Vitality & Spirit" Domain (12 items), are positively oriented, whilst the Facets "Depression" (6 items) and "Fear" (3 items) show a reversed orientation. Obviously the developers' main goal was to optimize the response scale within the single Facet, whenever possible.
During the first round of debriefing interviews, it was found that the change in orientation from one item to another was confusing for many respondents and led to erroneous scoring. Consequently we decided to make all response scales conform to a positively oriented scale. This required the rephrasing of 22 questions. The second round of debriefing interviews showed no further problems concerning response scales.
Target Questionnaire: Equivalence
Assessing questionnaire equivalence is not an easy task. A convenient framework for equivalence is provided by Herdman et al. [
11]. These authors identify six key types of equivalence: Conceptual, Item, Semantic, Operational, Measurement, and Functional. An exhaustive discussion of equivalence for the two ChQoL versions must be deferred to another paper. This discussion would also require more experimental data. Nonetheless, there are a few points which can be discussed here. They may bring to light some limitations of the present work.
Operational equivalence is the main issue. This kind of equivalence refers to "the possibility of using a similar questionnaire format, instructions, mode of administration and measurement methods" [
11]. Adopting a VAS instead of a 5-point Likert scale, and rewording several items to conform to a positively oriented scale does not necessarily mean that full Operational equivalence has been waived. A VAS and a 5-point Likert scale cannot be claimed to be equivalent,
a priori. Hasson et al. show that a replacement of Likert scales with VAS is actually possible, but interchangeability is not necessarily ensured [
36]. Lund et al. compare a VAS with a verbal rating scale, and find systematic disagreements when the VAS is transformed into a categorical scale [
37].
Our adoption of a VAS was a trade-off between the full exploitation of the ChQoL psychometric potential for Italian patients and the aprioristic preservation of Operational equivalence. At this stage we are more interested in the former issue than in the latter. Our aim was to find a final version where the Italian patient would understand the significance of each question in exactly the same way as the Chinese patient. Within Herdman's framework, we tried to favor Conceptual and above all Semantic equivalence. Conceptual equivalence ensures that questions have "the same relationship to the underlying concept in both cultures", whilst Semantic equivalence "is concerned with the transfer of meaning across languages, and with achieving a similar effect on respondents in different languages" [
11]. Our choice for a VAS and for a positive orientation of items was based on our relational experience with our patients, but it was particularly guided by the quotation above, regarding Semantic equivalence.
Our conclusions are founded on a specific sample. First of all, our respondents were Occidental patients. We by no means suggest that our choices are optimal for other cultures. E.g., Wong et al. [
5] studied the validity of the ChQoL in Hong Kong. In that context, it would have made no sense for Wong and colleagues to adopt our (or similar) choices for the response scales. These choices are useful for the Italian cultural context, but they may be totally unnecessary in different cultures. Secondly, our sample is made up of female oncological patients, with a recent breast cancer diagnosis. We selected this sample because we deal with this kind of patient on a daily basis. Of course this sample is not generic, and it has peculiar characteristics. These patients may show heavy emotional and psychological suffering. They also may experience postural problems and limb disability. In a sense, our sample was a sort of "worst case benchmark" for the ChQoL. Our finding is that the ChQoL is robust enough to be applicable to this kind of patient, provided that some modifications to the response scale are implemented. Future research may find that no advantage is gained from the modified response scales, whenever the sample comprises generic patients only. The numerical results in Table
1 should not be taken as a norm for generic populations. Our opinion is that equivalence (as a whole) can be preserved more with our changes to the response scales than without. A literal translation is not necessarily faithful, as it may not preserve Semantic equivalence. We may be mistaken. However, an experimental comparison of our results (on a wider and more generic sample) with those obtained from a Chinese cultural context is necessary to solve this issue. Until this comparison is completed, full equivalence between the ChQoL-IT and the ChQoL-CN cannot be claimed, and the ChQoL-IT should not be used for cross-cultural comparative studies.
Clinical testing: Scores and Distribution
The raw scores are not normally distributed. The choice of a VAS instead of a Likert scale efficiently highlights this point. The ChQoL capability to reveal floor and ceiling effects is in fact of great psychometric interest. These effects may reflect the presence of psychological resistances. Being able to unveil these resistances is particularly useful for the clinical psychologist dealing with frail patients, as oncological patients often are.
Severe deviations from normality, including skewness and ceiling effects, are not uncommon in Patient Reported Outcomes (e.g. [
38,
39]). Usually they can be reduced by modifying the response scales (e.g. [
40,
41]). However, more than just simple skewness, here we deal with a multimodal distribution. All shapes in Figure
2 are consistent with a superimposition of two or three bell-shaped curves. The Two-step Cluster Analysis confirms that the score distribution for 35 items is optimally split up into 2 clusters, and for 15 items into 3. One possible interpretation is that the respondents of the ChQoL are faced with unusual questions, which to them are seemingly unrelated to HRQoL. As a result, the respondents tend to simplify their task, dichotomizing their responses as being "yes or no". Some of the items generate more indecision than others, so that a third intermediate peak is possibly found in the score distribution. As a whole, a three-peak model seems reasonable for all items, with a chance that one (or even two) peaks turn out to be too low to be detected. If this model holds, a 3-point Likert scale is naturally suggested. Its adoption would make the ChQoL nimbler, and simplify data collection. It would also make localization in other languages easier. A drawback could be reduced sensitivity to changes, or reduced discriminability among individuals. It has been reported that placebo effects may sometimes remain undetected when using strictly binary outcomes, whilst they are detected when using continuous outcomes [
42].
Clinical testing: Additivity
In a Tukey's test, additivity is conceived as a lack of interaction between the respondents and the items, in the framework of a linear model. Additivity is a desirable property, because it simplifies handling of missing data and development of concise indexes to sum up a scale. The main result from Table
4 is that almost one half of the Facets ("Complexion" and "Stamina"; "Verbal Expression"; "Joy", "Anger" and "Depression") shows evidence of non-additivity. This suggests that using raw ChQoL data as outcomes in a clinical trial is not advisable. Raw scores should undergo some kind of pre-analysis correction. A corrective factor γ is provided by the TTN itself. Its aim is to yield additivity. When an Anscombe-Tukey transformation is applied, i.e. when all scores within one Facet are raised to γ, additivity is achieved for all Facets but "Joy". In fact, the transformation suggested by the TTN is not necessarily helpful for reducing non-additivity. The TTN assumes a quadratic model for the hypothetical respondent-by-item interaction. If the actual kind of non-additivity is different, the Anscombe-Tukey transformation may be ineffective. We can infer that there is a complex kind of respondent-by-item interaction for this Facet. Additivity can be achieved at
p ≥ 0.05 using γ ≃ 3.9, but of course such a high γ heavily distorts the frequency distribution, and it is not of any practical use. It should be noted that Facet "Joy" belongs to the Emotional Domain: evaluating and expressing emotions is not an easy task, and a strong respondent-by-item interaction is more plausible than for the other two Domains.
Obtaining additivity for 12 out of 13 Facets is a satisfactory result. Nonetheless the γ values are tailored to our specific sample, and they are all different. In order to make practical applications easier, we explored two alternatives. The first is to compute γ at the Domain level, and not at the Facet level. Then the γ can be applied either to the pool of items within one Facet or to the pool of items within one Domain, and the TTN can be run again. When applied within Facet, problems are again encountered for Facet "Joy", and additionally for "Anger" and "Depression" too. All these Facets belong to the Emotional Domain. When applied within Domain (γ in Table
4, penultimate column, three last rows), full additivity is finally achieved. The second alternative is to try a flat γ = 1.5 (which is the mean of the three γ found for the three Domains). When proceeding by Facet, the statistical significance is too low (
p < 0.05) for three Facets in the Emotional Domain, and also for one Facet in the Vitality & Spirit Domain. When proceeding by Domain, the test significance is acceptable, although for the Vitality & Spirit Domain it is lower than before (
p = 0.26 versus
p = 0.83).
Altogether, a constant γ = 1.5 seems a practical choice. It provides additivity at the Domain level, which is our main interest. It is low enough not to excessively distort data. Given Table
4, it lies inside a full range of acceptable γ, so increasing the chances of applicability to different populations of patients. We suggest that an Anscombe-Tukey transformation with γ = 1.5 is applied to the ChQoL-IT scores whenever the scores are used to estimate missing data or to provide summary indexes for the three Domains. This (or an equivalent) kind of transformation has been shown to be necessary for our specific sample of female patients suffering from breast cancer, in order to recover additivity. When dealing with different populations, the necessity and advisability of a γ = 1.5 transformation should of course be checked.