This intra-individual answering scale comparison shows that patients’ perception of hospital care in this public teaching hospital is high without a substantial reduction in floor and ceiling effects on the numeric compared to the labelled adjectival response scale. Moreover, the overlap of the numerical, when plotted against the categorical response scale indicates the difficulty of defining patient groups with regard to their satisfaction with healthcare. The Cronbach’s alpha is clearly higher using the numeric response scale, whereas the individual item level correlation between the response scales is high in questions about intent to return, quality of treatment and patient care was performed with respect and dignity, but low regarding satisfactory information transfer. Finally, the distribution of the percentage score is comparable on both response scales, which leads to a high overall correlation.
Limitations of the study
Our study presents some limitations. The main limitation was that only overall questions on patient satisfaction were available for the response scale comparison, since the nationwide questionnaire was limited to these five evaluating questions. A further limitation was a response rate of only 41%, which however is in line with response rates between 42 and 48% in previous studies [
7,
20]. Resulting bias is unlikely since the baseline characteristics of responders and non-responders only differed slightly and the percentage of missing items was very small compared to other studies in similar settings [
21,
22] still allowing for a certain generalisability. Moreover, several studies could show that non-respondents are generally not significantly different from respondents in terms of satisfaction scores [
23], or in terms of sociodemographic characteristics [
24]. A further limitation consists of differences between the questions on both response scales with regard to exact wording, the polarity of the response scale, the order of the questions and that the five questions on the LS correspond to a subset of a 17-items questionnaire, whereas the nationwide survey only included these five general questions. This may partly explain the divergent answers on both answering scales, which results in lower correlation coefficients and could be taken as a consequence of the pragmatic approach.
Findings in relation to other studies
The high patient satisfaction in the present study is in line with the findings of an international evaluation [
25], in which Switzerland was found to stand out by having very high quality of care ratings compared to other European countries and the United States.
Questionnaires are convenient for monitoring in-patient satisfaction, but reliability and validity of the findings are reduced by skewed response distributions [
8]. Different methods have been proposed to reduce the ceiling effect. In our study, we investigated the influence of the length of scale together with labelled categories versus end-anchors. We found a less pronounced ceiling effect with a persistent albeit less left-skewed response distribution on the longer end-anchored numeric scale than on the shorter adjectival scale with labelled categories. It was rather unexpected that the skewedness on the numeric answering scale (compared to LS) was nearly as marked. The different polarity of both answering scales, with the LS having positive answers on the left-hand side and the NS on the right-hand side, might partially explain the higher ceiling effect found in the answers on the LS, since there is evidence of bias towards the left side [
26]. People tend to use the first satisfactory response option in a presented questionnaire, as shown in studies which directly compared a reversed to a not-reversed answering scale [
27,
28]. However, had the reversed answering scale of both questionnaires a strong influence in our setting, we would expect the response distribution of the numeric scale (with negative answers on the left-hand side) to be more symmetrical. Moret et al. [
20] achieved a reduction in ceiling effect by extending the response scale from a four-point to a five-point format, but with an unbalanced scale including three positive and two negative choices. In contrast to our results, Garratt et al. [
7] showed that a 10-point scale produced a highly skewed distribution compared to a balanced 5-point scale, which showed a fairly symmetrical distribution. However in both cited studies, the ceiling effect was less pronounced than in our data [
7,
20]. Similarly, in a randomised comparison of four patient satisfaction questionnaires, Perneger et al. [
29] found the lowest ceiling effect in a questionnaire using an unbalanced five-point Likert scale. These results suggest that an unbalanced five or six points scale might outperform both our three- to four-point scale as well as our 11-point end-anchored scale. Alternatively, an even stronger imbalance with four positive categories is necessary to render our high rating approximately normally distributed. A normal distribution of questionnaire results can also be achieved by using an overall score [
30,
31]. However, even by combining all five answers to an overall percentage score, we did not obtain a fairly symmetric distribution. It might be argued that it is not possible to achieve a symmetrical distribution with general questions about patient satisfaction [
32,
33], rendering the selection of the questions used for the national benchmark disputable.
Apart from changing the length of the answering scale, the two scales also differed in the number of labelled categories. The use of an adjectival scale for comparative purposes can be limited by its lack of sensitivity for detecting small changes and may strongly depend upon the choice of the wording and the literacy of the patient. Accordingly, Downie et al. showed an improvement of discrimination by using a numerical rating scale as compared to a four-point descriptive scale and a continuous scale with two end-anchors [
34]. On the other hand, Streiner and Norman provided evidence that there is a tendency of end-labelled scales to pull responses to the end [
35]. This is in line with our data where a fairly big difference could be found between the percentages in the highest category (10) and the next one (9) in the responses on the numerical scale using end-anchors. Garratt et al. argued that the higher ceiling effect of their longer 10-point scale could be explained by this phenomenon of end-anchors pulling responses to the end [
7]. On the other hand, one could expect an end-aversion bias, especially if the endpoints are labelled with absolute words such as ‘Always’ as opposed to ‘Almost always’.
Our results further suggest that the internal consistency (Cronbach’s alpha) was higher for the NS than for the LS, which has been confirmed in the study of Hendriks et al. [
21]. However, with a lower limit of the Cronbach’s alpha above 0.75, both scales can formally be accepted as a valid measure of patient satisfaction.
Comparing both answering scales, we found a high individual item level correlation in three out of five questions. These three questions cover the domains of intention to return, quality of treatment and respectful behaviour. Laerhoven et al. [
36] found Spearman rank correlation coefficients comparing three different types of response scales, which all laid above the plausible range of values from our study. This finding might be explained by the above-mentioned additional differences in both questionnaires apart from the length and type of scale (e.g. polarisation and order of questions). There is consistent evidence that the most important factor affecting patient satisfaction is the patient-practitioner relationship, including information provision [
2,
3]. In the present study, this was the domain in which the correlation between both response scales diverged most. From the collected information, we are unable to determine which of the questionnaires contains the valid individual response. On the other hand, it is not unexpected that the correlation is lower in these two information questions. Whereas questions about treatment with respect and dignity and intention to return imply almost binary answers, in the sense that you are either willing to return or not, a question about provision with adequate information is much more complex, depending on the content as well as the wording of the given information. Moreover, the adjectival response scale only consists of three categories not allowing for much differentiation.
Implications for daily practice and further research
In order to further improve the nationwide benchmark at this point, the questions and answering scale need to be refined to enhance the discriminative power for the high end of patient satisfaction, since both response scales showed a strong ceiling effect. This is especially relevant, because patient satisfaction may influence many important outcomes such as compliance, overall well-being and consumer choice. The choice of these five overall questions for a national benchmark should be challenged and a different set of reporting questions defining quality aims tested.