Background
Routine Outcome Monitoring (ROM) is gathering momentum as an adjunct to treatment [
1,
2] and as a basis for outcome management [
3]. In the Netherlands, ROM has been stimulated by health insurers, resulting in a nationwide implementation of ROM in clinical practice to serve both goals: providing feedback on individual treatment progress and on outcomes attained with groups of patients (aggregated outcomes). The present paper focuses on the latter. Currently about 45% of all remunerated treatments can be evaluated, and results are aggregated and used to give feedback to institutions on their performance in terms of outcome [
4]. For ROM assessments of patients with severe mental illness (SMI), the Health of Nation Outcome Scales (HoNOS) [
5] is used, a well-known rating scale generally completed by the professional who delivers care. The HoNOS comprises 12 items, each with five response options (scoring range is 0–48), and has good psychometric properties [
6]. Outcome on clinical problems and psychosocial functioning is assessed by comparing pretest and posttest total scores on the HoNOS for each patient. The simplest, most straightforward and most commonly used outcome indicator in treatment outcome research is the average change from pretest to posttest score, converted into a standardized change score or within- group effect size (ES) indicator [
7,
8]. For benchmarking in the Netherlands, we have adapted this approach to a change score based on transformed T-scores (ΔT) [
9]. However, average change offers rather abstract information on the performance of treatment institutions. It would be informative to know what proportion of patients have benefitted from treatment or can be considered as recovered, yielding a performance indicator with direct appeal.
Jacobson et al. [
10‐
12] have proposed a method to delineate the treatment results of individual patients, comprising criteria for clinically significant and statistically reliable change. The outcome is deemed significant if a patient’s posttest score is within the functional range; a patient has reliably changed if the pretest-posttest change is larger than a chance fluctuation due to instrument measurement error. Various revisions of the Jacobson-Truax (JT) approach have been proposed [
13‐
15], finding extensive application in comparing outcomes of groups of patients [
16‐
18] as well as in ROM for individual patients [
19]. Recently we evaluated the practicality of this approach as an indicator of institutional performance, using pretest and posttest scores on self-report measures. The JT approach was deemed a worthy addition to traditional performance indicators such as pretest-posttest ES or change in T-score (ΔT), as it illustrates these numerical values in a clinically meaningful manner with children and adolescents [
20], and with adults with common mental disorders such as depression and anxiety disorders [
9].
Application of JT to rating scales, such as the HoNOS, is less common than its application to self-report measures. The results appear of limited use when the JT approach is applied to HoNOS for the SMI population, as usually a very large proportion of patients is deemed unchanged. This may reflect the chronicity of SMI, where change – let alone(clinical) recovery or remission – is relatively uncommon within the time frame of one or two years. It may be caused by lack of responsiveness to change of the HoNOS, especially for patients with low pretest scores to begin with [
21], but it may also be due to the stringency of JT criteria, particularly for reliable change [
22]. In a paper published in 2005, Parabiaghi et al. [
23] proposed for the HoNOS total score that a change of at least 8 points is required to deem a patient as statistically reliably changed. Such a change in score is substantial and infrequent in care provided to the majority of patients with SMI, but is also a stringent criterion when the HoNOS is applied to evaluate outpatient care for common mental disorders [
21]. Other values for reliable change and alternative statistical approaches to arrive at performance indicators for use with the HoNOS have been proposed by Burgess et al. [
24]. They discuss the merits of effect size (ES), reliable change index (RCI), and standard error of measurement (SEM), proposing various threshold values for these indicators to distinguish unchanged from changed patients (improved or deteriorated), varying in statistical uncertainty. Utilization of each threshold score yields three possible outcomes: no significant change, significant improvement, and significant deterioration.
In order to obtain an improved categorization for use with HoNOS data, in a more recent paper from 2014 Parabiaghi et al. [
22] describe a revised approach to JT. This approach (JT
revised) focuses more on outcome than on change, underlining the significance of slightly changed and unchanged subjects. Where JT distinguishes two classes of patients (dysfunctional and functional), Parabiaghi et al. propose three classes of severity for the HoNOS total score: mild (< 10), moderate (10–13), and severe (> 13). They also propose two levels of meaningful change: reliably changed (RCI 90%; at least 8 points) and minimally changed (at least 4 points change). Potentially, the method proposed by Parabiaghi et al. [
22] is an improvement over the traditional JT approach: as it allows for a more comprehensive categorization of treatment results, it seems better suited to meet the demands of clinical reality.
In the present study, we compared several categorical approaches as clinical illustrations of ES and ΔT: classifications into three categories (improved, unchanged, and deteriorated) based on ES and RCI threshold values, dichotomous classifications (JT
RCI and JT
CS), the more complex classification of JT into four categories (recovered, improved, unchanged, deteriorated, or JT
RCI&CS), and the proposed revised JT of Parabiaghi et al. [
22] into eight categories (JT
revised). We evaluated which categorical method is most suitable to denote outcome for patients with SMI by comparing the ranking of institutions according to ES and/or ΔT with their ranking based on categorical outcomes. ES/ΔT was chosen as the reference method, as this outcome indicator is appropriate given the continuous nature of the data, and it is the most commonly used effect indicator to denote within-group effect size in treatment outcome research [
8]. We therefore examined which of the categorical methods revealed the largest differences in outcome between mental health institutions, whether rankings based on continuous and categorical methods were concordant and evaluated the informative value of each method.
It is important to note that the aim of the present study was to compare performance indicators for their ability to assess differences in outcome of care among institutions. Variation in outcome between providers enables us to compare performance indicators. The aim was not to compare the performance of the participating institutions per se. Case mix differences and differences in completeness of the data among institutions preclude firm conclusions regarding their comparative performance. We consequently choose to anonymize institutions. The reader should take note of the fact that ranking of institutes does not necessarily reflect an order in the quality of care provided; it is merely a reflection of differences in outcome, which may well be due to case mix differences or other factors affecting outcome, such as timing of assessments, proficiency in use of the HoNOS, etc.
Results
The initial dataset comprised 16,771 patients who received treatment; 8402 (50.1%) were assessed at pretest and for 38.0% of these posttest data were available, yielding a final sample of 3189 patients with complete pretest and posttest data and an overall response rate of 19.1%. Table
1 presents background information on the participating patients. The duration of care episodes ranged from 30 to 446 days (M = 297.8; SD = 93.4), with no significant differences between institutions. Pretest selection, posttest attrition and overall response rates (the proportion of care episodes with complete pretest and posttest data) varied considerably between institutions (range: 6.1–33.8%).
There were statistically significant, albeit small, differences between institutions in mean age and gender; overall 58.3% of patients were males and gender was unevenly distributed among the 10 institutions (χ2(9) = 103.44; p < .001), with Institution 4 treating more males (82.1% vs. 58.3% for the total population). Participants’ age ranged from 17 to 84 years (M = 40.7; SD = 12.3) and varied among institutions (F(9) = 7.34; p < .001; η2 = .02), with Institutions 4, 6, and 9 treating somewhat younger patients. The mean pretest score on the HoNOS differed significantly between institutions (F(9) = 18.04; p < .001; η2 = .05), with Institutions 1, 6, and 10 having lower scores (i.e. less impairment in function) than the others according to Bonferroni corrected pairwise comparisons.
There were large differences in diagnostic composition of the case mix among institutions. Table
2 presents this diagnostic information. The largest group among the diagnoses is psychotic disorders (47.3%), followed by mood/anxiety/somatoform disorders (17.9%) and personality disorders (11.4%). The smallest groups are pervasive developmental disorders (10.5%), substance abuse (5.0%), and bipolar disorder (4.9%). Patient composition differs significantly among institutions (χ
2(54) = 1583.09;
p < .001), with Institution 4 treating more patients with substance-related disorders (36.9%) and fewer psychotic disorders (13.4%), Institution 5 treating more patients with mood/anxiety/somatoform disorders (36.0%) and fewer personality disorders (18.8%), Institution 6 treating more patients with pervasive developmental disorders (44.7%), and Institutions 8, 5, and 1 treating more personality disorders (19.4, 18.8 and 17.7%, respectively).
Table 2
Overview of the case mix composition regarding main psychiatric diagnosis per institution
Psych. Dis. | N | 132 | 128 | 127 | 39 | 108 | 108 | 282 | 195 | 264 | 125 | 1508 |
% | 49.8 | 66.7 | 59.1 | 13.4 | 32.1 | 31.6 | 55.8 | 44.5 | 64.9 | 62.8 | 47.3 |
MAS | N | 27 | 23 | 25 | 61 | 121 | 29 | 110 | 80 | 64 | 31 | 571 |
% | 10.2 | 12.0 | 11.6 | 21.0 | 36.0 | 8.5 | 21.8 | 18.3 | 15.7 | 15.6 | 17.9 |
Pers. Dis. | N | 47 | 11 | 21 | 38 | 63 | 28 | 37 | 85 | 21 | 12 | 363 |
% | 17.7 | 5.7 | 9.8 | 13.1 | 18.8 | 8.2 | 7.3 | 19.4 | 5.2 | 6.0 | 11.4 |
Perv. DD | N | 32 | 20 | 14 | 36 | 17 | 153 | 29 | 15 | 11 | 9 | 336 |
% | 12.1 | 10.4 | 6.5 | 12.4 | 5.1 | 44.7 | 5.7 | 3.4 | 2.7 | 4.5 | 10.5 |
Bipolar Dis. | N | 18 | 5 | 7 | 4 | 14 | 17 | 12 | 23 | 38 | 18 | 156 |
% | 6.8 | 2.6 | 3.3 | 1.4 | 4.2 | 5.0 | 2.4 | 5.3 | 9.3 | 9.0 | 4.9 |
Substance | N | 4 | 2 | 17 | 107 | 6 | 4 | 5 | 12 | 0 | 1 | 158 |
% | 1.5 | 1.0 | 7.9 | 36.9 | 1.8 | 1.2 | 1.0 | 2.7 | 0.0 | 0.5 | 5.0 |
Other | N | 5 | 3 | 4 | 5 | 7 | 3 | 30 | 28 | 9 | 3 | 97 |
% | 1.9 | 1.6 | 1.9 | 1.7 | 2.1 | 0.9 | 5.9 | 6.4 | 2.2 | 1.5 | 3.0 |
Difference between pretest-posttest change on HoNOS T-scores among institutions was analyzed in a 2 (time) × 10 (institution) repeated-measures ANOVA. This revealed statistically significant main effects of time (F(1) = 233.4; p < .001; η2 = .068) and institution (F(9) = 23.6; p < .001; η2 = .063), which reflects a difference over time as well as between institutions regardless of time. More importantly, there was a significant interaction effect (time x institution) revealing a difference in outcome slope between health institutions over time (F(9,3179) = 3.33; p < .001; η2 = .009). Pairwise comparisons of institutions (with Bonferroni correction) revealed that Institutions 5 to 10 reported larger pretest-posttest differences than Institutions 1 to 4.
Ranking of institutions was based on ES and ΔT. Hence, institutions with a higher rank number have a larger ES than those with a low rank number, as Table
3 shows. This table also presents results using threshold values for ES, SEM, and JT
RCI90. All categorizations reveal significant differences among institutions (all
p < .001). The proportions of reliably changed patients from Table
3 using the RCI threshold of at least 8 points as proposed by Parabiaghi et al. [
23] varied among institutions (χ
2(9) = 58.1; p < .001), as did the proportions of patients with a posttest score < 5, denoting a clinically significant change (χ
2(9) = 42.8; p < .001). Finally, combining both indices in four outcome categories also reveals differences among institutions (χ
2(27) = 111.4;
p < .001). Institutions with a higher rank number had more recovered (= Institution 9: 17.4% vs. = Institution 1: 4.9%) and fewer deteriorated patients (= Institution 9: 10.6% vs. Institution 6: = 3.8%). The results indicate that 11.2% (
n = 356) of patients had recovered, 6.1% (
n = 196) had improved, 75.9% (
n = 2421) remained unchanged, and 6.7% (
n = 215) had deteriorated. The large proportion of unchanged patients results from the stringent RCI criterion of at least 8 points change. The ranking of institutes diverges considerably among indicators, and most indicators have no statistically significant association with ES, except for the improved and reliable change (JT
RCI) indicators, which correspond best with ES. All in all, most of the indicators proposed by Burgess et al. [
24] and JT
RCI&CS based on raw scores are insufficiently concordant with ES.
Table 3
Effect Size (ES) and percentage of patients in outcome categories based on raw HoNOS scores and classification according to various ESmedium, SEM, and JTRCI-90 threshold values, and according to JTRCI95, JTCS, and JTRCI&CS
Continuous: | ES | 0.06 | 0.18 | 0.19 | 0.24 | 0.29 | 0.29 | 0.32 | 0.38 | 0.38 | 0.39 | 0.29 | 1 2 3 4 5/6 7 8/9 10 | |
Categorical: |
ESmedium ≤ −4 | deteriorated | 20.0 | 19.3 | 20.5 | 20.7 | 15.2 | 11.4 | 17.8 | 14.8 | 20.9 | 14.1 | 17.3 | 9 4 3 1 2 7 5 8 10 6 | .35 |
-4 < ESmedium < 4 | unchanged | 56.2 | 48.4 | 43.7 | 40.0 | 49.7 | 57.6 | 45.0 | 42.7 | 34.4 | 46.7 | 45.9 | 6 1 5 2 10 7 3 8 4 9 | .38 |
ESmedium ≥ 4 | improved | 23.8 | 32.3 | 35.8 | 39.3 | 35.1 | 31.0 | 37.2 | 42.5 | 44.7 | 39.2 | 36.8 | 1 5 2 8 4 3 6 9 10 7 | .71* |
SEM ≤ −5 | deteriorated | 15.5 | 13.0 | 15.3 | 17.6 | 12.5 | 9.1 | 14.5 | 11.0 | 16.2 | 11.1 | 13.5 | 4 9 1 3 7 2 5 10 8 6 | .35 |
-5 < SEM < 5 | unchanged | 63.4 | 60.4 | 56.3 | 49.3 | 57.1 | 68.1 | 53.7 | 52.7 | 45.9 | 54.8 | 55.5 | 6 1 2 5 3 10 7 8 4 9 | .55 |
SEM ≥ 5 | improved | 21.1 | 26.6 | 28.4 | 33.1 | 30.4 | 22.8 | 31.9 | 36.3 | 37.8 | 34.2 | 30.9 | 1 6 2 3 5 7 4 10 8 9 | .79** |
JTRCI90 ≤ −9 | deteriorated | 8.3 | 4.2 | 7.0 | 9.3 | 3.6 | 2.3 | 5.1 | 3.9 | 7.6 | 2.5 | 5.4 | 4 1 9 3 7 2 8 5 10 6 | .43 |
9 < JTRCI90 < 9 | unchanged | 85.3 | 82.8 | 80.0 | 71.4 | 87.5 | 88.0 | 82.2 | 79.2 | 69.0 | 83.9 | 80.6 | 6 5 1 10 2 7 3 8 4 9 | .24 |
JTRCI90 ≥ 9 | improved | 6.4 | 13.0 | 13.0 | 19.3 | 8.9 | 9.6 | 12.7 | 16.9 | 23.3 | 13.6 | 14.1 | 1 5 6 7 2/3 10 8 4 9 | .51 |
JTRCI (change ≥8) | reliable change | 9.4 | 14.6 | 14.9 | 19.7 | 11.3 | 14.3 | 17.4 | 21.7 | 27.0 | 15.6 | 17.3 | 1 5 6 2 3 10 7 4 8 9 | .64* |
JTCS (post < 5) | clinical change | 13.6 | 9.9 | 7.0 | 7.2 | 8.6 | 14.9 | 16.0 | 15.5 | 15.2 | 21.1 | 13.3 | 3 4 5 2 10 1 6 8 7 9 | .56 |
JTRCI&CS | deteriorated | 9.1 | 5.2 | 8.4 | 10.3 | 3.9 | 3.8 | 5.9 | 5.7 | 10.6 | 4.5 | 6.7 | 9 4 1 3 7 8 2 10 5 6 | .14 |
unchanged | 81.5 | 80.2 | 76.7 | 70.0 | 84.8 | 81.9 | 76.6 | 72.6 | 62.4 | 79.9 | 75.9 | 5 6 1 2 10 3 7 8 4 9 | .39 |
improved | 4.5 | 4.7 | 3.3 | 4.1 | 3.0 | 6.1 | 6.3 | 8.7 | 9.6 | 8.0 | 6.1 | 5 3 4 1 2 6 7 10 8 9 | .75* |
recovered | 4.9 | 9.9 | 11.6 | 15.5 | 8.3 | 8.2 | 11.1 | 13.0 | 17.4 | 7.5 | 11.2 | 1 10 6 5 2 7 3 8 4 9 | .24 |
Table
4 presents the results when we convert the HoNOS scores to T-scores. Again, institutions with a high rank number performed better (ΔT = range 3.3–4.3) than those with a lower rank number (ΔT range 0.9–3.0). Using the threshold of a change ΔT > 5 [
9,
20], the proportions of reliably changed patients differed significantly among institutions (χ
2(9) = 29.8;
p < .001), as did the proportions of patients transgressing the threshold of CS = 42.5 (pretest ≥42.5; posttest < 42.5), denoting clinically significant change (χ
2(9) = 30.4;
p < .001). Combining the two indices into JT
RCI&CS with four categories also reveals significant differences among institutions (χ
2(27) = 76.1;
p < .001). Furthermore, with the traditional JT
RCI&CS method applied to T-scores, patients got more evenly distributed over the outcome categories: in total 18.8% (
n = 598) of patients were considered recovered, 22.2% (
n = 709) had improved, 40.0% (
n = 1277) remained unchanged, and 19.0% (
n = 605) had deteriorated. Institutions with a higher rank have more recovered patients (Institution 10: 24.6% vs. = Institution 1: 12.5%) and fewer deteriorated patients (Institution 10: 17.1% vs. = Institution 1: 23.4%). The Rho correlation coefficients indicate that the rankings based on ΔT scores (in Table
4) are more concordant than rankings based on raw HoNOS difference scores (ES in Table
3), with JT
RCI&CS recovery having the highest concordance with ΔT, followed by the category of unchanged patients. However, lack of concordance is also noteworthy. Institution 9, for instance, has the second-highest ranking based on ΔT, but also the third-largest proportion of deteriorated patients (based on JT
RCI&CS; see Table
4).
Table 4
Mean ΔT and percentage of patients in outcome categories based on T-scores according to the JT approach
Continuous: | Mean ΔT | 0.9 | 2.0 | 2.0 | 2.6 | 2.8 | 3.0 | 3.3 | 4.1 | 4.3 | 4.3 | 3.1 | 1 2/3 4 5 6 7 8 9/10 | |
Categorical: |
JTRCI-95 (> 5) | improved | 33.6 | 34.4 | 37.7 | 40.0 | 37.5 | 36.5 | 43.0 | 46.8 | 46.9 | 45.7 | 41.0 | 1 2 6 5 3 4 7 10 8 9 | .86** |
JTCS (42.5) | changed | 14.0 | 14.1 | 19.1 | 14.8 | 17.6 | 20.5 | 23.6 | 23.1 | 23.8 | 25.6 | 20.2 | 1 2 4 5 3 6 8 7 9 10 | .95** |
JTRCI&CS | deteriorated | 23.4 | 18.2 | 21.4 | 22.1 | 16.1 | 14.3 | 20.2 | 16.0 | 21.9 | 17.1 | 19.0 | 1 4 9 3 7 2 10 5 8 6 | .38 |
unchanged | 43.0 | 47.4 | 40.9 | 37.9 | 46.4 | 49.1 | 36.8 | 37.2 | 31.2 | 37.2 | 40.0 | 6 2 5 1 3 4 8/10 7 9 | .65* |
improved | 21.1 | 21.4 | 19.1 | 26.9 | 22.0 | 18.7 | 21.4 | 24.9 | 23.6 | 21.1 | 22.2 | 6 3 1 10 2/7 5 9 8 4 | .26 |
recovered | 12.5 | 13.0 | 18.6 | 13.1 | 15.5 | 17.8 | 21.6 | 21.9 | 23.3 | 24.6 | 18.8 | 1 2 4 5 6 3 7 8 9 10 | .93** |
The results of Table
5 show the categorization according to the revised JT proposed by Parabiaghi et al. (2014). A significant difference among institutions in these categories is found, with higher rates of patients in the “mild illness” and “improvement to mild illness” categories and lower rates of “stability in severe illness” or “worsening in/to severe illness” among institutions with a high ranking (χ
2(63) = 230.9;
p < .001). Correspondence between ranking of institutes according to ES and the JT
revised categorization is low, except for the category “improvement to mild illness”.
Table 5
Percentage of subjects classified into 8 outcome categories based on raw scores according to the revised JT approach of Parabiaghi et al. [
22]
Mild illness | 40.4 | 17.7 | 21.4 | 15.2 | 22.9 | 36.3 | 29.3 | 24.9 | 21.1 | 34.2 | 26.4 | 4 2 9 3 5 8 7 10 6 1 | .13 |
Improvement to mild illness | 18.5 | 17.7 | 22.3 | 18.3 | 20.5 | 22.8 | 26.9 | 28.3 | 31.2 | 32.7 | 24.6 | 2 4 1 5 3 6 7 8 9 10 | .92** |
Improvement to moderate illness | 3.8 | 6.8 | 7.0 | 11.4 | 7.7 | 4.4 | 6.9 | 8.2 | 7.1 | 4.5 | 6.9 | 1 6 10 2 7 3 9 5 8 4 | .23 |
Improvement within severe illness | 1.5 | 7.8 | 6.5 | 9.7 | 6.8 | 3.8 | 3.4 | 5.9 | 6.4 | 2.0 | 5.3 | 1 10 7 6 8 9 3 5 2 4 | −.29 |
Stability in moderate illness | 12.1 | 12.0 | 8.4 | 9.3 | 10.4 | 11.4 | 9.1 | 7.8 | 5.9 | 8.0 | 9.2 | | |
Worsening to moderate illness | 4.9 | 5.2 | 3.3 | 2.8 | 2.7 | 2.0 | 4.0 | 3.9 | 4.4 | 3.0 | 3.6 | 2 1 9 7 8 3 10 4 5 6 | .26 |
Stability in severe illness | 7.2 | 20.3 | 15.8 | 17.9 | 18.8 | 11.7 | 8.9 | 11.4 | 10.3 | 8.0 | 12.5 | | |
Worsening in/to severe illness | 11.7 | 12.5 | 15.3 | 15.5 | 10.1 | 7.6 | 11.5 | 9.6 | 13.5 | 7.5 | 11.4 | 4 3 9 2 1 7 5 8 6 10 | .52 |
Discussion
In the present study, we compared various categorical indicators on their usefulness to illustrate differences between institutions regarding treatment outcome. The primary aim of the study was to test the suitability of various categorical methods to denote treatment outcome in mental health care for patients with SMI using the HoNOS as assessment instrument. We were fortunate to find differences in outcomes between institutions and could use their data to evaluate various methods to delineate outcome. We also assessed the suitability of a number of methods to compare institutions. The results revealed differences in ranking institutions between the two continuous indicators (ES and ΔT) and the categorical indicators (SEM, JTRCI, JTCS, JTRCI&CS, JTrevised). Indicators based on categorical outcomes yielded quite divergent rankings; the categories of the traditional JT approach were most concordant with the continuous outcome indicators ES and ΔT, particularly JTRCI and JTCS based on T-scores.
The traditional JT approach (JT
RCI&CS) with four categories is applied frequently in practice and provides useful information on patients’ condition after treatment [
9,
11,
16]. However, as an outcome indicator for aggregated data it has some serious drawbacks. As the indicator classifies patients into four categories, it is impossible to rank health institutions consistently: ranking according to proportion of recovered patients yields a different order than ranking according to proportion of reliably changed patients, and so forth. A possible solution would be to collapse the four categories into two, in order to get a ranking based on less complex information, but this reduces information value and statistical power. Fedorov, Mannino and Zhang [
33] calculated that dichotomizing information leads to a substantial loss of statistical power (at least 36% reduction when data are made binary and 19% when data are converted to three categories). These percentages are based on optimal cut-off points. In practice, the loss of statistical power may be greater. Indeed, Markon, Chmielewski and Miller [
34] showed that a sample needed to be twice as large when moving from a continuous to a dichotomous outcome. Statistical power can be increased by adding more categories, but this reintroduces the complexity of interpreting the outcome data.
Another drawback of the JT
RCI&CS method is that it will result in a large proportion of “unchanged” patients if a stringent criterion for RCI ≥ 8 is applied to raw HoNOS scores. Such a large category provides little information and is hard to interpret, as we are unsure whether to regard this outcome as disappointing or as successful stabilization (this of course also depends on the goal of treatment or care). Using various alternative cut-off values for deterioration or improvement, as proposed by Burgess et al. [
24], does not lead to a categorization highly concordant with ES. The present results show that applying the JT categorization after raw scores have been converted into transformed T-scores yields a more even distribution of patients over the outcome categories. Moreover, ranking of institutes according to proportion of recovered patients based on transformed T-scores is more concordant with outcome according to ΔT than the ranking using raw scores. We therefore recommend using transformed T-scores with the proposed cut-off values RCI > 5 and CS = 42.5 – corresponding to RCI > 2 to RCI > 4 (depending on the position on the scale) and CS = 8 in raw score on the HoNOS – as the most suitable approach to convey differences in performance between institutions, given that this indicator is methodologically sound as it uses data that have been transformed into a normal distribution.
Parabiaghi et al. [
22] evaluated a more refined approach for meaningful change and outcome. We examined this approach and compared it with the traditional JT approach, to investigate how these categorical methods compare in their convergence with the continuous method and how they compare in denoting outcome in a meaningful way. The results indicate that the proposed revision may have advantages over the traditional JT approach, as it provides a quite meticulous and clinically meaningful way to denote clinical status and outcome of care for individual patients with SMI. JT
revised may thus be more informative for clinicians when monitoring progress and choosing the most appropriate course of treatment as compared to the traditional JT approach. Further validation of JT
revised is needed to justify use of its more refined outcome categories. It should also be noted that the threshold value for change based on SEM (change ≥4 is deemed meaningful) needs validation, as it is far more lenient than the RCI90 ≥ 9 based on the formulas proposed by Jacobson and Truax and the reliability of the HoNOS may not justify the chosen low-threshold value. Future research, for instance directly comparing the predictive validity of the categorization according to the traditional JT approach and the JT
revised in terms of further course of treatment, will reveal which approach best predicts need for care after the first year. However, the Parabiaghi approach is deemed too complex for research on groups of patients or for use as a performance indicator comparing aggregated outcomes of institutions: with eight categories it is not considered a practical or more appropriate alternative to the simpler traditional JT
RCI&CS with four categories.
A strength of the present study is its use of real-life data, collected in everyday clinical practice. The study also uses a considerably large data set, in number of both institutions and patients per institution, boosting confidence in the generalizability of the findings for clinical practice in the Netherlands and bringing about ample statistical power to find differences among methods to denote outcome. Indeed, substantial variation in outcome was found among institutions, offering a realistic test of the usefulness of various approaches to denote outcome of patients in care for SMI.
A limitation of the study is that only data from the first year of care were analyzed. Patients with SMI typically stay in care for a longer period. Their change in subsequent years of care is likely to be substantially smaller, as may also be the case for outcome variation between institutions. It should be noted that the substantial differences between institutions in case mix composition for demographics and clinical features of patients, as well as differences in completeness of provided data, imply that outcomes of institutions are potentially confounded by these pretest differences. For example, institutions’ patient populations vary in pretest severity, a variable strongly associated with posttest scores and gain scores; this implies that the level of pretest severity is also associated with categorical outcomes. Higher average pretest levels leave more room for reliable improvement, lower pretest levels leave less room but make achieving recovery status more likely. In addition, case mix composition between institutions also differed in ratio of inpatients to community patients. This underscores the need for case mix correction when comparing institutional performance. We reanalyzed the data after case mix correction for several variables that appeared associated with outcome (pretest severity, age, and bipolar disorder). This case mix model explained 23% of outcome variation (predominantly by pretest HoNOS scores). Correction did influence average outcome of institutions, but overall the ranking of institutes remained the same. However, differences between institutions diminished somewhat, and with this smaller contrast between institutions the rankings of the various approaches were more diverse. Consequently, the concordance between approaches was also more varied. As a further limitation of the study, response rates for institutions ranged from 6.1 to 33.8%, compromising the representativeness of the data for the institutions. Hence, the present results do not necessarily reflect differences in quality of care between institutions and should be examined cautiously, also bearing in mind that comparing institutions was not our aim. Moreover, the overall response rate limits the generalizability of the study findings, as we do not know whether outcome data are missing systematically.
The HoNOS total score may be considered too small a basis to evaluate the outcome of an individual patient or appraise the overall performance of mental health institutions. Use of the HoNOS is widespread, not only for outcome monitoring but also to assign patients to clusters based on their treatment needs. Large datasets have thus become available to evaluate the psychometric quality of the instrument, and some negative findings have emerged. For instance, the HoNOS appears not to be associated with need-for-care as operationalized by costs of treatment in a large British cohort of 1343 patients with common mental health problems [
35]. For this patient group, the sensitivity to change in severity of psychopathology of the HoNOS appears to be limited as only three items (7, 8, and 9) seem relevant and appropriate [
21]. The utility of the HoNOS for clustering patients into groups of various need levels has been questioned as well [
36]. Finally, the factorial structure of the HoNOS has been criticized: the HoNOS does not appear to be unidimensional, which casts doubt on the validity of calculating a total score. Various multidimensional factorial models have been proposed, but none appears to have sufficient fit to be deemed good over the full range of psychiatric disorders [
37]. Further development of measurement instruments for appropriate outcome domains (assessing severity of symptomatology, functioning, and personal recovery) is therefore needed, and several such projects are currently underway, internationally as well as in the Netherlands. Finally, the present study lacks an external criterion to validate the various methods to denote outcome. Additional information on patients’ posttreatment functioning is needed, such as continued use of mental health care after the first year of treatment or long-term follow-up data (e.g. several years after treatment has ended).