Background
Over 66% of patients with chronic schizophrenia who are started on atypical antipsychotics for treatment of moderate to severe symptoms will fail to show moderate improvement after 3 months of treatment, and 1 in 3 will fail to show even minimal improvement [
1,
2]. When treatment does not rapidly bring about symptomatic improvement, patients have less long-term functional progress, higher health care costs, and a reduced likelihood of remission [
3]. These patients are more likely to discontinue treatment compared to those who experience rapid improvement [
4], and treatment discontinuation can have devastating consequences in this population [
5]. The ability to rapidly identify patients who are unlikely to improve on a given treatment allows for early intervention (eg, a change in dosage, a new schedule of delivery, an alternative medication) that could lead to better symptomatic relief, increased treatment adherence, decreased burden of suffering for patients and families, and decreased health care resource utilization.
Until recently, scientists and clinicians believed that at least 3 to 4 weeks of treatment were needed for patients to experience clinically meaningful symptomatic improvement in response to antipsychotic therapy. It has now been convincingly shown that a substantial amount of improvement occurs within 2 weeks or less of initiating treatment [
6‐
8]. Failure to respond within this time frame has been shown in multiple post hoc analyses to strongly predict later non-response with continued use of the same agent [
1,
9,
10]. More recently, the predictive power of early response/non-response was shown in a prospective study of patients with chronic disease [
2] and in a retrospective study of patients experiencing first episode psychosis [
11].
To date, studies assessing early symptom improvement [
1,
2,
9,
10] as a predictor of longer-term response have used a predetermined percent reduction from baseline in the Brief Psychiatric Rating Scale (BPRS) [
12] Total score or the Positive and Negative Syndrome Scale (PANSS) [
13] Total score at a point early in treatment as the criterion for differentiating likely responders and likely non-responders; for example, a 20% improvement in PANSS Total score at Week 2 of treatment might identify patients likely to respond at Week 8. However, the BPRS and PANSS are rarely used in clinical practice. The BPRS consists of 18 items and requires 20 to 30 minutes to administer, and the PANSS consists of 30 items and takes up to 40 minutes to administer. Time constraints related to patient care make using these scales to guide clinical decision making all but impossible.
In this study we used data from 6 randomized, double-blind clinical trials of atypical antipsychotic medications for treatment of moderately to severely ill patients with chronic schizophrenia to develop a simple decision tree employing early symptom improvement to predict longer-term response to treatment. Specifically, we used classification and regression tree (CART) analysis [
14,
15] to identify what
amount of change in which of the
fewest PANSS symptom measures at the
earliest time in treatment is
most predictive of response or non-response at Week 8 of treatment. Once a decision tree was created, we tested its validity by applying it to predict response in 2 independent studies with similar designs and patient populations.
Discussion
We used CART analysis on data pooled from 6 large, double-blind, randomized atypical antipsychotic therapy trials to develop a simple 2-branch decision tree that uses the amount of improvement in 6 PANSS items after 2 weeks of treatment to predict the likelihood of longer-term response to continued use of the same treatment by Week 8. When the model was applied to the data set from which it was created, we identified likely responders with 79% accuracy, likely non-responders with 75% accuracy, and a small group of patients (8%) in whom a prediction could not be made. In analyses of the 3 predicted response groups, patients identified as likely responders were more commonly female and had a shorter duration of illness than likely non-responders. They were also more severely ill at baseline across a spectrum of symptom domains, and showed significantly more rapid and more pronounced improvement over time.
At the first branch of the decision tree, patients were assessed for improvement in a composite variable consisting of 5 of the PANSS Positive items. A separate analysis of individual PANSS items showed strong correlation between these 5 items, lending strength to our model. Several other prediction studies have identified early improvement in positive symptoms as the primary predictor of later response and non-response. Correll et al [
9] evaluated BPRS scores as a predictor of later non-response and identified lack of improvement in the thought disturbance factor as the strongest predictor of non-response. Similarly, in 2 studies in which early improvement in PANSS Total score was shown to be predictive of later response [
1,
2], 40% of the overall improvement was due to the PANSS Positive subscore, with the other subscores contributing only 27% to 33% of the improvement.
Finding that early improvement in positive symptoms accurately predicted later improvement confirmed results of earlier studies and extended these results by specifying how early improvement can be assessed and how much improvement is required to accurately predict later response. We focused on a population of patients with chronic schizophrenia who were moderately to severely ill, with predominantly positive symptoms. This describes the majority of patients who come to medical attention and who are initiated or re-initiated on antipsychotic treatment, or who have their current medication changed or adjusted. Also, antipsychotics are most effective in treating positive symptoms, so a model that predicts later response based on early response is likely to include them. Whether this model applies to patients without prominent positive symptoms will be an interesting area for future research.
Our decision tree was strengthened and stabilized by the large number of patients and the diversity of studies pooled to create our data set. Although each of the studies had a similar design and similar enrollment criteria, they were carried out at both U.S. and international sites, included 5 different atypical antipsychotics, and were completed over an 11-year period. An analysis of model performance by study and by compound suggested that the creation of the model had not been driven by any one study or any one compound, but rather was a reasonable reflection of its component parts.
Many strategic decisions are made during a CART analysis, and the final usefulness of a classification tree depends heavily on the skill and judgment of the statisticians, scientists, and clinicians who determine the input variables and prune the output. We chose changes in PANSS item scores obtained at Weeks 1 and 2 as input variables based on prior work with prediction models. Our analysis of PPV and NPV by study week illustrates the trade-offs inherent in choosing the time for early assessment. While PPV stayed relatively stable regardless of the time point chosen, NPV improved as treatment time passed. However, in the clinical setting, delaying a treatment decision in order to improve NPV translates into additional suffering for some patients who will not improve on present therapy, so we included only very early time points as input variables. We also chose individual PANSS items as input variables rather than the PANSS Total score. The practicality of using a tool consisting of only 6 questions rather than administering the entire 30-question PANSS cannot be overemphasized.
Pruning is another area where investigator input is required. A decision tree that is too complex will likely reflect the idiosyncrasies of the data set used to create it, while a tree that is too simple may not fit any data set very well. In constructing this decision tree, we considered pruning back to the first branch, at which point the NPV was 72%. By including the additional branch, we improved the NPV to 75%, but in doing so, added another level of complexity to what could have been a single-question algorithm. This balance between complexity and increased predictive value must be weighed by the practicing clinician.
We found that our model was more accurate in predicting response at Week 8 based on response at Week 2 than change in the CGI-I at Week 2. Accurate prediction of later response may require a careful assessment of the current severity of a few distinct symptoms rather than a summary measure of change over 2 weeks that includes consideration of the patient's history, psychosocial circumstances, symptoms, behavior, and functioning.
Constructing a decision tree using CART analysis requires having access to a very large database with a consistent set of patients (eg, similar inclusion/exclusion criteria), consistent study design (eg, timing, blinding, randomization), and consistent clinical measurements (eg, PANSS). With the appropriate database, CART analysis is a powerful method for answering clinical questions. It is non-parametric; that is, no assumptions are made regarding the underlying distribution of variables, and it permits consideration of both continuous and categorical data. It can handle data sets with large numbers of input variables, each with many possible values; identify significant high-order interactions among variables; detect and quantify non-linear relations; determine which predictors can be dropped because they contain redundant information; and handle missing values through a process of "surrogate substitution." The tree can be simplified because the response variables are, for the most part, conditionally independent of variables not in the tree. CART analysis is limited in that it can be unstable; that is, a small change in the input variables or a fresh data sample can lead to construction of a very different classification tree. In addition, CART analysis is limited in that it can account for only one response variable [
14,
15].
The answer provided by CART analysis (the percent likelihood of response) is in line with the mindset of most clinicians. After initiating treatment, clinicians routinely evaluate their patients, make judgments about their progress, and decide whether they should continue treatment or make a change. This decision tree simply provides evidence-based tools, an algorithm, and associated PPV and NPV, to help quantify the decision-making process. Clinicians must still actively consider the unique presentation of each patient; for example, a patient with a high probability of failure as determined by the algorithm may benefit from continuing the treatment if they have already failed several previous medications and a patient with a high probability of success may have treatment discontinued due to an intolerable adverse event.
We recognize that when outcomes have been categorized as dichotomous (response/non-response), other approaches such as logistic regression can be used to create a predictive model. The strengths of CART compared with logistic regression are worth noting in this setting. First, CART analysis offers a great deal of flexibility because it requires very few assumptions about the data. Due to its natural branching process, CART is able to handle many dozens of potential predictor variables as well as potential interactions between predictors and non-linear effects of individual predictors, and missing data. CART also has algorithms that account for missing data amongst the predictors; whereas with logistic regression, entire observations are eliminated if a single predictor is missing. Furthermore, at the end of a CART analysis, subgroups (responders and non-responders) are specifically defined due to the variable selection and cut-offs created by the methodology. With logistic regression, significant predictors are identified, but the coefficients do not automatically define subgroups of patients. Finally, our experience suggests that the stepwise, dichotomous approach that is used with a decision tree more closely mimics clinical decision-making than use of odds ratios which indicate association, but do not clearly direct treatment decisions.
Two patient populations identified in our analysis deserve special attention from a clinical standpoint: first, the 8% of patients in whom a prediction could not be made and second, the 24% who were misidentified. The decision criteria laid out in this model allow for a category of patients (those who show improvement in the excitement item at Week 2, but who do not show significant improvement in positive symptoms) in whom the prediction of later response is no better than a coin toss. Fortunately, this population is relatively small, <10% in 14 of the 15 analysis subsets (learning data set analysis, validation analyses, analysis by study, and analysis by compound). Of those patients in the learning data set who were misidentified, the majority was identified as non-responders, but eventually did respond. The clinical ramifications of changing treatment when no change is needed versus continuing use of an ineffective treatment must be carefully weighed by all those involved.
Despite having developed this model using response data from Week 8, the strength of the model was diminished only slightly when used to predict response at Weeks 12 and 24. Further research is needed to see if a model derived using response data following longer-term treatment data might have greater accuracy in predicting response beyond Week 8.
Currently, this decision tree is appropriately used only for adult patients with chronic schizophrenia who are moderately to severely ill. It needs to be tested in other patient populations, such as those with chronic schizophrenia who are only mildly ill, and in patients with first episode psychosis. In addition, it needs to be validated in a prospective study design and to be assessed for inter-rater reliability. It is possible that introduction of additional variables, functional measurements or patients' perception of benefit, for example, would have altered the final decision tree. Further model building using data from different populations and different input variables is needed to establish the stability of this particular decision tree and to perhaps construct a tree with even better predictive characteristics and user-friendliness. In addition, were a tool such as this decision tree to be developed for use in the clinical setting, input on clear and effective phrasing would be needed from clinicians. For example, would verbal descriptions such as "substantial improvement" or numeric descriptions such as "a 2-point drop" be most preferred?
Competing interests
All authors are employees of Eli Lilly and Company and/or one of its subsidiaries, and all authors own stock in Eli Lilly and Company.
Authors' contributions
Authors SR and LC conceived of and designed this study. Author LC performed the statistical analysis. Author SR helped to draft the manuscript. All authors substantially contributed to interpretation of the data and critically revised the manuscript, and read and approved the final manuscript.