Background
The continued presence of unrelieved pain as a serious public health issue has been well-documented [
1‐
4]. Opioids have increasingly been used in recent years to address the problem, and they may be an important component of treatment for chronic pain [
5]. However, some patients who receive a prescription for opioids may be prone to non-adherence or misuse behaviors, such as escalation of their opioids, visiting multiple providers, or other overt drug-seeking behaviors [
6,
7]. The identification of opioid misuse is thus critical in the treatment of chronic non-cancer pain [
7], but many physicians lack adequate training to make such an identification [
8].
To assist clinicians in recognizing aberrant drug-related behavior among respondents who have been prescribed opioids, Butler et al. [
7] introduced the Current Opioid Misuse Measure (COMM), a 17-item self-report questionnaire. Because the COMM is designed to help assess whether a given respondent is currently misusing opioids, it should be distinguished from other instruments that predict future misuse [
9]. Previous research has validated the COMM against the Aberrant Drug Behavior Index (ADBI), a combination measure that incorporates information from a questionnaire taken by the patient, a questionnaire taken by the treating clinician, and urine toxicology results. The COMM was found to exhibit adequate sensitivity and specificity in both its original validation study [
7] and a cross-validation study using a new population of patients [
9].
Although most respondents can finish the full-length COMM in a reasonable amount of time, some individuals may be unable or unwilling to complete all 17 items. Members of compromised subpopulations (e.g., those with physical ailments and those who are at a low reading level) are less likely to accept lengthy screeners [
10]. The response rate [
11] and the quality of an individual’s answers to a given questionnaire [
12] may be improved by administering fewer items. Shortening an instrument may also lessen the degree of stress associated with taking it and decrease the likelihood that respondents will drop out midway through [
13]. Finally, reducing respondent burden is especially critical for lowering drop-out rates when a questionnaire is given to patients on multiple occasions over time [
13,
14]—and such longitudinal tracking of patients was listed as a goal of the COMM in its original validation study [
7]. These considerations demonstrate the need for a version of the COMM that lessens respondent burden while maintaining the sensitivity and specificity of the full-length instrument.
A number of recent articles have shown that modern advancements in computer-based testing can enhance the efficiency of an assessment [
15‐
17]. In particular, computer-based forms have the potential to achieve the same levels of sensitivity and specificity as their paper-and-pencil counterparts, but by using fewer items on average [
18,
19]. The reason for this advantage is that computer-based forms can track a respondent’s answers as the test progresses. By performing
interim analysis of these answers while assessment is underway, a computer program can customize the test to the individual taking it. A well-known component of this customization is the use of
sequential stopping rules to determine the appropriate test length for a given respondent. Two such stopping rules that have been proposed for computer-based testing are the methods of curtailment and stochastic curtailment. As will be explained in the Methods section, curtailment and stochastic curtailment attempt to shorten each respondent’s questionnaire while maintaining the same test outcome (“positive” or “negative”) that would have been made if the full-length questionnaire had been used. In retrospective analyses of responses to the Medicare Health Outcomes Survey [
18] and the Center for Epidemiologic Studies-Depression (CES-D) scale [
19], both methods substantially reduced the average number of items administered without compromising sensitivity and specificity. However, no previous study investigated either method as a means of enhancing the efficiency of the COMM.
The purpose of this study is twofold. First, we describe how the stopping rules of curtailment and stochastic curtailment can be applied to the COMM. Second, we evaluate how successful these rules are in reducing respondent burden and maintaining sensitivity and specificity. For the latter objective, we conducted an analysis of existing data from individuals who had already been administered the full-length COMM and been classified via the ADBI; thus, the research constitutes a “proof-of-concept” study for using a computer-based COMM in the future.
The article is organized as follows. The Methods section describes how the data were collected, how the COMM and ADBI are scored, how curtailment and stochastic curtailment can be used in conjunction with the COMM, and what analyses were performed. The Results section presents sensitivity and specificity values, the average test length of each method, and other statistics. The Discussion and Conclusions sections explore the implications of the study and present ideas for future work.
Results
Following the application of exclusion rules, the training dataset had a final sample size of n = 214, while the test dataset had a final sample size of n = 201. In the training dataset, the mean (SD) COMM score was 10.1 (7.5); 104 respondents (48.6%) were “screened in” by the COMM using a ≥ 9 cutoff, and 73 respondents (34.1%) were classified as positive by the ADBI. In the test dataset, the mean (SD) COMM score was 8.9 (6.9); 86 respondents (42.8%) were “screened in” by the COMM using a ≥ 9 cutoff, and 64 respondents (31.8%) were classified as positive by the ADBI.
Table
1 presents descriptive statistics for each item of the COMM. In both the training and test datasets, the median value for every item was either 0 (“never”) or 1 (“seldom”). For 16 of the 17 items, the median in the training dataset was equal to the median in the test dataset; the lone exception was item 8 (“How often have you had trouble controlling your anger (e.g., road rage, screaming, etc.)?”), which had a median of 1 in the training dataset and a median of 0 in the test dataset. In both datasets, the item with the highest mean was item 1 (“How often have you had trouble with thinking clearly or had memory problems?”), which had a mean of 1.3 in the training dataset and 1.5 in the test dataset. No item’s mean in the training dataset was more than 0.2 from its mean in the test dataset.
Table
2 presents stopping boundaries for each of the sequential methods under study (curtailment, SC-99, SC-95, and SC-90). That is, for each sequential method examined herein, the table indicates the scores that result in early stopping for each stage of the test. Scores that produce a “screened out” result are labeled “Negative stopping,” while scores that produce a “screened in” result are labeled “Positive stopping.” For example, after 10 items have been completed, curtailment never stops to produce a “screened out” result (as denoted “N/A” in Table
2 to indicate “Not Applicable”), but it stops to produce a “screened in” result if the respondent’s score is ≥ 9 at that stage. Continuing in the same row of the table, SC-99 stops after 10 items if the respondent’s score is ≤ 2 (“screened out”) or ≥ 9 (“screened in”); SC-95 stops if the respondent’s score is ≤ 3 (“screened out”) or ≥ 8 (“screened in”); SC-90 stops if the respondent’s score is ≤ 4 (“screened out”) or ≥ 8 (“screened in”). These results illustrate that SC-90 is the most liberal stopping rule of the four (it has the largest range of scores that result in early stopping), followed by SC-95, SC-99, and curtailment (the most conservative stopping rule).
Table 2
Stopping boundaries for curtailment and stochastic curtailment (based on the training dataset:
n
= 214)
1
| N/A | N/A | N/A | N/A | N/A | N/A | N/A | Score = 4 |
2
| N/A | N/A | N/A | Score = 8 | N/A | Score ≥ 6 | N/A | Score ≥ 5 |
3
| N/A | Score ≥ 9 | N/A | Score ≥ 9 | N/A | Score ≥ 7 | N/A | Score ≥ 6 |
4
| N/A | Score ≥ 9 | N/A | Score ≥ 8* | N/A | Score ≥ 7 | Score = 0 | Score ≥ 6 |
5
| N/A | Score ≥ 9 | N/A | Score ≥ 8* | N/A | Score ≥ 7 | Score = 0 | Score ≥ 6 |
6
| N/A | Score ≥ 9 | N/A | Score ≥ 9 | Score = 0 | Score ≥ 8 | Score ≤ 1 | Score ≥ 7 |
7
| N/A | Score ≥ 9 | Score = 0 | Score ≥ 9 | Score ≤ 2 | Score ≥ 8 | Score ≤ 2 | Score ≥ 7 |
8
| N/A | Score ≥ 9 | Score ≤ 1 | Score ≥ 9 | Score ≤ 3 | Score ≥ 8 | Score ≤ 3 | Score ≥ 8 |
9
| N/A | Score ≥ 9 | Score ≤ 1 | Score ≥ 9 | Score ≤ 3 | Score ≥ 8 | Score ≤ 3 | Score ≥ 8 |
10
| N/A | Score ≥ 9 | Score ≤ 2 | Score ≥ 9 | Score ≤ 3 | Score ≥ 8 | Score ≤ 4 | Score ≥ 8 |
11
| N/A | Score ≥ 9 | Score ≤ 3 | Score ≥ 9 | Score ≤ 4 | Score ≥ 9 | Score ≤ 4 | Score ≥ 8 |
12
| N/A | Score ≥ 9 | Score ≤ 3 | Score ≥ 9 | Score ≤ 4 | Score ≥ 9 | Score ≤ 4 | Score ≥ 8 |
13
| N/A | Score ≥ 9 | Score ≤ 4 | Score ≥ 9 | Score ≤ 5 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 |
14
| N/A | Score ≥ 9 | Score ≤ 5 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 |
15
| Score = 0 | Score ≥ 9 | Score ≤ 5 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 | Score ≤ 7 | Score ≥ 9 |
16
| Score ≤ 4 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 | Score ≤ 6 | Score ≥ 9 | Score ≤ 7 | Score ≥ 9 |
17
| Score ≤ 8 | Score ≥ 9 | Score ≤ 8 | Score ≥ 9 | Score ≤ 8 | Score ≥ 9 | Score ≤ 8 | Score ≥ 9 |
Most of the stopping boundaries in Table
2 are
monotonically nondecreasing: as the stage of the test advances, the cutoff score required for early stopping generally does not decrease. The one exception is SC-99, which has a “Positive stopping” requirement of ≥ 9 at stage 3 but only a requirement of ≥ 8 at stages 4 and 5. This result may be surprising, as a score of 8 would intuitively be considered less evidence of a final “screened in” result at stage 4 or 5 than at stage 3. The result is possible, however, because the stopping boundaries were obtained by fitting a separate logistic regression model at each stage of the questionnaire, without constraining these boundaries to be monotonic. If monotonic boundaries are preferred, a simple
constrained version of stochastic curtailment may be defined whereby the boundaries are adjusted to be nondecreasing. To be conservative, the stopping rule at stages 4 and 5 would be adjusted upward to ≥ 9, as opposed to adjusting the stopping rule at stage 3 downward to ≥ 8. The constrained boundaries are noted in Table
2.
Table
3 presents the sensitivity and specificity values of each method, as well as statistics related to respondent burden. The full-length COMM had a sensitivity of 0.703 and a specificity of 0.701 for predicting the ADBI. Curtailment and SC-99 always produced the same result (“screened in” or “screened out”) as the full-length COMM; hence, these methods had a sensitivity and specificity of 1 for predicting the full-length COMM, as well as a sensitivity and specificity of 0.703 and 0.701, respectively, for predicting the ADBI. SC-95 and SC-90 did not always match the result of the full-length COMM, but their sensitivities for it were 0.977 and 0.965, respectively; their specificities for it were 1 and 0.991, respectively. Moreover, these methods exhibited sensitivities of 0.688 or more, and specificities of 0.708 or more, for predicting the ADBI.
Table 3
Sensitivity, specificity, and respondent burden of each method (based on the test dataset:
n
= 201)
Full-length COMM
| 1 | 1 | 0.703 | 0.701 | 17.0 | 0.0 | 0.0 |
Curtailment
| 1 | 1 | 0.703 | 0.701 | 13.3 | 4.2 | 71.6 |
SC-99
| 1 | 1 | 0.703 | 0.701 | 10.7 | 3.9* | 88.1 |
SC-95
| 0.977 | 1 | 0.703 | 0.715 | 8.7 | 4.0 | 90.0 |
SC-90
| 0.965 | 0.991 | 0.688 | 0.708 | 7.0 | 4.2 | 96.5 |
Regarding respondent burden, the full-length COMM, by definition, never stops prior to the seventeenth item; therefore, its average (SD) test length was 17.0 (0.0), and its percentage of respondents whose tests stopped early was 0%. The average (SD) test lengths for curtailment, SC-99, SC-95, and SC-90 were 13.3 (4.2), 10.7 (3.9), 8.7 (4.0), and 7.0 (4.2), respectively. The percentage of respondents whose tests stopped early was at least 71.6% for every sequential stopping method; the highest such value was 96.5%, which was observed for SC-90.
We note that if the constrained version of SC-99 were used, the results would be nearly identical to those of SC-99. The two methods had the same sensitivity values, specificity values, and percentages of respondents whose tests stopped early. Their average test lengths were equal to one decimal place. The standard deviation of test lengths was 3.8 for the constrained SC-99, as opposed to 3.9 for SC-99.
Discussion
There are many well-known benefits of computer-based testing, including automated scoring, immediate data entry, and facilitated tracking of change over time [
29]. As described above, computer-based testing can also be coupled with sequential stopping rules to reduce the respondent burden of an assessment. Two such stopping rules, both of which enhanced the efficiency of the Medicare Health Outcomes Survey [
18] and the CES-D [
19] in previous post-hoc simulations, are curtailment and stochastic curtailment. However, prior to the current study, neither of these stopping rules had been investigated for use with the COMM.
Results of the study indicated that both curtailment and stochastic curtailment have the potential to reduce the respondent burden of the COMM without compromising sensitivity and specificity. Curtailment lowered the average test length by 22% while maintaining the same sensitivity and specificity as the full-length COMM. SC-99 also maintained sensitivity and specificity values identical to those of the full-length COMM while reducing the average test length by 37%. The sensitivities and specificities of SC-95 and SC-90 for predicting the ADBI were within 1.5% of those of the full-length COMM, while these methods reduced the average test length by 49% and 59%, respectively.
Which sequential stopping rule to use operationally will depend on the practitioner’s desired level of concordance with the full-length COMM. If the practitioner requires that the result of the shortened version (“screened in” or “screened out”) match that of the full-length version for all respondents, then curtailment is the correct method to use. We note that the result of SC-99 also matched that of the full-length COMM for every respondent considered in the current study; however, due to its stochastic nature (and unlike curtailment), SC-99 is not guaranteed to match the full-length COMM for 100% of future respondents. If the practitioner is willing to accept a possible decrement in sensitivity and/or specificity to achieve a greater reduction in respondent burden, then the more aggressive SC-99 may be preferred to curtailment. Further gains in average test length can be achieved by using the more liberal SC-95 or SC-90, although these methods may also exhibit less concordance with the result of the full-length COMM.
While the methods examined herein produced substantial improvements in efficiency, these improvements could potentially be enhanced by considering the
ordering of the COMM items. The current article assumed that items would be presented in the same order that they are given in the paper-and-pencil version of the COMM. This assumption was made to promote comparability between the computerized and paper-and-pencil versions. However, it ignores the possibility that in a computerized version of the COMM, items could be judiciously ordered to augment the gains made by curtailment and stochastic curtailment. Previous research [
30] has shown that by presenting the most informative items (e.g., the items selected first by a stepwise logistic regression procedure) at the beginning of the test, the average test lengths of curtailed and stochastically curtailed assessments are reduced without loss of classification accuracy. Future research should investigate the impact of item ordering in a computer-based COMM.
Another mechanism by which the COMM’s statistical properties could potentially be improved is the use of a more sophisticated classification model. The simple ≥ 9 cutoff rule is desirable from the standpoints of interpretability and logistical ease, but more rigorous statistical classification tools might achieve better sensitivity and specificity. Curtailment and stochastic curtailment have previously been studied alongside a multiple logistic regression classification rule [
18,
30]; such a rule has the added benefit of facilitating the inclusion of demographic information in the model if desired. Additionally, the classification of respondents via Item Response Theory (IRT) and computerized adaptive testing (CAT), which would allow the item ordering to be individualized at the respondent level, could be explored. Although the suitability of IRT and CAT for application to the COMM has not yet been examined, a comparison between the curtailed, stochastically curtailed, and CAT-based versions of the COMM could be illuminating. Previous comparisons using the CES-D suggested that stochastic curtailment can achieve reductions in test length similar to those of CAT while exhibiting greater concordance with the classifications of the full-length instrument [
19].
Regarding limitations of the current study, one was its retrospective nature: responses were analyzed post-hoc rather than collecting data prospectively. However, such post-hoc simulation to establish a “proof-of-concept” is a typical first step in evaluating potential computer-based testing procedures [
14,
18,
19,
21]. A second limitation was the sample size of the study, which was smaller than the sample sizes of previous applications of curtailment and stochastic curtailment [
18,
19]. However, the fact that the stopping rules were successful when applied to the test dataset, despite having been trained on a relatively modest-sized training dataset, suggests the robustness of the methodology. Third, the methods were evaluated using only one test dataset, limiting the generalizability of the results. We further caution readers that while the look-up table for curtailment (Table
2) is applicable to any population for which a ≥ 9 cutoff is appropriate, the look-up tables for SC-99, SC-95, and SC-90 (also shown in Table
2) are sample-specific and may not be suitable for other populations. Therefore, before stochastic curtailment is used for a different population of respondents, new look-up tables should be calculated based on pilot data from that population. See Finkelman et al. [
19] for a thorough description of how to calculate such look-up tables.
It should be reiterated that at 17 items, the full-length COMM is not unduly time-consuming for many individuals who take it. Nevertheless, reducing the respondent burden of an assessment can result in benefits such as an enhanced response rate [
11], including among members of compromised subgroups [
10]. Other potential advantages of shortening an instrument are an improvement in the quality of answers obtained [
12], a reduction in respondents’ stress levels [
13], and a lower drop-out rate [
13]. Alleviating the respondent burden may be particularly important for the COMM, given that this questionnaire was designed to be administered on multiple occasions to track patient status over time [
7].
We also emphasize that the COMM was not designed as a standalone mechanism for classification. Rather, it was developed as a screening tool to help clinicians in their assessment of risk for opioid misuse [
7,
9]. Therefore, the curtailed and stochastically curtailed versions should also be regarded as aids to clinicians, rather than as standalone classification tools.
Further investigations should be conducted to replicate this study’s results in different populations. In addition to retrospective analyses, curtailment and stochastic curtailment should be pilot-tested in live computer-based administrations. Subjects from compromised subpopulations, such as those with physical ailments and those who are at a low reading level, should be included in the pilot testing, considering that these subpopulations are most likely to benefit from reduced test lengths [
10]. The results of the live tests should be compared to the results of post-hoc simulation. All of these steps will be undertaken in future work.
Competing interests
SFB is an employee and shareholder of Inflexxion, Inc. Inflexxion holds the copyright for the Current Opioid Misuse Measure (COMM)®.
Authors’ contributions
MDF contributed to the conception of the study, analyzed the data, and prepared the manuscript. RJK contributed to the conception of the study and made significant comments on the manuscript. DZ and NS made significant comments on the interpretation of data and the manuscript. SFB contributed to the conception of the study, acquired the data, and made significant comments on the manuscript. All authors read and approved the final manuscript.