Background
Delirium is a preventable [
1,
2] acute confusional disorder. In the US, delirium affects over 2.3 million hospitalized older adults each year [
3] at an estimated total annual cost of $152 billion [
4]. Recognition of delirium is a prerequisite for developing a coherent treatment program. However, delirium remains under-recognized and is consequently mismanaged in most clinical settings [
5].
Formal diagnostic criteria for delirium were first codified in 1980 in the American Psychiatric Association′s Diagnostic and Statistical Manual of Mental Disorders, Version 3 (DSM-III) [
6]. Different definitions have appeared in subsequent DSM versions [
7‐
9]. The first appearance of delirium in the International Classification of Diseases occurred in ICD-10 [
10]. While the DSM clearly captures the key elements of the delirium syndrome, the DSM criteria themselves can be challenging to apply diagnostically, both in clinical practice and in research settings, particularly for patients who are not communicative [
11]. Additionally, the DSM-IV criteria require knowledge of underlying cause before diagnosis can be made. In clinical practice, usually delirium is first recognized and then a search for the underlying cause proceeds. Wide discrepancies in case identification have been reported when different criteria are used [
11‐
13].
There are many methods for research and clinical diagnosis of delirium, operationalizing either the International Classification of Diseases (ICD) or DSM criteria [
14]. The most commonly used algorithm for case identification of delirium is the Confusion Assessment Method (CAM) [
15]. The CAM reduces the nine original DSM-III-R criteria to four key features, requiring the presence of both 1) acute change in mental status with a fluctuating course and 2) inattention, and either 3) disorganized thinking or 4) altered level of consciousness. A recent comprehensive review showed its strong performance characteristics and widespread use [
16]. The CAM algorithm has been used in over 1600 publications over the past 14 years, more than 10 times more frequently than the DSM criteria [
16]. The recommended interview prior to completion of the CAM is a short cognitive screening tool, including assessment of attention [
17]. However, different researchers may operationalize the CAM features differently. To maximize the accuracy and reliability of the CAM, standardized mental status and neuropsychiatric assessments, questionnaires and ratings should be used to assess delirium symptoms [
18]. However, because such assessments may require up to 30 minutes for administration and scoring [
18] they are impractical for clinical use and burdensome for research studies. Therefore, reducing the length of screening interviews is an important step in improving case identification. Item response theory is a statistical tool that can help in this process. The goal of our work is to identify the most efficient set of items to determine the presence or absence of each of the CAM features.
Item response theory (IRT) encompasses a set of psychometric tools that—among other things [
19]—can help in the selection of optimal test questions to shorten instrument [
20‐
25]. IRT is a statistical framework that relates observed patient data (responses to test items, or diagnostic signs and symptoms) to theoretical (i.e., latent) and presumed continuously distributed constructs. IRT can be considered an extension of classical factor analysis [
26] and is a useful tool in test construction because it provides a framework for expressing characteristics of test-takers and test items on a uniform metric. IRT and factor analysis are isomorphic when the factor analysis is performed on a matrix of polychoric correlations and only one latent variable is modeled [
26‐
28]. In this study, the unidimensional factor analysis results are item response theory results, and more globally the multidimensional factor analysis results are multidimensional item response theory [
29]. The ordinal dependent variable approach to factor analysis was described by Birnbaum in Lord and Novick′s seminal work on IRT [
30], formalized by Christoffersson [
31] and Muthén [
32].
In our approach, insofar as unidimensionality is an assumption of IRT [
33], we sought first to assess the extent to which our data satisfied this assumption before moving on to formal IRT analyses. This feature makes possible the construction of tests for specific uses or specific populations. In many IRT parameter estimation procedures, item parameters are assumed to be fixed and invariant across population subsamples [
34]. This is a strength in that tests can be constructed using only some items from a larger bank of items but still produce estimates of person level on the same metric as other tests using different items from the bank.
IRT posits models that express a person′s response (
y
ij
), person-level trait (
θ
i
), and item parameters (
a
j
,b
j
). Let
y
ij
represent person
i′s response to item
j that is observed as correct (or symptom present) (
y = 1) or incorrect (or symptom not expressed) (
y = 0). The probability that a randomly selected person from the population expresses a symptom is
where G is some cumulative probability transformation, usually the inverse logit, but the normal probability distribution function is also used. The unobserved variable (e.g., latent level for the CAM feature of inattention)θ, is often assumed to be distributed normally with mean zero and unit variance. The difference between a person′s latent trait level (θ
i
) and the item difficulty (or item location, or symptom severity level, b
j
) defines the probability that a person will display a symptom (e.g., ″Trouble keeping track of what was being said,″ for the CAM feature of inattention). P
j
(θ) describes the increasing probability of a randomly chosen patient displaying indicator y
j
with increasing values of the latent trait θ.
If a test symptom severity is greater than the person′s level on the underlying trait or exceeds the test item symptom severity, less likely than not they will express the symptom. The precise probability is modified by the strength of the relationship between the latent trait and the item response, captured with the item discrimination parameter (
a
j
). When logistic regression estimation procedures are used, it is common to include a scaling constant (
D) so that the logit parameters are standardized [
35].
Building tests to suit specific uses can employ the concept of item information [
30]. Item information is expressed with
I
j
(
θ) =
a
j
2
P
j
(
θ)[1 −
P
j
(
θ)]. The more highly discriminating an item is, the more peaked its information function. Information functions are centered over the item difficulty parameter. Information is analogous to reliability in the sense that it expresses measurement error. Due to the assumption of local independence, item information functions are additive. Local independence is an important basic assumption in IRT along with unidimensionality, where an answer to one item is not contingent or statistically dependent upon an answer to a preceding item. The curve describing the sum of information over the underlying trait is called a test information curve. Taken together, it is possible to achieve fine control over where and how well a given item set measures a latent trait along the latent trait distribution (subject to the availability of items with the desired parameters). The goal of this paper was to identify the shortest set of mental status assessment questions and interviewer observations that could be used to efficiently provide relevant information for screening about a patient′s level on four CAM diagnostic features. We present our approach to developing an item bank for the future development of screening tool using item response theory and related psychometric methods. The context is the future development of predictive tests for distinguishing persons who satisfy each of the four CAM criteria for delirium. Our substantive goal was to develop a parsimonious set of indicators for each of the four key CAM features of delirium to be considered in further developing brief clinically useful screening measures [
15].
Results
Participant characteristics are summarized in Table
2. The mean age was greater than >80 years. Women represented over two thirds of the sample. Baseline cognitive impairment was common: the mean Mini-Mental Status Examination (MMSE) score was 21.4 (Standard Deviation, SD ± 6.3). CAM Delirium was present in about 1 in 8 of the sample.
Table 2
Characteristics of study participants
Total [n (%)] | 4598 | (100) | |
Age [M (SD)] | 81.5 | (7.7) | [64.0–104.0] |
Sex [n (%)] | | | |
Male | 1425 | (31.0) | |
Female | 3172 | (69.0) | |
Race/Ethnicity [n (%)] | | | |
White | 3918 | (85.2) | |
Black/African American | 269 | (5.9) | |
Other races | 29 | (0.6) | |
Missing | 382 | (8.3) | |
Delirium Present [n (%)] | 611 | (13.3) | |
Mini‐Mental State Examination Score [M (SD)] (scored 0–30, 30 best) | 21.4 | (6.3) | [0.0–30.0] |
Mini‐Mental State Examination Score group [n (%)] | | | |
Severe cognitive impairment (0–17) | 1018 | (22.1) | |
Cognitive impairment (18–23) | 1560 | (33.9) | |
No cognitive impairment (24–30) | 2019 | (43.9) | |
The clinical expert panel defined CAM feature
indicators from source
items drawn from the MMSE orientation items, digit span, and DSI
. We analyzed the resulting 135 indicators following the psychometric modeling steps described in the methods (multi-collinearity checking, dimensionality determination, IRT). Results are summarized in Table
1. This table lists by CAM feature (column 1) the number of indicators proposed by the clinical expert panel (column 2), the number of indicators remaining after empirical multi-collinearity checking (column 3), number of significant eigenvalues following permuted parallel analysis (column 4), and the marginal reliability of each feature at
θ
50
+ (column 5). Columns 6–7 summarize model fit statistics and estimates of a single factor model fit to the indicator set, and columns 8 and 9 the model fit statistics for the
m-dimensional model. As indicated in Table
1, no indicator set had more than two significant eigenvalues based on the permuted parallel analysis. Column 10 summarizes whether large secondary loadings were observed (secondary factor loading exceeded the common factor loading for a given item) in the BFA. Column 11 reports the final adjudication of the expert panel on the number of retained dimensions. Three indicator sets identified more than one secondary factor, and the expert panel agreed with the results. When only one significant eigenvalues was detected, model fit statistics were generally good (CFIs > 0.94 and RMSEAs < 0.05) [
57].
The next step was to identify items that provided high information content in a region of the underlying trait assessed by the items. We did this by evaluating the item information at the 50
th percentile of the latent trait distribution underlying the indicator set (or sub-set) among those participants who were rated as CAM feature positive. We identify this level of the latent trait as the 50
th percentile (
θ
50
+) curve. An example of one such curve is shown in Figure
2. This figure plots item information curves for the indicators identified by the Clinical Expert Panel as measures of CAM feature 2 –
inattention—direct interview. All indicators are illustrated, but we highlight two for discussion:
″List the months of the year backward″ (heavy dotted line) and
″List the days of the week backward″ (solid bolded line). The box and whisker plots beneath the horizontal axis indicate the distributions of posterior estimates of latent trait scores for participants ultimately classified as CAM feature 2 -
inattention positive and negative. Vertical reference lines for key percentiles of the CAM feature positive group are illustrated in the main panel.
This figure illustrates several important points about the analysis of this indicator set. First, the latent trait distributions for the CAM feature positive and negative sub-groups show wide separation. Nevertheless, most of the item difficulty parameters (located where the information functions peak) are above the 75th percentile of the CAM feature positive group. Thus, most of indicators in this set contribute the most information at very severe levels of the underlying trait. Such items would not be useful for screening purposes, even if the assessed symptoms were pathognomonic of delirium. Our goal is to derive a test information curve tuned for screening purposes. We approach this by choosing the items with the most information at the 50th percentile for our item bank. The two highlighted items provide the most information at the 50th percentile of the latent trait distribution in the feature positive group. This is the area of the latent trait of greatest interest for screening purposes.
The top 5 delirium indicators ranked in order of information at the 50
th percentile of the latent trait distribution for the CAM feature positive subgroup are displayed in Table
3. The tabulated indicators comprise 39 original assessment items. In Table
3 we also present the item information (
a) and difficulty (
b) parameters for each indicator.
Table 3
Source items and indicator IRT parameters for top five indicators identified for each dimension of each CAM feature*
Feature 1 -Acute Change and Fluctuating Course- Direct interview (θ
50
+ = − 0.20) |
Felt confused during the past day | 0.96 | 1.72 |
Thought you were not really in (name of facility) | 1.00 | 2.21 |
Saw things that were not really there | 1.33 | 2.29 |
Thought things were moving that were not really moving | 0.98 | 2.66 |
Heard things that were not really there | 1.55 | 2.56 |
Feature 1 -Acute Change and Fluctuating Course-Observational (θ
50
+ = 1.17) |
Level of consciousness fluctuated | 2.97 | 1.80 |
Level of attention fluctuated | 1.83 | 1.46 |
Speech/thinking fluctuated | 1.98 | 1.77 |
Evidence of disturbance of sleep | 1.97 | 1.83 |
Psychomotor activity fluctuated | 1.57 | 2.43 |
Feature 2 -Inattention- Direct interview First Factor (θ
50
+ = 0.22) |
What is the year? †
| 1.57 | 1.14 |
What is the month? †
| 1.86 | 1.17 |
What is the day of the week? †
| 1.21 | 0.78 |
What type of place is this? †
| 1.55 | 1.23 |
What is the name of this place? †
| 1.12 | 0.24 |
Second Factor (θ
50
+ = 0.27) |
Days of the week backwards | 1.65 | 1.29 |
Months of the year backwards | 1.17 | 0.20 |
Digit span backwards 3 Numbers ‡
| 1.12 | 0.85 |
Digit span backwards 4 Numbers ‡
| 1.20 | −0.34 |
Digit span forwards 4 Numbers ‡†
| 1.11 | 2.09 |
Feature 2 -Inattention- Observational (θ
50
+ = 0.38) |
Trouble keeping track of what was being said | 1.26 | 0.25 |
Level of attention fluctuated | 1.74 | 1.55 |
Unaware of environment | 2.09 | 1.91 |
Distracted by environmental stimuli | 1.28 | 2.06 |
Staring into space | 1.09 | 2.11 |
Feature 3 -Disorganized Thinking Direct interview (θ
50
+ = 0.67) |
What type of place is this?†
| 1.56 | 1.23 |
What is the year? †
| 1.49 | 1.17 |
What is the month? †
| 1.74 | 1.20 |
What is the day of the week? †
| 1.20 | 0.79 |
What is the name of this place? †
| 1.11 | 0.24 |
Feature 3 -Disorganized Thinking Observational
§
First Factor (θ
50
+ = 1.03) |
Unclear or illogical flow of ideas | 2.07 | 1.29 |
Changes the subject suddenly | 1.83 | 1.90 |
Conversation was rambling | 1.36 | 1.68 |
Words or phrases that were disjointed or inappropriate | 1.33 | 2.21 |
Speech/thinking fluctuated | 1.17 | 2.27 |
Feature 4 -Fluctuating Course and Altered Level of Consciousness- Observational First Factor (θ
50
+ = 1.99) |
Sleepy, or stuporous, or comatose | 9.70 | 1.70 |
Disturbance of sleep | 3.18 | 1.81 |
Lethargy and sluggishness | 1.41 | 1.44 |
Slowness of motor response | 1.23 | 1.70 |
Expressed a paucity of thoughts | 0.97 | 3.23 |
Second Factor (θ
50
+ = − 0.14) |
Restlessness | 1.44 | 2.02 |
Speech unusually fast or pressured | 0.74 | 3.71 |
Excessive absorption with ordinary objects | 2.31 | 2.29 |
Increased speed of motor response | 0.69 | 4.49 |
Grasping/picking | 2.68 | 2.20 |
Of note, we did not pursue IRT modeling for the second observational factor of Feature 3 (disorganized thinking) because only three items loaded on this factor: limited speech, paucity of thoughts, and slow speech. We also did not include the direct interview items of Feature 4 (altered level of consciousness) because the item set was redundant with Feature 2 (inattention-direct interview). For Feature 4 (altered level of consciousness-observational), the second factor showed all items having very low information content at the 50th percentile, so for this feature, we made our decision based on the 75th percentile in the CAM feature positive group.
The marginal reliability estimates for each of the CAM IRT-derived features are shown in Table
1. The marginal reliability estimates are based on the mean standard error of the IRT scores for the items at the 50
th percentile of the latent trait distribution for the CAM feature positive group. Most marginal reliability estimates were 0.80 or higher, with higher reliability approaching a coefficient of 1, suggesting good reliability at the area of reliability relevant to screening.
Discussion
Through an iterative process pairing a clinical expert panel with psychometric data analysis, we have identified a set of 48 indicators, derived from 39 items that are optimal for screening patients for the four core features of delirium as defined in the CAM algorithm. The symptoms assessed are clinically relevant and optimize psychometric properties for screening. The resulting item pool can be used to develop short form screening instruments for clinical or research use.
A challenge we faced in our item selection procedure is what criteria to use for selecting candidate items that would be optimal for screening. To this end, we generated item information functions for each indicator, and selected indicators that maximized information around the median underlying latent trait level for persons with each CAM feature positive. Some items, even those that are pathognomonic for a particular CAM feature, may have been omitted if they provide most of their information around a level of severity that is not relevant for screening. Our approach leads to measures that maximize measurement precision of underlying latent traits at a level that is important for separating persons who are or who are not classified as demonstrating the CAM feature.
Our goal was to define a set of items for clinical researchers to construct a short form for the routine screening of delirium to replace lengthy batteries of mental status, neuropsychological assessment, and observational items. The significance of this work is for the future establishment of validated instrument for delirium screening. Our work represents a first step in development of a more refined delirium screening instrument. The approach used here may be more widely applicable to a broad array of conditions that rely on multi-item assessment batteries to screen for delirium. The innovation of the approach we used in this study is the use of IRT to select optimal items for screening that maximizes psychometric information at the latent trait level that discriminates between persons who do and do not demonstrate the four core features of delirium described in the CAM algorithm. The items were chosen in an iterative fashion that incorporates an interdisciplinary perspective from both clinical and methodological expertise in measurement research. The novel approach used in this study for case identification in delirium allows the interdisciplinary team to select items based on item information at the 50th percentile for those who screen positive on the specific CAM feature. Ideally, in the near future our analysis will be enhanced by computer assisted bedside interviewing with well characterized item banks and adaptive testing algorithms tuned to distinct purposes (e.g., grading delirium severity, screening for probable delirium).
Several caveats are worthy of discussion. First, our study involved a single, albeit very large, sample of acutely ill elderly patients. Future work will be needed to extrapolate our findings in other samples. Second, the operationalization of the critical theta value for screening could have been incorrect; however, we performed sensitivity analyses demonstrating that using values other than the median among CAM feature positive persons identified similar items. Third, any delirium tool developed from the identified items would need to be validated in an independent cohort. We are actively pursuing this work.
The DSM-IV and ICD-10 are used for diagnosis and coding by trained clinicians. In contrast, the design and purpose of the current study was to identify items for delirium screening based on the four CAM features, which can be done by both clinicians and trained non-clinicians. Therefore, this research may not directly inform diagnosis relying only and strictly on the DSM and ICD.
Another limitation of our analysis is that age, sex and race/ethnicity, have not been considered in this analysis. These factors have been shown to be associated with the differential expression of signs and symptoms in other psychiatric and cognitive disorders, although not necessarily in delirium. Our results assume that the measurement of symptoms of CAM features is invariant across major sociodemographic groups. A future direction for potentially improving the current instrument is to examine measurement bias due to age and gender.
Competing interests
The authors declare that they have no competing interests.
Authors’ contribution
FMY participated in the design, analysis, and drafted the article. RNJ participated in the acquisition of data, conception of design and analysis, and drafted the article. SKI participated in the design and analysis and provided critical revision of the manuscript. DT participated in the analysis and critical revision of the manuscript draft. PKC participated in the conception of the design and analysis and provided critical revision of the manuscript. JLR participated in the acquisition of data and provided critical revision of the manuscript. LHN participated in the analysis and provided critical revision of the manuscript. ERM obtained support for the research, participated in the acquisition of data, contributed to the design and analysis and provided critical revision of the manuscript. All authors read and approved the final manuscript.