Background
In the U.S., the number of those aged 65+ in the year 2000 was approximately 35 million. In 2050, this figure is expected to rise to nearly 82 million [
1]. The potential burden to healthcare becomes apparent if we couple these figures with evidence indicating that 55 years-of-age is the median age of detectable chronic disability [
2]. Such forecasts have prompted gerontologists and geriatricians to consider more seriously prevention-type models, with an emphasis on the earliest stages of functional decline. Increased interest in the maintenance of function and prevention of disability has led to relatively new diagnostic criteria, such as symptoms of frailty or preclinical disability. The utility of identifying individuals who are 'high risk' for future functional decline rests on the notion that it is potentially an easier state to reverse than overt disability [
3]. Intervention programs designed to prevent functional decline in older adults show that participants with relatively good functional status or moderate frailty are those who benefit the most from these programs [
4]. However, 'prehabilitation' strategies necessitate the use of assessment measures that exhibit a high degree of sensitivity. Standardised tests of physical performance have been employed with increasing frequency in recent years, presumably to meet this demand for greater sensitivity [
5].
Activities of Daily Living (ADL) [
6] and Instrumental Activities of Daily Living (IADL) [
7] were developed to assess capabilities relating to the maintenance of self and lifestyle, which often includes self-care, keeping one's life-space in order, and obtaining resources [
8]. When compared to performance-based measures (e.g., walk time), ADLs and IADLs generally display weak face validity, reproducibility, and sensitivity to change [
9]. Also, as the emphasis has changed toward early detection in community-dwelling older adults, for whom dependency in self-reported ADL-IADLs is uncommon, researchers often have to cope with large ceiling effects, in which greater than 90% of subjects endorse no 'difficulty' or 'dependency' on ADL tasks [
10]. It has been proposed that the relative standing of ADL-IADLs could be enhanced by improving construct validities to levels that are at least equivalent to those of physical performance measures [
11]. Enhancements of this nature have progressed relatively slowly. The justification for improving construct validity in ADL-IADLs, rather than abandoning them in favour of performance measures, can be found in two observations. First, there is evidence that self-reported ADL-IADLs and performance based measures are comparable to each other, but usually measure different aspects of functioning [
5]. Second, combining information from self-report and performance measures has been shown to increase prognostic value, particularly in high-functioning older adults [
10].
One reason given as to why the psychometric properties of self-reported ADL-IADLs can be insufficient pertains to the ordinal nature of Likert scoring methods. This traditional, and still the most common, aggregate method of scoring computes a raw total score by summing responses to individual items. Despite the popularity of the aggregate scoring method, there are well-established problems with raw scale scores that make them difficult to interpret [
12]. One problem pertains to weighing each item equally; the total score method assumes that each item or symptom on the scale represents an equal level of severity, which is almost never true [
13]. Furthermore, the two methods (i.e., IRT vs. Likert scoring), with respect to difficulty ranks, can diverge considerably. For example, it has been demonstrated, within a 16-item scale, five Likert items scores differed by three or more ranks compared to Partial Credit (Rasch model) scores [
14].
Revised ADL-IADLs, through the use of Item Response Theory (IRT), avoid the pitfalls of aggregated approaches to self-reported disability. In contrast to traditional summative scoring methods, IRT models meet the conceptual requirements of order and additivity [
15]. This is primarily achieved by converting the ordinal level data into interval level log-odd units, which are computed for both items and person separately and then placed on a common scale [
16]. "With the priority placed on establishing interval units of measure, the investigator derives complementary tools for understanding the nature of scale's meaning and, more importantly, provides a substantive context within which an individual's score on a scale may be interpreted" [[
17], p.52]. Establishing interval level units permits one to identify important features of the construct that have been excluded. These gaps in measurement (typically referred to as construct under-representation) are worth investigating because they are thought to undermine construct validity. This means that there may be uneven rates of change in the construct being measured. For instance, an increase in a 10-point scale can represent different amounts of improvement at different parts of the functional status scale; it might be more difficult for a person to improve from 9 to 10 than from 4 to 5 [
18].
Construct validity for ADL-IADL scales can also be enhanced by formally confirming a hierarchy of decline. For example, by supporting or refuting the expectation that 'Stepping over obstacles' is a more challenging task than 'Walking over a level surface' [
19]. Establishing a hierarchy of functional decline tells more than the typical simple summation of functional loss, and may have predictive value to the clinician monitoring older adults: if the sequence is accelerated or out of order it may indicate the need for interventions [
20]. IRT-based transformations allow for items to be ranked unequivocally on a hierarchy based on item difficulty, ranking items from easiest to most difficult [
21]. Ordering items or tasks by group mean scores does not imply that this ordering also holds at the individual level. "Any set of items can be ordered by item mean scores, but whether such ordering also holds for individuals has to be ascertained by means of empirical research. Only when the set of items has an invariant item ordering (IIO) can their cumulative structure be assumed to be valid at the lower aggregation level for individuals" [[
22], p.579].
In addition to improving the validity of ADL-IADL measures by reducing ceiling effects, identifying construct under-representation, and confirming a formal item hierarchy, IRT methods can expand upon classical approaches to instrument reliability. Knowing the instrument's reliability provides information about the variance or error associated with the person's true score. The true score refers to the average score a person would receive if they were tested repeatedly (necessarily hypothetical) [
23]. Instrument reliability relating to disability can tell us whether observed changes are due to, for example, an intervention aimed at attenuating severity or problems with the precision of an instrument. An unreliable disability instrument may therefore underestimate the size of the benefit obtained from an intervention. IRT enhances interpretive power by providing measurement precision that varies with a person's ability level [
24]. This information (i.e., error that varies by person performance) can be used to identify the most sensitive part of the instrument or scale under investigation [
25]. Whereas in CTT a single number (e.g., the internal-consistency reliability coefficient, or the SEM based on that reliability) would be used to quantify the measurement-precision of a test, a continuous function is required in IRT to convey comparable data [
26].
The goal of this systematic review is to identify manuscripts that use Item Response Theory to revise or develop ADL-IADL scales used for community-dwelling older adults. These revised scales should: (i) assess internal validity (cause and effect) by formally confirming a hierarchy of functional decline; (ii) enhance content validity, i.e., reduce ceiling effects to thresholds approaching 15%; and (iii) quantify construct under-representation (i.e., gaps in coverage) by converting the raw aggregated disability score into interval level measurement. The by-product of the aforementioned goals will be the identification of ADL-IADL instruments that are highly sensitive to the early stages of disability, and more accurate in detecting change over time. Lastly, this review is not concerned with establishing the superiority of one method over another (i.e., item response theory vs. classical test theory) in relation to scale analysis.
Discussion
This review was concerned with the evolution and enhancement of ADL-IADL scales that specifically target high functioning community-dwelling older adults. It has been proposed that the relative standing of self-report ADL-IADLs could be enhanced by improving construct validities that are at least equivalent to those of physical performance measures. To address these challenges, this review chose to investigate constructs related to scale hierarchy, ceiling effects, and establishing interval level measurement that enables the identification of construct under-representation.
Seven scales from this review were able to establish interval level measurement using parametric IRT procedures, thus enabling greater accuracy when considering change scores as well as identifying construct under-representation. With regard to construct under-representation all scales in this review presented with relatively large gaps in coverage, with the exception of McHorney and Cohen [
69]. When IRT methods are used to transform the ordinal nature of ADL scales to interval level data, diagnostic precision [
15] and sensitivity to clinical change are enhanced [
74]. Comparing disability measurements between patients, or within patients between different moments in time is complicated. Change scores for Likert summative scores need to be interpreted with caution. It has been noted that assessing change in terms of estimated trait level rather than raw scores can yield more accurate estimates of change [
75]. If non-equal intervals exist between adjacent items, change scores for subjects with different levels of ability may misrepresent the amount of change, or fail to detect change in the latent trait [
51]. Furthermore, Fraley et al. [
76] demonstrated that analyses of change at the raw-score level and analysis of change using the latent-trait metric may lead to opposite conclusions. In one example, they displayed results showing that highly anxious individuals are relatively less stable over time when considered at the raw-score level, but more stable over time when considered at the latent-trait level. Thus, failing to understand the scaling properties of an instrument can lead to grossly inaccurate conclusions [
77].
Four scales met IRT standards for ascertaining item hierarchy at the individual level, as opposed to merely establishing item hierarchy at the population level. Despite the comprehensive coverage of McHorney and Cohen [
69], this manuscript made use of the 2PL IRT model which does not provide the added advantage of invariant item ordering; Ligtvoet et al. [
22] point out that Sijtsma and Hemker [
44] proved that the graded response model used in McHorney and Cohen does not imply invariant item ordering. Invariant item ordering is clinically useful because improved understanding of the sequence of functional change or decline and its natural trajectory in aging would open up opportunities for thinking about early intervention and/or ways to change this trajectory [
20,
78]. Ligtvoet et al. [
22] reports that IIO is a strong requirement in measurement practice, and that researchers sometimes assume that fitting an IRT model implies that items have the same ordering by difficulty or popularity for all individuals, but this assumption requires modification. In following this rather strict criterion for IIO, our final pool of scales was relatively limited. This resulted in very few items that were common to other scales, thus allowing for only modest patterns of functional decline to emerge.
It has been noted, within the last 25 years, that interest in measuring functional status among the nondisabled elderly has expanded dramatically because of the aging of the population and its implications for health care policy. As a result, measures of ADLs and IADLs have increasingly been applied to community-dwelling individuals, resulting in substantial ceiling effects [
79]. Four of the twelve scales were exceptional in reducing ceiling effects: Kempen and Suurmeijer [
38] reported 5% of subjects at the ceiling level; Fortinsky et al. [
14] also reported a ceiling effect of 5%; Haley et al. [
66] and Jette et al. [
67] observed a ~1% and 0% ceiling effect, respectively. However, it should be considered whether the success of the scales used in Kempen and Suurmeijer as well as Fortinsky et al. are being driven more by sample characteristics than scale sensitivity. Both scales were categorised in Table
1 as having the 'least healthy' samples of older adults. The Kempen and Suurmeijer sample were all new users of professional home help, in addition to subjects being 77% female. Again, gender should be considered, as previous studies have reported gender differences in functional disability, with elderly women reported to have higher functional disability than elderly men [
80]. The Fortinsky et al. sample were described as Medicare-eligible with a recent history of home care services, and one third of the sample was age 85 or above. Despite Haley et al. and Jette et al. also having large proportion of female subjects, their subjects appear much healthier than the two other samples mentioned above. Thus, we are more confident that the low percentage of ceiling effects has more to do with scale characteristics.
The success related to improved content validity can be attributed to the development of more difficult items. The items used in Haley et al. [
66] are very different than traditional IADL items (e.g., assessing the ability to 'Run a half mile'). In an effort to approach the novel status of a 0% ceiling effect, Haley et al. increased item difficult. However, it has become apparent that 'newly developed' items designed to limit ceiling effects in high functioning populations lie outside the realm of daily experience, and thus may prove less reliable. For instance, questions about walking difficulty over a distance of one-quarter mile or more may be answered inaccurately simply because the respondent has not attempted to walk such a distance in quite some time [
81]. Furthermore, it has been noted that the 'Vigorous activities' item (from a sample of chronically ill or psychiatric subjects) may have misfit due to lack of actual engagement in these activities within a typical day [
82].
Lawton's instrumental activities of daily living [
7] were thought to reflect a greater degree of complexity than the previously developed ADLs, and thus would be more applicable to a broader population of older adults. However, it seems that these traditional IADLs are most responsive to community dwelling older adults that show early sings of cognitive pathology, such as mild cognitive impairment. It has been shown that a majority of the traditional IADLs are more closely approximated with physical fitness than cognitive complexity [
83]. In an effort to reduce ceiling effects and to track change in community-dwelling older adults, scale developers have chosen to assess tasks that are more and more physically demanding, e.g., 'Run a half mile' or 'Vigorous activities'. However, the Late Life FDI scale presented in Jette et al. [
67] utilises difficult items (as evidenced by a ceiling effect of 0%), while maintaining a degree of complexity, e.g., the 'Travel out of town' item or 'Invite people into home'. And yet this scale does have two relatively large 'gaps' in coverage that might make tracking change over time problematic. Also these sorts of items may prove cumbersome for tracking progress in 'prehabilitation' (e.g., cognitive training) over relatively short intervention periods. It might be fruitful to explore the embedded components of a complex task such as 'Travel out of town', much the same way geriatricians have scrutinised the sub tasks involved in bathing [
84,
85].
Another avenue for increasing scale sensitivity in community-dwelling older adults is to alter the wording and thus the context in which activities are performed. Fries et al. [
86] provides a review (with a mixed patient population) on the effects of altered context. In this review, Avlund et al. [
54], like Jette et al. [
67], explored atypical disability wording in an effort to reduce ceiling effects in community-dwelling populations (Avlund et al. is cited in the 'close to inclusion' section of this manuscript). Avlund et al. [
53] compared 'tiredness' and 'reduced speed' classifications, and found that the reduced speed scale was more effective in reducing ceiling effects. However, Avlund et al. [
55] advocated the rejection of the reduced speed scale (in favour of the 'tiredness') due to severe heterogeneity across age groups, as well as model fit difficulties. Avlund et al. [
55] also compared dependency (i.e., 'do you need help?') vs. tiredness and found that the tiredness scale was more suitable for measuring change among well older adults. At the same time, Fried et al. [
87] were altering scale classification by asking whether health or physical problems result in ADL-IADL tasks being completed with less frequency, or do such problems cause individuals to modify how they perform a particular functional task. Lastly, from this review, Schumacker [
16] used the uncommon categorization of 'Do you have fear?' performing various ADL-IADL activities. The result was massive floor effects, and the manuscript was ultimately excluded from this review because of poor reliability. It's worth mentioning that the categorization or wording of a particular ADL-IADL item (i.e., the differences that often occur between large surveys or cohorts) can prevent data from being pooled to create much bigger samples with increased statistical power. This topic lies beyond the scope of this review, but it should be noted that IRT equating procedures can be used to bring different groups together for comparisons on a common scale. The potential for such methods can be seen in Jagger et al. [
88] in which there was a desire to make disability comparisons between five national surveys.
A primary advantage of IRT is the extension of reliability. Traditionally, reliability (i.e., the degree to which a scale is free of measurement error) has been used to assess a scale's
average reliability. IRT however, is able to evaluate measurement error, or precision, at various stages along the scale continuum (e.g., disability construct). This is valuable because precision along the continuum is not uniform, and thus is expected to vary. This information is summarized with the information function, which allows for the estimation of the standard error of measurement for each subject's ability level. Despite the obvious utility, only one manuscript from this review chose to estimate the test information statistic--namely, Dubuc et al. [
62].
Our review contains only one 2PL manuscript, which could be viewed as a study limitation. Some authors have suggested that 1PL models, as compared to 2PL models, are unsuitable as a final model for describing data resulting from functional status items [
21]. Similarly, the fit of an IRT model can be examined with a likelihood ratio test, which assumes the more parameters that are used to describe item and subject behaviour, the better the model will fit the data [
56]. However, the 1PL model is more robust [
21] and has the advantage of assuring that items can be ordered unambiguously, in the sense that their item characteristic curves do not cross [
65] The 1PL procedure is the only well-known parametric IRT (as well as the rating scale model for polytomous items) model that has nonintersecting IRFs [
72]. Additionally, the item fit statistics available for the 2PL model are barely reliable for scales containing few items and very sensitive with large samples [
58]. A further limitation relates to the unavailability of data. This resulted in some logit data being extracted from figures rather than tables. This will merely have a small impact on the accuracy of reporting. Finally, several studies in this review use less than 100 subjects in their IRT analyses, which may be small even by Rasch standards. It has been proposed that a sample size of 100 will provide 95% confidence of item calibration. However, it has also been suggested that the adequacy of test targeting influences sample-size, and thus, a well-targeted test may produce adequate location precision with less than 100 subjects [
51].
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors have read and approved the final manuscript. RF, IJD, EJA and JMS have taken part in designing and planning of the study, as well as editing pre-submission drafts of this manuscript. RF conducted data collection.