Background
Studies of within-individual change in cognitive performance are critical to advancing our understanding of the impact of aging and disease on cognitive functioning. However, repeated test administration can result in observed gains in performance that may be due to familiarity with test content or context rather than true differences in underlying cognitive ability. These spurious gains, referred to as practice or retest effects, may be due to familiarity with the testing situation, reduced anxiety, or changes in the environment [
1]. Retest effects are observed across a variety of cognitive domains [
2,
3] and can last for several years [
1]. The magnitude of decline attributable to aging or age-related disorders may be underestimated in longitudinal studies if practice or retest effects are not considered, and may even erroneously suggest longitudinal gains in performance [
1,
4,
5].
Although multiple methods have been proposed to correct for retest effects in longitudinal studies, there is no clear consensus on the best approach [
6]. Moreover, the ability to measure or adjust for practice or retest effects is further complicated when there is an acute event or insult that is anticipated to influence test performance, such as major surgery, acute illness, or intervention. The cohort examined in this observational study provides a particularly instructive example of this challenge. It concerns older adults undergoing major surgery with cognitive test performance data assessed immediately prior to and for some time after surgery. Patients’ observed cognitive performance over time is expected to be influenced by retest effects, and also the combined effects of their baseline cognitive abilities, individual variations over time, older age, major surgery (including hospitalization, anesthesia, psychoactive medications, postoperative complications), and in some cases, delirium, an acute confusional state that is common after surgery in older adults [
7]. Both surgery [
8‐
10] and specifically delirium [
11‐
14] have been shown to be associated with acute and long-term cognitive decline.
Our goal in this study was to evaluate the use of different methods of correcting for the effects of practice or retest on repeat test administration in the context of an observational study of older adults undergoing elective surgery. It is important to point out that our objective is related to but different from methods of characterizing change using reliable change indices. We are specifically interested in assessing change when there are more than two repeated observations and with investigating the impact of an acute insult or exposure on short- and long-term cognitive change. We first examined the raw data without retest correction, then applied three methods of retest correction that have been utilized in previous studies. Two retest correction approaches, mean difference correction [
15] and predicted difference correction [
16‐
18], rely on retest-adjusted cognitive scores based on performance in a non-surgical comparison group. A model-based correction [
19,
20] in which the data from the non-surgical comparison group was modeled directly as a reference group was also assessed. We contrast results and inferences from the four approaches with regard to overall trends and the differences in short-term (e.g., 1–2 months) and longer-term (e.g., 6–18 months) cognitive change in a group of patients who developed post-operative delirium compared to those who did not. The objective of this study was to compare the different retest correction methods and to provide insights on strengths and weakness of different methods for future longitudinal studies of older adults employing serial cognitive testing in the setting of acute insults, such as surgery, hospitalization, acute illness, or delirium.
Discussion
Retest effects are a known source of bias in longitudinal studies. Our goal was to contrast results and inferences derived from four existing methods in the literature for addressing practice or retest effects in an observational study of cognitive performance following elective surgery. Overall, we found that all four approaches provided nearly equivalent information and would lead to identical inferences regarding relative mean differences between groups experiencing a key post-operative outcome (delirium). Conversely, the three methods produced qualitatively different impressions of absolute performance differences over 18 months following surgery: uncorrected GCP scores increased by more than 1/5 of a population SD two months after surgery in both the surgical and NSC groups, suggesting that this increase is primarily due to retest rather than surgery; in contrast, retest-corrected GCP scores revealed a decline in performance in both surgical groups at month 1 and a recovery to baseline by month 2.
Our main finding, that substantive inferences and effect size estimates relating to the relative impact of exposure variables on cognitive changes are robust to different approaches to retest effects, echoes similar findings reported by Vivot et al. (2016) despite those authors considering a different context of study (long-running observational cohort studies of cognitive aging vs. follow-up studies of a clinical cohort), types of exposures (diabetes and depression vs. delirium) and studied approaches to handling practice and retest effects (model-based approaches with no control group vs. data manipulation and model-based approaches with a control group) [
20]. Our findings are also congruent with those of Salthouse (2016) [
45] who concluded that the primary impact of practice and retest effects was to disrupt mean age trend trajectories in the context of retest effects, whereas slopes over time are relatively unconfounded by retest effects. It is also notable that despite similar conclusions to our study, Vivot et al. and Salthouse used different approaches to quantify practice and retest effects (quasi-longitudinal or sequential cohort design).
Our study combined with those of Vivot et al. and Salthouse, all of which used different practice/retest effect adjustment methods suited to different study designs and research questions, demonstrates that retest effects influence age-related trends but are less important for understanding the relative impact of risk factors. Therefore, correction for retest effects is especially important for descriptive and natural history studies, but for analytical epidemiology studies – for example, characterizing the impact of discrete risk factors (such as delirium, depression, diabetes) – the impact of addressing retest effects is in interpretation and building the narrative to set the context of observed differences and effects of exposures. Of the four approaches examined, only Approach 1 (no correction) contradicts our a priori assumption based on clinical observations that surgery causes an acute decline in cognition. Furthermore, Approach 2 (mean difference correction) is the most straightforward method for interpreting cognitive trends since Approach 3 (predicted difference correction) changes the rank order and Approach 4 (model-based correction) requires extra post hoc transformations to generate interpretable results.
There are other important strengths and weakness of each retest correction method that should be considered before choosing the best approach for a particular study (Table
1). The primary strengths of Approach 1 are that it does not make manipulations to observed data and does not require a control group. However, because it is difficult (or impossible) to separate the effects of retest and exposure, it is only appropriate for studies examining relative differences between groups or only in long-term cognitive change, provided that a slope is only fit to data after the first 2–3 administrations, after the most significant retest gains have already occurred. Approach 4 utilizes raw GCP data from the surgical group as well as from the NSC group to model retest effects and variation in retest effects across a population. Although the latter is a strength of this approach, its importance is attenuated by our finding that model fit indices and standard errors were nearly identical across approaches, suggesting that variance in retest effects does not substantially impact inferences. A primary limitation of Approach 4 is that using a NSC group as the reference group in a statistical model restricts the types of hypotheses that can be tested. For instance, surgery type, anesthesia type, or other exposure-related factors cannot be modeled as covariates or predictors of cognitive decline since these variables cannot be collected in patients who are not undergoing surgery. This significantly limits the applicability of this method for many studies.
Rather than modeling the NSC data directly, Approaches 2 and 3 use the NSC data to generate retest-corrected GCP scores. The key difference between Approach 2 and Approach 3 is that Approach 2 uses a constant retest effect correction for every participant while Approach 3 allows the magnitude of the retest effect correction to vary by the participant’s baseline GCP. A primary strength of Approach 3 is that variables that could influence retest effects can be used to predict retest effects on an individual basis. Indeed, various characteristics have been suggested to influence retest effects [
46,
47], including baseline cognitive ability [
5,
44]. However, this literature is inconsistent, making it difficult to reliably select appropriate prediction variables [
46,
48]. Because we did not expect to draw more definitive conclusions about predictors of retest in our sample than prior work, and to keep the four Approaches as consistent as possible for comparison purposes, we chose not to include other potential predictors of retest in Approach 3. However, should consistent drivers of retest emerge, this would be a considerable advantage for using Approach 3 since those variables could be included in the retest prediction model. In building the predictive model for Approach 3, we observed that baseline GCP scores were negatively correlated with change in GCP scores in the NSC sample. The phenomenon of regression to the mean [
49,
50] would lead to the expectation that change scores on two variables that are positively correlated and have similar variance (e.g. baseline and follow-up GCP) would be negatively correlated with baseline performance. So, while this observation is expected given the phenomenon regression to the mean, it may also signal limitations of using a linear model to describe the dependence of follow-up scores on baseline scores.
Approach 2 is a relatively straightforward method which uses a constant transformation derived from the NSC group. This transformation is applied uniformly across all participants and is uncorrelated with any other variable. This strength may illustrate the primary reason for differences in the model outputs between Approaches 2 and 3. Approach 3 allows the mean difference correction to vary as a function of a participant’s baseline GCP, but it is known that cognitive performance is correlated with delirium, our independent variable of interest in the present analyses. Approach 2, therefore, is preferred over Approach 3 when the variable of interest is correlated with cognitive performance, as this may result in biased estimates of difference. The primary advantages of Approach 2, mean-difference correction, are its relatively straightforward application, that it enables interpretation of both relative differences and absolute performance, and that the hypothesized limitation that it does not account for variability due to precision of retest correction estimation has a negligible impact on inferences and model fit.
Our study also compared shorter and longer-term cognitive performance between the surgical and NSC groups. The mean score is different across groups only at the 1-month assessment, where the lingering effects of surgery, including postoperative delirium, are presumed to depress scores for some people. It is also worth noting the slightly lower score at baseline, but equivalent score at month 6 and beyond, suggesting the possibility that the performance of the surgical group at baseline might be depressed by factors related to the impending surgery (e.g., stress, pain, use of pain medications), rather than true differences in cognitive abilities. If it is true that surgical cohorts systematically have lower baseline cognitive scores compared to a NSC group, then a retest prediction model derived in the NSC sample based on baseline cognitive scores (as in Approach 3) may create biased predictions when applied to a surgical sample. This potential source of bias should be considered in future studies that plan to use Approach 3 to correct for retest effects.
This study offers an innovative contribution to the study of retest effects because it specifically assesses approaches that are applicable to observational studies with longitudinal cognitive assessment with two or more time-points aimed at investigating the impact of an acute insult or exposure on short- or long-term cognitive change. Indeed, some of the most common methodologies for controlling for retest effects cannot be evaluated using this type of study design. For instance, although many studies have evaluated the “reliable change index” [
51‐
56], these methods are less applicable to studies with more than two assessments. Additionally, “boost” correction [
4,
57], which typically uses a step function to model improvement after the first assessment (i.e., using a function of 0–1–1-1…1 across assessment time points), will not work as designed if there are other factors affecting performance at the second assessment besides retest effects. In the present study, surgery occurred in the intervening period between the first and second assessment; thus the retest “boost” would be biased (likely attenuated) by surgery effects. In contrast, the four approaches evaluated herein are appropriate for study populations where the second or third test administration co-occurs with the acute exposure under study.
This study also has several limitations. First, our study was based on an observational cohort study, and does not provide a “gold standard” by which to measure retest effect. Fortunately, all four approaches provided similar estimates of effect and inferences were qualitatively indistinct for our primary point of inference – delirium+ and delirium- group differences in longitudinal trends of cognitive performance. Second, all models examined here assumed that incomplete data were missing at random. It is possible that bias due to non-random drop out affected both our retest effect estimates in the NSC group and modeling of the effect of delirium in the surgical sample. The latter has been investigated in prior work, which found that estimations of long-term decline in SAGES were robust to multiple different assumptions about missing data (see the Supplementary Appendix of reference [
13]). In the NSC sample, all participants returned for the second assessment at month 1, five participants (4%) did not return for their third assessment at month 2, and an additional eight participants (11%) did not return for their fourth assessment at month 6. Although all participants returned for the second assessment, when the greatest retest gains are usually observed, it remains possible that drop-outs may have influenced our retest effect estimations. Third, because it remains unclear which variables consistently influence retest effects, it is possible that our findings may not be generalizable to cohorts that are younger, more racially and ethnically diverse, or less educated. Moreover, it is possible that important differences in our NSC group (e.g. greater baseline cognitive performance, more years of education, fewer physical and functional impairments) compared to the surgical cohort impacted our results. However, a recent study of a community-based cohort of older adults (mean age 77,
N = 4073) found that, similar to other studies [
48], retest effects did not differ as a function of individual differences in race/ethnicity, sex, language, years of education, literacy, apolipoprotein E ε4 status, or cardiovascular risk [
46]. Fourth, because retest effect may vary by cognitive domain [
58], it is possible the optimal retest correction would also vary by cognitive domain. This is an important area for future study. Finally, the retest correction approaches analyzed have been previously studied in various fields and were selected specifically for this study design, but the chosen methods are not an exhaustive list, and it is possible that alternative approaches exist. In fact, a gold standard approach for this type of study might be repeated observations
prior to the acute event, such that retest effect has been exhausted before the event or insult of interest [
44,
59]. However, given that SAGES participants were enrolled in anticipation of an impending surgical procedure, such an approach might not be feasible for other studies as well.