Background
The success of behavioral interventions depends, in part, on the fidelity with which they are delivered. Fidelity is the “degree to which an intervention was implemented as described in the original protocol or as it was intended by the program developers” [
1]. Dane and Schneider summarize that fidelity encompasses (1) program adherence, (2) dose of the program delivered, (3) quality of delivery, (4) participant engagement, and (5) differentiation between critical program features [
2]. Assessment of fidelity is crucial because failure of interventions to produce desired change in targeted outcomes (e.g., diet, suicide rates, smoking) may be a result of poor implementation delivery rather than a poorly designed program [
3‐
5]. In preparing to measure fidelity, researchers must decide who will provide fidelity ratings on what aspects of fidelity, at what frequency and at what intervals, with what mode of data collection, and with what standard of fidelity in mind [
6]. Researchers must also balance psychometric rigor with pragmatic value [
7].
There are two main approaches to sourcing information on fidelity: direct assessment (i.e., observer report) and indirect assessment (i.e., self-report) [
6]. Direct measures include completion of ratings by trained observers of videotape, audiotape, or direct observation. Indirect measures include self-reports using pencil and paper surveys or technology-based submissions [
8]. Direct observation methods are considered to be more valid but are more resource-intensive; self-report methods require less resources but reflect the implementer’s valuable perceptions. [
9] Further, despite the psychometric advantages of direct measures, a 2017 review found that researchers use direct and indirect measures equally as often [
10].
Few studies have examined the conditions under which direct and indirect measures of fidelity are most appropriate, and few evaluations provide scientific justification for deciding who provides fidelity information. Further, most available comparisons are limited to studies in the mental health field. In these studies, the correspondence between approaches has varied. In some studies, therapists’ self-reported ratings of fidelity to treatment skills and strategies has demonstrated statistically weak relationships with direct measures of fidelity, [
11,
12] with individuals reporting higher fidelity for themselves than observers [
13]. However, other studies have found a more nuanced relationship. Correspondence between direct and indirect measures has been shown to be stronger for some practices (e.g. practice coverage, client comprehension, homework assignment) than others (e.g., type of exercises completed with client) [
14]. Further, despite overestimation of their fidelity, indirect ratings of fidelity by therapists have at least been consistent across time in their correspondence with direct measures [
15]. These studies suggest that indirect measures may still provide useful information in some circumstances for understanding variability in implementation. However, researchers have yet to characterize the circumstance under which indirect measures provide this utility. Further, it is unclear if approaches to fidelity measurement in other fields relate to one another in a similar way. For example, WAVES [
16] and High 5 [
17] were obesity prevention and nutrition promotion interventions delivered in elementary schools. Both of these behavior change efforts collected direct and indirect measures of fidelity; neither completed or planned comparisons between the measures. There could be important differences from the field of mental health given the variety of experience, education, and training levels held by those asked to implement nutrition interventions.
Regardless of the source of information, researchers typically struggle to balance at least three elements to get a valid and stable measure of fidelity: (a) resource constraints, (b) the ideal number of fidelity assessments, and (b) the ideal interval for fidelity assessment [
18]. Their decisions might differ by intervention and by the implementation setting. However, guidance is largely lacking. Fidelity measures in mental health studies comparing direct and indirect assessment have ranged widely from one session [
11] to every session for a set period of time [
15]. Examples of fidelity frequency and intervals for interventions aimed at student behavior change in classrooms range from weekly for 8–10 weeks [
19,
20] to once per year [
20]. Frequency of collection may also differ across the source of information with self-report being collected more frequently than direct report when both are used within the same study. For example, self-reported fidelity logbooks for each lesson were requested in both the WAVES study [
16] and Krachtvoer healthy diet intervention [
21]; direct observations were conducted three times per year per school and once per classroom per year in these studies, respectively. Choices may reflect the resource-intensive nature of direct observations and illustrate a lack of standard in the field about how much fidelity observation is adequate.
The unique strengths and weaknesses of approaches and schedules for monitoring fidelity deserve further exploration across a broader range of intervention types and implementer characteristics. To progress toward guidelines that inform researchers’ choices about selection of fidelity measurements, an important first step is to replicate the comparisons of direct and indirect measures that exist in the field of mental health. Additionally, illustration of how the information captured varies across frequently collected intervals and how approaches to analyzing these data inform conclusions could inform other studies. To this end, our primary objective was to compare direct and indirect measures of fidelity to a nutrition promotion curriculum among early educators across time. We highlight the use of three distinct descriptive analytical approaches and the differential conclusions that they suggest.
Discussion
This case study compares direct and indirect assessments of fidelity to a complex intervention for nutrition promotion in educational settings. Our findings illustrate that,
on average, observed and self-reports may seem consistent despite weak correlations and individual cases of extreme over reporting by those implementing the intervention. These real world data provide an example to help ground future decisions about the “who, what, and when” of fidelity measurements as well as how these data can be analyzed. Few guidelines are available for community-based interventions in making decisions about fidelity measurement. Improvements in standards for fidelity measurement may contribute to reduced numbers of “Type III errors” in which interventions are deemed ineffective due to poor implementation rather than a true ability to produce the desired effect [
46].
Consistency in comparisons between direct and indirect assessments in our study differed by the component of fidelity assessed and the time of the school year/intervention. On average, educators in our study reported higher scores than did observers, consistent with the finding that there were cases in the upper left quadrant of scatterplots more often than in the bottom right. This is consistent with previous observation that indirect assessments are prone to self-report bias [
6]. In our study, evidence of possible self-report bias was more prevalent (as indicated by cases in the upper left of scatterplots) for some practices than others. Differences in results suggests that self-report bias may be content dependent, reflecting not only a desire to represent oneself well but also a true gap in understanding of how the evidence should be enacted. This is consistent with mental health research in which therapists more accurately reported on use of some techniques than others [
14,
15]. In our study, educators made greater shifts toward consistency with observer reports across the school year when there was less subjectivity involved in the ratings (e.g., number of children in group vs. used the puppet enthusiastically). Future research should systematically document traits of evidence-based practices across disciplines on which implementers are better able to rate themselves than others (e.g., concrete versus abstract).
Our case study illustrates frequent collection of both direct and indirect assessments of fidelity across study classrooms. This approach provides a unique opportunity compared to other nutrition intervention studies that typically select a sub-sample of implementers to be assessed at varying intervals (e.g., one lesson per school, three lessons per term per school, 50% of classrooms observed), often using either direct
or indirect assessment [
16,
17,
20,
47]. In our study, both types of fidelity assessment occurred with every unit of the WISE intervention, which coincided, with every month of the school year. On average, educators demonstrated increases in fidelity for some components across time (i.e., involving children as prescribed in hands on activities), decreases to some components across time (i.e., role modeling), and variability likely due to the content of the unit (i.e., higher observed role modeling for berries, less for greens and green beans). In our study, the intervention content was confounded with time of year. Therefore, patterns may reflect calendar effects such as fatigue as the school year draws to the end (e.g., Role Modeling) or distractions from other demands around the time of the holidays (e.g., drop in Mascot Use in December). Researchers cannot assume that improvements due to practice effects are uniform or that a single observed measure of fidelity provides an accurate assessment of the entire intervention period. Decisions about the timing and frequency of fidelity assessment may benefit from aligning resources with the nature of the intervention itself. Researchers should consider content shifts in the intervention (e.g., fruit/vegetable change in WISE) or contextual seasonality effects (e.g., school year) as key variables to inform the measurement schedule. Infrequent or one-time assessment of fidelity may mask the true relations between direct and indirect assessments for some interventions.
This study intentionally illustrated the application of simple analytic techniques that other research teams or leaders in the community (e.g., administrators at schools and hospitals) could use throughout an implementation study with a low burden. Findings illustrate that the type of analysis used to compare direct and indirect assessments can lead to different conclusions. In our data, the means of direct and indirect assessments were closest for the component of Hands On exposure even though correlations were often weak and negative. Additionally, the means for Mascot Use appeared to be tracking together, capturing similar peaks and valleys in fidelity across time despite the gap between the overall means. Examination of scatterplots suggested a more problematic relationship for several individual cases. Thus, interpretation of means and correlations may lead to conclusions that are not true for individuals. Researchers can also appropriately consider this issue by using mixed level models that account for assessments nested in time within the individual and the individual nested within the site [
13], although this approach may be less pragmatic for monitoring throughout implementation.
Collecting both direct and indirect assessments of fidelity at key intervention points may be useful to inform implementation strategies. For example, audit and feedback is an evaluation of performance for a set period of time that is given to an implementer verbally, on paper, or electronically [
24]. The provision of audit and feedback would differ for a case in the bottom left quadrant (low fidelity by indirect
and direct assessment) who is aware he/she is not enacting the practice and an individual in the upper left (high indirect fidelity and low direct fidelity) who is reporting he/she is using the practice when observers report otherwise. Cases in the bottom left may not believe the evidence works or may not be motivated to enact the practice. Cases in the upper left may have a misunderstanding about the meaning of the practice or lack the skill to use it. Providing differential feedback to educators in these two scenarios could result in greater shifts to the upper right quadrant. If resources are limited, interventions could collect direct and indirect assessments only until cases are consistently in the upper right quadrant. The joint measurement and comparison of direct and indirect fidelity assessment is a promising application for improving feedback to implementers given previous research showing that fidelity monitoring supports staff retention when used as part of a supportive consultation process [
48]. In mental health interventions, practitioners have reported that feedback on their fidelity is helpful to support their learning and practice [
48]. Improving the nuance of this feedback through comparison of direct and indirect assessment may prove even more useful.
The present study has both limitations and strengths. First, the resource-intensive nature of direct fidelity assessment limited the size and diversity of our sample to communities in only two locations. This limitation is likely to affect other studies as well [
6], and a balance between study size and rigor of evaluation must be considered. Further, as with most fidelity studies, we developed the fidelity measures to reflect the target intervention. This meant that full validation was neither feasible nor possible. For assessing Use of Mascot, item content was not an exact match between the direct and indirect observations. Further, teachers did not receive separate training in use of the fidelity instrument, as did the observers. However, the establishment of interrater reliability for the observed measure and adherence to existing guidelines for fidelity measurement development [
6] provide support for the value ofour approach. Future work should consider what aspects of fidelity can be standardized to apply across diverse contexts and interventions. Finally, we did not design this study to capture adaptations to the intervention, conceiving of all departures from our definition of fidelity as equally detrimental and failing to document any potentially appropriate shifts. We made this decision because we conceived of targets of fidelity monitoring in our study as the active ingredients necessary for influencing change. However, Wiltsey-Stirman and colleagues have identified a range of potential adaptations applicable to complex behavioral interventions (e.g., shortening, adding, repeating) and documented that adaptations to psychotherapies were not detrimental in a review of existing studies [
49]. Embedding measures to codify adaptations is important for a holistic understanding of how an intervention is implemented [
50‐
52]. Future researchers should consider including measures of adaptation into evaluation plans.
A number of strengths balance these limitations. The research team collected fidelity frequently in all classrooms, a primary strength of the study. The availability of both direct and indirect assessments across the year allowed us to make comparisons at multiple points in time and across all units of the intervention. In addition, the research team designed the fidelity tool to be brief, simple, and specific to the core evidence-based components of WISE. We sought to employ a pragmatic approach which is key to minimizing burden on the implementers and encouraging fidelity monitoring as a routine process [
48]. Although use of the tool was variable, teachers did not voice concerns about the one-page assessment.
Researchers have many opportunities for future research in fidelity assessment. When making connections between fidelity and health outcomes, it is unclear if an aggregate measure should be used or if multiple indicators of fidelity across time would be more appropriate for inclusion in statistical models (i.e., an early, middle, and late fidelity score). Resource limitations may prevent multiple measures of fidelity in which case researchers lack guidance on when to time the assessment [
53] or model variability in its distance from measurement of the outcomes. Currently, Beidas and colleagues [
54] are conducting a randomized control trial to compare the costs and accuracy of three approaches to self-report fidelity measurement (i.e., behavioral rehearsal, chart-stimulated recall, and self-report) in cognitive behavior therapy interventions. Future studies will need to determine if these findings replicate to other content areas in which the implementers and interventions have different characteristics. For example, future research could compare direct and indirect measures after more rigorous training of the implementers on self-assessment or after an initial coaching session comparing an indirect to direct assessment. In our work with WISE specifically, we will seek to determine a minimum level of fidelity that corresponds to significant impacts on various child outcomes. Similarly, we will determine how differently timed fidelity measurements relate to outcomes. Finally, considerations of fidelity measurement source and timing will be important for future studies which seek to test associations between implementation strategies and shifts in fidelity to core components. For example, measurement of fidelity relative to delivery and use of implementation supports (e.g., newsletters in this study) may provide insight into impact of implementations strategies in particular contexts. Further, quality fidelity measures of both the innovation and the implementation intervention will be essential to tease out which strategies contribute to improved implementation [
55].