Intervention Description
2.a.
Standard: The intervention must be described at a level that would allow others to implement/replicate it, including the content of the intervention, the characteristics and training of the providers, characteristics and methods for engagement of participants, and the organizational system that delivered the intervention.
2.b.
Standard: A clear theory of causal mechanisms (including identification of mediators as well as outcomes) should be stated.
2.c.
Standard: A clear statement of “for whom” and “under what conditions” the intervention is expected to be effective should be stated.
2.d.
Standard: The core components of the intervention (i.e., those hypothesized to be essential for achieving the advertised outcomes) and the theory relating these components to the outcomes must be identified and described.
A clear and complete description of the intervention is necessary to guide practice, provide a basis for sound measurement of its implementation, and for replication. The standards for describing the intervention have been modified from Flay et al. (
2005) to require that in addition to describing the intervention, an account of the theoretical mechanism through which the intervention is expected to influence the outcome is also provided. Chen (
1990) and MacKinnon (
2008) discuss the two components of this theoretical mechanism: The “action theory” corresponds to how the treatment will affect mediators, and the “conceptual theory” focuses on how the mediators are related to the outcome variables. To meet this standard, authors should provide an account of both the action and conceptual theories. Making these theories explicit should help the developer to identify the features of the intervention that are most central to the action theory. These features should be clearly identified as the “core” components of the intervention.
1 These core components should be fully described.
The statement regarding the conditions under which the intervention is expected to be efficacious should clarify the populations, settings, times, and outcomes, or the “range of application” for the intervention. In so doing, the underlying assumptions about hypothesized similarities in the causal structure as well as anticipated limitations to application across populations, settings, times, and outcomes will be documented. This statement should also define the broad target for future dissemination if the intervention is demonstrated to be effective.
The level of detail included in these statements describing the intervention must be sufficient so that others would be able to replicate the intervention.
2.e.
Standard: The anticipated timing of effects on theoretical mediators and ultimate outcomes must be described.
The intervention theory should also clarify when (relative to the intervention) the expected outcomes should be observed. The description of timing of outcomes should be based on an understanding of the developmental epidemiology of the targeted behavior. Is the intervention expected to influence the ultimate outcomes immediately (as, for example, adding lighting to a parking lot would be expected to influence theft from the lot), or is a lag anticipated (as, for example, encouraging attachment to school in elementary school children might be expected to reduce substance use during adolescence)?
2.f.
Standard: It is necessary to characterize the research evidence supporting the potential that the intervention will affect outcomes that have practical significance in terms of public health impact.
Demonstrated public health impact is essential at a later stage of program development, but it should not be ignored at the efficacy stage. This standard requires that an argument be made connecting the observed outcomes to outcomes of practical significance. This connection can be accomplished by collecting and reporting data on such outcomes (e.g., number of subjects who stopped using tobacco as a result of the intervention; number of child abuse and neglect cases or crimes prevented, increases in number of high school graduates). If such outcomes are not available at the efficacy trial stage, a logical argument can be made to connect the available outcomes with outcomes of practical significance. For example, a study may collect data on known precursor of criminal activity such as low self-control or poor parental supervision. Making use of data collected by others that links these proximal outcomes to outcomes of practical importance, the researcher can characterize the potential of the intervention to produce practically meaningful outcomes.
Measures and Their Properties
3.a.
Standard: The statement of efficacy can only be about the outcomes (e.g., mediators as well as problem and well-being outcomes) that are measured and reported [Reporting Standard].
3.b.
Standard: The quality and quantity of implementation must be measured and reported.
3.b.i.
Standard: Precursors to actual implementation such as completion of training, practitioner-coach ratio, caseload, staff qualifications, and availability of necessary resources must be measured and reported.
3.b.ii.
Standard: The integrity and level of implementation/delivery of the core components of the intervention must be measured and reported.
3.b.iii.
Standard: The acceptance, compliance, adherence, and/or involvement of the target audience in the intervention activities must be measured and reported.
3.b.iv.
Standard: Level of exposure should be measured, where appropriate, in both the treatment and control conditions.
Implementation fidelity influences intervention outcomes (Durlak and Dupre
2008; Fixsen et al.
2005), and the quality of implementation of preventive interventions when delivered in “real-world” settings is often suboptimal (Gottfredson and Gottfredson
2002; Hallfors and Godette
2002; Ennett et al.
2003). Assessing implementation fidelity and quality is an important activity at all stages of development of an EBI (Allen et al.
2012), and tools to guide researchers in the reporting of implementation fidelity (e.g., Oxford Implementation Index, Montgomery et al.
2013a) are available. It is important to understand the extent to which the core components of the intervention can be varied and still achieve the desired effect and to document modifications that occur in the field. Although information on the quality and quantity of implementation will become more important in later stages of research, it is essential that such information be collected and reported in earlier trials that produce desired outcomes. This information will provide a benchmark against which implementation levels in later trials can be compared.
The level of implementation of the intervention is meaningful only in comparison to what is present in the comparison condition. Many interventions contain elements that are likely to be present in the comparison group as well as in the treatment group. For example, most drug treatment courts involve intensive probation, frequent judicial hearings, drug testing, and drug treatment. A lower dosage of these same components is likely to be part of usual service for a “treatment as usual” comparison group. Similarly, interventions that are related to the intervention of interest may be present in the control condition. It is important to document the differences between the services provided to the treatment and comparison groups while taking care not to allow the measurement itself to influence what is delivered in the control condition.
It is desirable during the efficacy trial period to assess not only the quality and quantity of implementation (Efficacy
Standard 3.b.), but also the factors that are likely to be related to variation in implementation. These factors include features of the intervention such as amount and type of training involved in implementing the intervention during the efficacy trial, the clarity of the intervention materials, the type of setting in which the intervention is tested, and external (social, economic, and political) forces in the larger community. Many efficacy trials are small in scope and would therefore not provide sufficient variability across different conditions of these factors to provide useful data without deliberate manipulation. Spoth et al. (
2013) recommend embedding research on factors that are likely to be relevant in the dissemination stage (e.g., factors that might influence implementation quality when delivered in natural settings, factors that might influence communities’ decisions to select or adopt the intervention, etc.) into earlier stage research studies. Efficacy studies, for example, might randomly assign units to different levels of training and technical assistance, or to different levels of organization development assistance. Short of conducting this type of rigorous research on these factors, qualitative data on factors that are perceived to be related to implementation quality would provide a useful starting point for more thorough investigation into these factors at a later stage.
3.c.
Standard: Clear cost information must be reported [Reporting Standard].
Flay et al. (
2005, p. 167) included a standard stating that “clear cost information must be readily available” before an intervention is ready for scaling up. A discussion of the types of costs that should be included in the cost calculation was also provided. Glasgow and Steiner (
2012) and Spoth et al. (
2013) underscore the need for such information in community decisions to adopt and to sustain evidence-based practices later on. Prevention scientists can begin to pave the way for accurate cost tracking during the efficacy trial stage by documenting program costs.
Of course, assessing costs is not straightforward. There is currently no accepted standard to guide cost assessment, and considerable variability exists in what elements are included. Costs incurred during efficacy and effectiveness trials are likely to include significant costs related to conducting the research that are difficult to separate from the costs likely to be incurred by communities later adopting the intervention. Further, costs are likely to change over time as programs evolve. The Institute of Medicine recently held a workshop on standards for benefit-cost assessment of preventive interventions (
http://www.iom.edu/Activities/Children/AnalysisofPreventiveInterventions/2013-NOV-18.aspx), and a recently formed SPR task force is studying this topic and will soon provide much needed guidance in this area. Prevention scientists should be guided by the forthcoming recommendations of these groups. In the meantime, investigators are encouraged to include in their cost reporting not only the cost of intervention materials and training, but also projected costs to the delivering organization, as discussed in Flay et al. (
2005). These include:
-
Nonresearch investments in delivery of staff training
-
On-site time
-
Facility, equipment, or resource rental and maintenance
-
Reproduction of materials
-
Value of volunteer labor and donated space and equipment
-
Attendant delivery costs for consultants, clerical staff, and physical plants
Foster et al. (
2007) provide a more detailed discussion of cost elements in prevention interventions and how they can be measured.
Researchers do not usually estimate cost-effectiveness in efficacy trials. Even at the efficacy level, however, it is desirable to estimate cost-effectiveness (i.e., the cost of achieving the observed change in the outcome). This information will influence decisions to adopt the intervention at a later stage and so should be collected during earlier stages if possible.
-
Desirable Standard: It is desirable to collect data on outcomes that have clear public health impact.
2
Desirable Standard: It is desirable to measure potential side effects or iatrogenic effects.
3.d.
Standard: There must be at least one long-term follow-up at an appropriate interval beyond the end of the intervention. For policy interventions whose influence is expected to continue for an indefinite period of time, evidence must be presented for a sustained effect of the policy for an appropriate interval after the policy was put in place.
The positive effects of an intervention may diminish rapidly or slowly, or broaden and increase over time. Some interventions may demonstrate effects on problems that emerge later in development, such as substance use or abuse, sexual behavior, mental disorder, criminal behavior, or drunk driving (Griffin et al.
2004; Olds et al
.
2004; Wolchik et al.
2002). Flay et al. (
2005) recommended a follow-up interval of at least 6 months after the intervention but noted that the most appropriate interval may be different for different kinds of interventions. We believe that the 6-month time frame is a reasonable minimum time frame to demonstrate that effects observed at the end of the intervention do not dissipate immediately, but a more accurate picture of intervention effects requires that measurement time points coincide with the theory of timing of intervention effects specified in the intervention description (see above). This theory should be developed based on an understanding of the developmental epidemiology of the targeted behavior. For example, to demonstrate efficacy of a fifth grade intervention on outcomes that arise during adolescence, it is necessary to include measurement during adolescence rather than after 6 months. The causal theory linking the intervention to the ultimate outcomes (see Efficacy
Standard 2.e.) should specify proximal outcomes and the expected timing of effects on them. The timing of measurement of both the proximal and ultimate outcomes should conform to this theory.
3.e.
Standard: Measures must be psychometrically sound.
The measures used must either be of established quality, or the study must demonstrate their quality. Quality of measurement consists of construct validity and reliability.
3.e.i.
Standard: Construct validity—Valid measures of the targeted behavior must be used, following standard definitions within the appropriate related literature.
3.e.ii.
Standard: Reliability—Internal consistency (alpha), test–retest reliability, and/or reliability across raters must be reported.
3.e.iii.
Standard: Where “demand characteristics” are plausible, there must be at least one form of data (measure) that is collected by people different from the people who are applying or delivering the intervention. This is desirable even for standardized achievement tests.
Theory Testing
4.
Standard: The causal theory of the intervention should be tested.
Although the primary emphasis in efficacy trials is on demonstrating that an intervention is efficacious for producing certain outcomes, understanding the causal mechanism that produces this effect will allow for greater generalization to the theory of the intervention rather than to the specific components of the intervention. For example, the knowledge that implementing a specific model of cooperative learning in a classroom increases achievement test scores is valuable. But knowledge that the mechanism through which this effect occurs is increased time on-task is even more valuable because it facilitates the development of additional interventions that can also target the same mediator. It is therefore important to measure the theoretical mediators that are targeted by the intervention.
As noted earlier (see Efficacy Standard 2.d.), the intervention theory involves both an “action theory” of how the treatment will affect mediators and a “conceptual theory” of how the mediators are related to the outcome variables. Both aspects of the intervention theory should be tested at the efficacy stage. Tests of the action theory probe the extent to which each core component influences the mediators it is hypothesized to move. The strongest of such tests would systematically “dismantle” the intervention into core components. That is, they would randomly assign subjects to conditions involving different core components and compare the effects of the different combinations on the hypothesized mediators. Of course, such designs are seldom feasible when the subjects are schools or communities. However, testing for intervention effects on the hypothesized mediators would provide a test of the action theory of the intervention as a whole.
Testing the conceptual theory involves analysis of mediating mechanisms. Despite recent advances in methods for testing mediational processes (e.g., MacKinnon
2008), these methods are not as well developed as are methods for testing casual effects of the intervention. Intervention theories often involve complex causal processes involving numerous mediators operating in a chain. Testing such complex causal chains in a rigorous fashion is not yet possible with existing tools (Imai et al.
2012). Even for simple theories involving only one mediator, it is not possible to test the theory underlying the intervention except in comparison with another theory. Available tools allow only for a rudimentary examination of the behavior of theorized mediating variables. Even so, such rudimentary tests can provide valuable information about which of the proposed mediators are both responsive to the intervention and correlated with the outcomes. Such information, although not constituting a strong test of the full intervention theory, at least provides information about which mechanisms are consistent with the stated theory. These tests should be conducted at the efficacy stage.
We caution that high-quality measurement of theoretical mechanisms will often be costly because it may require additional measurement waves between the intervention and the ultimate outcome as well as additional modes of measurement (e.g., observations). In some cases, the ultimate outcome may be decades in the future. Although measuring and testing causal mechanisms is critical to advancing science, in reality doing so may require trade-offs with the strength of the test of the effect of the intervention on the ultimate outcome. This trade-off creates tension that will have to be resolved over time as mediation analysis strategies improve and funding sources increase funding to allow for more rigorous testing of theoretical pathways through which interventions affect outcomes. In the meantime, this tension should be resolved in a way that preserves the integrity of the test of the intervention on the outcomes.
Valid Causal Inference
5.a.
Standard: The design must have at least one control condition that does not receive the tested intervention.
The control condition can be no-treatment, attention-placebo, or wait-listed. Or, it can be some alternative intervention or treatment as usual (e.g., what the participants would have received had the new interventions not been introduced), in which case the research question would be, “Is the intervention better than a current one?”
5.b.
Standard: Assignment to conditions must minimize bias in the estimate of the relative effects of the intervention and control condition, especially due to systematic selection (e.g., self-selection or unexplained selection), and allow for a legitimate statistical statement of confidence in the results.
Although there are many sources of bias in the estimation of causal effects, selection effects are the most serious and prevalent in prevention research. Researchers should assign units to conditions in such a way as to minimize these biases. Such assignment reduces the plausibility of alternative explanations for the causes of observed outcomes. This then increases the plausibility of causal inference about the intervention. The design and the assumptions embedded in the design must take into account exactly how people or groups were selected into intervention and control conditions and how influences on the treatment and control conditions other than the intervention might differ.
5.b.i.
Standard: For generating statistically unbiased estimates of the effects of most kinds of preventive interventions, well-implemented random assignment is best because it is most clearly warranted in statistical theory.
Within the context of ethical research, it is necessary to use randomization whenever possible to ensure the strongest causal statements and produce the strongest possible benefits to society (Fisher et al.
2002). Many objections to randomization may be unfounded (Cook and Payne
2002). Randomization is possible in most contexts and situations. Gerber et al. (
2013) provide numerous examples of RCTs conducted to test policies in diverse areas such as migration, education, health care, and disease prevention. The White House recently sponsored an event to encourage the use of RCTs to different policy options for social spending (
http://www.whitehouse.gov/blog/2014/07/30/how-low-cost-randomized-controlled-trials-can-drive-effective-social-spending). In fact, the Cochrane registry (
www.cochrane.org) contains over 700,000 entries on randomized trials. The level of randomization should be driven by the nature of the intervention and the research question. Randomization can be of individuals or of intact groups such as classrooms, schools, worksites, neighborhoods, or clinics (Boruch
2005; Gerber et al.
2013). Also, the timing of intervention can be randomly assigned to allow a short-term comparison of outcomes between the early and later intervention groups.
5.b.ii.
Standard: Reports should specify exactly how the randomization was done and provide evidence of group equivalence. It is not sufficient to simply state that participants/units were randomly assigned to conditions [Reporting Standard].
Because correct randomization procedures are not always implemented or sometimes break down in practice, it is essential that the randomization process be described in sufficient detail so that readers can judge the likelihood that the initial randomization process was correct and has not broken down despite being initially implemented correctly. The description of the process should include details of exactly how cases were assigned to conditions and a discussion of the extent to which the assignment was well concealed, or could have been guessed at or tampered with. A “well-implemented” random assignment is one in which this possibility is judged to be small. Post-randomization checks on important outcomes measured prior to the intervention should be provided so that the pretreatment similarity of the experimental groups can be assessed.
Although random assignment is the strongest possible design for generating statistically unbiased estimates of intervention effects, and although perceived obstacles to random assignment are often not as difficult to overcome as initially anticipated, random assignment studies sometimes involve important trade-offs, especially in terms of generalization or statistical power. Researchers must sometimes rely upon fallback designs, hoping that the estimates of effects produced using these designs approach those that would be obtained through a random assignment study. There has been much debate about which designs should be considered as suitable alternatives when the trade-offs involved in random assignment are too costly. Fortunately, evidence from within-study comparisons of different alternatives versus random assignment have yielded invaluable information about which designs are likely to yield results most comparable to those obtained from random assignment studies. These within-study comparisons directly compare the effect size obtained from a well-implemented random assignment design with the effect size from a study that shares the same treatment group as the randomized study but has a nonrandomized comparison group instead of a randomly formed one. In these studies, the effects size obtained from the randomized arm of the study serves as a benchmark against which to compare the effect size obtained from the nonrandomized arm of the study.
Cook et al. (
2008) summarize what has been learned from within-study comparisons and report results from 12 recent studies in an attempt to identify features of nonrandomized designs whose results match those from randomized designs most closely. This research identifies only two alternative designs, regression discontinuity designs and comparison time series designs, which reliably generate unbiased estimates of treatment effects. There are now a total of seven studies comparing regression discontinuity and experimental estimates at the regression discontinuity cutoff score and there are six comparing experimental and interrupted time series or comparison time series designs with a nontreatment comparison series. All point toward the causal viability of the quasi-experimental design in question. The third design considered in Cook et al. (
2008) involves matched comparison group designs without a time series structure. In their paper, the results from these designs approach those of random assignment studies only under certain restrictive conditions—when the selection process happens to be completely known and measured well and when local intact comparison groups are chosen that heavily overlap with treatment groups on pretest measures of the outcome. Since then, somewhat conflicting claims have been made about the other quasi-experimental design features that promote causal estimates close to those of an experiment. The regression discontinuity, comparison time series, and matched group designs are described below, along with potential trade-offs involved with each. The trade-offs anticipated for randomized designs and the fallback design under consideration should be carefully weighed against each other when determining the strongest possible design for a given study.
Research on alternative quasi-experimental designs for evaluation studies is evolving quickly. The standards articulated here take advantage of the most rigorous research available to date, but we expect that as the field continues to evolve, additional alternative designs will be identified using the within-group comparisons strategy.
5.b.iii.
Standard: Well-conducted regression discontinuity designs are second only to random assignment studies in their ability to generate unbiased causal estimates.
Regression discontinuity designs involve determining who receives an intervention based on a cutoff score on a preintervention measure. The cutoff score might be based on merit or need, or on some other consideration negotiated with the other research partners. For example, students with reading scores below the 25th percentile might be assigned to a tutoring intervention while the remaining students serve as a control, or communities whose per capita income level falls below a certain point might be targeted for certain services while those above the cut-point are not. The regression of the outcome on the assignment score is used to estimate intervention effects. Intervention effects are inferred by observing differences in the slopes and/or intercepts of the regression lines for the different groups. This design provides unbiased estimates of the treatment effects because, as in randomized studies, the selection model is completely known.
Cook et al. (
2008) analyzed three within-study comparisons contrasting causal estimates from a randomized experiment with those from regression discontinuity studies. The regression discontinuity design studies produced comparable causal estimates to the randomized studies at points around the cutoff point. There are now four further studies in each of which the authors claim that the regression discontinuity and experimental results are similar at the cutoff. There is also one (Wing and Cook
2013) showing that when a pretest comparison function is added to the regular regression discontinuity (called a “comparison regression discontinuity function”), this mitigates the disadvantages of the regression discontinuity relative to the experiment. That is, regression discontinuity is more dependent on knowledge of functional forms, its statistical power is lower, and causal generalization is limited to the cutoff score (Shadish et al.
2002; Trochim
1984,
2000). Although Wing and Cook (
2013) is the only relevant study with an experimental benchmark, its results indicate that a comparison regression discontinuity function can enhance statistical power almost to the level of the experiment, can help support conclusions about proper functional form that the nonparametric experiment does not need, and can attain causal conclusions away from the cutoff (and not just at it) that are similar to those of the experiment. So adding this particular comparison to the regression discontinuity function can significantly reduce the limitations of the regression discontinuity design relative to an experiment.
5.b.iv.
Standard: For some kinds of large-scale interventions (e.g., policy interventions, changes to public health law, whole-state interventions) where randomization is not practical or possible, comparison time series designs can provide unbiased estimates of intervention effects.
Flay et al. (
2005) included a standard recommending the use of interrupted time series designs for large-scale interventions where randomization was not feasible. The logic of this design is that the effect of an intervention can be judged by whether it affects the intercept or slope of an outcome that is repeatedly measured (Greene
1993; Nerlove and Diebold
1990; Shadish et al.
2002). For example, Wagenaar and Webster (
1986) evaluated the effects of Michigan’s mandatory automobile safety seat law for children under 4 by comparing the rate of injuries to passengers 0–3 years old for the 4 years prior to enactment of the law and a year-and-three quarters after its enactment. Flay et al. (
2005) pointed out that these designs could be strengthened by using comparison series in locations in which the intervention was not implemented, by using naturally occurring “reversals” of polices to test whether the outcome responds to both reversals and reinstatements of the policy, and by increasing the number of baseline time points.
Time series designs with only a single treatment group are rarely unambiguously interpretable because the effects of the intervention are often confounded with other events occurring at the same time.
3 Often broad reforms are made in response to highly publicized, often emotionally laden, incidents. These incidents may result in numerous reforms that fall into place at roughly the same time, making it impossible to isolate the effects of only one of them using time series analysis. This is why all but one test of the similarity of experimental and ITS results deals with a comparison time series design rather than a single group interrupted time series design. Using the comparison time series design, the same outcome measures might be collected in a neighboring county or state or (in studies of school policy reform) in a grade level not affected by the reform. These designs, if well implemented, provide a means by which confounding effects due to co-occurring events can be ruled out. A nascent literature (reviewed in St. Clair et al.
2014) comparing the estimates from these comparison time series designs with those of randomized designs suggests that the comparison time series designs produce unbiased estimates of treatment effects (this assumes, of course, that there are few studies with conflicting results sitting in “file drawers”). Wagenaar and Komro (
2013) encourage the use of these comparison time series designs for research evaluating public health laws and policies and discuss a number of design features (e.g., multiple comparison groups and multiple levels of nested comparisons, reversals, replications, examination of dose response) that can be used to further enhance these designs. These designs have broad utility for a wide variety of research needs including establishing theory-based functional forms of intervention effects over time (e.g., understanding the diffusion S-curves, tipping point transitions, and decay functions that often characterize effects when going to scale). We conclude that the comparison time series design, but not the single group interrupted time series design, can provide a useful alternative to randomized designs.
5.b.v.
Standard: Non-randomized matched control designs can rarely be relied on to produce credible results. They should be used only under the following conditions:(a) Initial group differences are minimized, especially by identifying intact comparison groups that are local to the treatment group and demonstrably heavily overlap with it on at least pretest measures of the outcome (Cook et al.
2008
); (b) the process by which treatment subjects select into the intervention group (or are selected into it) is fully known, well-measured, and adequately modeled (Diaz and Handa
2006
; Shadish et al.
2008
);or (c) the covariates used to model any group differences remaining after careful comparison group choice lead to no detectable pretest difference between the treatment and comparison groups in adequately powered tests. To this last end, it is desirable to explicate the selection process and use it to choose covariates or, where this is not possible, to include as many covariates as possible that tie into multiple domains.
Early reviews of within-study comparisons (Glazerman et al.
2003; Bloom et al.
2005) concluded that estimates of effects from studies using common strategies for equating groups (e.g., matching, analysis of covariance, propensity scoring, selection modeling) are often wrong. A well-known example is the research on hormone replacement therapy for women where prior nonrandom trials suggested positive effects but a large randomized trial found harmful effects (Shumaker et al.
2003). A more recent summary of within-study comparisons comes to a slightly more optimistic conclusion about the value of nonrandomized matched comparison group designs. Cook et al. (
2008) summarized results from nine within-study comparisons of random assignment versus matched comparison groups. Some but not all of these matched comparison estimates were similar to the estimates obtained in the randomized arm of the study. Cook et al. (
2008) described the very specific conditions under which nonrandomized matched designs produce similar results to randomized designs.
First, studies in which researchers designed the study beforehand to identify an intact comparison group that was “likely to overlap with the treatment group on pretest means and even slopes” (Cook et al.
2008, p. 736) resulted in comparable experimental and nonexperimental study effect size estimates. For example, Bloom et al. (
2005) evaluated the National Evaluation of Welfare-to-Work Strategies. One part of this evaluation reported on five sites in which a comparison group from a randomized trial conducted in a different job training center was used as a matched comparison group, but the comparison training centers were located in the same state (or the same city in four of the five sites), and the measurement was taken at the same time as the measures for the subjects in the job training sites that were the focus of the evaluation. No matching of individual cases was conducted, but the careful selection of intact groups from similar locations and times resulted in pretest means and slopes that did not differ between the treatment and comparison groups. Conversely, when intervention and comparison sites were from different states or even different cities within a state, the groups were not at all equivalent and differences could not be adjusted away. Cook et al. (
2008) concluded that the use of intact group matching, especially using geographic proximity as a matching variable, is a useful strategy for reducing initial selection bias.
The second condition under which effects sizes from nonrandomized matched comparison group designs matched those from randomized studies involved treatment and nonrandomized comparison groups that differed at pretest but where the selection process into treatment was known and modeled (Diaz and Handa
2006). An example of this type of study comes from an evaluation of PROGRESA in Mexico. In this study, eligible villages were randomly assigned to receive the intervention or not, and eligible families in treatment villages were compared with eligible families in control villages on outcomes. Eligibility for the intervention was based on scores on a measure of material welfare, both at the village level and at the individual family level within village. The design identified villages that were too affluent to be eligible for PROGRESA. These villages were clearly different than the villages that participated in PROGRESA, but the selection mechanism that resulted in some villages and families being selected into the study and others not was completely known and measured. Once the same measure of material welfare that had determined eligibility for PROGRESA was statistically controlled, selection bias was reduced to essentially zero.
These conditions—intact group matching and complete knowledge of the selection process—are rare. It is not yet clear when nonrandomized matched comparison group designs that do not meet these conditions will yield valid results, regardless of the technique used for statistical adjustment. However, the likelihood of bias is reduced when initial group equivalence is inferred from adequately powered no-difference results on multiple, heterogeneous baseline measures that include at least one wave of pretest measures of the main study outcome. This is the criterion currently advocated by the What Works Clearinghouse of the Institute for Educational Sciences (
http://ies.ed.gov/ncee/wwc/).
In nonexperimental studies, the choice of data analysis technique is not very important for reducing selection bias. Direct comparisons of ordinary least squares and propensity score matching methods have not shown much of a difference to date (Glazerman et al.
2003; Bloom et al.
2005; Shadish et al.
2008; Cook et al.
2009), though the latter is theoretically preferable because it is nonparametric and requires demonstrated overlap between the treatment and comparison groups on observed variables. More leverage for reducing selection bias comes from (a) preintervention theoretical analysis of the selection process into treatment—or even direct observation of this process—so as to know which covariates to choose, (b) selecting local comparison groups that maximize group overlap before any covariate choice, and (c) using a heterogeneous and extensive collection of covariates that, at a minimum, includes one or more pretest waves of the main study outcome (Cook et al.
2009).
The evidence to date suggests that randomized studies are less vulnerable to bias than nonrandomized studies, but that regression discontinuity and comparison time series designs may be suitable alternatives to randomized studies. However, it bears repeating that a poorly implemented randomized design is as likely to yield biased results as a nonrandomized study. Randomization can be subject to tampering. But even when executed faithfully, randomized trials often suffer from differential attrition across study groups, which often reduces group equivalence and renders the study results ambiguous. Therefore, we caution against any process for identifying efficacious interventions that identifies effective interventions based on the initial study design without carefully considering the quality of implementation of the design. We also provide the following standard to guide reporting of randomization procedures (above) and analysis and reporting of study attrition:
5.c.
Standard: The extent and patterns of missing data must be addressed and reported.
Analyses to minimize the possibility that observed effects are significantly biased by differential patterns of missing data are essential. Sources of missing data include attrition from the study, from particular waves of data collection, and failure to complete particular items or individual measures. Missing data is particularly troubling when the extent and pattern of missing data differs across experimental conditions. Differences across conditions in the nature and magnitude of attrition or other missingness can bias estimates of intervention effects if they are not taken into account. Note that differential measurement attrition can occur even when the rates of attrition are comparable across groups.
Schafer and Graham (
2002) discuss methods of analyzing data in the face of various kinds of missingness. One common strategy is to impute missing data based on the data that are available. Appropriate application of these imputation methods requires assumptions about the pattern of missingness, however, and these assumptions are often not justified in practice. The required assumption is that missing data are “missing at random,” which means that there is no discernible pattern to the missingness once measured variables are controlled. If this assumption cannot be met (as is often the case), sensitivity tests should be conducted to probe the likely impact that missing data might have on the estimates of the intervention effect (Enders
2011; Imai
2009; Muthen et al.
2011).
Statistical Analysis
6.a.
Standard: Statistical analysis must be based on the design and should aim to produce a statistically unbiased estimate of the relative effects of the intervention and a legitimate statistical statement of confidence in the results.
6.b.
Standard: In testing main effects, the analysis must assess the treatment effect at the level at which randomization took place.
In many contexts in which prevention researchers carry out their work, the participants belong to naturally occurring groups that often must be taken into account when conducting statistical tests. For example, if a researcher is testing a drug prevention curriculum in third grade classrooms, the fact that the students belong to the classrooms means those student responses may not be independent of other students in the same classroom, and this has an important impact on the validity of the statistical tests. Often, researchers will randomize at a higher level (e.g., the school) but analyze the data at a lower level (e.g., individuals). Doing so almost always results in a violation of the assumption of the statistical independence of observations. Even small violations of this assumption can have very large impacts on the standard error of the effect size estimate (Kenny and Judd
1986; Murray
1998), which in turn can greatly inflate the type I error rate (e.g., Scariano and Davenport
1987). In these situations, analysts must conduct analyses at the level of randomization and must correctly model the clustering of cases within larger units (Brown
1993; Bryk and Raudenbush
1992; Hedeker et al.
1994; Zeger et al.
1988). For example, if an intervention is delivered at the clinic level (e.g., some clinics deliver a new intervention, others do not), then clinics should be randomly assigned to conditions, and the statistical analyses must take into account that patients are nested within clinics.
6.c.
Standard: In testing main effects, the analysis must include all cases assigned to treatment and control conditions (except for attrition—see above).
6.d.
Standard: Pretest differences must be measured and statistically adjusted, if necessary.
That is, when differences between groups on pretest measures of outcomes or covariates related to outcomes are observed, models testing intervention effects should incorporate these pretest values in a manner that adjusts for the preexisting differences.
6.e.
Standard: When multiple outcomes are analyzed, the researcher must provide a clear rationale for the treatment of multiple outcomes, paying close attention to the possibility that conclusions may reflect chance findings.
There is no consensus on the best way to handle this issue in prevention research. However, an expert panel recently convened by U.S. Department of Education Institute of Educational Sciences explored ways of appropriately handling multiple comparisons (Schochet
2007). This panel recommended that outcomes be prioritized to reflect the design of the intervention and that confirmatory analyses be conducted to test global hypotheses within the main domains identified as central to the study’s hypotheses. For example, a program might include a tutoring component aimed at improving academic performance and a social skills curriculum aimed at improving social competency skills. Schochet (
2007) recommends that multiple measures of academic performance (e.g., teacher reports of academic competence, grade point average, standardized reading, and math scores) be combined into one scale to test the hypothesis that the program influences academic performance and that multiple measures of social competency (e.g., goal setting, decision-making, and impulsive control) be combined into a second scale to test the hypothesis that it influences social competency skills. The report recommends against testing each of the multiple measures as a separate outcome.
Our standard does not require that researchers follow this advice, but rather that they attend carefully to potential misinterpretations due to the analysis of multiple correlated outcomes and provide a clear rationale for the treatment of multiple outcomes.
Efficacy Claims—Which Outcomes?
7.a.
Standard: Results must be reported for every targeted outcome that has been measured in the efficacy study, regardless of whether they are positive, nonsignificant, or negative [Reporting Standard].
7.b.
Standard: Efficacy can be claimed only for constructs with a consistent pattern of nonchance findings in the desired direction. When efficacy claims are based on findings from more than one study, efficacy can be claimed only for constructs for which the average effect across studies is positive.
Note first that this standard pertains to constructs rather than to measures of constructs. For studies reporting findings for multiple measures of the same construct, an omnibus test that corrects for alpha inflation must confirm a nonchance effect in the desired direction (see Standard 6.d.).
This standard can be met either within one study or through replication. Replication has two main purposes in Prevention Science: To rule out chance findings and to demonstrate that results obtained in one study are robust to variations in time, place, and certain implementation factors. The latter are generalizability issues that are most appropriately addressed in effectiveness trials. Before an intervention can be judged to be a suitable candidate for effectiveness trials, though, the possibility that positive results were due to chance must be minimized.
Flay et al. (
2005) called for at least two different studies of an intervention, each meeting all of the other efficacy standards, before an intervention could be labeled as “efficacious.” This standard is consistent with recent calls in Psychology for more direct replication studies to rule out chance findings. Pashler and Harris (
2012) note that replication studies that test the same experimental procedure are extremely rare in psychological research, but they are essential to the conduct of science. They calculate that more than a third of published positive results are likely to be erroneous, even when researchers set low alpha levels (e.g., .05).
4 Further, “conceptual” replications, in which researchers vary aspects of the intervention or the research operations, do not help to rule out chance findings because failures to replicate in such studies are too easily attributed to the variations tested rather than to the possibility that the earlier results were due to chance.
Flay et al. (
2005, p. 162) recognized that exact replication in which the same intervention is tested on “a new sample from the same population, delivered in the same way to the same kinds of people, with the same training, as in the original study” is rare, and suggested that “flexibility may be required in the application of this standard …until enough time passes to allow the prevention research enterprise to meet this high standard.”
Time has passed. The prevention research enterprise appears no closer to reaching this high standard for replication to rule out chance findings, and funding agencies are no more likely today to fund replications simply to verify the results of an earlier study than they were 10 years ago. When replication studies are conducted, they are much more likely to be for the purpose of testing variations in the intervention or of generalizing results to different settings or populations than for ruling out chance findings. Although the accumulation of positive results from this type of replication study does eventually rule out chance findings, we regard these studies as generalizability studies most appropriate for interventions that have met all of the efficacy standards.
How should chance be ruled out at the efficacy stage? Chance can be ruled out in a single study if the magnitude of the intervention effect observed in a well-designed and well-conducted trial is so large that it is extremely unlikely to have arisen by chance given a true null hypothesis. That is, highly improbable significance levels lend confidence to the conclusion that the results are unlikely to be due to chance. For example, using Pashler and Harris’ (
2012) reasoning, significant findings at the .005 level would translate into an actual error rate of approximately 5 %. Differences of this magnitude from a single trial should suffice to rule out chance.
When intervention effects from a single efficacy trial are not large enough to confidently rule out chance, one or more additional trials are needed. Data from these additional trials, when combined together with the first trial, must achieve a sample size large enough to test whether findings for the combined dataset exceed chance levels. Also, in order to rule out chance at the efficacy level, it is important that all experimental units be exposed to the same intervention rather than to different variants of the intervention, as is often the case in subsequent trials of an intervention. As noted by Flay et al. (
2005), efficacy trial replications should be “exact” replications (Hunter
2001) in which the same intervention is tested on a new sample of the same population, delivered in the same way by the same kinds of people, with the same training, as in the original study, or “scientific” replications (Hunter
2001) in which all aspects are exactly replicated except that the study samples comes from similar populations rather than the exact same population (such as is likely in a multisite evaluation of an intervention). Judgments about the similarity of the population should be made on the basis of the program developer’s statement of the range of application of the intervention (see
Standard 2.c.).
7.c.
Standard: For an efficacy claim, there must be no serious negative (iatrogenic) effects on important outcomes.
Reporting
8.
Standard: Research reports should include the elements identified in the 2010 CONSORT guideline or a relevant extension of these guidelines.
Several of the standards articulated above are standards for reporting about prevention research. For example, Standard 2.a. requires that the intervention “be described at a level that would allow others to implement/replicate it,” and Efficacy Standard 7.a. states that “results must be reported for every targeted outcome that has been measured in the efficacy study.” Most of the standards pertain to research methods that should be fully described in reports of the research. Unfortunately, reporting of interventions tested and the methods used to evaluate them is often suboptimal, even in our best journals (Grant et al.
2013), and this often makes it difficult to judge the quality of the evidence from potentially important prevention trials. Research reports are often brief, omitting or inadequately reporting important information. Incomplete and inaccurate reporting results in underuse of the research.
Incomplete reporting is a problem in other disciplines as well. This has led to the development of numerous guidelines for reporting of research across different fields, the most well known of which is the Consolidated Standards of Reporting Trials (CONSORT) guideline, which has been recently updated (Schulz et al.
2010). CONSORT is intended to facilitate the writing of transparent reports by authors and appraisal of reports by research consumers. It consists of a checklist of 25 items related to the reporting of methods, including the design, who the participants were, how the sample was identified, baseline characteristics of the participants on key variables, how the sample size was determined, statistical methods used, participant flow through the study (including attrition analysis), the delivery, uptake, and context of interventions, as well as subsequent results.
The CONSORT guideline was developed by biomedical researchers to guide reporting of health-related clinical trials. It is therefore not broad enough to cover all reporting issues relevant for reporting of Prevention Science research. An extension of the CONSORT guideline has been proposed to guide transparent reporting of implementation, including how intervention implementation is adapted in the trial (Glasgow and Steiner
2012). Other extensions of the CONSORT guideline have been tailored to certain types of research common in Prevention Science (e.g., cluster randomized trials, Campbell et al.
2012). Nevertheless, the available guidelines are insufficient to cover many types of research in Prevention Science.
A new CONSORT extension for randomized controlled trials in social and psychological research is under development (Gardner et al.
2013) and is likely to address many of the special reporting issues in Prevention Science research (Mayo-Wilson et al.
2013. Indeed, this effort is addressing several aspects of intervention research discussed in the earlier SPR standards of evidence (Flay et al.
2005), such as active ingredients or mechanisms of interventions, interventions that operate and outcomes that are analyzed at multiple levels (e.g., individual, family, school, community), intervention implementation, the role of context (e.g., effectiveness versus efficacy; site differences in a multisite randomized trials), subgroup analysis, and intervention adaptation. This effort is well underway, with systematic reviews of guidelines and reporting practices (Grant et al.
2013), a modified Delphi process, and formal expert consensus meeting completed (Montgomery et al.
2013a; see project website:
www.tinyurl.com/consort-study), and is likely to produce highly relevant reporting guidelines for Prevention Science. CONSORT guidelines cover only randomized trials. For nonrandomized designs, an appropriate reporting guideline should be used, such as the TREND statement (Des Jarlais et al.
2004) for behavioral and public health interventions. We note that this guideline would benefit from updating to reflect advances in causal designs described in Efficacy Standard 5.b.
We encourage SPR to collaborate in the development of standards for reporting of randomized controlled trials in social and psychological research and to create a task force, or join in with other groups, to work on refining these standards to make them broadly applicable to a wider range of research designs. As these more specific guidelines become available, the standard should be changed to reflect their availability.