The Hedges Team of the Health Information Research Unit (HIRU) at McMaster University is conducting an expansion and update of our 1991 work on search filters or 'hedges' to aid clinicians, researchers, and policymakers harness high-quality and relevant information from MEDLINE [
14]. We planned to conduct the present work within the larger context of the Hedges Project prior to the onset of data collection and analyses.
Journal selection
The editorial group at HIRU prepares four evidence-based medical journals, the ACP Journal Club, Evidence-based Medicine, Evidence-based Nursing, and, up to 2003, Evidence-based Mental Health. These journals help keep healthcare providers up-to-date. To produce these secondary journals, the editorial staff has identified 170 journals that regularly publish clinically-relevant research in the areas of focus of these evidence-based journals (i.e., general internal medicine, family practice, nursing, and mental health). We evaluated journals for inclusion into this set that have the highest Science Citation Index Impact Factors in each field and journals that clinicians and librarians who collaborate with HIRU recommended based on their perceived yield of important papers. The editorial staff then monitors the yield of original studies and reviews of scientific merit and clinical relevance (criteria below) for each of these journals, to determine if they should be kept on the list or replaced with higher yielding nominated journals.
Study identification and classification
On an ongoing basis, six research associates review each of these journals and apply methodological criteria to each item to determine if the article is eligible for inclusion in the evidence-based publications. For the purpose of the Hedges Project (i.e., to develop search strategies for large bibliographic databases such as MEDLINE), we expanded the data collection effort and began intensive training and calibration of the research staff in 1999. In this manuscript, we report the κ statistic measuring chance-adjusted agreement between the six research assistants for each classification procedure.
We reported the training and calibration process in detail elsewhere.[
15] Briefly, prior to the first inter-rater reliability test research staff met to develop the data collection form, and to develop a document outlining the coding instructions and category definitions using examples from the 1999 literature. Meetings involving the research staff revealed differences in interpretation of the definitions (early κ were as low as 0.54). Intensive discussion periods and practice sessions using actual articles were used to hone definitions and thus remove ambiguities (goal κ > 0.8). The six research associates received the same articles packaged with the data collection form and the instructions document (this document is available from the authors on request) and each independently and blindly reviewed each article and recorded their classification in the data collection forms. We conducted three reliability tests during 1999. We conducted the fourth and final inter-rater reliability test approximately 14 months after the process had commenced using a sample of 72 articles randomly selected across the 170 journal titles. In calculating the κ statistic for methodological rigor, raters had to agree on the purpose category for the item to be included in the calculation (Table
1 describes the purpose categories and the criteria for methodological rigor for each one). We analyzed data using PC-agree (software code written by Richard Cook; maintained by Stephen Walter, McMaster University, Hamilton, Ontario, Canada).
Table 1
Purpose categories: definitions and criteria of methodological rigor
Etiology (causation and safety) | Content pertains directly to determining if there is an association between an exposure and a disease or condition. The question is "What causes people to get a disease or condition?" | Observations concerned with the relationship between exposures and putative clinical outcomes; data collection is prospective; clearly identified comparison group(s); blinding of observers of outcome to exposure. |
Prognosis | Content pertains directly to the prediction of the clinical course or the natural history of a disease or condition with the disease or condition existing at the beginning of the study. | Inception cohort of individuals all initially free of the outcome of interest; follow-up of at least 80% of patients until occurrence of a major study end point or to the end of the study; analysis consistent with study design. |
Diagnosis | Content pertains directly to using a tool to arrive at a diagnosis of a disease or condition. | Inclusion of a spectrum of participants; objective diagnostic reference standard OR current clinical standard for diagnosis; participants received both the new test and some form of the diagnostic standard; interpretation of the diagnostic standard without knowledge of test result and vise versa; analysis consistent with study design. |
Treatment | Content pertains directly to an intervention for therapy (including adverse effects studies), prevention, rehabilitation, quality improvement, or continuing medical education. | Random allocation of participants to comparison groups; outcome assessment of at least 80% of those entering the investigation accounted for in 1 major analysis at any given follow-up assessment; analysis consistent with study design. |
Economics | Content pertains directly to the economics of a healthcare issue with the economic question addressed being based on the comparison of alternatives. | Question is a comparison of the alternatives; alternative services or activities compared on outcomes produced (effectiveness) and resources consumed (costs); evidence of effectiveness must from a study of real patients that meets the above-noted criteria for diagnosis, treatment, quality improvement, or a systematic review article; effectiveness and cost estimates based on individual patient data (micro-economics); results presented in terms of the incremental or additional costs and outcomes of one intervention over another; sensitivity analysis if there is uncertainty. |
Clinical prediction guide | Content pertains directly to the prediction of some aspect of a disease or condition. | Guide is generated in one or more sets of real patients (training set); guide is validated in another set of real patients (test set). |
For the purposes of the Hedges Project, we defined
review as any full text article that was bannered as a review, overview, or meta-analysis in the title or in a section heading, or that indicated in the text that the intention of the authors was to review or summarize the literature on a particular topic [
15]. To be considered a
systematic review, the authors had to clearly state the clinical topic of the review, how the evidence was retrieved and from what sources (i.e., name the databases), and provide explicit inclusion and exclusion criteria. The absence of any one of these 3 characteristics would classify a review as a
narrative review. The inter-rater agreement for this classification was almost perfect (κ = 0.92, 95% confidence interval 0.89 – 0.95).
Then, we classified all reviews by whether they were concerned with the understanding of healthcare in humans. Examples of studies that would not have a direct effect on patients or participants (and, thus, are excluded from analysis) include studies that describe the normal development of people; basic science; gender and equality studies in the health profession; or studies looking at research methodology issues. The inter-rater agreement for this classification was almost perfect (κ = 0.87, 95% confidence interval 0.89 – 0.96).
A third level of classification placed reviews in purpose categories (i.e., what question(s) are the investigators addressing) that we defined for the Hedges Project and included etiology (causation and safety), prognosis, diagnosis, treatment, economics, clinical prediction guides, and qualitative (Table
1) [
15]. The inter-rater agreement for this classification was 81% beyond chance (κ = 0.81, 95% confidence interval 0.79 – 0.84).
A fourth level of classification graded reviews for methodological rigor placing them in pass and fail categories. To pass, the review should include a statement of the clinical topic (i.e., a focused review question); explicit statements of the inclusion and exclusion criteria; a description of the search strategy and study sources (i.e., a list of the databases); and at least 1 included study that satisfied methodological rigor criteria for the purpose category (Table
1). For example, reviews of treatment interventions had to have at least one study with random allocation of participants to comparison groups and assessment of at least one clinical outcome. All narrative reviews were included in the fail category. We refer to systematic reviews that passed this methodological rigor evaluation as rigorous systematic reviews. Again, the inter-rater agreement for this classification was almost perfect (κ = 0.89, 95% confidence interval 0.78 – 0.99).
For this report, we retrieved data on review articles including a complete bibliographic citation (including journal title), the pass/fail methodological grade, and the review type (narrative or systematic review).
Data analysis
Data were arrayed in frequency tables. We conducted nonparametric univariate analysis (Kruskal-Wallis) to assess the relationship between the number of citations and the type of review. We assessed the correlation between journal impact factor and citation counts. Then, using multiple linear regression, we determined the ability of the independent variables – methodological quality of the reviews and journal source – to predict the dependent variable, the number of citations (after log transformation). Thus, this analysis was stratified by journal to adjust not only for impact factor, but also for other journal-specific factors not captured by this measure.