Focus of the COSMIN checklist
The COSMIN checklist is focused on evaluating the methodological quality of studies on measurement properties of HR-PROs. We choose to focus on HR-PROs, because of the complexity of these instruments. These instruments measure constructs that are both multidimensional and not directly measurable.
In addition, we focused on evaluative applications of HR-PRO instruments, i.e. longitudinal applications assessing treatment effects or changes in health over time. The specification of
evaluative is necessary, because the requirements for measurement properties vary with the application of the instrument [
8]. For example, instruments used for evaluation need to be responsive, while instruments used for discrimination do not.
The COSMIN Steering Committee (
Appendix 1) searched the literature to determine how measurement properties are generally evaluated. Two searches were performed: (1) a systematic literature search was performed to identify all existing systematic reviews on measurement properties of health status measurement instruments [
9]. From these reviews, information was extracted on which measurement properties were evaluated, and on standards that were used to evaluate the measurement properties of the included studies. For each measurement property, we found several different standards, some of which were contradictory [
9]. (2) The steering committee also performed another systematic literature search (available on request from the authors) to identify methodological articles and textbooks containing standards for the evaluation of measurement properties of health status measurement instruments. Articles were selected if the purpose of the article was to present a checklist or standards for measurement properties. Standards identified in the aforementioned literature were used as input in the Delphi rounds.
International Delphi study
Subsequently, a Delphi study was performed, which consisted of four written rounds. The first questionnaire was sent in March 2006, the last questionnaire in November 2007. We decided to invite at least 80 international experts to participate in our Delphi panel in order to ensure 30 responders in the last round. Based on previous experiences with Delphi studies [
10,
11], we expected that 70% of the people invited would agree to participate, and of these people 65% would complete the first list. Once started, we expected that 75% would stay involved. We included experts in the field of psychology, epidemiology, statistics, and clinical medicine. Among those invited were authors of reviews, methodological articles, or textbooks. Experts had to have at least five publications on the (methods of) measurement of health status in PubMed. We invited people from different parts of the world.
In the first round, we asked questions about which measurement properties should be included in the checklist, and about their terms and definitions. For example, we asked for the measurement property internal consistency ‘which term do you consider the best for this measurement property?’, with the response options ‘internal consistency’, ‘internal consistency reliability’, ‘homogeneity’, ‘internal scale consistency’, ‘split-half reliability’, ‘internal reliability’, ‘structural reliability’, ‘item consistency’, ‘intra-item reliability’, or ‘other’ with some space to give an alternative term. Regarding the definitions, we asked ‘Which definition do you consider the best for internal consistency?’, and provided seven definitions that were found in the literature and the option ‘other’ where a panel member could provide an alternative definition. In round two, we introduced questions about preferred standards for each measurement property. We asked questions about design issues, i.e. ‘Do you agree with the following requirements for the design of a study evaluating internal consistency of HR-PRO instruments in an evaluative application? (1) One administration should be available. (2) A check for uni-dimensionality per (sub) scale should be performed. (3) Internal consistency statistics should be calculated for each (sub) scale separately’. The panel could answer each item on a 5-point scale ranging from strongly disagree to strongly agree. Next, the panel was asked to rate which statistical methods they considered adequate for evaluating the measurement property concerned. A list of potential relevant statistical methods for each measurement property was provided. For example, for internal consistency the following often used methods were proposed: ‘Cronbach’s alpha’, ‘Kuder-Richardson formula-20’, ‘average item-total correlation’, ‘average inter-item correlation’, ‘split-half analysis’, ‘goodness of fit (IRT) at a global level, i.e. index of (subject) separation’, ‘goodness of fit (IRT) at a local level, i.e. specific item tests’, or ‘other’. Panel members could indicate more than one method. In the third round, we presented the most often chosen method, both the one based on CTT and the one based on IRT, and asked if the panel considered this method as the most preferred method to evaluate the measurement property. For internal consistency, these were ‘Cronbach’s alpha’ and ‘goodness of fit (IRT) at a global level, i.e. index of (subject) separation’, respectively. In the third round, the panel members were asked whether the other methods (i.e. ‘Kuder-Richardson formula-20’, ‘average item-total correlation’, ‘average inter-item correlation’, ‘split-half analysis’, ‘goodness of fit (IRT) at a local level, i.e. specific item tests’) were also considered appropriate. Panel members could also have indicated ‘other methods’ in round 2. Indicated methods were ‘eigen-values or percentage of variance explained of factor analysis,’ ‘Mokken Rho’ or ‘Loevinger H’ for internal consistency. In round 3, the panel was also asked whether they considered these methods as appropriate for assessing internal consistency. In the final Delphi round, all measurement properties and standards that the panel agreed upon were integrated by the steering committee into a preliminary version of the checklist for evaluating the methodological quality of studies on measurement properties.
In each Delphi round, the results of the previous round were presented in a feedback report. Panel members were asked to rate their (dis)agreement with regard to proposals. Agreement was rated on a 5-point scale (strongly disagree—disagree—no opinion—agree—strongly agree). The panel members were encouraged to give arguments for their choices to convince other panel members, to suggest alternatives, or to add new issues. Consensus on an issue was considered to be reached when at least 67% of the panel members indicated ‘agree’ or ‘strongly agree’ on the 5-point scale. If less than 67% agreement was reached on a question, we asked it again in the next round, providing pro and contra arguments given by the panel members, or we proposed an alternative. When no consensus was reached, the Steering Committee took the final decision.
When necessary, we asked the panel members to indicate the preferred statistical methods separately for each measurement theory, i.e. Classical Test Theory (CTT) or Item Response Theory (IRT), or for each type of score, such as dichotomous, nominal, ordinal, or continuous scores.