Introduction
Measuring patient reported outcomes (PRO) has become a common clinical practice. This is primarily because a patient’s perspective on their health is central to a number of conditions, and because patients have become more forthcoming in describing their health status and illness experience. PRO measurements can facilitate patient involvement in decision-making about their own care, and may help healthcare professionals to identify patients concerns. This measurement is also essential in clinical research, as PROs are frequently used as study endpoints. As a consequence, new PRO measures are now regularly developed.
Prior to using PRO measures in clinical practice or research, instruments need to be developed and validated cautiously, in order to avoid biased results that might lead to incorrect interpretations. The development process of a PRO is fairly well defined [
1],[
2]. The development stage for a PRO questionnaire, as proposed by Fayers and Machin [
2], include generating an initial hypothetical model, defining the target population, generating items by qualitative methods, followed by pre-testing and field-testing the questionnaire. The validation stage aims to assess the measurement properties of the PRO measure. This includes the assessment of validity (content validity, face validity, construct validity and criterion validity), reliability (repeatability and internal consistency) and responsiveness. This psychometric validation step is very important for a new PRO measure to be accepted and widely used [
1],[
2].
Sample size is recognized as a key parameter for the planning of studies in many areas of clinical research. This is exemplified by its use in a growing number of published guidelines including: CONSORT (CONsolidated Standards Of Reporting Trials) [
3], STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) [
4], TREND (Transparent Reporting of Evaluations with Nonrandomized Designs) [
5], STARD (STAndards for the Reporting of Diagnostic accuracy studies) [
6], STREGA (Strengthening the reporting of genetic association studies) [
7], as well as in the recently published CONSORT PRO [
8].
Nevertheless, sample size is only briefly mentioned in most guidelines published in order to help researchers design studies aimed at assessing PRO psychometric properties, or evaluating the methodological quality of those studies [
1],[
9]-[
11]. Moreover, the International Society for Quality of Life research (ISOQOL) recently defined minimum standards required for the design and selection of a PRO measure but did not mention sample size determination [
12]. Although inappropriate sample size can lead to erroneous findings in many aspects of PRO development and validation, in particular the identification of the correct structure of the questionnaire (eg. number of dimensions and items in each dimension), no consensus exists to define sample size with the same rigour as found in most clinical research based on clinical or biological criteria (eg. arbitrarily determined sample size or subject to item ratio).
Our motivation was to examine how developers of new PRO measures currently determine their sample size, and report the critical steps of psychometric validation of their newly developed instruments, in terms of design, measurement properties, and statistical methods. To our knowledge, the last review aimed at investigating the methodology used in the construction of a measurement scale was reported in 1995 [
13].
The aim of the study was to perform a comprehensive literature review to enable the practices in PRO primary psychometric validation studies to be listed and described, with a particular focus on the importance on sample size determination.
Materials and methods
A literature review was conducted from September 2011 to September 2012, on articles published between January 2009 and September 2011, following the Centre for Review and Dissemination’s (CRD) guidelines for undertaking reviews in health care [
14], and recommendations published by Mokkink [
15]. It comprised three stages:
Search strategy: Identification of articles by specifying inclusion and exclusion criteria, keywords and search strings in the PubMed database.
Selection: Article pre-selection by reading titles, followed by a selection by reading abstracts.
Extraction: Extraction of data from articles, and filling in a reading grid and providing a synthesis.
Psychometric properties definitions (Table 1)
Table 1
Psychometric properties definitions in the field of health-related assessment
Content validity | The ability of an instrument to reflect the domain of interest and the conceptual definition of a construct. In order to claim content validity, there is no formal statistical testing, but item generation process should include a review of published data and literature, interviews from targeted patients and an expert panel to approach item relevance [ 2]. |
Face validity | The ability of an instrument to be understandable and relevant for the targeted population. It concerns the critical review of an instrument after it has been constructed and generally includes a pilot testing [ 2]. |
Construct validity | The ability of an instrument to measure the construct that it was designed to measure. A hypothetical model has to be formed, the constructs to be assessed have to be described and their relationships have to be postulated. If the results confirm prior expectations about the constructs, the instrument may be valid [ 2]. |
Convergent validity | Involves that items of a subscale correlate higher than a threshold with each other, or with the total sum-score of their own subscale [ 2]. |
Divergent validity | Involves that items within any one subscale should not correlate too highly with external items or with the total sum-score of another subscale [ 2]. |
Known group validity | The ability of an instrument to be sensitive to differences between groups of patients that may be anticipated to score differently in the predicted direction [ 2]. |
Criterion validity | The assessment of an instrument against the true value, or a standard accepted as the true value. It can be divided into concurrent validity and predictive validity [ 2] . |
Concurrent validity | The association of an instrument with accepted standards [ 2]. |
Predictive validity | The ability of an instrument to predict future health status or test results. Future health status is considered as a better indicator than the true value or a standard [ 2]. |
Reliability | Determining that a measurement yields reproducible and consistent results [ 2]. |
Internal consistency | The ability of an instrument to have interrelated items [ 2]. |
Repeatability | (Test-retest reliability) The ability of the scores of an instrument to be reproducible if it is used on the same patient while the patient’s condition has not changed (measurements repeated over time) [ 2]. Measurement error is the systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured [ 17]. |
Responsiveness | The ability of an instrument to detect change when a patient’s health status improves or deteriorates [ 2]. |
The definition of the psychometric properties was defined by the individual investigators in a consensual manner prior to beginning the review. This is important because experts often employ different terminologies and definitions for the same concept [
1],[
2],[
9],[
10],[
12],[
16],[
17]. Standard references of psychometric theory in the field of health-related assessment were used to define the psychometric properties that were collected [
2],[
17].
Literature review
The authors, including three statisticians (EA, JBH, VS) and a public health physician (LM), took part in the literature review, and were responsible for designing and performing the search strategy, article selection and data collection.
Stage 1: Search strategy
The primary inclusion and exclusion criteria were chosen to meet the objective of the study: to examine how many individuals are included in PRO validation studies, and how developers of PRO measures report the steps involved in psychometric validation, including sample size determination. Because the focus was on primary studies, we excluded studies that reported translation and transcultural validation, revised scale validation and scale revalidation.
Inclusion criteria were:
Measure of a patient reported outcome (PRO)
Report of a scale construction and evaluation of its psychometric properties (primary study)
Published in English or French
Published from January 2009 to September 2011
Abstracts available on PubMed
Report of psychometric properties validation
Exclusion criteria were:
Instruments with a predominantly diagnostic, screening or prognostic purpose
Systematic review articles
Comparisons of scale psychometric properties
Transcultural adaptation and translation validation studies
Studies using a scale without performing any validation
Symptom inventory
Scale revalidation on another sample or deepening of scale psychometric properties
Short or revised form of a scale
Articles exclusively related to content and face validation.
The PubMed database was searched for relevant articles as it is the main medical database and because we focus our attention on PRO. Because the use of searchable technical terms for indexing international literature in databases is not always up-to-date, we defined a search strategy composed of free text terms, synonyms and MeSH terms with high sensitivity but low specificity. The search included the terms or expressions “score”, “scale”, “index”, “indicator”, “outcome”, “composite”, “construction”, “development”, “item selection”, “validation”, and “questionnaire”, but excluded the terms or expressions “translation”, “transcultural” and “cross-cultural”. The search string is provided in the Appendix 1.
Stage 2: Article selection
To pre-select articles, EA reviewed the titles of all records retrieved from the initial search. LM, JBH and VS then performed an independent review of the same articles by evenly sharing the full list of titles. Inclusion or exclusion disagreements were resolved by a third reviewer (e.g.: disagreements between EA and LM, on the titles they both read, were resolved through JBH). Once articles were pre-selected by title, the same procedure was used to score the available abstracts, using the same article selection and disagreement resolution process. There were two kinds of disagreements: those related to inclusion or exclusion of articles and those related to the reason of exclusion.
The number of articles remaining after the second stage was still fairly large. In order to proceed with a manageable data extraction phase, in terms of time and available resources, whilst keeping the data representative of the literature, a sample of articles was randomly selected (Additional file
1 (AF1)), using the sample function in R 2.12.1. The data from these articles was extracted and uploaded to the reading grid by EA. In addition, LM, JBH and VS each reviewed 10 randomly selected articles independently from EA.
The extraction reading grid was formulated based on psychometric properties definitions from the standard references [
2],[
17]. The variables of the grid were discussed among the authors and yielded 60 variables in 5 domains (general information on article, study and scale, determination of sample size, items distribution and evaluation of psychometric properties) to describe the reporting of studies in terms of design, measurement properties and statistical methods (Additional file
2 (AF2)).
Statistical analysis
To evaluate whether the reviewers agreed with each other, the proportion of observed agreements P
0, and the Kappa were computed. This allowed the judgement consistency related to inclusion and exclusion of articles to be measured, in both the pre-selection step, and the subsequent selection step [
18].
Descriptive statistical analyses (mean, standard error and frequencies) for each variable of the extraction reading grid were performed.
The software R 2.12.1 was used.
Discussion
This literature review aimed to describe validation practices, with a primary focus on sample size, and focussed on 114 psychometric validation studies of new PRO measures, published between January 2009 and September 2011. The process of validation requires collecting a comprehensive body of evidence on the measurement properties of a scale including content and face validity, construct validity, criterion validity, and reliability and responsiveness. Numerous literature reviews, aimed at describing psychometric properties of scales, exist but they are limited to a specific disease, with the objective of comparing and choosing the appropriate instrument [
19]-[
23]. To our knowledge only one review, dating from 1995, aimed to investigate the methodology used in the construction of a measurement scale and proposed recommendations [
13]. However, given the widespread use of PRO measures, it is therefore of interest to obtain a clear picture of how these measures are currently validated, and especially how sample size is planned.
Results of the review revealed that the method used for the sample size determination was defined
a priori in less than 10% of articles. Four per cent of articles compared the numbers of included patients to a subject to item ratio
a posteriori, to justify their sample size. Thus, 86% of the validation studies didn’t provide any robust justification for the sample size included. This high rate is of concern, because determining a sample size is required to achieve a given precision, or to have enough power to reject a false null hypothesis while being confident in this result. It is therefore of interest to motivate researchers to control the type II error, or to think
a priori about the precision they want to have, before testing the null hypothesis regarding the structure of a scale. The lack of consensus regarding how to compute the sample size was pointed out in two papers of the review [
24],[
25]. Indeed, subject to item ratio is a frequently used method to determine a required sample size to perform an EFA, but with various recommendations. For several authors, this ratio is partly determined by the nature of the data, i.e. the stronger the data, the smaller the sample size can be. Strong data display uniformly high communalities without cross-loadings [
26]. Recommendations range from 2 to 20 subjects per item [
27],[
28], with an absolute minimum of 100 to 250 subjects [
29]-[
31]. Comrey and Lee [
32] provided the following guidance: 100 = poor, 200 = fair, 300 = good, 500 = very good, ≥1000 = excellent. In the articles reviewed in this study, the mean subject to item ratio was 28, with a minimum of 1 and a maximum of 527.
Recommendations in the literature for the sample size determination when conducting a CFA are also disparate (ranging from 150 to 1000 subjects), and seem to depend on the normality of data, and parameter estimation methods [
33]-[
36]. Some authors suggested two different sample sizes planning methods when performing a CFA. MacCallum et al. [
37] suggested, in 1996, a method to determine a minimum sample size required to achieve a given level of power, for a test of fit using the RMSEA fit index. More recently, Lai and Kelley [
38] developed a method to obtain sufficiently narrow confidence intervals for the model parameters of interest. These methods seem to be unused by PRO developers.
Moreover, whether it is used for performing an EFA or a CFA, most of published recommendations don’t express their opinion on the issue of sample size [
1],[
9]-[
12], which doesn’t facilitate good practice. For example, the COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) group assessed if the included sample size was “adequate” [
11], but did not define its meaning or interpretation, and the Scientific Advisory Committee of the Medical Outcomes Trust noted that developers should include methods of sample size determination [
9].
The current absence of clear guidance and the lack of consensus about how to compute a priori sample size are two key issues of sample size determination.
Several technical pitfalls in the psychometric validation were also highlighted. The first one pertains to the fact that descriptive information about items and score distributions were rarely given, while they are important in our opinion. For example, missing value rate was evaluated in only 22% of the studies, but an item with a lot of missing values is probably not relevant or understandable for patients.
The second one deals with content validity. It is encouraged to involve patients during the development phase of the instrument, in order to ensure content validity of a PRO measure, and to represent patient values [
39]. This is particularly central in the Food and Drug Administration guidance [
1] and this recommendation has to be supported. However, our literature review showed that patients were less often asked for interviews or focus groups than experts, whereas they are in the best position to describe their illness experience.
Finally, CFA was seldom (16%) performed for the study of construct validity. In the framework of a CFA, hypothesis of relationships between items and factors, and between factors, have to be postulated [
33] and, once a hypothesized model is established, a CFA will confirm that it provides a good fit to the observed data. This makes CFA a method that is probably better suited than EFA for validation of instruments with a predefined measurement model. The practice of defining the structure during the development phase of a PRO measure should be followed, but was mentioned in only 2 of the reviewed papers.
Our research has some limitations. The first one relates to the absence of unique reference terminologies and definitions of measurement properties. This made the standardized extraction of data challenging. Mokkink [
17] confirmed this by concluding that the correct answer probably doesn’t exist. We selected two references in the field of health-related assessment [
2],[
17] and tried to be as clear as possible, so that readers understood the concepts that were explored. The second limitation relates to the fairly short publication period included in our literature search. This was a deliberate decision. We anticipated that even in a short period, many publications would be included, and this was confirmed by the retention of 422 relevant articles using our selection process. This prompted us to use a reductive random selection step to make the data extraction phase manageable, whilst keeping the results representative of the targeted literature, and representative of current practices in terms of psychometric validation. Indeed, there is no reason that an important change in practices would have happened as no recommendation in terms of sample size determination was published since 2011. It should be noted that we deliberately included only publications on the primary validation of PRO measures. Indeed, validation of PRO measures (for new linguistic versions of an existing PRO measure or a validation in another population) involves slightly different questions and would not necessarily compare with primary validation. Hence, we preferred to not include those. Another possible limitation was that only the PubMed database was used, but we were specifically interested in psychometric validation practices in the medical field. Finally, only articles published in English or French were included, as none of the authors were fluent in other foreign languages.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
EA was involved in the conception and the design of the experiment. EA performed the experiment, analysed and interpreted the data. EA was involved in drafting the manuscript. LM was involved in the conception and the design of the experiment. LM was involved in drafting the manuscript. AR was involved in the conception and the design of the experiment. AR was involved in revising the manuscript critically. JBH was involved in the conception and the design of the experiment. JBH was involved in drafting the manuscript. VS was involved in the conception and the design of the experiment. VS was involved in drafting the manuscript. All authors read and approved the final manuscript.