Background
When information about the costs of alternative treatments is to be used to guide healthcare policy decision making, it is the total budget needed to treat patients with the disease that is relevant. Estimates of these total costs are based on various cost categories, such as direct health care costs (costs of healthcare resources used by patients) and indirect healthcare costs (costs due to lost productivity). As it is known, health impairments among workers can result in considerable costs due to work disability, sickness absence, and productivity loss at work, all imposing a substantial financial burden for the employee, employer and society as a whole. Various studies on different diseases have shown that indirect costs, henceforth called ‘productivity costs’, contributed substantially to total costs, illustrating how important the consequences of disease are for work performance [
1‐
4]. Productivity costs refer to the costs associated with lost or impaired ability to work or to engage in leisure activities due to morbidity and lost economic productivity due to death [
5]. A study in the Netherlands showed that the productivity costs due to low back pain can be as high as 93% of the total costs of this impairment [
3]. In Germany, productivity costs due to asthma amounted to 75% of the total costs [
2]. A large study in the USA among workers with common health conditions showed that productivity costs substantially exceeded the direct costs. Moreover, presenteeism costs, or costs due to reduced productivity while still at work, appeared to represent up to 60% of all costs [
4].
Converting the changes of health-related productivity into a financial metric makes these changes more interpretable. However, there is no agreement on how to quantify time lost due to health impairments or how to assign a monetary value to the lost productivity. To help improve the comparability and interpretability of productivity changes, a sound estimation of productivity costs requires sound measurements of the relevant components. The comparability of estimated productivity costs is hampered by substantial differences in the costs of the items considered and the methods used for measuring sickness absence and presenteeism, as well as differences in and insufficient methodology used in the valuation of these measurement tools.
In the last decades, a large number of measurement methods and instruments have been developed to quantify health-related productivity changes. These instruments are preferably self-administered by workers with health impairments because objective measurements of productivity changes are unable to capture reduced productivity while still at work (i.e. presenteeism). Several studies have shown that presenteeism contributes substantially to the estimated total costs of health impairments among workers [
6‐
10]. The comparability across studies estimating productivity changes and associated costs is poor, since methods of measuring changed productivity seem to vary considerably.
There is thus an urgent need for practical and applicable knowledge and insight into the reliability, validity and responsiveness of these instruments. Regarding the validity of the instruments, one should keep in mind that the extent to which a valid measurement of productivity loss, especially presenteeism, can be achieved is often influenced by many factors (e.g. the amount of teamwork required in the job, the work setting, the desired actual production output, etc.) [
6,
11].
Although several researchers have provided comprehensive reviews of existing instruments that measure productivity changes, the methodological quality of the reviewed studies remains unclear [
12,
13]. Consequently, judgements on the quality of the studies cannot be made. If the methodological quality of a study on the measurement properties of a specific instrument is appropriate, the results can be used to assess the quality of the instrument at issue, including its measurement properties. However, if the methodological quality of the study is inadequate, the results cannot be trusted and the quality of the instrument under study remains unclear, despite of the magnitude or strength of the estimates presented [
14]. Therefore, in this systematic review both the methodological quality of the study and the quality of the instrument, based on its psychometric properties, are taken into account. The main aim of this systematic review is therefore to critically appraise and compare the measurement properties of generic, self-reported instruments measuring health-related productivity changes.
Conclusions and discussion
Twenty-five studies on measurement properties of 15 generic self-reported instruments measuring health-related productivity changes have been systematically reviewed, and their methodological quality has been evaluated using the COSMIN-checklist in a best evidence synthesis. The WLQ is the most frequently evaluated instrument. Structural validity and content validity reported a strong and moderate positive level of evidence respectively. For measurement error and cross-cultural validity, no information was available and the internal consistency and criterion validity resulted in conflicting evidence. Reliability, hypotheses testing and responsiveness resulted in limited negative and moderate negative evidence respectively. Due to poor methodological quality, the EWPS, WPAI, and WPSI showed unknown levels of evidence for almost half of the information on measurement properties. For eight questionnaires (AMA-guides, PRODISQ, HLQ, Q&Q, WBA-P, HWQ, WHI, VOLP) at least half of the information on measurement properties per questionnaire was lacking. Four instruments (WLQ, WHO-HPQ, SPS, and PRODISQ) showed strong or moderate positive levels of evidence for some of the measurement properties.
The main strength of most studies was that they reported detailed information regarding the population characteristics, sampling methods, the setting and the country where the studies were conducted.
There were, however, many limitations. First, the generalisability of the results of the studies on measurement properties was low, mainly because of selective samples, the non-reporting of and the lack of information regarding the handling of missing values, and inadequate sample sizes.
Second, most studies recruited convenience samples, which might not cover the entire target population [
44]. Ozminkanski et al. was the only study that consecutively recruited workers with job-related accidents or injuries, which could be a reasonable representation of the workers with lost productivity at the workplace [
21].
Third, although Zhang and colleagues examined the measurement properties in two countries (UK and Ireland), no international samples demonstrated the cross-cultural validity of their measures [
22,
23]. Most studies were conducted in the United States, which makes it difficult to discern whether the instruments are appropriate for study populations outside of the United States. The results of this review emphasize the need for international studies on measurement properties as well as additional evaluation studies conducted worldwide to examine the cross-cultural appropriateness of these measures to improve generalizability. The Work Role Functioning Questionnaire (WRFQ) measures perceived difficulties in meeting work demands among employees given their physical and emotional problems. The WRFQ addresses work outcomes in an attempt to describe how health affects work role functioning [
45‐
48]. Despite the fact that the WRFQ is to be used as a detection instrument to identify, and not value decreased productivity, it can serve as an excellent example since several studies have translated and adapted the WRFQ to Canadian French [
45], Brazilian Portuguese [
46], Dutch [
47], and Spanish [
48]. These studies demonstrate a systematic procedure for translation and cross-cultural adaptation which can serve as excellent examples for future studies attempting to adapt and validate instruments in other cultures.
Fourth, almost half of the reviewed studies reported item and unit nonresponses under 50%, which might indicate selection bias, further hampering the generalizability of the results [
44]. Inadequate descriptions of the handling of missing values might suggest non-random missing items, which could bias the results and lead to misinterpretation and misjudgement of the measurement properties of an instrument. Furthermore, if missing values are inappropriately handled, bias in parameter estimates can occur, resulting in lower samples sizes and thus lower statistical power. High percentages of missing values on specific items might even indicate that an item is not relevant for the study population, perhaps pointing to ambiguous formulations and hampering the validity of the instruments [
17,
49]. In light of the flaws presented from previous studies, response rates should be accurately reported, including information on the handling of missing items, and if randomness of nonresponse occurred, it should be examined and reported in future studies.
Fifth, based on the results of this systematic review it can be concluded that the information regarding the measurement properties of generic self-reported instruments measuring health-related productivity changes is mostly limited or of poor to fair methodological quality. The results should be treated with caution due to the missing information on the remaining measurement properties. Especially when considering measurement error and cross-cultural validity, wherefore (almost) no information was available.
Sixth, although it is difficult to determine the criterion validity without a real gold standard for health-related productivity change instruments, most studies considered the extent to which scores on the instrument of interest could be adequately reflected to a predetermined comparator. Criterion validity is therefore most frequently evaluated, but only five studies on this measurement property were of good methodological quality. As a consequence, the SPS and the WBA-P yielded moderate and limited positive levels of evidence for criterion validity respectively.
Seventh, it is difficult to determine the responsiveness of the different health-related productivity change instruments because almost all of the retrieved studies were of poor or fair methodological quality regarding responsiveness. Because the instruments are often used as an outcome measure to determine the costs of lost productivity, specific hypotheses regarding expected correlations with other constructs must be formulated a priori when developing a new measure.
Eighth, the internal consistency statistic only gets an interpretable meaning when the interrelatedness among the items is determined as a set of items that together form a reflective model. The internal consistency of an instrument is thus reflected in the quality assessment of structural validity, and vice versa. If the structural validity was not assessed by analysing the unidimensionality (there is no evidence that the scales are unidimensional), no internal consistency statistic can be properly interpreted. Four studies resulted in good methodological quality on structural validity for the WLQ and SPS, and also in good internal consistency for both instruments. Most of the instruments with unknown levels of evidence due to poor methodological quality regarding internal consistency also lacked information on structural validity.
Ultimately, some general issues on measuring productivity changes should be addressed. First, the concept of productivity loss due to illness is, according to economic theory, based on the concept of a production function where output is a function of capital input, labour input and technology. The focus of most productivity measurement instruments, as has been seen, is on the individual’s labour input; measuring the time a person is not at work due to health complaints (absenteeism), or is not productive while at work due to health complaints (presenteeism). However, job and workplace characteristics also play a key role and differ among countries, which are reflected in the socio-political context in which the study takes place. For example, in some countries that have a workers’ compensation system, such as Canada and the United States, there is a differentiation between work and non-work related disability. In other countries, such as the Netherlands, no such differentiation exists. Due to these variations arising from the social-political context, one cannot assume a ‘one size fits all’ mentality when comparing instrument effectiveness across countries or cultures. Transparency in reporting the key aspects of measurement and validation of health-related productivity would simplify the comparability and usability of the results for occupational and health economic decision making.
Another point to be addressed is that although the COSMIN taxonomy might contribute to a better understanding in the terminology used in validation studies and provides a structured procedure for the evaluation of the methodological quality of the studies on measurement properties, the taxonomy provides a lot of room for interpretation in the checklist items. To minimize interpretability differences between reviewers (CYGN, AER, SE), decisions had to be made on how to score the different items. For example, a problem encountered during the rating of ‘criterion validity’ was the absence of a gold standard in health-related productivity change instruments. One example of how this problem was dealt with was by assuming the original long version of the shortened instrument being assessed was an adequate comparator, and can thus be seen as a ‘gold standard’. Furthermore, since the studies were systematically reviewed on the measurement properties of self-reported instruments which encounter subjective data, it was agreed that objective, registered data could serve as an adequate comparator as well. Predetermined and transparent arguments that the gold standard is ‘gold’ had to be thoroughly discussed and decided beforehand to assess criterion validity.
Finally, although an agreement was reached that objective data could be seen as a ‘gold standard’ for collecting lost productivity data it should be in mind that both objective and self-reported instruments have their advantages and disadvantages, which need to be weighed per research question. For example, when using objective insurance data, a particular disadvantage is that the data reflects what has been compensated. What has been compensated does not necessarily reflect the actual time a worker has been unable to work. Productivity changes related to sick leave should therefore always be supplemented by the productivity changes due to decreased work performance; i.e. presenteeism, to avoid underestimations.
Recommendations
Although only cautious advice can be provided on the most appropriate instruments to capture changes in productivity for use in occupational and economic health practice, the WLQ is cautiously recommended at the moment because the instrument is most frequently evaluated and moderate respectively strong positive evidence was found for content and structural validity respectively. However, negative evidence was found for reliability, hypothesis testing and responsiveness. The WLQ is only used in an English-speaking study population. Using the PRODISQ is cautiously preferred when conducting a study in the Netherlands based on its strong positive evidence for content validity, although evidence for the other measurement properties is lacking. In order to improve the interpretation of the PRODISQ scores, more research regarding the measurement properties (aside from content validity) is needed. The Stanford Presenteeism Scale (SPS) can also be cautiously recommended as it is evaluated in two studies and showed strong positive results for internal consistency and structural validity, and moderate positive results for hypotheses testing and criterion validity. Limited negative evidence however was available for reliability and content validity and information on the other measurement properties was lacking.
Better knowledge and usage of key methodological principles based on quality checklists, such as COSMIN, is recommended to provide high-quality studies evaluating the measurement properties of new and existing instruments in the future. High-quality studies that evaluate and provide strong evidence for the unknown measurement properties, especially cross-cultural validity, are recommended to improve the generalizability and applicability of generic self-reported health-related productivity change instruments. Given the large number of available productivity instruments the development of new instruments is not recommended, but rather improvement of the existing instruments.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors (CYGN, SMAAE, FJN, AER) made substantial contributions to conception and design, and analysis and interpretation of the data. All authors have been involved in drafting the manuscript and revised it critically for important intellectual content. Three reviewers determined the methodological quality of the studies (CYGN, AER, SMAAE). Consensus on the methodological quality of the studies and the evidence synthesis was reached through discussion (CYGN, SMAAE, FJN, AER). All authors have given their final approval of the version to be published.