Introduction
Goldberg’s General Health Questionnaire (GHQ) [
1] items have been used frequently by population and health service researchers for measuring levels of clinically significant but non-specific psychological distress. Tens of thousands of survey respondents and patients from a variety of populations and health care settings have completed one of the four available versions with 12, 28, 30 or (rarely) 60 items [
2,
3]. Simple scoring methods and cut-off scores for “caseness” are commonly applied and such practice has supported a large volume of studies.
A range of psychometric and technological developments have taken place in educational, social survey and clinically oriented assessment research over recent decades. Among the most important are those that allow for some aspect of personalization, especially if these can be aligned to methods that are efficient, reduce burden, and appeal to respondents. Additionally, from a “psychometric epidemiology” [
2] perspective two aspirations remain: (1) to integrate what can be known about individuals or populations from items across versions and (2) how to apply the item set in a manner that does not rely on the “legacy” or fixed-length versions [
4]. In this paper, we address the second aspect by providing a full demonstration of the computerized adaptive testing (CAT) paradigm [
5] as it might be adopted for GHQ-30 data or other item pools: to personalize assessments, make them more efficient, and tailor them in length and administration to the mode needed for a specific implementation (e.g., pencil and paper, mobile device, desktop computer).
Although CAT originated in educational settings where the target for measurement would typically be an examinee’s ability level, our exposition here is in the wider setting of population health, social science, or epidemiological and lifestyle surveys. CAT is an approach involving computer-based administration of questionnaires using principles able to adapt the content to the score level of the person. Such adaptation is based on the concept of item information introduced in item response theory (IRT) modelling. Specifically, CAT algorithms will select and administer the most informative items for each respondent based on (1) known item characteristics obtained from prior calibration using IRT models and (2) on what is known about an individual’s level of the measured attribute (construct) from their responses to previous questions. In CAT, the required level of measurement accuracy for the target construct is usually fixed instead of fixing the number of items as in the traditional approach. CAT then selects optimal item sequences until this goal is met. As a result, typically fewer items are administered and each respondent encounters a unique set of items, with the potential benefit that the questions presented might seem more relevant to the respondent, since they are targeted closer to their distress level. These two features are synergistic, hence they result in improved efficiency [
6].
CAT principles have been successfully applied in mental health assessment [
6‐
8] and were found to outperform traditional static tests [
9]. However, the increase in efficiency may in specific contexts not be sufficient to justify the added technical requirements for CAT administration [
9]. Fortunately, recent developments and availability of open-source CAT algorithms [
10‐
13] make its implementation easier and less costly.
The aim of our study was to evaluate the potential of CAT for the GHQ-30 item pool and to demonstrate the steps required for transition from the fixed-length test to an adaptive version, which are generally agreed [
14,
15]. For this purpose, we used data collected with traditional methods (i.e. paper and pencil self-completion). The structure of the study was as follows: We first followed an established approach [
14] to estimate the IRT parameters to evaluate model fit and to derive psychometric properties of items (i.e. calibrate the item pool). Building on these results, we aimed to contribute further detail on a more complex scenario: repeated adaptive administration in longitudinal studies. For this we examined how a CAT version of GHQ-30 could be used to measure change in psychological distress. We begin, however, with the usual case of a single GHQ-30 administration as is applicable to a cross-sectional study.
Discussion
Traditionally, applied health and epidemiological survey research has relied on fixed-length questionnaires to measure subjective (mental) health and related constructs. Because most were developed originally as paper forms few researchers experiment with more flexible modes of administration. Fixed-length instruments are popular among researchers because of their familiarity, ease of administration, widespread use and simple scoring (traditionally sum scores). In addition, any comparison of results with studies using the same set of items is straightforward. Thus, there has been little appetite for potentially more optimal administration designs, where technology is needed. Traditional questionnaire surveys are often lengthy in terms of number of items, time consuming to complete, and they may therefore place a considerable burden on patients, some of which might be avoided.
This study provided GHQ-30 calibration (model fit assessment, DIF analysis, and estimation of item parameters) and considerations regarding adaptive administration of GHQ-30 over time in longitudinal studies. The simulation showed that the adaptive administration of the GHQ-30 becomes useful when the required reliability is approximately 0.84 or lower. In that case, a CAT administration would deploy, on average, only half (or less) of the 30 items. Our simulation showed, however, that the utility of CAT depends also on the respondent’s distress level. For individuals with little distress, all, or nearly all items are deployed.
Various
θ estimators and item selection methods have recently become available in CAT. We selected frequently used options and in terms of efficiency, results suggested similar performance of most of them. However, an informative (standard normal) prior helped to further reduce the number of items, especially for lower reliabilities. Researchers should be cautious when specifying informative priors though, as priors not corresponding with the population distribution may have adverse effects on the number of administrated items [
39].
The GHQ-30 was developed as a screening measure to be used by epidemiology, health science and mental health researchers. “Screening” describes two different strategies with different consequences for the usefulness of CAT administrations. In the first strategy, a short test is applied to a large population to identify (groups of) at-risk respondents who might be subject to further (typically longer and/or more expensive) diagnostic tests. For this strategy, screening tests do not necessarily need to be highly precise. Instead they need to be valid, show high correlations with the disorder in question, for example gauged by sensitivity, specificity or predictive values. The reliability of 0.84 mentioned in the previous paragraph can typically be considered as sufficient for such purposes and an adaptive version of GHQ-30 may be an improvement over traditional modes of administration. The second strategy uses the test itself to identify whether an individual respondent may have an unrecognized disorder. For such applications, a highly reliable test is needed to allow for clear decisions about whether an individual is above or below a relevant severity threshold. For this, the confidence interval around the individual severity level or the relevant threshold needs to be small: this decreases the number of cases for which the severity threshold is included in the confidence interval around the individual’s severity level (or the severity level lies within the interval around the threshold, respectively) [
3,
38]. A reliability of 0.84 seems rather low for such decisions. These considerations highlight the important role both measurement accuracy as well as validity play in such assessments. Both strategies rest on the assumption that the test is valid in general (appropriate sensitivity, specificity, predictive values).
For both strategies, reliable data on costs associated with the different screening decisions can help to optimise the process. But only the first strategy would allow combining the CAT algorithm with further selection rules, such as choosing the most predictive items, since trading off reliability in favour of validity might be an option, while it would not be for the second strategy.
Our study suggests that the utility of adaptive administration of GHQ-30 items is problematic for the measurement of individual change in longitudinal studies as high reliability is required and all or nearly all items need to be deployed. However, for an assessment of group-level changes in distress, the (random) bias in individual distress change scores cancel out and thus CAT administration may still be a viable option. In addition, the correlations in Table
3 suggest highly similar changes in distress levels (apart from possible linear drifts), captured by either the complete set of GHQ-30 items or the CAT administration, even for low reliability cutoffs (for which considerably fewer items are administered).
An additional potential benefit of CAT administration in longitudinal studies is that respondents measured over time are likely to be exposed to different items (from the same instrument/item pool) at each time they are assessed, whilst keeping the metric of person estimates comparable. This is potentially useful design science, for app-based or web-based data collections [
40]. With the recent introduction of mobile devices that increase the frequency of assessment, and perhaps add a new dimension of user-friendliness to questionnaire item delivery, a principled approach to the use of a large item bank could avoid item fatigue or compromise due to over-exposure and thus might help with respondents’ engagement. Obviously, such benefit is suppressed if all or nearly all items are administered. As noted above, CAT algorithms would administer nearly all GHQ-30 items at each occasion to capture individual changes in distress reliably. In summary, CAT administration of GHQ-30 in longitudinal studies which aim to evaluate individual changes would do no harm but may lack utility.
One limitation of this study is that
\(\theta_{\text{baseline}}\) and
\(\theta_{\text{followup}}\) estimates based on the complete set of GHQ-30 items are point estimates and thus not true values of
θ. Therefore
\(\theta_{\text{change}}\) is not true change and therefore is associated with standard error of measurement. This may limit the size and interpretation of correlations in Table
3. However, the uncertainty accompanied with the point estimates of
θ
s is symmetric and therefore it tends to cancel out in large samples as the one used here.
An additional limitation of our simulation is that we have not considered additional CAT parameters such as item exposure control (meaning whether the researcher wants to restrict or balance any administration profile for the item set or subsets) or the termination criteria (when the CAT stops administering items, e.g. the precision of latent
θ). In principle, we had no a priori reason with this GHQ item set to control the frequency of any item selection. However, it is worth acknowledging that one concern in CAT is that the standard specifications tend to result in the most informative items being selected too often and the least informative most rarely and therefore item exposure control issues might need to be thought through further in practical applications of adaptive GHQ-30 administrations [
4].
As a final limitation, one could argue that the technical resources needed for any CAT application in survey practice might prove to be a barrier to implementation, but while this is certainly a limitation in settings where assessments are not routinely administered on electronic devices, this is not true for surveys. Population surveys usually employ computer-assisted personal interviewing (CAPI) techniques, i.e. electronic devices, to document interviewer—as well as self-rated responses [
41]. Their costs were initially discussed controversially [
42], but among others the reduced resource use in survey post-processing and the increased quality of the collected data led to today’s wide-spread use of these techniques. In addition, open-source CAT algorithms have become available [
10‐
13]. Their integration into CAPI systems is possible and is still a largely untapped resource [
42,
43].
In conclusion, GHQ-30 can be adapted for CAT administration for screening populations. In settings that are usually not interested in individual diagnostic assessments the adaptive presentation can shorten the GHQ-30 considerably and still produce useful estimates of psychological distress for group comparisons. These benefits can be realized in cross-sectional as well as longitudinal surveys. For the assessment of individual changes in distress over time, however, CAT administration may lack utility as nearly all items are administered to reach satisfactory reliability of change scores.