Statistical analysis
We conducted psychometric analyses in accordance with the PROMIS analysis plan [
37]. An IRT model requires that three assumptions are met: unidimensionality, local independence, and monotonicity.
First, we examined unidimensionality by Confirmatory Factor Analyses (CFA) on the polychoric correlation matrix with Weighted Least Squares with Mean and Variance adjustment (WLSMV) estimation, using the R package LAVAAN (version 0.5-23.1097) [
38]. For unidimensionality, all items must load on a single factor. The CFI, Tucker Lewis Index (TLI), Root Means Square Error of Approximation (RMSEA) and Standardized Root Mean Residual (SRMR) evaluated model fit. We report scaled fit indices, which are considered more exact than unscaled indices [
39,
40]. In addition, we performed an Exploratory Factor Analysis (EFA) on the polychoric correlation matrix with WLSMV estimation procedures using the R package Psych (version 1.7.5) [
41]. Following the PROMIS analysis plan and recommendations from Hu and Bentler [
42], we considered sufficient evidence for unidimensionality if CFI > 0.95, TLI > 0.95, RMSEA < 0.06, SRMR < 0.08, the first factor in EFA accounted for at least 20% of the variability, and the ratio of the variance explained by the first to the second factor was greater than four [
37,
43].
Second, we evaluated local independence. After controlling for the dominant factor, there should be no significant covariance among item responses. We examined the residual correlation matrix resulting from the single factor CFA mentioned above, and considered residual correlations greater than 0.20 indicators of possible local dependence [
37].
Third, we assessed monotonicity. The probability of endorsing a higher item response category should increase (or at least not decrease) with increasing levels of the underlying construct [
37]. We evaluated monotonicity by fitting a non-parametric IRT model, with Mokken scaling, using the R-package Mokken (version 2.8.4) [
44,
45]. This model yields non-parametric IRT response curve estimates, showing the probabilities of endorsing response categories that can be visually inspected to evaluate monotonicity. We evaluated the fit of the model by calculating the scalability coefficient H per item and for the total scale. We considered monotonicity acceptable if the scalability coefficients of the items were at least 0.30, and the scalability coefficient for the total scale was at least 0.50 [
45].
To study IRT model fit, we fitted a logistic Graded Response Model (GRM) to the data using the R-package Mirt (version 3.3.2) [
46]. A GRM models two kind of item parameters, item slopes and item thresholds. The item slope refers to the discriminative ability of the item, with higher slope values indicating better ability to discriminate between adjoining values on the construct. Item thresholds refer to item difficulty, and locate the items along the measured trait (i.e., the construct of interest) [
47]. For items with five response categories, four item thresholds are estimated. To assess the fit of the GRM model, we used a generalization of Orlando and Thissen’s S-X
2 for polytomous data [
48]. This statistic compares observed and expected response frequencies under the estimated IRT model, and quantifies the differences between them. The criterion for good fit of an item is a S-X
2p-value greater than 0.001 [
49].
We used DIF analyses to examine measurement invariance, i.e., whether people from different groups (e.g., males vs females) with the same level of the trait (participation) have different probabilities of giving a certain response to an item [
50]. We evaluated DIF for age (median split: ≤ 53 years, > 53 years), gender (male, female), education (low, middle, high), region (north, east, south, west), ethnicity (native, first and second generation western immigrant, first and second generation non-western immigrant), and language (English vs Dutch). For this latter aim, we compared our sample to the US PROMIS 1 Social Supplement, obtained from the HealthMeasures Dataverse repository [
51], which was used to develop these item banks [
18]. We selected only the participants from this Supplement who were recruited from the US general population (Polimetrix sample,
n = 1008). From this group, we used 429 people with complete data for the Ability to Participate in Social Roles and Activities item bank and 424 people with complete data for the Satisfaction with Social Roles and Activities item bank for the DIF analyses. We evaluated DIF by a series of ordinal logistic regression models, using the R package Lordif (version 0.3-3) [
52], which models the probability of giving a certain response to an item as a function of the trait, a (dichotomous or ordinal) group variable, and the interaction between the trait and the group variable. We used a McFadden’s pseudo
R2 change of 2% between the models as a criterion for DIF [
52]. Uniform DIF exists when the magnitude of the DIF is consistent across the entire range of the trait. Non-uniform DIF exists when the magnitude or direction of DIF differs across the trait.
Finally, we evaluated reliability. Reliability within IRT is conceptualized as “information.” Information (I) is inversely related to the standard error (SE) of the estimated construct or trait level (called theta,
θ), as indicated by the formula:
$${\text{SE}}(\theta )=\frac{1}{{\sqrt {I(\theta )} }}.$$
The SE can differ across theta [
47,
53]. The theta is estimated based on the GRM model and scaled with a mean of 0 and a SD of 1 and an effective range of − 4 to 4. An SE of 0.316 corresponds to a reliability of 0.90 (SE 0.548 corresponds to a reliability of 0.70). For each person, we calculated four theta scores: one based on all items of the item bank, one based on the standard 8-item short form (version 8a), and two based on different CAT simulations. In the first simulated CAT, we used the standard PROMIS CAT stopping rules. The standard CAT stopped when a SE of 3 on the
T-score metric was reached (comparable to a reliability slightly higher than 0.90) or a maximum of 12 items was administered (the recommended minimum by PROMIS is 4 items, but this could not be defined in catR so we used no minimum). In the second simulated CAT, we administered a fixed number of 8 items to compare the reliability of the CAT with the short form. In all simulations, the starting item was the item with the highest information value for the average level of participation in the population (
θ = 0), according to PROMIS practice. We used the R-package catR (version 3.12) for the CAT simulations [
54]. We used maximum likelihood (ML) for estimating thetas to prevent biased scores in people with extreme responses [
55]. ML, however, is not able to estimate
θ for response patterns that exclusively comprise extreme responses. Therefore, we set the possible scale boundaries in the CAT simulation to − 4 to 4, whereby people who score 1 or 5 on all CAT items get a theta score of − 4 or 4.
We transformed theta scores into
T-scores as recommended by PROMIS according to the formula (
θ*10) + 50. A
T-score of 50 represents the average score of the study population, with a standard deviation of 10. We plotted the SE across
T-scores for the entire item banks, for the standard 8-item short forms (version 8a), and for the two different CAT simulations [
54]. We plotted the distribution of
T-scores in our population to show the reliability of the item bank in relation to the distribution of scores in the population.