Descriptive statistics
Descriptive statistics were used to describe the demographic and clinical characteristics of the sample. The FreBAQ-G was summarized using range, median, mean and standard deviation for the total score. The frequencies in each response category were also reported.
IRT modelling was used to assess cross cultural validity and the psychometric properties of the FreBAQ-G. Because the 9 items of the FreBAQ-G are ordinal scaled, a polytomous IRT model, should be used [
26]. Based on statistical analysis the graded response model (GRM) was selected [
26]. The assumptions of the statistical IRT model, local independence, dimensionality and model fit statistics were investigated. Details about the model selection and test of the IRT assumptions are given in the
Appendix.
Psychometric properties of the FreBAQ-G
Psychometric properties, including scalability, internal consistency, item characteristics, test characteristics and test reliability of the FreBAQ-G were calculated. Differential item functioning (DIF) was used to evaluate item invariance, which means whether different subgroups of the German speaking sample have the same chance to answer the items of the FreBAQ-G.
Internal consistency was estimated using Cronbach’s α. Acceptable internal consistency is reached if α is > 0.7 [
28]. Loevinger’s H
j scalability coefficient is reported as a measure of homogeneity. The coefficient can be considered as an accuracy measure for the ability of items to order the respondents in the measured latent trait (back specific self-percetion) [
29]. As a rule of thumb, items with values of Loevinger’s H
j < 0.3 are indicative of poor/no scalability, values between 0.3 and 0.4 indicate useful but weak scalability, values between 0.4 and 0.5 are indicative of moderate scalability and values > 0.5 indicate strong scalability [
30].
After fitting the GRM model, the test- and item-characteristics were evaluated. In IRT modelling, a person’s ability in the latent trait -in this study” back specific self-perception”- is measured on a logit scale which follows a Z-distribution with a mean of 0 and a SD of 1 (range from − 4 to 4) [
26]. This logit scale is called Theta (θ) and is represented on the x-axis of every IRT graph. The θ -scale is not sample specific [
26,
31,
32], so that even when the questionnaire is administered to other groups or languages, the items should have the same properties, yielding comparable scores. Hence, the item and test characteristics of the current study should be comparable to those of the original English speaking version reported by Wand et al. [
20].
The test characteristic curve visualizes the relationship between the IRT-based estimated ability in the latent trait “back specific self-perception” for each person and the expected classical sum-score, based on the classical test theory [
26]. This helps to understand which FreBAQ-G sum-score is expected for a person with NSCLBP with a certain trait level on the current scoring system.
The test information function shows how precisely the FreBAQ-G can estimate the level of the respondent’s ability in the latent trait [
26]. Thereby, the test information function helps to decide which region on the latent trait continuum can be estimated most precisely (or most poorly). This concept is closely related to the concept of reliability [
32], therefore the test information function also visualizes the standard error (SE). In IRT, the SE varies for each level of the latent trait. The SE can be used to calculate the estimated overall mean reliability often described as marginal reliability, using the formula: reliability = 1-mean (SE)
2 [
33].
The item characteristics include item discrimination (slope), item difficulty (threshold) and item information [
26]. The item discrimination parameter (a) describes the slope of the item characteristic curve. Higher values are indicating better item discrimination, which means items with higher values are more sensitive to detect a difference in the latent trait (back specific self-perception). Values > 1 are desirable [
26]. Item discrimination and item information are very closely related [
26,
32]. The item difficulty parameter (b) describes the point on the x-axis (θ value), where the probability of choosing a response option is 50% (threshold). Because of the underlying statistical nature of the GR model the item difficulty parameters are cumulative [
26]. Item difficulty parameters are calculated for each item. A person whose back specific self-perception is not impaired will choose the response option 0 (never feels like that), whereas a person with highly impaired back specific self-perception should have a high probability to choose response option 4 (always, or most of the time feels like that). The highest probability of which response option will be answered by a person with a certain trait level is visualized in the category characteristic curve.
Finally, differential item function (DIF) was used to assess the assumption of item invariance [
26]. Item invariance implies that the FreBAQ-G is independent to particular sample characteristics. Differential item function (DIF) is present for a given item if individuals with the same ability level (back specific self-perception), but belonging to different groups (e. g. gender), do not have the same probability (chance) of responding to the item with the same rating [
26]. Therefore, item invariance can be considered as a measure of fairness.
Cross cultural validity
Cross cultural validity refers to the equivalence of measurement across different cultural groups [
28]. Cross cultural validity was investigated using IRT techniques. In a first step we pooled the data of the German version (FreBAQ-G,
N = 271) with those collected for the English-language validation study (FreBAQ,
N = 251) in an Australian study population [
20]. To detect differential item function (DIF) we first separately investigated the item properties (difficulty and discrimination) for the German and English version using graded response model (GRM). To differentiate between uniform (difference in item difficulty only) and non-uniform (difference in item difficulty and discrimination) differential item function (DIF), the mean item difficulty was calculated per polytomous item when the slopes over all items were set to 1 [
34]. The calibrated mean item difficulties were plotted with the German items on the y-axis and the English items on the x-axis. To facilitate interpretation an identity line was drawn through the origin of the plot with a slope of 1. Additionally control lines representing 95% CI are drawn around the identity line. Items that fall outside these control lines are suspected to demonstrate differential item function (DIF) [
28,
31]. In the same way the item discrimination parameters were plotted. In addition we used the IRT-LR test (likelihood ratio test) to confirm both uniform and non-uniform differential item function (DIF) [
34,
35]. The IRT-LR test procedure compares hierarchically nested IRT models; with one model fully constraining the IRT parameters to be equal between the German and the English version and other models that allows the item parameters to be freely estimated between groups. Finally we used a multiple-group graded response model (GRM) model with a correction for observed differential item function (DIF) to validate the performance of the classical sum-score of the English and German version [
34].
Construct validity: associations of self-perception of the back with back pain related parameters
The relationship between the IRT-based estimated FreBAQ-G score (Theta) and pain intensity, disability and fear avoidance beliefs was calculated using correlation statistics (Pearsons r coefficient). Finally multiple linear regression with the FreBAQ-G (estimated with the Theta) as the dependent variable was performed to find the best predictors.
For statistical analyses Stata 16.1 (StataCorp LLC, USA) was used. The IRT model fit statistics was calculated using the student version of IRTPRO 4.2 (Scientific Software International Inc., USA).