Introduction
Osteoporosis is a worldwide health problem contributing considerably to health care costs, morbidity and mortality [
1]. Due to the increasing age of the population, this problem will become even more substantial over time. Osteoporosis is a good candidate for screening, because it has a long preclinical phase and cost-effective therapeutic possibilities are readily available [
2]. Therefore, early detection may be a reasonable strategy to prevent osteoporosis related fractures [
3].
Bone mineral density (BMD) can be derived from quantitative computed tomography (QCT), ultrasound, and dual energy x-ray absorptiometry (DXA), but the World Health Organization (WHO) definition for osteoporosis only includes the DXA T-score [
4]. DXA is a strong predictor of bone deterioration and fracture risk [
5]. Nevertheless, osteoporosis remains underdiagnosed in the general population [
6]. Recent studies proposed the use of regular clinical computed tomography (CT)-scans for bone mineral density assessment as an opportunistic screening method [
7‐
14]. In this way, BMD surrogates could be derived from CT without additional radiation dose. Deterioration of BMD could be determined in the early stages of disease, which could prompt suitable medical treatment of osteoporosis, and thus prevent fractures in patients at risk. Additionally, scans performed in screening programs such as lung cancer screening or coronary artery disease evaluation may also be used, since BMD measurements on CT have been shown to predict all-cause mortality in lung cancer screening participants [
15].
Bone density measurements on CT are mostly performed by manually placing a region of interest (ROI) in a lumbar or thoracic vertebra. Although in the future ROI measurements may possibly be executed automatically using software, the placement of a ROI is currently performed manually, and thus may vary in size, shape and location [
16]. This could introduce variability in measurements and may depend on the experience of the observer. Furthermore, the inter-examination reliability is currently unknown. High inter-examination reliability would enable monitoring changes over time; for example, after an intervention. A few studies addressed the reproducibility of bone density assessment of the vertebrae [
9,
17], but data is missing for unenhanced low-dose thoracic CT. Therefore, the aim of this study was to assess the inter-observer and inter-examination reliability and agreement in attenuation measurements of the vertebrae on low-dose unenhanced CT.
Discussion
In this study, we found excellent inter-examination reliability for manual bone density measurements of the vertebrae. Limits of agreement ranged from -26 to 28 HU, which means a change of at least 28 HU is needed in order to detect a real change in bone attenuation. Therefore, these results have to be taken into account when planning to use bone density measurements for longitudinal studies (e.g., for measuring therapeutic effects). Inter-observer reliability was good to excellent and limits of agreement with the mean ranged from -12 to 12 HU, which indicates that observers can be discordant with the mean estimated bone attenuation by 12 HU.
Our results imply that manual placement of a ROI in L1 is a reliable method for the quantification of vertebral attenuation. Therefore, in a lung cancer screening setting, low-dose chest CTs may be used to measure bone attenuation. Because these measurements are performed manually, in theory, experience could influence the precision of the measurement. However, the present study shows that radiological experience has no major effect on attenuation measurements. Moreover, ICCs between more experienced observers were not better than between less experienced observers. Low-dose CT scans could therefore gain a role in early detection of osteoporosis.
With the recent recommendation on the implementation of lung cancer screening [
24], a large number of subjects will receive a low-dose chest CT. Next to screening for lung cancer, this can provide an opportunity for the assessment of other abnormalities, such as chronic obstructive pulmonary disease and coronary artery calcifications [
25]. Because smoking is associated with lower bone density [
26], this could be an opportunity for the detection of osteoporosis in this smoking population. By diagnosing low bone density as well, this could improve the yield and cost-effectiveness of lung cancer screening.
To our knowledge, this is the first study to describe the inter-examination agreement and reliability of attenuation measurements of the vertebrae on unenhanced low-dose CT in a large population. In addition, we extensively studied inter-observer agreement and reliability. Although several studies used attenuation measurements in the search for an appropriate screening tool for osteoporosis, studies on the agreement and reliability are lacking.
Ohara et al. [
9] studied the correlation between pulmonary emphysema and reduced bone density. For this purpose, they used manual vertebral bone measurements. They validated their measurements by calculating correlation coefficients of two observers. This resulted in ICCs of 0.995, 0.993, 0.950 and 0.996 for T4, T7, T10 and L1, respectively. Their strength was the evaluation of multiple vertebral levels, but they concluded that the average bone density of three thoracic vertebral bones was highly correlated with bone density in L1 alone (r = 0.914,
p < 0.001). Pickhardt et al. [
7] elaborated on this and found that measurements at L1 are as or more accurate than the results at other levels, including multilevel assessment. Also, Romme et al. [
27] showed no added value of using three thoracic vertebral levels to assess bone density compared to one measurement at L1. Although L1 seems to provide the most accurate results in terms of attenuation measurements, this vertebral level is not always included on thoracic CT. In our population of 376 participants, 89 (23.7 %) measurements were made at a vertebral level different from L1.
Both Pickhardt et al. [
17] and Romme et al. [
27] studied inter-observer agreement and found limits of agreement between two observers of -6 HU to 16 HU for T12-L5 and -9 to 5 HU for T4-T7-T10, respectively. We complemented to this by using six observers with different experience levels and showed limits of agreement with the mean ranging from -12 to 12 HU. The intra-observer limits of agreement from Romme et al. ranged from -9 to 5 HU in 20 participants. Our limits of agreement were substantially wider, ranging from -26 to 28 HU, but could be more realistic as a result of a larger study cohort.
Next to presenting positive results in terms of agreement and reliability, it is important to estimate the impact of these results on clinical practice. In order to perform reclassification analyses, we used a threshold of 110 HU to define osteoporosis, which was derived from Pickhardt et al [
7]. This threshold was proposed for a routine care population with lower osteoporosis risk because of its high specificity. Buckens et al. [
28] validated this threshold as being the most optimal as compared to DXA. By using this threshold, 159 (55.4 %) participants were classified as having osteoporosis. This high prevalence is in line with some findings of osteoporosis prevalence in a high-risk chronic obstructive pulmonary disease (COPD) cohort [
29]. With this heavy smoking population being at risk for osteoporosis as well, these prevalence numbers could be appropriate. Another explanation for the high prevalence could be that the HU in the vertebra was systematically lower compared to the study by Pickhardt and Buckens.
Our reclassification analysis showed that inter-examination variability can lead to a different diagnosis in 11.2 % of included participants. Moreover, variability of inter-observer measurements can lead up to 22.1 % misclassified participants. As a consequence, when measuring bone density that is close to a threshold that defines disease, the effect of variability within a patient and between different observers could be substantial. Considering the development of guidelines for osteoporosis screening, variability consequently has to be taken into account.
Our study has limitations. First, the follow-up scans for the assessment of inter-examination variability were performed three months after baseline. In this period, CT attenuation values could have altered. However, we think the impact will be limited because decline in bone density progresses slowly. Still, a follow-up CT examination directly after baseline would be more ideal to eliminate changes over time. Second, we only used measurements of one vertebra in our evaluation and did not include more vertebral levels. Nevertheless, one may assume that, even if bone attenuation may vary at each vertebral level, inter-observer and inter-examination agreement may be similar [
9]. Thereby, former studies have shown that ROI placement in multiple vertebrae does not add value compared to one measurement at L1 [
7,
27]. Third, as a consequence from the study design of the lung cancer screening trial, only a small amount of our cohort consisted of women. But, in this cohort, no difference was seen in inter-examination differences in HU between men and women. Lastly, although our scanners were calibrated weekly, we did not use a calibration phantom in this study as is done in QCT of the spine. We were therefore unable to provide BMD as milligrams hydroxyapatite per cubic centimetre and our method has lower precision compared to QCT [
30,
31]. However, previous studies have shown that, although precision was lower compared to QCT, BMD estimation techniques without phantom calibration were nevertheless promising for assessing fracture risk [
11].
In conclusion, this study shows that bone attenuation can be measured by manual ROI placement on unenhanced low-dose chest CT examinations with good reliability. However, when developing guidelines for early detection of osteoporosis, variability still has to be taken into account. While the distinctive character of this technique is excellent, diagnostic studies are needed to confirm these results, to evaluate its accuracy and ultimately its cost-effectiveness.