Introduction
The Greulich and Pyle (G&P) method remains the most common method for determining bone age (BA), and it involves the comparison of the X-ray image of the left hand with reference images [
1,
2]. Compared with other BA determination methods, G&P can be easily implemented into clinical practice, but it depends on the reader’s experience, takes time, and shows considerable inter- and intraobserver variability [
1,
3,
4].
Several programs based on deep learning have been able to circumvent these limitations in recent years and provide an estimate of BA within seconds comparable to that of expert readers without relevant intraobserver variability [
5‐
9]. AI (Artificial Intelligence)-assisted BA assessment aims to reduce the dependence on human input and provides a better alternative to the subjective traditional BA assessment [
10,
11].
Four commercially available programs for determining BA using the G&P method are currently admitted for medical use in Europe by Conformité Européenne (CE): BoneXpert (Visiana)—since March 2009; VUNO Med-BoneAge (Vuno)—since May 2018; IB Lab PANDA (ImageBiopsy Lab, Vienna, Austria)—since November 2020; and BoneView Bone Age (Gleamer)—since March 2023 [
12]. As the first commercial, CE-certified AI software for automated BA determination, BoneXpert, was launched in 2009 [
9] and determines the BA of every single bone of the hand and wrist by analyzing shape, intensity, and texture. The overall BA is calculated by taking the aggregate of the individual values [
13]. BoneXpert was developed as an “autonomous reader,” completely replacing a human reader. In contrast, the CE-certified AI software PANDA and BoneView were developed to assist the human reader in clinical practice. In practice, images and results still must be checked by a human reader using the G&P atlas.
Besides internal validation as part of CE certification, independent external validation is particularly important for assessing the real-life performance of AI-based programs. To our knowledge, all AI-supported programs for BA determination have been validated separately, except for one study comparing two programs on the same data [
14].
Each of the performed validation studies provides quality parameters, such as the mean absolute deviation or the root mean squared error for BA estimation, which potential buyers and users can use to estimate the performance of the software. However, the differences between those validation studies regarding age, ethnicity, and population size, as well as the varying expertise of the G&P expert readers representing the reference standard, render the comparison between the different programs less objective.
Thus, the current study aimed to compare the performance of three AI-supported programs for determining BA in a single collective of Central European Caucasians.
Materials and methods
Patient cohort and selection
A patient collective already described in another context was used for this retrospective study [
15]. It comprised patients aged 1–18 years who had received an X-ray of the left hand for clinical BA determination between 2011 and 2020. All images were taken at a Central European tertiary center within an institute of pediatric radiology. The radiology information system was used to identify 5612 images. After categorizing all patients into 1-year age groups, nine male and nine female patients were randomly selected for each group, resulting in the inclusion of 153 male and 153 female patients, as published in Gräfe et al [
15] (Supplementary Fig.
1). A positive ethics vote of the local ethics committee was obtained for the retrospective analysis of the images and informed consent from the patients was waived because of the retrospective study design (EK 46/2020).
Subgroup for the most common age range
In infants and older adolescents, BA is rarely determined in everyday clinical practice according to G&P. This is partly because the hand is not yet sufficiently ossified in young children and already too ossified in near adults. Hence, gender-based subgroups were formed for the mid-90% of the age percentiles. The 5th and 95th age percentiles required for this purpose were calculated through the analysis of the chronological age (CA) of every BA exam in the same center between 2014 and 2024 (6183 examinations, 3178 exams of male patients).
Determination of bone age
All X-ray images used for the study were taken on an Axiom Aristos FX (Siemens Healthcare) without a scatter grid, with a 0.1 mm copper filter. The left hand including the wrist was always examined in posterior-anterior projection.
All four vendors of AI-assisted BA determination according to G&P certified for the European market were asked to participate in this study: three vendors agreed to participate in the study (Visiana, ImageBiopsy Lab, Gleamer), no response was received from the fourth vendor (VUNO).
All 306 radiographs were examined by three readers (two pediatric radiologists, D.G. and A.B.B., with 7 and 4 years of experience, respectively, and a pediatric endocrinologist, R.P., with 30 years of experience in pediatric BA determination) in independent sessions to determine the manual BA. BA was determined using the atlas of G&P [
2]. If the ages of the carpals and epiphyses were different, the age of the epiphyses was preferred [
15,
16]. The mean value of the BA estimates by three readers was used as a reference for further evaluation (“ground truth”).
Likewise, all 306 images were submitted for evaluation to the included AI programs for BA determination: BoneXpert (v. 3.2.2), PANDA (v 1.13.21), and BoneView (v. 2.3.1.1). BoneXpert and PANDA were available as local standalone apps, whereas BoneView evaluation was cloud-based. BoneView rejects analysis for a CA below 3 years and returns every BA above 17 years of age as “≥ 17 y”.
Statistics
The evaluation was conducted using RStudio Version 2023.06.2 (PBC). Mean values with standard deviations (SD) or median with interquartile range (IQR) were given as appropriate. The mean error, the mean absolute error (MAE), and the root mean squared error (RMSE) were determined as performance parameters. A Bland–Altman plot was used to determine the fixed bias and the limits of agreement as a measure of the variation between AI and human readers. The mean SD between the three AI programs and between the three readers was determined as root mean square
\({RMS}=\,\sqrt{\frac{1}{n}\left({{{SD}}_{1}}^{2}+\,{{{SD}}_{2}}^{2}+\,\ldots +{{{SD}}_{n}}^{2}\right)}\) where
\({{SD}}_{i}\) is the standard deviation of the three bone age values of the i-th patient obtained using the three AI programs or from the three readers. A comparison of the SDs of the AI programs with the SDs of the human readers was performed by a two-tailed Wilcoxon signed-rank test. Next, the assumption of a constant difference between the methods, i.e., zero slope in the Bland–Altman plots, was relaxed. The differences were regressed on the averages, and the results were used to obtain prediction intervals of the differences between BA as estimated by AI and the ground truth. Additionally, prediction equations and prediction intervals of the ground truth, given an AI measurement, were derived [
17]. A two-tailed Student’s t-test was used to test if the slope in the prediction equation was equal to one (equivalent to constant mean error or no slope in the Bland-Altman plot). The R package “MethComp” was used for these computations. The Friedman rank sum test was applied to squared differences between BA as estimated by AI and the ground truth to compare the accuracy of the different AI methods. Interreader variabilities for the expert readers for this cohort were reported in literature [
15]. Differences in the central tendency were determined using the student t-test, with the significance level set at 0.05.
Discussion
This study is the first to compare three of the four programs for BA determination based on the atlas of G&P that have been approved for the European market as medical devices.
All three programs analyzed in the present study - BoneXpert, PANDA, and BoneView - predict BA according to G&P reliably and with similar quality with an MAE of 0.48–0.56 years. However, it should be mentioned that BoneView does not analyze the age group up to 3 years of CA, which is particularly difficult to evaluate but is also rarely used in clinical practice. Also, BoneView returns every BA above 17 years as “≥ 17 y”, due to forensic considerations, resulting in a rejection rate of one-fifth of the whole cohort.
In the subgroup containing the main age range, the MAE even fell between 0.50 and 0.52 years. Notably, the mean standard deviation among the three AI programs was lower than that of the human readers, consistent with findings in the literature and indicating a quality feature [
18]. In previous validation studies, all three programs showed excellent agreement between AI-based and expert reader assessments. The analyzed programs also showed similar deviations of BA as estimated by the AI as compared to human readers in previous studies: BoneXpert had a MAE of 0.34 years and an RMSE of 0.45 years [
5], BoneView had a MAE of 0.49 years [
19]. To the authors’ best knowledge, only one recent study evaluated more than one AI-supported program for BA determination concurrently: BoneXpert and Med-BoneAge [
14]. No significant differences between the two AI programs and expert readers were found [
14]. Med-Bone Age was not included in the current study due to a lack of response from the vendor.
The differences between the AI programs are more apparent in the details: While BoneXpert produced the best results over the entire age range of 1–18 years, PANDA showed an even smaller deviation from the expert readers evaluation for the main age range, which we defined as 90% of all BA examinations in clinical practice. However, outside this age range, the results of PANDA were worse. Therefore, it is not surprising that the main age range determined in this study largely coincides with the age range of intended use of PANDA (> 3 years and < 16 years for females and < 17 years for males). While BoneView rejects examinations beyond the recommended CA before the analysis, PANDA analyzes every patient (and, in principle, even X-rays not meant for G&P BA analysis, such as chest X-rays) and leaves it to the user to validate the correct input data. This is a trade-off for the high acceptance rate of PANDA and BoneView. Nevertheless, for cases with a CA outside the intended use range, a warning is issued by PANDA in the findings report. Considering the ceiling effect for BA beyond the recommended CA (Fig.
1), the results for PANDA in near adults must be interpreted with caution.
Additionally, PANDA analyses have been shown to be quite robust against non-straight posterior-anterior projections of the hand [
20]. In contrast to BoneView and PANDA, BoneXpert has an internal quality control for hand radiographs. In the study dataset, 2.2% of the examinations were rejected.
The clinical relevance of a mean absolute deviation of approximately half a year is unclear. This is further emphasized by the fact that human readers in this study and the literature show a similar standard deviation [
18]. Furthermore, the extent to which mean deviations of weeks to a few months, as observed among the three programs studied, are significant for clinical practice in the context of such high natural variability remains speculative. Without the adjustments described in Table
3, the proportional bias in BoneXpert would result in a tendency to overestimate and, in BoneView, a tendency to underestimate in prepubescent children.
The software programs also differ in their requirements for the selection of X-ray images and the patient population: Of the three AI software programs analyzed, BoneXpert is the most used and established software and has been validated in many studies as well as different ethnicities and diseases [
13]. Moreover, BoneXpert was validated in an actual Caucasian population [
21]. In contrast to the other software, BoneXpert was validated for BA determination based on both G&P and the Tanner–Whitehouse method. It can analyze both left- and right-hand radiographs in the anterioroposterior view [
13]. BoneXpert is unique in having an intended use not only as an AI-assist tool but also as an AI-replace tool. The additional option of determining a “bone health index” also extends the sheer BA determination by the AI software (“AI-expand”) [
13]. BoneView is the newest AI software and can analyze anteroposterior views and radiographs of the left or right hand [
19].
The authors acknowledge several limitations of the study: The study population comprised patients with suspected or proven growth failure from a single center. Since some pathologies affect not only BA but also bone morphology (e.g., achondroplasia), a center with a different patient composition may achieve different results. The cohort studied was comprised of Caucasians only, so the results may not be generalized to other ethnicities.
The economic requirements of the programs were not compared as part of the study, as pricing depends on both the volume of analysis and the negotiating conditions of the clinical facilities, such as whether other programs from the same provider are already in operation on-site. BoneView rounds each BA to an age category given in the G&P atlas, while PANDA and BoneXpert give continuous values. Since even the gold standard, the mean value of three expert readers, does not correspond exactly to G&P stages, the performance of BoneView without rounding to G&P age categories could, in theory, be higher (Supplementary Tables
2 and
3). However, in the commercial version, only the rounded values are available. Most importantly, the “gold standard,” the expert reader, can also be flawed. Inter- and intra-reader variability in the G&P atlas, averaging the BA estimates of three expert readers, was used to mitigate this limitation.
In summary, all three AI-supported programs showed a high degree of accuracy in determining the G&P BA. Differences emerge in the details, particularly in infants and near adults.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.