Introduction
Acromegaly is a rare, progressive disease, caused by an oversecretion of growth hormone (GH) and elevated levels of insulin-like growth factor 1 (IGF-1) in the bloodstream [
1]. A GH-secreting pituitary tumor is the cause of acromegaly in more than 95% of patients and surgical treatment remains the first-line therapy in most cases [
2].
There are many variables that play into the likelihood of surgical success and endocrinological remission, such as age, Knosp grade, repeat surgeries, or even different somatostatin receptor subtypes [
3‐
5]. The more factors that come into play, the harder it gets for clinicians to take them and their interactions into account. Based on these patient features, machine learning (ML) can be implemented to tailor treatment to a patient’s individual characteristics in the era of “personalized medicine” [
6]. It has become evident that ML has strong potential for outcome prediction and sometimes even outperforms statistical modeling techniques [
7,
8].
The ability to predict the likelihood of outcomes such as gross total resection (GTR) and biochemical remission (BR) as well as complications that are clinically relevant such as intraoperative cerebrospinal fluid (CSF) leaks from simple information available pre-operatively would be beneficial in risk-benefit patient counseling and shared decision-making. For some complications such as intraoperative CSF leaks, modifiable risk factors could even be adjusted based on personal risk, and precautions such as lumbar drainage could be taken in individuals with a high predicted risk of CSF leaks [
9]. For this reason, we aim to develop and externally validate clinical prediction models for outcomes after transsphenoidal surgery for acromegaly.
Discussion
In this study, the feasibility of predicting surgical and endocrinological outcome after transsphenoidal surgical treatment of acromegaly was evaluated. With data from two registries, three clinical prediction models were trained and subsequently externally validated. The achieved results proved to be promising and thereby displayed that there is significant potential for clinical application of ML.
In surgical treatment of acromegaly, normalization of GH levels through total resection is crucial. Treatment-refractory acromegaly puts patients at risk for early mortality [
21]. Consequently, a more aggressive surgical approach is justified in refractory cases. It has been proven that the percentage of reduction in GH closely correlates with the fraction of removed tumor in surgery for acromegaly [
22]. Further, low serum GH levels indicate persisting remission, whereas with higher levels the probability of recurrent disease—linked to significant mortality —is markedly larger [
23,
24]. Even intraoperative CSF leaks are detrimental to endocrinological outcomes, since they have been shown to inhibit hormonal recovery after surgery—apart from their inherent risk for persisting CSF fistulas and meningitis [
25,
26].
Surgical outcome depends on many variables that are hard to account for—including surgical experience, skill, and caseload [
2]—making their prediction difficult. ML methods can deduce a simple risk assessment model from relatively complex data [
4,
5]. For this reason, ML has been proven to aid in improved shared decision-making as well as enhanced patient care by modification of risk factors [
7,
27,
28]. However, some factors cannot be taken into account by any model—prediction models will always remain just that: models of reality. Therefore, ML should never replace the careful study of imaging results, the contemporary literature, and surgical experience. Rather, it should be seen as supplemental information available to surgeons, complementing the existing evidence and allowing personalized risk-benefit assessment. There is decent evidence that ML can help with improved surgical decision-making, and in some cases may even outperform expert predictions [
28].
Other important parameters that help physicians include simple scores and classifications, like the Knosp classification [
29] or the Zurich Pituitary Score [
30]. While these scores are well validated and robustly predict e.g. GTR, they are rather difficult to tailor to specific patient characteristics because they stratify patients into large risk groups. For the ML models established in this investigation, some of these classifications were combined with other recognized prognostic factors to deliver predictions that are precisely tailored to each patient. When trying to compare the performance of ML models with these scoring systems, little valid comparisons can be made, since reporting of performance measures such as sensitivity and specificity for these scores is uncommon. A systematic review by Dhandapani et al. [
31] allows comparison to the raw Knosp classification and its relationship with GTR. This review found that the usual dichotomization of the Knosp classification (Knosp 1 and 2 vs. Knosp 3 and 4) led to a sensitivity of 66.4% and specificity of 90.3% for GTR [
31]. Furthermore, in future studies, by combining additional endocrinological parameters like preoperative IGF-1 or early postoperative GH value in the model, a better performance for BR prediction might be obtained [
32‐
34]. However, the rationale of this study was to develop a simple tool that can give meaningful predictions using basic, pre-operatively available data only.
The developed models demonstrated good generalizability, performing similarly well on the external validation data as compared to on the training data. The GTR and BR models had a high PPV, making them suitable as “rule-in” models. Conversely, the CSF leak model demonstrated a high NPV, and is thus more suitable in a “rule-out” setting.
A major criticism of ML-based prediction models is that they at times work like a “black box” [
35]. Especially with deep neural networks, one is often confronted with the inability to understand why certain predictions have been made. By feeding the algorithm with the required data it can often provide precise outcome prediction, but it remains unknown how the internal decision-making process works. In this study, an initial problem was solved firstly by relying on algorithms with a complexity suitable to tabulated medical data. In addition, insight into the decision-making procedure can be gained by evaluating the variable importance listed in Table
3. With ML, interpretability can involve an inherent trade-off for better prediction power.
In conclusion, it can be stated that prediction of these complex outcomes like BR and GTR–which are certainly governed also by “unmeasurable” factors such as surgeon experience—from simple input data remains a difficult task, although ML can provide relatively accurate predictions in this pilot study already. Using more complex variables as input instead would probably improve the performance, but too complex inputs could be undesirable, as they would make the application of the models impractical. This study aimed at creating a simple tool that can give meaningful predictions using basic, pre-operatively available data. The models developed are proof that this is no longer mere wishful thinking. To the best of the authors’ knowledge, there are no other published, externally validated clinical prediction models for outcomes of transsphenoidal pituitary surgery in acromegalic patients. Once these models are enhanced by additional patient data and more participating centers to foster generalizability, an integration into a web application available to the public would be feasible.
Limitations
The main limitation of our study is the relatively low sample size. Although a very decent surgical cohort of over 300 acromegalic patients was included for training— one of the largest contemporary single-center cohorts in the literature— this sample size is still rather low for ML. For example, evaluation of model calibration usually requires larger amounts of data. Recalibration would not change anything in this respect, and would only artificially improve calibration [
36,
37]. Larger amounts of data would also likely improve general model performance. Even though external validation was carried out, which demonstrated generalizability of our models, including more participating centers to create a multicenter model that may account for the differences in surgical strategies, and so forth. Another important factor to consider is that these models are not applicable to centers with radically different treatment protocols. Importantly, surgical outcomes are also influenced by surgical experience and caseload [
38], inherently limiting the generalizability of any prediction model, score, or classification for surgical outcome. For example, a significantly different endpoint incidence may lead to systematic over- or underestimation of the outcome probability from the developed models [
36]. Furthermore, it needs to be taken into account that all clinical prediction models are unable to reliably predict extreme cases that fall outside the range of the training data (extrapolation) [
39,
40]. Furthermore, our models are trained on “real-world” registry data. The rate of BR was higher than the rate of GTR due to supplemental treatments such as radiation and medical therapy. While this does represent the “real-world” clinical practice—with some patients undergoing multiple treatments—our models may be less suitable when aiming to predict the chances of BR purely from tumor resection. Problems may also occur because of the poor reliability between different physicians’ ratings [
41,
42]. Especially with the Knosp and Hardy classification, there is evidence for poor inter-rater reliability.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.