1 Introduction
Intraoperative monitoring is applied in cerebello-pontine-angle (CPA) surgery to detect and avoid nerval damage. In vestibular schwannoma (VS) surgery, monitoring of free-running EMG, facial motor evoked potentials (MEP) and direct nerve stimulation (DNS) support preservation of facial and vestibulocochlear function and consequently postoperative quality of life [
1,
2]. Monitoring of free-running EMG examines continuous EMG activity recorded by needle electrodes in the facial muscles for specific pathological patterns, so-called “A-trains”. The overall quantity of A-trains (“traintime”) has been shown to correlate with the degree of postoperative facial palsy [
3,
4]. The positive predictive value of the method with fixed risk thresholds is ~ 64%, which is comparable to the values published for MEP and DNS [
1,
4].
A limiting factor is the occurrence of false-positive cases with high amounts of A-trains and no severe deterioration of facial function [
1,
5,
6]. In a previous study [
6], we demonstrated that such patients frequently show a so-called “split” facial nerve [
7]. In these cases, the intermediate nerve (NI) takes a course in the CPA separate from the facial nerve, carrying motor fibers targeting the facial muscles [
6,
8‐
10]. Irritation of the NI provokes comparably large amounts of A-trains. Potentially due to the low functional importance of intermedius motor fibers, this is frequently not accompanied by respective deficits [
6]. Unfortunately, characteristics of “intermedius” A-trains are not significantly different from “facial” A-trains [
11], which prevents differentiation of the two entities. Instead, so-called A-train “clusters”, i.e. A-trains occurring in most recording channels within a short time are more frequent in patients with separate NI on a group level [
11]. In addition, the observation of a separate NI increases with larger tumor size, however is rare in cases with very large tumors [
11].
These findings suggest complex interactions between tumor size, NI, surgical manipulation, A-train activity and correlation to outcome. It seems therefore unsurprising that fixed traintime thresholds largely independent of tumor size and without consideration of a separate NI suffer from limitations.
In the current study, we employ machine learning and specifically neural networks (NN) to calculate an outcome parameter similar to House-Brackmann (HB) grades [
12] based on traintime, tumor size and preoperative functional status. An advantage relevant to our application is the ability to integrate different data types and to capture complex interactions. While understanding the performance of a successful neural network is notoriously difficult, even a pure black-box approach may have clinical merit if it outperforms estimation based on direct interpretation of parameters alone.
The main goal of our study is therefore to provide an improved tool to estimate postoperative facial nerve outcome with the potential for real-time intraoperative application for facial nerve monitoring.
2 Methods
2.1 Patients
Data from 200 consecutive adult patients who had undergone VS surgery between 7/2006 and 8/2016 were selected retrospectively and anonymized. This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of the University Hospital Halle (Saale) (Ref. Number 2018 − 138). All patients of whom data were included in the study had given their written informed consent for scientific usage of their data. Inclusion criteria were first VS surgery, availability of complete continuous intraoperative EMG recordings from clinical routine as well as facial nerve outcome data from follow-up after at least 6 months. Exclusion criteria were previous irradiation and neurofibromatosis.
2.2 Recordings
Continuous EMG was recorded during the complete surgical procedure as described previously [
3,
4]. In short, 15 mm long non-insulated needle electrodes were placed parallel in the facial muscles with an interelectrode distance of 5 mm. For each of the 3 main nerve branches 4 electrodes were positioned on the operated side. Referencing neighboring electrodes resulted in 3 bipolar channels per branch. The ground electrode was placed in the contralateral upper arm. Data were recorded with a Grass-Telefactor 15LT biosignal amplifier (West Warwick, RI, USA) with approximately 7 kHz and using a 5 Hz high pass filter.
2.3 EMG processing
Recorded data was evaluated postoperatively by computer-assisted visual inspection using in-house software. Extending automated marking [
3], on- and offsets of individual A-trains were marked. In addition, A-train clusters [
11] were identified visually. Subsequently, the durations of all A-train events were summed up per channel, yielding a total of 9 traintime values for each patient.
2.4 Clinical data
Clinical data were extracted from clinical documentation: preoperative and immediate postoperative facial nerve function as well as follow-up after 6 months, graded according to House-Brackmann [
12]. The HB grading system distinguishes 6 degrees of facial palsy: 1 – normal function, 2 to 5 represent dysfunction from mild to severe and 6 represents total paralysis. Clinically especially relevant is HB ≥ 4 as eye closure on the affected side is no longer possible. HB degrees were checked and corrected, if necessary, by a single experienced evaluator (author JP) to reduce limited interrater reliability [
13]. Intraoperative observation of a separate NI was taken from the surgeon’s documentation.
2.5 Relationship to postoperative outcome
Relationship of traintime, tumor size and NN estimates of postoperative outcomes (postoperative and follow-up HB grades) with the actual observed outcomes was evaluated using Spearman partial rank correlation as applied previously [
4]. A statistically significant partial correlation suggests an association which is not explained by the covariates, e.g. traintime and outcome independent of tumor size [
4]. Evaluation of the correlation of only the raw traintime and tumor size with the outcome, i.e. without first passing through the networks was performed to yield a baseline performance to compare network outputs against.
2.6 Neural networks and logistic regression models
Feed-forward networks with different input parameters, a single hidden layer and simultaneous postoperative and follow-up HB grades as outputs were constructed using the feedforward function of the Matlab Deep Learning Toolbox (Matlab R2021a, The Mathworks, Natick, MA, USA). Number of hidden layer neurons was chosen equal to the number of inputs. Continuous network outputs were rounded and interpreted as estimated HB grades. The networks therefore were trained to recognize the association between input parameters and the target “patterns” of HB grade pairs (postoperative and follow-up).
The procedure utilized a Levenberg-Marquardt training function and mean squared error for performance evaluation. Data was randomly separated into 75% (150 datasets) training and 25% (50 datasets) validation splits. Performance was evaluated in only the validation split by calculating chi
2 statistic between estimated HB grades and postoperative and follow-up HB grades. For more intuitive interpretation, chi
2 values were transformed into Cramér’s V effect sizes. For 5 × 5 tables (evaluated HB 1–5), values below 0.05 are considered negligible, 0.05–0.13 small, 0.13–0.22 medium and above 0.22 as large [
14].
To illustrate the performance of a more transparent model, multivariable multinomial logistic regression models (LRM) were trained and evaluated with the feature combination showing the best NN performance, applying the same methodology.
2.7 Statistical evaluation of performance
NN training depends on random choice of training and validation splits as well as random initialization of synapse weights between layers. To better estimate overall NN performance, we applied bootstrapping to sample the performance distribution observed with many networks. The approach repeated a single run of calculations 1000 times, yielding 1000 estimates, i.e., chi2 values of the comparison between network output and outcomes.
The mean and 95% confidence intervals of the resulting distribution was taken as overall performance. For calculation of significance, the distribution was compared to a surrogate distribution using a Komolgorov-Sminorv (KS) test. The surrogate distribution was constructed by shuffling input data of the validation in respect to the outcome values. Chi2 values were then calculated using surrogate network output. The procedure was also repeated 1000 times yielding the surrogate distribution.
2.8 Comparison of different input sets
Primary endpoint of our study was to evaluate NN with inputs traintime, tumor size and preoperative facial nerve function. Additionally, we evaluated performance, when adding the information that a separate intermedius and/or A-train clusters were observed. Performance differences are discussed based on 95% confidence intervals (CI). Overlapping CI were interpreted as a lack of significant differences, which is considered conservative [
15].
2.9 Evaluation of tumor size
Networks trained on traintime, tumor size and preoperative facial nerve function were further analyzed to study the influence of tumor size. The complete dataset was subdivided into groups according to Koos grades. Chi2 values were then calculated for each group individually. Due to comparable preoperative HB grades in most patients and therefore also within tumor size subgroups, the observed group correlations then necessarily must depend on traintime. Mean correlations and 95%-CI are reported over all 1000 randomizations. For evaluation of differences between tumor size categories, a general linear regression model (GLM) was fitted to the network estimates, taking tumor size and sample size in the groups into account to control for the different patient numbers in tumor size groups, ranging from 18 with Koos 1 to 70 with Koos 3.
2.10 Influence of a separate intermedius nerve
NN performance was investigated regarding the influence of a separate NI. Based on all 200 patients, chi2 of estimates and clinical HB grades were calculated for patients with and without separate NI in each of the 1000 randomizations and compared with the KS test. We decided not to perform this evaluation in only the validation split unlike the remaining analysis but in the complete sample. Due to the random selection of 50 cases in each randomization, this would have led to varying and frequently unbalanced percentages of cases with a separate NI. Since chi2 statistics and to some degree Cramér’s V are sensitive to the sample size, comparison to performance of other neural networks evaluated in only the smaller validation split is limited.
4 Discussion
We utilized machine learning approaches in a group of 200 patients undergoing VS surgery. Our results show that these methods can combine preoperative facial nerve function, tumor size and intraoperative traintime to estimate postoperative facial nerve outcomes. Performance exceeds results from evaluation of the features alone and when tumor size is controlled. Performance did not improve when observation of a separate NI and/or detection of A-train clusters were added to the analysis. Prediction improved when A-train-clusters were removed from the detected traintime, mainly due to improvements in patients without a separate NI. Improved prediction may support intraoperative decision making as well as recognition, which surgical maneuvers carry an increased risk for postoperative palsy.
Our previous studies demonstrated that a separate NI can give rise to an exceeding amount of A-trains not related to postoperative palsy [
6,
11], which limits outcome estimation based on free-running EMG alone. Since observation of a separate intermedius is related to tumor size [
11], which itself yields predictive information [
4,
17,
18], we hypothesized that considering this interaction could improve outcome estimation.
Indeed, integrating preoperative facial nerve function, traintime and tumor size outperformed outcome estimation using only tumor size or traintime. Although performance was generally lower in patients with a separate NI, combined analysis also resulted in improvements in this subgroup.
Preoperative facial nerve function and tumor size have been shown to impact intraoperative monitoring. Facial MEP for example correlate with tumor size already at the start of surgery [
19], while traintime interpretation should consider preoperative deficits [
3]. Our results show that NN approaches integrate these different modalities, effectively implementing such clinical recommendations in a formalized and objective manner.
Utilizing corrected traintime resulted in a considerable improvement even if only mean traintime was considered. Correction increased chi
2 from 31.9 to 47.7 (Cramér`s V from 0.39 to 0.49). The combination with preoperative HB and tumor size then showed the best of all tested combinations. Correction was based on our previous findings, that patients with separate NI show A-train clusters significantly more often than patients without [
11], similar to patients with previous surgery or irradiation [
5]. We argued that these clusters are an expression of a hyperexcitable or more vulnerable NI.
The result that removing A-train clusters is beneficial for HB estimation supports the idea that such excessive, clinically not informative traintime may be caused by a separate NI [
6,
11]. It is however surprising that considering the observation of a separate intermedius or the presence of clusters to NN was not helpful and even partially decreased performance. Furthermore, the effect was largely present in the subgroup without separate NI, while patients with NI did not benefit (Table
4).
Consequently, the results indicate that A-train clusters generally over-represent actual damage to the facial nerve – not only when a split nerve course is encountered. Cluster traintime should therefore be weighted weaker than traintime from singular A-trains or removed entirely. In the current study, correction however was not sufficient to ameliorate the impact of a separate NI. There are several potential reasons. First, due to practical factors, A-train clusters were identified visually. This strategy may have resulted in marking only the clearest of clusters, while the phenomenon might in fact be subtler and manifest as a “spectrum of over-representation”. Furthermore, topography, time and distance between occurrences and relationship to singular A-trains were not evaluated.
Even if such information would not alleviate the intermedius “issue”, NN offer further potential improvement. NN allow integration of more information sources, above and beyond the evaluated features. E.g., FMEP [
19‐
21] or direct electrical stimulation [
22] could be utilized for a multimodal monitoring approach. In addition, determination of the facial nerve course [
23] could add valuable anatomical information.
Overall, estimated HB grades corresponded well to clinical evaluation. In moderate ranges, we observed deviations by one, sometimes two degrees (Fig. 3). Such variability may partially be caused by the subjective nature of HB grading itself, respectively its practical application [
24‐
26]. Scheller et al. [
13] investigated the interobserver variability of HB grading as part of a randomized multi-center phase III trial. In this study, too, HB grades varied between observers in an extent comparable to our results. HB grades were also most consistent when facial nerve function was normal or mildly impaired. NN estimates are therefore well within the range of this variability. Further improvement may require the use of a more objective grading system with better interrater reliability [
26‐
29].
Finally, a significant disadvantage of neural network is their “black box” nature, i.e., how they achieve their performance is notoriously difficult to interpret. In comparison, LMR are more accessible, as the resulting regression coefficient allow direct interpretation of the relative feature importance. The performance of LMRs in our study was lower than with NN, however still within a clinically useful range. It is conceivable that more training data may result in further improvement. Future studies should therefore conduct more detailed comparisons, including further computational approaches to combine multimodal information.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.