Background
Materials and methods
Signal recordings and MEP data
Exploratory data analysis
Machine learning pipeline
-
Random forest (RF) [26]: an ensemble learning method making use of multiple decision trees for supervised classification
-
K-nearest neighbor (kNN) [27]: a supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point
-
Logistic regression (LogReg) [28]: a statistical method that estimates the parameters of a logistic model to classify data.
Preselection
Splitting and data augmentation
Tuning the hyperparameters
Dimensionality reduction
-
Principal component analysis (PCA): a technique which linearly transforms the data into a new coordinate system that captures (most of) the variation of the data with fewer dimensions [31].
-
A simple feature extraction (FE) was carried out using a custom-made Python function to extract onset latency (see Sect. "Exploratory data analysis".), peak latency (i.e. latency of the first peak), end of signal (defined as the onset latency of the inverse of the signal), maximum, minimum, area under the curve (AUC), and number of peaks.
Training the classifiers
-
Four muscles simultaneously (APB versus EXT versus TA versus AH)
-
Within upper extremity comparison (EXT versus APB)
-
Across extremities comparison (EXT versus TA).
-
Raw, unprocessed data (dimensions per signal: 1650)
-
Data reduced by PCA (reduced to capture 95% variability of data, dimensions per signal: 20–40)
-
Feature extracted data (dimensions per signal: 7, see Sect. "Dimensionality reduction".)
Assessing performance
-
Accuracy: The percentage given by the number of correct classifications divided by the total number of samples in the test dataset.
-
ROC AUC: Receiver operating characteristic area under the curve plotting the true positive rate against the false positive rate.
-
F1: A performance metric that combines the precision (positive predictive value) and recall (sensitivity) scores of a model. The formula is:where TP stands for true positive, FP for false positive and FN for false negative. In the case of the 4-muscle comparison, which is a multiclass problem, we used the so-called ‘macro’ weighting, which determines the F1 score for each label and computes their unweighted mean.$$F1= \frac{2}{\frac{1}{Recall}+\frac{1}{Precision}}=\frac{2TP}{2TP+FP+FN}$$
Neurophysiologist task sheets
Results
MEP data
Muscle | Normalized Amplitude | Peak latency (ms) | # Peaks | Normalized AUC |
---|---|---|---|---|
EXT | 0.09 ± 0.12 | 17.53 ± 10.14 | 1 ± 1.5 | 0.4 ± 0.6 |
APB | 0.34 ± 0.36 | 20.01 ± 6.38 | 2 ± 1.1 | 1.02 ± 1.29 |
TA | 0.20 ± 0.25 | 24.69 ± 9.36 | 2 ± 1.5 | 0.84 ± 1.13 |
AH | 0.11 ± 0.23 | 31.03 ± 15.94 | 2 ± 1.8 | 2.0 ± 1.83 |
Classification
Four muscles
Raw | PCA | FE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc | F1 | ROC AUC | Acc | F1 | ROC AUC | Acc | F1 | ROC AUC | ||
Four muscles | RF | 0.83 | 0.72 | 0.9 | 0.75 | 0.67 | 0.88 | 0.77 | 0.71 | 0.9 |
kNN | 0.71 | 0.64 | 0.75 | 0.69 | 0.6 | 0.74 | 0.7 | 0.66 | 0.86 | |
LogReg | 0.28 | 0.24 | 0.45 | 0.3 | 0.26 | 0.47 | 0.73 | 0.67 | 0.87 | |
EXT vs. APB | RF | 0.89 | 0.88 | 0.94 | 0.84 | 0.84 | 0.91 | 0.83 | 0.83 | 0.9 |
kNN | 0.79 | 0.79 | 0.79 | 0.76 | 0.76 | 0.76 | 0.79 | 0.79 | 0.85 | |
LogReg | 0.48 | 0.47 | 0.85 | 0.53 | 0.5 | 0.44 | 0.8 | 0.8 | 0.85 | |
EXT vs. TA | RF | 0.97 | 0.95 | 0.98 | 0.92 | 0.9 | 0.96 | 0.88 | 0.84 | 0.94 |
kNN | 0.89 | 0.85 | 0.85 | 0.89 | 0.84 | 0.84 | 0.87 | 0.82 | 0.87 | |
LogReg | 0.43 | 0.4 | 0.41 | 0.46 | 0.42 | 0.43 | 0.88 | 0.85 | 0.91 |
Two muscles
Benchmarking human performance
Discussion
Data quality and class imbalance
Importance of model and parameter choices
Interpretation of machine learning results
Comparing ML results to human decision making
Potential clinical applications
What is needed to (try to) successfully implement ML in IOM
-
Data quality: high quality recordings, awareness of inherent variability
-
Labeling: standardized protocols, clear labeling rules [44]. This should also include tracking sources of variance during the surgical procedure (e.g., expected and unexpected noises, such as cautery, drill or anesthesia processes) and bias of data collection (only upper extremity MEPs for threshold reasons)
-
Adequate quantification of the outcome: defined and standardized outcome scores at defined postoperative time points [10] and outcome scores for the ML task, to limit an interpretation bias
-
More data [45]: pooled data from multiple centers
-
Understanding data: exploratory data analysis to find out how the data is distributed, analyzing imbalances of the features as well as the labels, and extraction of meaningful information to feed into ML, in particular to address to some extent the “black box” of ML [46].