Assessment of symptom characteristics and cardiovascular risk factors
In all patients, a standardized history was taken in the ED, including the following features: symptom quality (vertigo, dizziness, double vision), symptom onset (acute, lingering), symptom duration (10–60 min, > 60 min), symptom intensity (by visual analogue scale), preceding triggers (yes, no), accompanying features (ear symptoms, central neurological symptoms), and CVRF (diabetes, high blood pressure (> 140 mmHg), nicotine abuse, atrial fibrillation, family history, prior stroke or myocardial infarction). Health-related quality of life and functional impairment was assessed by questionnaires: European Quality of Life Score—5 dimensions—5 levels (EQ-5D-5L), including subscores for anxiety, pain, activity, self-care, and mobility (ranging from 1–5 each with 5 indicating worst impairment) [
16], EQ visual analogue scale (EQ-VAS) (ranging from 0–100 with 100 being the best status), Dizziness Handicap Inventory (DHI) (ranging from 0–100 points (maximum)) [
17], and modified Rankin scale (mRS) (ranging from 0–6 points).
Quantitative assessment of vestibular, ocular motor and postural functions
Videooculography (VOG): Vestibular and ocular motor signs were documented by VOG (EyeSeeCam®, EyeSeeTec GmbH, Munich, Germany) during the acute stage of symptoms, including nystagmus in straight ahead position (slow phase velocity (SPV) (°/sec), amplitude (°), horizontal and vertical component, with and without fixation), horizontal vestibulo-ocular reflex (VOR) function by vHIT (gain, presence of refixation saccades), gaze-evoked nystagmus (SPV (°/sec), horizontal and vertical component, lateral and vertical gaze positions), saccades (velocity (°/sec), horizontal and vertical direction), smooth pursuit (gain, horizontal and vertical direction), fixation suppression of the VOR (gain, horizontal direction), and skew deviation (cover test in six gaze positions). VOR gain was rated as pathological for values < 0.7. Suppression of spontaneous nystagmus (SPN) was positive, if the horizontal or vertical component of the SPV decreased by at least 40% on fixation.
Testing of subjective visual vertical (SVV): The SVV was measured by the bucket test method as described previously [
18,
19]. Ten repetitions (5 clockwise/ 5 counter clockwise rotations) were performed and a mean of the deviations was calculated. The normal range was defined as 0 ± 2.5° [
19].
Posturography: A posturographic measurement of body sway was performed using a mobile device (Wii Balance Board®, Nintendo Co. Ltd., Kyoto, Japan). Four conditions were tested: bipedal standing with eyes open/closed, upright tandem standing with eyes open/closed. For each condition, the sway pattern, normalized path length, root mean square, and peak-to-peak values in medio-lateral and anterior–posterior direction were analyzed.
Classification methods
We prospectively evaluated two established diagnostic index tests, the HINTS and ABCD
2 clinical scores for stroke detection, to establish a baseline classification performance. We compared these baselines against the performance of various modern machine-learning techniques. The latter learn the mapping of 305 input features (from history taking, questionnaires, and instrumentation-based examinations) to the output class of stroke vs. peripheral AVS. The classification performance is quantified with three diagnostic test measures [
20], namely the area-under-the-curve of a receiver-operating-characteristic (ROC-AUC), accuracy, and F1-score, defined as:
$${\text{Accuracy}}=\frac{TP+TN}{N}$$
$$\text{F1} - \text{score}= \frac{2\bullet {\text{precision}}\bullet {\text{recall}}}{{\text{precision}}+{\text{recall}}};{\text{precision}}=\frac{TP}{TP+FP};{\text{recall}}=\frac{TP}{TP+FN}$$
Here, TP/TN/FP/FN indicate the number of true-positive/true-negative/false-positive/false-negative detections, respectively, and N indicates the number of test samples overall. The established diagnostic index tests and each of the machine-learning techniques are described briefly in the following.
HINTS: The HINTS clinical scoring system aggregates a risk score for detection of vestibular stroke, as proposed in [
9]. HINTS constitutes a 3-step examination, based on Head Impulse, gaze-evoked Nystagmus, and Test of Skew. HINTS indicates a central pattern, if horizontal head impulse test is normal, and/or a direction-changing nystagmus in eccentric gaze, and/or a skew deviation is detected. Consequently, in our data set we give 1 point per central HINTS item and define a HINTS score cutoff value of ≥ 1 as indicative for vestibular stroke. From this binary value for stroke diagnosis, we compute the detection accuracy and F1-score. Additionally, we perform a receiver-operator-characteristic (ROC) analysis, varying the HINTS cutoff over our data set, to obtain an area-under-the-curve (AUC) score.
ABCD2: ABCD
2 is an aggregative scoring system for clinical detection of stroke as proposed in [
21] and validated in [
22]. ABCD
2 is based on the following features: age ≥ 60 years (1 point); blood pressure ≥ 140/90 mm Hg (1 point); clinical features: unilateral weakness (2 points), speech impairment without weakness (1 point); duration ≥ 60 min (2 points) or 10–59 min (1 point); and diabetes (1 point). For stroke detection in our study, we consider ABCD
2 scores at a cutoff value of ≥ 3. We apply this cutoff to our dataset prospectively, and obtain the accuracy and F1-score, as well as a ROC-AUC score.
Logistic Regression (LR): In descriptive statistics, LR is used to report the goodness-of-fit of a linear set of equations, mapping a set of input features (i.e., observations) to a binary descriptor variable (e.g., stroke indicator variable). In this work, we use LR in a prospective/predictive manner. We regularize LR with a combined L1 and L2 loss, which allows learning of a Lasso-like sparse model, while still maintaining the regularization properties of a ridge classifier [
23,
24]. The balancing ratio between the L1 and L2 losses is optimized during learning as a hyper-parameter. After fitting the LR parameters to samples in a training set, we apply the fitted model to samples in a holdout test set, to obtain a logistical posterior probability of stroke. We binarize the soft decision output of LR at a posterior probability
\(p\left({\text{stroke}}|{\text{features}}\right)>0.5\), from which accuracy and F1-score are calculated. The AUC value is obtained by computing an ROC analysis on the probabilistic predictions for all samples.
Random Forest (RF): RF bundles an ensemble of decision tree (DT) models to compensate for tree overfitting [
25] by vote aggregation [
26]. In this work, we tune the number of DTs within the range of 5 to 50 trees towards optimal prediction performance. Due to the vote aggregation from the ensemble, an RF yields a probabilistic posterior. Accuracy, F1-score, and ROC-AUC are calculated on this posterior.
Artificial neural network (ANN): Computer-aided diagnosis has advanced due to the application of machine-learning techniques [
27]. In particular, our own previous work [
28‐
30], as well as numerous works in related literature [
31] have demonstrated the effectiveness and modeling flexibility of ANNs for computer-aided diagnosis in medicine. Here, we apply a multilayer perceptron (MLP) with 305 input neurons, two hidden layers (128 and 64 neurons each), and two softmax-activated output neurons for classification. Due to the non-linear activation at the output layer, our ANN also yields a probabilistic posterior, allowing the calculation of accuracy, F1-score and ROC-AUC.
Geometric matrix completion (GMC): Geometric deep learning [
32] is a novel field of deep learning, and has been introduced for computer-aided diagnosis in medicine only recently [
33]. In previous work, we have shown that it is advantageous to construct multiple population graphs from meta-features of patients [
34,
35]. We further proposed GMC [
36] (denoted in the following as SingleGMC) to alleviate the common problem of missing values in medical data sets [
37]. Recently, we have combined these ideas into multi-graph matrix completion (MultiGMC) [
38]. Here, we apply both the original SingleGMC approach [
36] and MultiGMC to our data set. In SingleGMC, we used a single graph and constructed it using age and ABCD
2 scores. Graph connections are calculated based on similarity measures using age (age difference ± 5 years) and ABCD
2 scores (± 1 score). For SingleGMC, the graph connectivity is the sum of these similarity measures. In MultiGMC, instead of taking the sum, we use them as two separate graphs. We learn separate patient representations within these two graphs (a single spectral convolutional layer per graph) and aggregate them via self-attention, before computing the classification posterior [
38]. The calculation of accuracy, F1-score, and ROC-AUC is performed as for LR/RF/ANN.
The models LR, RF, and ANN were based on implementations in the scikit-learn machine-learning library [
39], while GMC [
36] and MultiGMC [
38] are custom implementations, based on PyTorch [
40].
Statistical analysis
Compared to HINTS and ABCD
2, which are evaluated prospectively on the entire data set, the training of machine-learning models on the entire data set would result in overfitting and an overly optimistic performance estimate. Instead, we split the data into a training set and a test set, to obtain a prospective classification performance for our investigated models. All machine-learning based classification results were thus obtained following a rigorous ten-fold cross-validation scheme [
41], with stratified label sampling to account for class imbalance, and a data split ratio of 90% training vs. 10% testing data. To perform hyper-parameter tuning for all methods, we monitored the tuned model performances on a withheld validation set (10% of the training set). We compared the best-performing model to the other four models, in terms of classification accuracy by pair-wise, two-tailed, non-parametric hypothesis tests (Wilcoxon signed-rank test) at a level of significance
p < 0.05.
Furthermore, to make the results of the machine-learning classifier more explainable, we used the RF classifier to compute, which features contribute the most towards the detection of stroke. Such analysis constitutes a fundamental technique in the domain of machine-learning interpretability [
42]. Feature importance was calculated according to the Mean Decrease in Impurity (MDI) measure [
43], as implemented in scikit-learn [
39]. We ranked the discriminative power of features by sorting the MDI coefficients, and reported the top 10 most important features utilized by the RF during classification. For these features, univariate analysis of quantitative values was performed for patients with vestibular stroke and vestibular neuritis (% for categorical variables, mean ± SD for continuous variables). The parameters were compared between groups using either the Chi-square test or Mann–Whitney
U-test applying a significance level of
p < 0.05.