Background
Parkinson’s disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease [
1], affecting more than 10 million people worldwide [
2]. The cardinal features of PD are bradykinesia (slowness of movement), followed by tremor at rest, rigidity, and postural instability [
3]. Prevalence of PD increases rapidly over the age of 60 [
4], and both global incidence and economic costs associated with PD are expected to rise rapidly in the near future [
5,
6]. Since its discovery in the 1960s, levodopa has been the gold standard treatment for PD and is highly effective at improving motor symptoms [
7]. However, after prolonged levodopa therapy, 40% of individuals develop levodopa-induced dyskinesia (LID) within 4–6 years [
8]. LIDs are involuntary movements characterized by a non-rhythmic motion flowing from one body part to another (chorea) and/or involuntary contractions of opposing muscles causing twisting of the body into abnormal postures (dystonia) [
9].
To provide optimal relief of parkinsonism and dyskinesia, treatment regimens must be tailored on an individual basis. While PD patients regularly consult their neurologists to inform treatment adjustments, these consultations occur intermittently and can fail to identify important changes in a patient’s condition. Furthermore, the standard clinical rating scales used to record characteristics of PD symptoms require specialized training to perform and are inherently subjective, thus relying on the experience of the rater [
10]. Paper diaries have also been used for patient self-reports of symptoms, but patient compliance is low and interpretation of symptoms can differ significantly between patients and physicians [
11,
12].
Computerized assessments are an attractive potential solution, allowing automated evaluation of PD signs to be performed more frequently without the assistance of a clinician. The information gathered from these assessments can be relayed to a neurologist to supplement existing clinic visits and inform changes in management. In addition, computerized assessments are expected to provide an objective measurement of signs, and therefore be more consistent than a patient self-report. Computer vision is an appealing modality for assessment of PD and LID: a vision-based system would be completely noncontact and require minimal instrumentation in the form of a camera for data capture and a computer for processing.
To address the inherent subjectivity and inconvenience of current practices in PD assessment, efforts have been made to develop systems capable of objective evaluation of signs. Studies generally involve the recording of motion signals while participants perform tasks from clinical rating scales or execute a predefined protocol of activities of daily living (ADL).
Wearable sensing has thus far been the most popular technology for PD assessment, using accelerometers, gyroscopes, and/or magnetometers to record movements. These sensors are often packaged together as inertial measurement units (IMU). Keijsers et al. continuously monitored participants during a 35 item ADL protocol and predicted dyskinesia severity in one minute time intervals [
13]. Focusing on upper limb movements, Salarian et al. attached gyroscopes to the forearms to estimate tremor and bradykinesia severity [
14], while Giuffrida et al. used a custom finger mounted sensor to estimate severity of rest, postural, and kinetic tremors [
15]. Patel et al. investigated multiple tasks from the Unified Parkinson’s Disease Rating Scale (UPDRS) motor assessment to determine the best tasks and movement features for predicting tremor, bradykinesia, and dyskinesia severity [
16]. With a single ankle-mounted IMU, Ramsperger et al. were able to identify leg dyskinesias in both lab and home environments [
17]. Delrobaei et al. used a motion capture suit comprised of multiple IMUs to track joint angles and generated a dyskinesia severity score that correlated well with clinical scores [
18]. Parkinsonian gait has also attracted considerable attention and is the most studied type of gait using wearable sensors [
19]. While wearable systems have the potential to be implemented in a discreet and wireless fashion, they still require physical contact with the body. Furthermore, standardization is required regarding the quantity and placement of sensors needed to capture useful movement signals.
In contrast to wearable sensors, vision-based assessment requires only a camera for data capture and computer for processing. These assessments are noncontact, and do not require additional instrumentation to capture more body parts. However, the current state of vision-based assessment for PD and LID is very limited. Multi-colored suits were used for body part segmentation in parkinsonian gait analysis [
20,
21], or environments were controlled to simplify extraction of relevant movements [
22,
23]. Points on the body were also manually landmarked in video and tracked using image registration to observe global dyskinesia [
24]. More complex camera hardware (e.g. Microsoft Kinect) can track motion in 3D with depth sensors and has been used to characterize hand movements [
25], as well as analyze parkinsonian gait [
26,
27] and assess dyskinesia severity [
28] using the Kinect’s skeletal tracking capabilities. Multi-camera motion capture systems can capture 3D movements more accurately by tracking the position of reflective markers attached to the points of interest. While they have been explored in the context of PD [
29,
30], their prohibitive costs and complicated experimental setup make them impractical outside of research use.
While human pose estimation in video has been actively studied in computer science for several decades, the recent emergence of deep learning has led to substantial improvements in accuracy. Deep learning is a branch of machine learning built on neural networks. These networks, inspired by simplified models of the brain, are composed of layers of neurons that individually perform basic operations, but can be connected and trained to learn complex data representations. One major advantage of deep learning is automatic discovery of useful features, while conventional machine learning approaches use hand engineered features that require domain knowledge to achieve good performance. Convolutional neural networks (CNNs) are a specific deep learning architecture that takes advantage of inherent properties of images to improve efficiency. Toshev and Szegedy were the first to apply deep learning for pose estimation, where they framed joint position prediction as a cascaded regression problem using CNNs as regressors [
31]. Chen and Yuille took advantage of the representational power of CNNs to learn the conditional probabilities of the presence of body parts and their spatial relations in a graphical model of pose [
32]. Wei et al. iteratively refined joint positions by incorporating long range interactions between body parts over multiple stages of replicated CNNs [
33].
The use of deep learning for PD assessment is still in early stages, although a few recent studies have applied deep learning for classification of wearable sensor data [
34,
35] as well as extraction of gait parameters [
36]. Therefore, an excellent opportunity exists to assess the readiness of deep learning models for vision-based assessment of PD. We have previously shown that features derived from videos of PD assessments using deep learning pose estimation algorithms were correlated to clinical scales of dyskinesia [
37]. This paper substantially extends the preliminary results by analyzing additional motor tasks for parkinsonism and by evaluating the predictive power of the chosen feature set.
The key contributions of this paper are as follows:
1.
Evaluating the feasibility of extracting useful movement information from 2D videos of Parkinson’s assessments using a general purpose deep learning-based pose estimation algorithm
2.
Extracting features from movement trajectories and training of a machine learning algorithm for objective, vision-based assessment of motor complications in PD (i.e. parkinsonism and LID)
3.
Determining the accuracy of predicting scores of individual tasks in validated, clinical PD assessments using vision-based features as well as predicting total scores of PD assessments using a subset of the full clinical assessment suitable for video analysis
Results
Binary classification and regression results for communication and drinking tasks are shown in Table
5, while results for the leg agility and toe tapping tasks are given in Table
6. Errors provided are the standard deviation of results when cross-validation was run multiple times. For binary classification, the number of ratings binarized to the negative class (i.e. “no dyskinesia” or “no parkinsonism”) is denoted by
n0 and informs if the classification task was well balanced. There are some disparities between the number of videos (Table
1) and the number of samples shown in Tables
5 and
6, as some videos did not have all possible ratings available.
Table 5
Results for communication and drinking tasks (UDysRS)
Communication (n = 128) |
Binary Classification | Neck n0 = 48 | Rarm n0 = 60 | Larm n0 = 54 | Trunk n0 = 60 | Rleg n0 = 57 | Lleg n0 = 59 | Mean |
F1 | 0.941 ± 0.003 | 0.920 ± 0.004 | 0.929 ± 0.014 | 0.960 ± 0.009 | 0.819 ± 0.007 | 0.865 ± 0.007 | 0.906 ± 0.002 |
AUC | 0.935 ± 0.006 | 0.957 ± 0.004 | 0.946 ± 0.005 | 0.983 ± 0.002 | 0.852 ± 0.007 | 0.907 ± 0.005 | 0.930 ± 0.001 |
Regression | Neck | Rarm | Larm | Trunk | Rleg | Lleg | Mean |
RMS | 0.559 ± 0.008 | 0.399 ± 0.008 | 0.465 ± 0.011 | 0.513 ± 0.011 | 0.579 ± 0.009 | 0.590 ± 0.011 | 0.518 ± 0.005 |
r | 0.712 ± 0.017 | 0.760 ± 0.022 | 0.645 ± 0.029 | 0.760 ± 0.024 | 0.522 ± 0.021 | 0.490 ± 0.024 | 0.661 ± 0.011 |
Drinking (n = 118) |
Binary Classification | Neck n0 = 61 | Rarm n0 = 79 | Larm n0 = 81 | Trunk n0 = 60 | Rleg n0 = 70 | Lleg n0 = 66 | Mean |
F1 | 0.711 ± 0.026 | 0.148 ± 0.054 | 0.289 ± 0.068 | 0.643 ± 0.013 | 0.594 ± 0.046 | 0.617 ± 0.020 | 0.500 ± 0.015 |
AUC | 0.774 ± 0.007 | 0.418 ± 0.033 | 0.557 ± 0.015 | 0.687 ± 0.014 | 0.673 ± 0.027 | 0.696 ± 0.012 | 0.634 ± 0.005 |
Regression | Neck | Rarm | Larm | Trunk | Rleg | Lleg | Mean |
RMS | 0.724 ± 0.003 | 0.737 ± 0.005 | 0.575 ± 0.005 | 0.701 ± 0.008 | 0.586 ± 0.008 | 0.622 ± 0.009 | 0.657 ± 0.003 |
r | 0.075 ± 0.008 | −0.150 ± 0.015 | −0.003 ± 0.018 | 0.099 ± 0.020 | 0.087 ± 0.026 | 0.147 ± 0.025 | 0.043 ± 0.008 |
Table 6
Results for leg agility and toe tapping tasks (UPDRS)
Binary Classification | Right n0 = 43 | Left n0 = 36 | Mean | Right n0 = 39 | Left n0 = 36 | Mean |
F1 | 0.538 ± 0.012 | 0.725 ± 0.036 | 0.631 ± 0.022 | 0.755 ± 0.018 | 0.694 ± 0.027 | 0.725 ± 0.019 |
AUC | 0.699 ± 0.017 | 0.842 ± 0.028 | 0.770 ± 0.007 | 0.842 ± 0.006 | 0.704 ± 0.015 | 0.773 ± 0.010 |
Regression | Right | Left | Mean | Right | Left | Mean |
RMS | 0.648 ± 0.024 | 0.462 ± 0.023 | 0.555 ± 0.013 | 0.614 ± 0.014 | 0.615 ± 0.014 | 0.614 ± 0.009 |
r | 0.504 ± 0.049 | 0.710 ± 0.058 | 0.618 ± 0.029 | 0.383 ± 0.034 | 0.360 ± 0.032 | 0.372 ± 0.022 |
Binary classification of communication task features achieved a mean AUC of 0.930, while drinking task performance had a mean AUC of 0.634. For the leg agility task, the mean AUC was 0.770, while the AUC for the toe tapping task was 0.773. The mean correlation between LID severity predictions and ground truth ratings for the communication task was 0.661, compared to 0.043 for the drinking task. For PD severity predictions, the mean correlations were 0.618 and 0.372 for the leg agility and toe tapping tasks, respectively.
For multiclass classification, the overall accuracy on the communication task was 71.4%. Sensitivity and specificity for each class are provided in Table
7. For predicting the total validated scores on the UDysRS Part III and UPDRS Part III, the results are given in Table
8. The correlation between predicted and ground truth ratings was 0.741 and 0.530 for the UDysRS and UPDRS, respectively.
Table 7
Multiclass classification results for communication task
LID | 26 | 96.2% ± 3.8% | 95.7% ± 0.9% |
Normal | 17 | 9.4% ± 3.2% | 89.7% ± 3.0% |
PD | 34 | 83.5% ± 4.5% | 68.4% ± 1.3% |
Overall Accuracy | 77 | 71.4% ± 2.8% | |
Table 8
Results for prediction of validated scores. UDysRS Part III is predicted using features from the communication and drinking tasks, while UPDRS Part III is predicted using features from the communication, leg agility (all joints) and toe tapping tasks
RMS | 2.906 ± 0.084 | 7.765 ± 0.154 |
r | 0.741 ± 0.033 | 0.530 ± 0.026 |
Discussion
The purpose of this study was to determine if features derived from PD assessment videos using pose estimation could be used for detection and severity estimation of parkinsonism and dyskinesia. Random forest classifiers and regressors were trained for the communication, drinking, leg agility, and toe tapping tasks. The task with the best performance was the communication task. This was not surprising, as it is well-known clinically that the communication task elicits involuntary movements [
45]. Despite the RMS error appearing similar for the drinking task, the correlation of 0.043 shows performance was poor in comparison to the communication task. This was because most ratings for the drinking task were between 0 and 2, thus emphasizing that both RMS and correlation are necessary to accurately portray performance. However, the mean AUC greater than 0.5 indicates that features from the drinking task still had slight discriminative power for detecting dyskinesia, even though they were inconsistent for measuring the severity of dyskinesia. Drinking task arm subscore performance was noticeably worse than for other subscores, which was likely due to inability to discern voluntary from involuntary movements, as well as increased occlusion of upper limbs during movement. Multiclass classification of the communication task had poor sensitivity (< 10%) in detecting normal movements. The class that was best discriminated was LID. Intuitively, the communication task does not prompt participants to move voluntarily, therefore the slowness or absence of movement in PD and the lack of voluntary movement in the normal class can be confused with each other. This contrasts with the larger involuntary movements present in LID, which are easily identifiable.
Although only features from a subset of the full assessments were used to predict the total UPDRS Part III and UDysRS Part III scores, predictions had moderate to good correlation with total scores. This implies that this technology could use an abbreviated version of these clinical scales, although further analyses with a larger population would be required for validation. Previous studies have used measures derived from simple tasks such as the timed up and go [
46] and a touchscreen finger tapping and spiral drawing test [
47] to achieve moderate to good correlation with the total UPDRS Part III score. While the RMS error for the total UPDRS Part III appears much larger than the RMS error for the UDysRS Part III, this is consistent with the range of possible values for each scale. The UPDRS Part III had a range of 0–112 compared to the UDysRS Part III’s range of 0–28. It may be possible to improve performance on task subscores by using joints from the entire body. It is likely that motor complications in one part of the body will be correlated to motor complications elsewhere. However, these correlations would be unlikely to generalize across a population, as each person’s PD will manifest differently. Likewise, only features extracted from a specific task were used for predicting the task’s rating despite possible performance boost from using additional task features. Each task was included in their respective rating scales to capture different facets of motor complications, and the correlations between these tasks would be unique to each individual.
No explicit feature selection was performed despite having many features compared to samples. Although the random forest algorithm is generally resistant to overfitting, feature selection can often still reduce features that are not useful. However, after evaluating several feature selection methods, no performance boost was observed compared to applying random forest with all features. Dimensionality reduction methods were not tested as feature transformation would reduce interpretability, thus making further analysis more difficult. Likewise, more complex algorithms that learn feature representations were not considered as discovered features may not have been clinically useful. While the emphasis of this analysis was on model accuracy, the parity of performance even after feature selection indicates that future models could be built with comparable performance and a smaller set of features. Identification of features that consistently perform well or poorly is the next step towards deployment of more lightweight models.
The use of 2D pose estimation was motivated by visual inspection of motor complications during Parkinson’s assessments and observation of gross movements. It was hypothesized that 2D pose estimation would be successful at extracting movement information accurate enough to infer the severity of motor complications. While the results indicate that features derived from CPM pose estimation could capture clinically relevant information from videos, this serves as an indirect measure of the accuracy of pose estimation. In preliminary testing, a benchmark made of frames of video from the dataset was used to assess CPM. All body parts were well-detected except for the knees. Knee detection was complicated due to the hospital gowns worn by participants, which resulted in insufficient texture to discern knee location. This means that the involuntary opening and closing motions of the knees were poorly tracked, which may explain why leg subscore predictions were the worst in the communication task. However, ankles were well-tracked so this is not expected to have significantly affected performance on the leg agility task.
As the MPII dataset that CPM was trained with contained images of individuals sitting, the model could generalize to the PD assessment videos. A further evaluation by Trumble et al. supports the accuracy of CPM, as a CPM-based 3D pose estimation with multiple views performed well in comparison to other vision-based and wearable algorithms when validated against motion capture data [
48]. The quality of trajectories generated using CPM and derived features should generalize well to other studies of PD assessments, as the video recording quality is consistent with recommended recording protocols and videos used for initial validation of the UDysRS [
49,
50]. However, the CPM model pre-trained on MPII is limited by inability to track head turning and does not detect feet and hands. In the future, an improved model could be trained specifically with images more representative of clinical or home environments, as well as augmented datasets that include head orientation, foot, and hand positions. Models that impose biomechanical restrictions on joint positioning [
51] or integrate video information for 3D pose estimation [
52] could also improve performance.
The optical flow-based method for extracting motion from toe tapping took advantage of the foot being anchored by the heel. The algorithm may not be transferrable to other applications as it relied on assumptions of foot location with respect to the ankle. For example, upper body measures of parkinsonism such as hand open/close and pronation/supination often involved significant arm motion and video motion blur, which would not be feasible to track accurately using the optical flow-based method without a more complicated approach. Furthermore, generalizability to other toe tapping applications could be limited by differences in recording conditions. While this toe tapping algorithm cannot be directly evaluated by its accuracy at tracking foot motion, it is possible to compare its relative performance against other studies that have assessed toe tapping. Heldman et al. used an accelerometer heel-clip mounted to the person’s shoe while Kim et al. used a gyrosensor mounted on the top of the foot [
53,
54]. Heldman et al. achieved
r = 0.86 and RMS of 0.44 and Kim et al. achieved
r = 0.72–0.81 for different features when compared against the UPDRS toe tapping score. There is a gap in performance as the vision-based method presented is less accurate at tracking the motion. However, the tradeoff is convenience for accuracy, as vision-based is still easier to use than wearables due to lack of special hardware requirements and attachment of sensors.
Due to differing experimental conditions and rating scales used in past studies, it is difficult to perform a direct comparison in terms of system performance. The closest study in terms of experimental protocol was Rao et al., who analyzed videos of the communication task and tracked manually landmarked joint locations to develop a dyskinesia severity score [
24]. They report good correlation between their score and the UDysRS Part IV (single rating of disability) score (Kendall tau-b correlation 0.68–0.85 for different neurologists). Their study used non-rigid image registration for tracking, which was not able to infer joint positions if occluded and could not recover if the joint position was lost. In contrast, deep learning-based pose estimation learns the structure of the human body after seeing training data and can often make accurate predictions of joint locations even when the joints are not visible. Dyshel et al. leveraged the Kinect’s skeletal tracking to extract movement parameters from tasks from the UPDRS and Abnormal Involuntary Movement Scale (AIMS) [
28]. They trained a classifier to detect dyskinesia with an AUC of 0.906 and quantified the dyskinesia severity based on the percent of a movement classified as dyskinetic. This quantitative measure had good correlation with AIMS scores (general correlation coefficient 0.805). In wearable sensing, Patel et al. reported classification errors of 1.7% and 1.2% for parkinsonism and dyskinesia, respectively, using tasks from the UPDRS [
16]. Tsipouras et al. detected dyskinesia with 92.51% accuracy in a continuous recording of multiple ADLs [
55]. Eskofier et al. used CNNs on accelerometer recordings of the pronation/supination and hand movements tasks and achieved parkinsonism classification accuracy of 90.9% [
34]. In our work, the best performance for binary classification of dyskinesia was in the communication task, with an AUC of 0.930. This is comparable with other studies, including those using wearables, although the difficulty of classification is highly dependent on the length of the motion segments to be classified and the type of motion performed. For parkinsonism, the best binary classification performance was for the toe tapping task, with an AUC of 0.773. This is not as high as dyskinesia classification performance and can likely be attributed to the distribution of ratings. In the communication task, 30–40% of ratings for subscores were at the lower limit of the scale (i.e. 0), whereas for the leg agility and toe tapping tasks, this percentage was much smaller (less than 3%). Threshold selection for binarizing scores was based on balancing classes, and therefore may not have been optimal with respect to clinical definitions. Ideally, the solution would be to gather sufficient data to represent all ratings and to select thresholds either based on clinical supervision or by discovery of an optimal separation between groups.
Limitations
As the videos from this dataset were not captured for subsequent computer vision analysis, there were recording issues that introduced noise, including different camera angles and zoom. Despite these concerns, the videos are representative of the quality of videos used by clinicians for PD assessment, and the availability of the data outweighed the unnecessary burden on participants required to perform a new experiment. However, manual intervention was required for task segmentation and person localization. For this feasibility study, the videos were of sufficient quality; however, standardization of recording protocols to eliminate camera shake should improve algorithm performance and consistency. Future studies could use deep learning algorithms that take advantage of temporal information in videos for more accurate pose estimation [
52]. In addition, CPM’s accuracy for pose estimation was limited by the resolution of the input video (368 × 368). Performance could be improved with algorithms accepting a higher resolution video or by applying refinements for subpixel accuracy. Calibrating cameras to a known distance in advance would enable movement amplitudes to be measured in a unit of length comparable to other studies (e.g. metres). Although single-camera systems offer the possibility of convenient, non-contact measurement of PD motor complications, occlusions and the fixed nature of cameras can limit use cases, especially in outdoor environments. Resolving human pose in 3D is also significantly more difficult and inaccurate without using multiple cameras. The optical flow-based method used for toe tapping has not been validated in the context of foot motion estimation. It will be important to define the scope of applications to mitigate these limitations.
The recruitment criteria selected individuals with moderate levels of dyskinesia. Therefore, the study population reflects only a segment of the patient population. The small sample size should also be increased in follow-up studies to ensure generalizability of results. In addition, a small number of tasks from the UPDRS and UDysRS were not assessed for practical reasons. While adjustments of rating scales are common practice, studies have shown that the UPDRS and UDysRS retain validity despite multiple missing items [
56,
57]. Future studies should also include healthy participants as controls.
Regression performance is reported using correlation; however, it is unclear what would be a clinically useful level of agreement. Furthermore, while a high correlation may indicate that a method is able to mimic clinicians, validation based on agreement with clinical ratings does not provide insight into whether such technologies can achieve better sensitivity to clinically important changes than subjective rating scales. Additional investigation is required to compare the sensitivity of the proposed system to validated clinical measures.