4.1 Main findings
We compared different artifact annotation strategies of captured blood pressure data points in an AIMS database. Live annotated artifacts were frequently not identified as artifacts retrospectively (sensitivity of 0.32). The learning algorithms we subsequently developed to artificially identify artifacts were not able to adequately model artifacts which were annotated during live observations. Although the performance of these algorithms increased when retrospective annotations were modelled, the overall performance remained moderate.
Artifacts in invasive blood pressure measurements have different causes, such as movement or measurement technique artifacts. Some of the artifacts were short lasting and harder to pick up retrospectively, while others were longer lasting (Table
4). For example, movement artifacts according to artifact definition 1 were in only 2% of the cases identified as an artifact retrospectively (definition 3). On the other hand, the artifacts according to definition 2 (> 30 s of artefactual signal) were retrospectively identified in 13% of the cases. In the present study, the AIMS used a calculated median of one minute of data to store data points. It therefore makes sense that short lasting artifacts have not resulted in artifactual data points within the AIMS database, which could be identified retrospectively. Nonetheless, we would have expected a larger difference as a result of the effect described above. In addition, we found variation over different causes of artifacts in retrospective positively identified artifacts. These differences likely were a result of differences in information availability per situation. For example, from the AIMS record data points with systematic errors in blood pressure measurement due to the height of artery sensor placement were easier identified, than artifacts caused by movement.
In the present study we present different methods to manually define artifacts in AIMS data, and compare these different definitions with each other. Others have analyzed differences between artifact annotations, but comparisons were done to compare different raters, who received the same annotation task, i.e. retrospective annotation. The present study shows that it is not only important to describe who annotated data, but also when and how data points were marked as artifacts, in order to make research reproducible [
7].
We have prospectively collected data during a period of twelve weeks. This resulted in a reasonable quantity of observations. Nevertheless, the incidence of artifacts in the present study was quite low (2%). The amount of data available for the learning algorithms might thus have been too small. We observed procedures mainly in the maintenance phase of surgery, as we expected that it would be more complex to label artifacts precisely in the induction and emergence period where a lot of things happen at the same time. The artifact incidence was similar to what was previously found during maintenance in pediatric surgery, which was also lower than during induction or emergence [
3]. Furthermore the type of surgery could have affected the incidence of artifacts, for example the cohort had a high portion of neurosurgery procedures, during which movement of the patient is limited and the surgical field is further away from the blood pressure sensor than other types of surgery.
In the present study, only one researcher annotated the data retrospectively, which can be considered a limitation. We could have improved quality of annotation when more than one researcher had annotated the data. On the other hand, because differences between these raters also need to be evaluated, the time invested in an extra person who annotated the data, would have been considerably more than twice the time which we spent thus far. Also the goal of this research was not to compare raters with each other, as has already been done previously [
7].
We used two definitions to translate the live observations to an artifact definition (Definition 1 and 2), using the duration of the artifacts. Another approach could have been to combine the severity of the artifact, for example the deviance from baseline, with the duration of the artifact. In theory short extreme artifacts (e.g. flush events) can affect analysis differently than long but less extreme artifacts (e.g. height of the pressure sensor). In our situation the duration of artifacts was more important since our system stores the median of 12 consecutive blood pressure measurements. Therefore we only used the duration of artifacts, but in other situations this definition might be too limited.
We used two research assistants for live observations, which could result in differences in the way data were annotated. We observed minor differences in artifact incidence in each group of procedures, which were probably due to differences in procedure types (Table
1). The number of artifacts according to retrospective annotation (definition 3) varied in a similar way, between these two subgroups (data not presented). Unfortunately, we have not performed a double-code observation to compare both observers adequately.
We have purposefully used only automatically collected physiologic data captured during the anesthetic procedure as source of features. We made this choice to ensure that resulting methodology and workflow will be generalizable, even when no other data than vital signs are available. This approach makes the methodology broadly applicable. On the other hand we tried to model a (human) decision, i.e. manual artifact identification, with limited information, from which the performance of the learning algorithms would have suffered. We saw that none of the learning algorithms performed well enough to apply for future research, as presented here. We showed in a post-hoc analysis that the performance could improve by adding additional data points. Nevertheless the information that was available to these algorithms was probably still too limited. To understand this concept better, future research could focus on adding not only more observations but also more features to the model, which are commonly available in databases used for research. For example patient characteristics, procedure type and medication administration or other events around the data point of interest could be added.
4.2 Implications
Before we can say anything about the implications of this study we first need to consider the definition of an artifact. Is every measurement in an AIMS database, based on a disturbed or a not perfect signal an artifact? Or does a live observed disturbance in a signal only produce an artifact, when the stored data point is different than what we expect for a patient at that particular time during anesthesia? But in the latter case, how do we define an expected value? These questions show that artifact annotation is a subjective matter, and a question of definition. It is important that researchers report what they considered to be artifacts, even when this process was done manually.
Despite this issue in defining artifacts, artifact annotation could still be automated using learning algorithms. The present study showed that this is not straightforward and might still require an investment of time to collect manually annotated data points for training. We live observed around 95 h of anesthesia, while using retrospective data from 328 h only improved the performance of the learning algorithms marginally. Observing this much data live would have been very labor intensive and likely not feasible. In contrast, retrospectively annotating these data took us around three hours with a custom made registration application by a single person. This makes retrospective annotation better suitable to remove artifacts from research data, than live annotation.
Even though machine learning algorithms performed poorly in the present study, our approach is still insightful for those who want to apply similar annotation tools and models on their own AIMS data. Future research could focus on improving the performance and develop application of the methods presented in the present paper. To minimize time spend on manual data collection, we suggest optimizing this process using an active learning strategy. With this strategy, only data points are annotated, which contribute significantly to the learning algorithm. This could reduce the time spend on annotating data significantly [
18,
19].