Within this panel of recently available ICU ventilators, no technical features could be considered as differentiating between devices, while a contrario, ergonomics and interface features were considered inadequate, thus increasing the risks of misusage and adverse events.
While volume delivery and pressurization accuracy are critical issues, few differences were observed between devices. However, volume delivery and pressurization errors and precisions were important for some ventilators. In all cases, volume delivery was lower than that expected, as already observed [
18].
Triggering performances depicted within our study are similar to those observed in a previous study concerning emergency transport ventilators [
8] and tended to be higher than previously observed [
9,
17]. These results may be explained by different respiratory mechanics and BTPS conditions [
8]. No device enabled a triggering delay faster than 50 ms, and it exceeded 100 ms for two devices in normal respiratory mechanics conditions. As already described, flow or pressure triggering has not varied significantly over the last decade [
7].
During non-invasive ventilation, patient–ventilator asynchronies are frequent [
21,
22] and mainly related to leaks around the interfaces and/or overassistance. Our mean asynchrony index was close to that observed in other studies [
8,
23]. Non-invasive ventilation algorithms that are implemented in most devices were able to decrease the asynchrony index significantly and might thus be systematically turned on during non-invasive ventilation, in an attempt to limit non-invasive ventilation failure.
Ergonomics assessment
While huge effort has been made by manufacturers to improve technical issues, increasing complexity of devices may in fact result in design errors. Not only are the devices’ full capabilities underutilized, but also their main functions may often be handled improperly.
Human error has been demonstrated to be a leading cause of morbidity and death during medical care [
10,
24‐
26]. Many devices have interfaces that are so poorly designed and difficult to use that they can increase the risks associated with the medical equipment and device-induced human error. Human error may be to some extent inevitable and equally caused by human performance and machine performance. In order to limit the number of errors, computing technology and human–machine interface development should be designed to correspond to human characteristics of reasoning and memory constraints [
11]. It is also well known that the working memory of humans is limited and that the number of variables depicted on screens is excessive. This results in a large cognitive load (i.e. mental workload) on the user, which is also a determinant of human error [
27]. An interface with a human-centred design increases efficiency and satisfaction and decreases the rate of medical error. While these data are integrated in the ventilators’ interface development by manufacturers, and while ergonomics are as essential as technical performances, very few studies have assessed the ergonomics, and many were limited to timed tasks and subjective evaluation [
13‐
15].
To the best of our knowledge, this is first time that such an innovative ergonomics evaluation of ICU mechanical ventilators has been performed, globally integrating the four main dimensions that enable a comprehensive approach to the problem: 1—tolerance to error; 2—ease of use; 3—efficiency; and 4—engagement. Tolerance to error may be directly linked to efficiency and ease of use to engagement. While all four dimensions may be considered independently, they are in fact related one to each other (Fig.
1). Most previous ergonomics evaluations have mainly focused on tolerance to error, while the three other dimensions were often missing data.
The integration of pupil diameter measurement, heart and respiratory rate or tidal volume activation to assess ergonomics are new data in the ICU field. Compared to subjective psychological measurements, these are objective data that allow the estimation of the physiological stress induced by a device’s interface and an indirect assessment of the interface’s usability.
The objective tasks results are often considered as the most representative of the devices’ ergonomics’ differences. Even if we entirely agree with the fact that not all scenarios may have the same importance, it is still surprising that some ventilators could not be powered on by a majority of physicians or that the NIV mode could not be easily activated. Excluding powering on/off tests from analysis, considering that these may be very different tasks that have been voluntarily been made difficult by the manufacturers for safety reasons, did not modify the overall results. While it may be one of the main tasks routinely performed, ventilator setting readings had the worst results of all tasks, probably because of the absence of a homogenized terminology among manufacturers. As already observed in another recent study, the lack of sensitivity of the S1 touch screen was specifically considered by the participants as responsible for an increased mental workload and higher rates of task failures [
28]. The physicians praised the Servo-U interface, but the interface also tended to induce high mental workload during specific tasks, thus generating frustration and higher task failure rates.
The pupillary diameter variation is linked to mental workload and is used to assess cognitive skills [
29,
30]. However, we must consider the variability related to the light reflex induced by the laboratory environment and the devices themselves [
31]. To some extent, this could explain results from the V500 that has a screen luminosity that is higher than that of other devices. Heart and respiratory rates and/or tidal volume variations are linked to emotional behaviour [
32‐
34]. The better results that were observed with the Avea can be explained by the fact that this device was well known to all participants. Our results on the other devices clearly enable the depiction of differences in terms of task completion perceptions among users while using these parameters. Importantly, while the evolution of physiological parameters may not provide comparable results to those obtained with the psycho-cognitive scores, they are consistent with the objective task completion rate results.
The System Usability Scale [
20] and NASA Task Load Index [
35,
36] are validated psycho-cognitive tools to assess devices’ interface.
The SUS is a very easy scale to administer to participants. It can be used on small sample sizes with reliable results, and it can effectively differentiate between usable and unusable systems. A SUS score above 68 would be considered above average and anything below 68 is considered below average.
The NASA-TLX is a flexible, well-established and widely used multidimensional assessment tool that enables quick and easy workload estimation in order to assess a task or a system. It has been used in a great variety of domains and is considered as one of the most reliable questionnaires to measure workload in a healthcare setting. The higher the weighed TLX, the higher the mental workload and the more ‘difficult to use’ is the device. Each individual dimension can also be considered on its own, either those dependent on users’ perception of the task (mental workload, temporal workload and physical workload) or those dependent on the interaction between the subject and the task itself, which may be mostly related to the interface (effort, performance and frustration). Mental demand describes how much mental and perceptual activity is required to perform the task (e.g. thinking, deciding, calculating). Physical demand describes how much physical activity is required (e.g. pushing, pulling, turning). Temporal demand describes how much time pressure is perceived to fulfil the task (was it slow and leisurely? Or rapid and frantic?). Effort describes how hard the task is to be fulfilled (mentally and physically) in order to accomplish the required level of performance. Performance describes how satisfied the subject feels or whether he/she thinks they were successful in accomplishing the goals. Frustration describes how insecure, discouraged or irritated the subject feels after accomplishing the task. The subscale rating enables inter-/intra-individual variability to be decreased, thus enabling the number of subjects in the experiment to be reduced.
Precedent studies have shown the influence of experience on SUS scores [
37], and the better results of the Avea can clearly be related to the users’ knowledge of and experience with this device and not specifically to a better interface. Given the overall expertise of all the physicians from the five ICUs with this device, it was used in the comparison as a reference value.
When considering both psycho-cognitive assessment tools, two devices (V680 and S1) could be considered as below our reference device in terms of usability and induced mental workload. In terms of usability, all devices except the R860 and the Servo-U were equal to or below a SUS value of 60, far below the acceptable average value of 68, which may enable us to consider that from an ergonomics point of view, a huge amount of work has to be done to improve the device’s usability. With regard to the other ventilators, the SUS and NASA-TLX values did not differ, which corresponds with physiological analyses. If devices’ interfaces are globally equivalent, the level of failure observed for some devices, combined with the high induced mental workload and the low usability score, clearly depicts a lack of adaptation of the device’s development to end users. Considering our results and the impact of tasks on dimensions like performance and effort for some devices, manufacturers may primarily focus on interface simplification and rationalization, immediately providing the most important settings and alarms on a first screen, leaving expert settings to a second one. However, given individual physicians’ heterogeneity, the perfect ventilator may be a difficult goal to achieve, and even with experience, some element of frustration and/or temporal workload may still occur, as with our reference device.
Limitations
As with other bench tests, the main limitations of our study may concern the inability to extrapolate our results to the real clinical situation. First, our technical evaluation was performed on a model, which cannot mimic the complexity of all interactions between a patient and a ventilator. The ASL5000 is a simulator and it remains different from patients, mainly because the spontaneous inspiratory profile is not modified by pressurization during the inspiratory phase. However, the bench simulates most other situations and combinations that can be encountered in the clinical field. Second, the objective and subjective ergonomics measurements were assessed during standardized conditions that may be considered as different from real-life conditions. In order to be able to use various physiological sensors during the ergonomics evaluation, we chose not to use a high-fidelity environment with a manikin. We do agree with the fact that human behaviour under test may be significantly affected by the context and set-up of the experiment. However, while we only included experts, it would have been difficult to reach our experimental goals while also trying to run after a more important degree of immersion that may not be necessary with these types of physicians. A simulated condition may never reproduce all the complexities of the interactions between a patient, a clinician and a ventilator, especially if the tester is an experienced clinician [
39]. There are many techniques available for usability evaluation, such as cognitive walk-through, expert reviews, focus groups, Delphi technique, heuristic evaluation or objective timed tasks completion, all of them providing different information [
38]. To the best of our knowledge, our study is the only one to provide a global and complete ergonomics evaluation, taking into account different techniques. Third, we may also consider that the small number of senior ICU physicians that were included in our study does not enable firm conclusions to be drawn. Considering the design of the ergonomics evaluation, it required a huge amount of dedicated time from the physicians to undergo the different scenarios and various measurements for the experimental team. Moreover, none of them were familiar with the six tested devices, which exacerbated the difficulty in recruitment. It was therefore unrealistic to use more testers, and such a drawback also tended to be limited by the use of a device that was known to everyone as a comparison and by the fact that we included physicians from five different ICUs. The pairwise comparison that is performed while using the NASA-TLX also limits inter-/intra-individual variability. Finally, the use of the Avea as a ‘reference’ also depicts a specific limitation about the use of subjective psycho-cognitive scales. The better results of the Avea, with both the SUS and the NASA-TLX, clearly indicate that these values may be highly influenced by previous experience. Such a bias was limited within our evaluation by the fact that, in an attempt to assess the ease of use, we only included naive subjects in order to limit the impact of such experience on the evaluation.