Background
Auscultation has been considered as an integral part of physical examination since the time of Hippocrates. The stethoscope, introduced by Laennec [
2] more than two centuries ago, was one of the first medical instruments which enabled internal body structures and their functioning to be checked.
The stethoscope still remains a tool that can provide potentially valuable clinical information. However, the results of such examinations are strongly subjective and cannot be shared and communicated easily, mostly because of doctors’ experience and perceptual abilities, which leads to differences in the their assessments, depending on their specialization (Hafke et al., submitted for publication). Another important issue is the inconsistent nomenclature of respiratory sounds. This problem is widely recognized [
1], but to date, there is still no standardized worldwide classification of the types of phenomena appearing in the respiratory system [
10]. There is both a variety of terms used for the same sound by different doctors and different sounds described by the same term. Lung sounds, as defined by Sovijarvi et al. [
14], concern all respiratory sounds heard or detected over the chest wall or within the chest, including normal breathing sounds and adventitious sounds. In general, respiratory sound is characterized by a low noise during inspiration, and hardly audible during expiration. The latter is longer than the former [
12]. The spectrum of noise of normal respiratory sound (typically 50–2500 Hz) is broader on the trachea (up to 4000 Hz) [
11].
Adventitious sounds are abnormalities (pathologies) superimposed on normal breathing sounds. They can be divided into two sub-classes depending on their duration: continuous (stationary) sounds—wheezes, rhonchi, and discontinuous (non-stationary) sounds—fine or coarse crackles.
Wheezes are continuous tonal sounds with a frequency range from less than 100 Hz to more than 1 kHz, and a duration time longer than 80 ms [
8]. They are generally recognized correctly and rarely misinterpreted, which makes them probably the most easily recognized pathological sound [
7]. However, as Hafke et al. (submitted for publication) proved, in the case of describing previously recorded sounds, doctors have difficulty identifying this kind of pathology depending on breathing phase, i.e., inspiratory wheezes were confused with expiratory wheezes and vice versa.
Rhonchi are continuous, periodic, snoring-like, similar to wheezes, but of lower fundamental frequency (typically below 300 Hz) and duration, typically longer than 100 ms [
8]. It is one of the most ambiguous classes of pathological sounds, as it is often considered to be on the boundary between wheezes and crackles (especially of coarse type). Thus, they may be mistaken for them [
15]. Although many authors suggested “rhonchus” as a separate category [
10], some doctors use the term “low-pitch wheeze” [
6]. Due to the fact they have the features of both wheezes and crackles, these phenomena are often differently classified by the respondents. As Hafke et al. proved, this is strongly dependent on the examiner’s experience. Moreover, in the cited research, the advantage of pulmonologists was clearly visible. In their case, the number of correct rhonchi detections was 51.2%, while for other groups, this value did not exceed 30%, which was the lowest result for all the phenomena taken into account.
Finally, crackles are short, explosive sounds of a non-tonal character. They tend to appear both during inspiration and expiration. Two categories of this phenomenon have been described—fine and coarse crackles. They vary in typical length (ca. 5 ms and ca. 15 ms, respectively) and frequency (broad-band) and may appear in different respiratory system disorders [
3]. This is why the proper detection and evaluation of crackles is of high importance.
Auscultation includes the evaluation of sound character, its intensity, frequency, and pathological signals occurring in the breathing sound. Its subjective nature is widely recognized, which has led to a new era of developments, for instance computer-based techniques.
Recordings made with electronic stethoscopes may be further analyzed by a digital system in terms of its acoustic features and, after proper signal processing, delivered to the doctor at an enhanced level of quality or even complemented by a visual representation, e.g., a spectrogram. The latter should be considered as an association between an acoustical signal and its visual representation, and is beneficial to the learning and understanding of those sounds, not only for medical students [
13], but also when it comes to doctors diagnosing patients.
Currently, the subject of the greatest attention in the field of computer-based medicine are neural networks (NNs). NNs are a particularly fast developing area of machine learning which learn from examples, as human do. A decade ago NNs were one of many available classifiers. They were trained on a small set of high-level features and produced probability scores of a sample belonging to one of several predefined classes. Their popularity sharply rose when it was proven that deeper neuron structures are able to learn intermediate features from low-level representations by themselves. These intermediate features learned by the NN are much more distinctive and descriptive in comparison to hand-crafted features in many artificial intelligence (AI) tasks, including audio signal analysis and medicine.
Contemporary deep neural networks (DNNs) operate on raw signals directly and are therefore able to identify and exploit all important dependencies that they provide. But in order to be able to do that, a large number of training examples need to be provided. Yet, after these initial requirements are met, the NN algorithm is able to match or even surpass human performance. This is also believed to be the best strategy for dealing with respiratory sounds.
Therefore, the aim of this study was to compare the efficiency of AI and a group of five physicians in terms of respiratory sounds identification in four main classes of pathological signals, according to [
10]: wheezes (with no differentiation to sub-classes), rhonchi, and coarse and fine crackles.
Analysis
Results
A GS was used as a point of reference (100%) for tagging recordings performed by doctors and the NN. Therefore, confusion matrices could be analyzed—the values of recall (the proportion of actual positives that are correctly identified as such, also called as sensitivity), precision (the fraction of relevant instances among the retrieved instances), specificity (the proportion of actual negatives that are correctly identified), and the F1-score (the harmonic mean of precision and recall) were measured for the doctors and NN’s phenomena detection in comparison with the GS. First the chi-square test (
α = 0.05) was performed to investigate if there is a difference in the data gathered for doctors and the NN. The proposed null hypothesis was rejected for all four phenomena. Therefore, the results gathered for the doctors and the NN are statistically different. Detailed results are depicted in Table
2.
Table 2
Juxtaposition of recall (sensitivity), precision, specificity, and F1-score for doctors (pediatricians) and NN
Coarse crackles | 56.1 | 56.1 | 34.6 | 40.7 | 84.6 | 88.2 | 42.8 | 47.1 |
Fine crackles | 72.3 | 83.9 | 39.5 | 52.5 | 69.8 | 79.3 | 51.1 | 64.6 |
Wheezes | 58.1 | 78.2 | 66.1 | 57.7 | 90.7 | 82.2 | 61.8 | 66.4 |
Rhonchi | 67.3 | 87.6 | 55.9 | 61.1 | 85.3 | 84.6 | 61.0 | 72.0 |
Mean | 63.5 | 76.5 | 49.0 | 53.0 | 82.6 | 83.6 | 54.2 | 62.5 |
The lowest F1-score was observed for coarse crackles both in the case of medical and NN descriptions. This may be partially due to the rare occurrence of coarse crackles in the analyzed database (see Table
1). Moreover, this kind of phenomena is often confused with other types of crackles or rhonchi (Hafke et al., submitted for publication) so its correct detection might be problematic. However, it is important to note that the NN F1-score which is related to its performance in correct phenomena detection is higher than in the case of medical descriptions (47.1% vs. 42.8%).
The highest F1-score was obtained for rhonchi and wheezes (both continuous, “musical” sounds). Medical descriptions for rhonchi are comparable to the GS (which is reflected in F1-score value) in 61.0%, while NN is much more accurate—72.0%. This is undeniable proof of the ambiguous character of rhonchi, which results in poor detection performance (probably caused by mistaking them for other phenomena, as evidenced by low precision and recall (sensitivity) values when compared to the NN).
When it comes to wheezes, despite the slightly lower values of precision and specificity noted for the NN, its final performance, expressed in F1-score value, is better than in the case of human tagging. The results are as follows—61.8% and 66.4%, with NN superiority.
It can also be noted that the AI-based analysis is more accurate in detecting rhonchi and wheezes. This may be due to the fact that it is based mainly on the spectrograms, which accurately reflect tonal content in a recording. For the doctors, descriptions are mainly based mainly on acoustical cues, while the visual representation is used rather as an additional, supporting tool. This may be an important issue influencing the proper detection of pathology, especially when phenomena is of ambiguous nature (e.g., rhonchi) or accompanied by louder sounds, which make them barely audible (e.g., silent wheezes).
The biggest differences in F1-scores, meaning a significant predominance of the new automatic system over doctors, are observed for fine crackles—64.6% vs. 51.1%. Also, all of other parameters are higher for the NN.
Generally, for each of the four phenomena, the F1-score for the NN is higher than for doctors with an average of 8.4 percentage points (p.p.), which clearly indicates the advantage of the tested algorithm over the group of doctors. NN is 13 p.p. in average more sensitive and 4 p.p. more precise than the reference group of pediatricians.
Discussion
The main goal of this research was to investigate the effectiveness of pathological respiratory sounds detection for both doctors and the automatic analyzing system based on the NNs developed by the authors.
To measure the performances, the GS was established as a set of 522 recordings taken from the respiratory system of 50 pediatric patients and gathered during auscultation using electronic stethoscopes in real situations. Since auscultation tends to be subjective and there is not an objective measure of correctness, those recordings were then tagged (described) by doctors and experienced acousticians in terms of pathological phenomena content. The recordings with consistent taggings were taken as a point of reference. The inconsistent ones were described by a consilium (2 experienced pediatricians and one acoustician). Only positively verified recordings were used in the next steps of the experiment. In this way, a very reliable GS was established which was taken as a point of reference for the evaluation and comparison of the descriptions of both doctors and the newly developed NN. Since the statistical analysis showed that the performance of those two groups (the doctors and NN) are significantly different, it is reasonable to state that that ML-based analysis that uses the NN algorithm introduced here is more efficient in detecting all four pathological phenomena (wheezes, rhonchi, and coarse and fine crackles), which is reflected in the high values of recall (sensitivity) and the F1-score. It is worth noting that the biggest difference between the performance of doctors and the NN was observed in the case of coarse crackles, where the NN clearly outperformed. Moreover, it has to be mentioned that the NN performance is also higher than that of the doctors in the case of ambiguous sounds (i.e., rhonchi) which tend to be misinterpreted or evaluated in an improper way in everyday medical practice. Finally, the difference between the performance of the doctors and the NN was less significant when it came to the recognition of wheezes; however, this is just because the performance of doctors with those signals which are easiest to interpret is relatively high. Thus, the potential of the proposed solution seems to be enormous. It must be also emphasized that the NN algorithm was taught using thousands of recordings and taggings, which makes the results unique and reliable.
Conclusions
To conclude, the NN algorithms that were used in this experiment can be described as a very efficient tool for pathological sound detection. This is why AI may become a valuable support for doctors, medical students, or care providers (also lay ones), both when it comes to diagnosing or monitoring processes, on the one hand, and training or education on the other. The database we built is itself a very good tool in this field. Moreover, the AI algorithms can be also beneficial for lay people in terms of monitoring their respiratory system at home, which makes this solution valuable in many areas, e.g., patient safety; reaction speed in case of danger; and, for reducing, the cost of treatment.
It also must be emphasized that there are many publications that correlate pathological sounds with particular disease; however, it is more complicated. There are many publications that show that efficiency of physicians is very low [
1,
10]; thus, the AI solution is a first step in making auscultation more objective with less incorrect identification and thus better correlation with diseases made by physicians.
Finally, AI algorithms can also be used in other areas, such as heart disease, which makes this area even more promising, especially taking into account that the results from this experiment which was carried out in real conditions, not in a laboratory with proven high performance of NN.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.