Background
In recent years, medical imaging has become crucial in the clinical decision-making process, playing an important role to improve the public health due to its ability to extract information for diagnosis and treatment purposes. The use of large databases for medical imaging also implies the challenge of handling such amount of information in a reliable and useful way for the clinical expert. In addition to this, many medical imaging-based procedures present low repeatability, mainly due to the subjective appreciation of the analyzed data, the variability of the image conditions, or even the expert training for a specific task. Besides the subjectivity, the manual characterization of a large image dataset is a tedious and time-consuming task that inevitably leads to a decreasing performance over time for the same expert. In that sense, the use of computer-based systems that provide the image storage and analysis by a common repeatable procedure allows ensuring an objective and reliable environment for the specialists, improving, thereby, the productivity and efficiency of the clinical performance.
In opthalmology, retinal image analysis is an useful tool for the noninvasive diagnosis of many relevant diseases, such as hypertension, diabetes or atherosclerosis. Common symptoms of those pathologies include neovascularization, occurrence of pathological structures, or increased tortuosity that can be observed analyzing the vascular tree of the eye fundus. Given the importance of the eye fundus study, Sirius (System for the Integration of Retinal Images Understanding Services) was proposed in [
1] as a computer aided diagnosis tool for the analysis of retinal images. It provides a framework for ophthalmologists or other experts to collaboratively work using retinal image-based applications in a distributed, fast and reliable environment. Sirius integrates several image processing algorithms structured as independent modules. One of the modules is in charge of the automatic arterio-venous ratio (AVR) calculation [
2], a relevant biomarker to determine the vascular risk that is associated to diseases that affect the circulatory system such as hypertension. Another module localizes microaneurysms [
3], which are small red points that appear in early stages of diabetic retinopathy. A third module is focused on measuring the vascular tortuosity of the blood vessels [
4,
5], that is, how and how many times a vessel bends, complementary to the AVR parameter. It is a indicator for a number of vascular and nonvascular diseases such as diabetic retinopathy, cerebrovascular disease, stroke, and ischemic heart disease [
6‐
9]. This module integrates four different metrics of tortuosity of reference [
10‐
13].
The validation of Sirius modules against the manual evaluations performed by clinical experts is crucial to ensure a repeatable and reliable analysis of the biomedical parameters that are extracted from the retinal images. The AVR prognostic value, as computed in Sirius, has been clinically validated by Pose et al. [
14]. The posterior validation of this module has been carried out in different real environments involving several health care systems [
1]. Moreover, additional evaluations of Sirius vessel width measurement have been conducted in DRIVE and REVIEW databases [
15,
16]. Regarding the tortuosity module, a preliminary validation over a set of retinal images previously classified as tortuous / non-tortuous has been presented by Sánchez et al. [
4,
5].
Although retinal vascular tortuosity is underlying both vascular and systemic diseases, its manual characterization is affected by several limitations that still restrict its use to research purposes. Systematic reviews of retinal vessel tortuosity measures and clinical findings related to them conducted by Kalitzeos et al. [
17] and Abdalla et al. [
18] compile the main limitations for using the retinal vascular tortuosity as a clinical marker for diagnostic, treatment and monitoring purposes. One of the main limitations is the lack of a precise and standard guide for the tortuosity assessment regarding the image acquisition, measurement location and consequent calculation. In the clinical practice, the manual characterization of the retinal vascular tortuosity is mostly based on clinical experience by identifying relative characteristics such as the dissimilarity to normal healthy vessels in terms of length, width or number of twists, among others, also evaluating changes in and around each vessel. Therefore, the grading is performed on a subjective scale resulting in a tedious and time-consuming task with a remarkable inter and intra expert variability. Another aspect stated in these reviews is that different diseases produce different tortuosity effects [
9,
19,
20], so that the vascular tortuosity should be analyzed from each specific pathological point of view. Despite this, the absence of unified public datasets, the limited size of the existing ones or the differences in the segmentation techniques for extracting blood vessels and the medical state of the patients at the moment of screening, hinder the validation processes of available computational measurements. Additionally, most computational metrics are depending on one or two factors such as the curvature or the number of twists. However, the experts, based on their experience, consider additional parameters such as dilation, elongation, vessel calibers or branching angles [
21,
22], among others, that are non incorporated in the current computational metrics of reference. The limitations extracted from these reviews indicate the necessity for standardizing the image acquisition, parameter calculation and analysis of the retinal vascular tortuosity in order to become more useful and reliable to support the clinical decision-making processes.
In the work herein described, a complete and exhaustive multi-expert validation procedure for the Sirius tortuosity module is proposed. This study aims, first, to lay the foundations for advancing the standardization of the retinal vascular tortuosity as a clinical biomarker with diagnostic potential. Once a consistent clinical criteria is established, the validation of the prognostic performance of objective computational measurements of reference is performed.
In order to cover the entire spectrum of the expert knowledge, the validation experiments included a group of five different experts with gradual levels of expertise that usually work in a ophthalmological service of the health care systems, from the head of the service to resident physicians. The manual rating was performed on the basis of a four-grade qualitative scale from non-tortuous to severe tortuosity, being complemented with non-tortuous / tortuous and asymptomatic / symptomatic binary classifications. A rating procedure divided in several rounds was designed in order to set a consensual ground-truth and the extraction of uniform criteria. To this end, first, the five experts rated separately the whole dataset in a blind process. In order to gain consensus, the discrepancies were analyzed followed by a second rating round that was carried out by each expert. Finally, a joint session involving all the experts was held to set total consensual rates. Therefore, the expert agreement was analyzed throughout the rating procedure and, then, the consensual rates were set as reference to compare the individual manual and automatic measurements. This way, the prognostic performances of the tortuosity metrics presented in [
4] were evaluated in relation to the experts performance.
This paper is organized as follows: “
Materials and methods” section describes the designed dataset, the details of the automatic tortuosity metrics and the procedure for the multi-expert validation. Next, Section
Results exposes all the conducted experiments and Section
Discussion discusses the obtained results and the constraints and potential of the tortuosity characterization. Finally, “
Conclusions” section presents the conclusions and possible future work.
Discussion
The results extracted from the overall comparison among the manual rates show that there is a high inter-rater variability, especially for the four-grade scale. Regarding the binary classifications, the experts agree with higher rates in the discrimination between asymptomatic / symptomatic than between non-tortuous / tortuous retinal images. In the rating round R2, the percentage of images with full consensus decreased mostly due to the rates of E5, the control expert that did not attend to the meeting to discuss the discrepancies and keep its initial criteria, indicating, thereby, the utility and suitability of the meeting. Hence, there is a slight increment in the percentage of images where at least four expert converge since the discussion allowed to unify criteria and gain consensus.
Regarding the Cohen-Kappa indexes, they show low or fair agreement for the four experts who attended the session to clarify the discrepancies found in
R1, since after the meeting, they change their criteria for the second round
R2. However,
E5, the control expert who was not involved in that session, made a similar rating in both rounds, given its criteria was not influenced and modified, presenting, thereby, a high intra-rater agreement. According to the data showed in Table
2, in the round
R1, the experts were more conservative for asymptomatic cases whereas in the round
R2 the sensitivity for symptomatic cases increased. The change in the criteria is mainly due to the fact that initial rates corresponds to a global assessment of the whole retina, mostly focused on the main vessels, nevertheless, the expert meeting for analyzing the discrepancies led to a more local analysis taking into account each specific vessel during the round
R2. The criteria refinement is also reflected in the low index between
VR1 and
VR2. However, the rates obtained by combining
R1 and
R2 are quite close to the consensual rates in
Rc since
VR1R2 represents the majority inclination comprising the conservative criteria based in the global perception followed in
R1 as well as the analysis of specific vessels considered in
R2.
With respect to the objective tortuosity measurements, Fig.
4 shows that the prognostic performance is below, at different distances, of the experts performance. As detailed before, the analyzed computational metrics incorporate parameters as amplitude, number of twists or curvature of the retinal blood vessels, depending on each case. The results show that the best performance is provided by the metrics which integrate the information about how many times a vessel changes its convexity. In particular, the metric that reached the best score was the Grisan proposal, followed by the Onkaew proposal, given they combine the number of segments with constant convexity within a vessel with the evaluation of such segments. However, Hart and Trucco proposals analyze each vessel globally, regardless of whether it has a constant sign or presents twists.
Tortuosity characterization. Constraints and potential
The assessment of the retinal vascular tortuosity is affected by several factors that prevent its use for diagnostic and treatment purposes. Thus, the lack of precise and standard guides for tortuosity characterization leads to a remarkable disagreement among the experts. In this sense, the multi-expert validation process throughout a rating procedure in several stages is raised in order to lay the foundations for advancing the standardization of the retinal vascular tortuosity as a potential indicator for diagnostic purposes.
Besides the subjective appreciation of tortuosity signs, the manual characterization is also depending on the experience of the rater. In order to cover the entire spectrum of the expert knowledge, a group of five clinicians belonging to different levels of an ophthalmological service was considered for the rating procedure. In particular, they cover the head of the service, experienced clinicians with different levels of expertise and also the participation of resident physicians. This way, the manual characterization of the retinal vascular tortuosity incorporates assessments at different levels of expertise and medical profiles. In order to avoid biased rates, the information related to the patient medical state was not known by the experts at the time of the manual rating. The rating procedure was performed individually in a totally blind process in which the experts were only instructed with the explicit indication of sticking to the evaluation of the vascular tortuosity, abstracting from other clinical findings that could bias the rating.
Despite this, there are also limitations related to the availability of normative data due to the absence of unified public datasets. Moreover, even the available datasets, public or private, present limitations in terms of type and size. This, along with the lack of a standard regarding the computational algorithms used for extracting the blood vessels or the location of the tortuosity measurements, hinder the validation processes of the available computational methods. Furthermore, different diseases produce different tortuosity effects, so that the vascular tortuosity should be analyzed from each specific pathological point of view. In this work, given the association of the retinal vascular tortuosity with diabetes and, more specifically, diabetic retinopathy, diabetic patients were found representative for this study. Although vascular tortuosity is underlying more pathologies, the dataset was limited to diabetic patients in order to analyze a representative cohort of homogeneous data. With respect to the type and size of the retinal images, the implemented methods allow a high degree of normalization in the computed tortuosity values, independently of the acquisition procedure. Regarding the location or zone of the vessels involved in the tortuosity computation, this analysis is based on the metrics of reference in the literature [
10‐
13], included in the Sirius framework [
1]. According to these metrics, the vascular tree is extracted by means of a consolidated computational methodology [
26] for that purpose, being all the vessels composing the vascular tree involved in the global tortuosity computation.
Regarding the prognostic performance of the computational metrics, despite the acceptable results of some of the metrics, all of them remain at a distance of the experts performance. The metrics of reference generally use mathematical properties depending on one or two factors such as curvature or number of twists. However, the experts, based on their experience, analyze a larger set of properties being, therefore, differentiated of the computational metrics.
Conclusions
The retinal vascular tortuosity constitutes a potential indicator of relevant vascular and non-vascular diseases, so that a reliable quantitative measurement would be a potential biomarker for early detection and disease prevention. However, there is no a precise and standard definition of the vascular tortuosity, and consequently, its manual characterization is a subjective task with a high variability. This work is raised with the aim of establishing the basis for advancing in the standardization of the retinal vascular tortuosity as a clinical marker with diagnostic potential allowing, thereby, the robust validation of computational measurements to ensure an objective and reliable environment for the retinal experts. For this purpose, a multi-expert validation procedure is presented in order to assess the prognostic performance of the computational calculation of the vascular tortuosity following the main referenced strategies, included in the Sirius framework. The presented validation included the participation of a group of five different experts and considered a four-grade scale from non-tortuous to severe tortuosity as well as non-tortuous / tortuous and asymptomatic / symptomatic binary classifications. The rating procedure comprised 2 rating rounds in which each expert manually rated the whole dataset and a posterior final joint consensus session where the debatable cases were discussed to reach a global agreement.
For the multi-expert validation procedure, firstly, the expert agreement was analyzed along the different rating rounds. The intra and inter-rater reliability were computed and the discrepancies were discussed involving the experts in order to clarify the criteria and extract additional information from their clinical perception. This rating process allowed to gain consensus among the experts and get consensual rates comprising all the unified criteria extracted throughout the rating process. Therefore, the consensual rates were set as a reference for validating the computational tortuosity metrics that were included in this analysis. Once a consolidated clinical criteria was established, the prognostic performance of the computational measurements was compared to the experts performance, allowing a robust validation of the strengths and limitations of the different tortuosity metrics of reference.
The multi-expert validation provided acceptable results, especially regarding the Grisan proposal. However, all of the considered computational measurements remain at a distance of the experts performance. The analyzed metrics use mathematical properties to define the degree of tortuosity according to one or two factors such as as the amplitude, the curvature or the number of twists, depending on each case. Despite that, the experts, based on their experience, analyze additional parameters such as the neovascularization, the vessel caliber or the distinction between arteries and veins that are not incorporated, at the moment, in the existing computational metrics, thereby causing differences between the automated effectiveness and the expert perception. The results extracted from this work demonstrate that the metrics of reference do not provide a full representation of the expert perception so that additional parameters should be incorporated in the computational metrics in order to have a more accurate and reliable tortuosity assessment. Thus, future work in this research line includes the integration of additional properties in new computational proposals that could approach the performance of the computational metrics to the knowledge of the expert clinicians.